diff mbox series

drm/msm/a6xx: Fix recovery vs runpm race

Message ID 20231218155927.368881-1-robdclark@gmail.com (mailing list archive)
State New, archived
Headers show
Series drm/msm/a6xx: Fix recovery vs runpm race | expand

Commit Message

Rob Clark Dec. 18, 2023, 3:59 p.m. UTC
From: Rob Clark <robdclark@chromium.org>

a6xx_recover() is relying on the gpu lock to serialize against incoming
submits doing a runpm get, as it tries to temporarily balance out the
runpm gets with puts in order to power off the GPU.  Unfortunately this
gets worse when we (in a later patch) will move the runpm get out of the
scheduler thread/work to move it out of the fence signaling path.

Instead we can just simplify the whole thing by using force_suspend() /
force_resume() instead of trying to be clever.

Reported-by: David Heidelberg <david.heidelberg@collabora.com>
Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272
Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm")
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

Comments

Akhil P Oommen Dec. 22, 2023, 7:57 p.m. UTC | #1
On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote:
> 
> From: Rob Clark <robdclark@chromium.org>
> 
> a6xx_recover() is relying on the gpu lock to serialize against incoming
> submits doing a runpm get, as it tries to temporarily balance out the
> runpm gets with puts in order to power off the GPU.  Unfortunately this
> gets worse when we (in a later patch) will move the runpm get out of the
> scheduler thread/work to move it out of the fence signaling path.
> 
> Instead we can just simplify the whole thing by using force_suspend() /
> force_resume() instead of trying to be clever.

At some places, we take a pm_runtime vote and access the gpu
registers assuming it will be powered until we drop the vote.  a6xx_get_timestamp()
is an example. If we do a force suspend, it may cause bus errors from
those threads. Now you have to serialize every place we do runtime_get/put with a
mutex. Or is there a better way to handle the 'later patch' you
mentioned?

-Akhil.

> 
> Reported-by: David Heidelberg <david.heidelberg@collabora.com>
> Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272
> Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm")
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 268737e59131..a5660d63535b 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
>  	dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb);
>  	dev_pm_genpd_synced_poweroff(gmu->cxpd);
>  
> -	/* Drop the rpm refcount from active submits */
> -	if (active_submits)
> -		pm_runtime_put(&gpu->pdev->dev);
> -
> -	/* And the final one from recover worker */
> -	pm_runtime_put_sync(&gpu->pdev->dev);
> +	pm_runtime_force_suspend(&gpu->pdev->dev);
>  
>  	if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000)))
>  		DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
> @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
>  
>  	pm_runtime_use_autosuspend(&gpu->pdev->dev);
>  
> -	if (active_submits)
> -		pm_runtime_get(&gpu->pdev->dev);
> -
> -	pm_runtime_get_sync(&gpu->pdev->dev);
> +	pm_runtime_force_resume(&gpu->pdev->dev);
>  
>  	gpu->active_submits = active_submits;
>  	mutex_unlock(&gpu->active_lock);
> -- 
> 2.43.0
>
Rob Clark Dec. 22, 2023, 8:15 p.m. UTC | #2
On Fri, Dec 22, 2023 at 11:58 AM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On Mon, Dec 18, 2023 at 07:59:24AM -0800, Rob Clark wrote:
> >
> > From: Rob Clark <robdclark@chromium.org>
> >
> > a6xx_recover() is relying on the gpu lock to serialize against incoming
> > submits doing a runpm get, as it tries to temporarily balance out the
> > runpm gets with puts in order to power off the GPU.  Unfortunately this
> > gets worse when we (in a later patch) will move the runpm get out of the
> > scheduler thread/work to move it out of the fence signaling path.
> >
> > Instead we can just simplify the whole thing by using force_suspend() /
> > force_resume() instead of trying to be clever.
>
> At some places, we take a pm_runtime vote and access the gpu
> registers assuming it will be powered until we drop the vote.  a6xx_get_timestamp()
> is an example. If we do a force suspend, it may cause bus errors from
> those threads. Now you have to serialize every place we do runtime_get/put with a
> mutex. Or is there a better way to handle the 'later patch' you
> mentioned?

So I was running into issues, when I started adding an igt test to
stress test recovery vs multi-threaded submit, with cxpd not always
suspending and getting "cx gdsc did not collapse", which may be
related.

I was considering using force_suspend() on the gmu and cxpd if
gpu->hang==true, I'm not sure.  I ran out of time to play with this
when I was in the office.

The issue the 'later patch' is trying to deal with is getting memory
allocations out of the "fence signaling path", ie. out from the
drm/sched kthread/worker.  One way to do that, without dragging all of
runpm/device-link/etc into it is to do the runpm get in the submit
ioctl before enqueuing the job to the scheduler.  But then we can hold
a lock to protect against racing with recovery.

BR,
-R

> -Akhil.
>
> >
> > Reported-by: David Heidelberg <david.heidelberg@collabora.com>
> > Closes: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10272
> > Fixes: abe2023b4cea ("drm/msm/gpu: Push gpu lock down past runpm")
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > index 268737e59131..a5660d63535b 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > @@ -1244,12 +1244,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >       dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb);
> >       dev_pm_genpd_synced_poweroff(gmu->cxpd);
> >
> > -     /* Drop the rpm refcount from active submits */
> > -     if (active_submits)
> > -             pm_runtime_put(&gpu->pdev->dev);
> > -
> > -     /* And the final one from recover worker */
> > -     pm_runtime_put_sync(&gpu->pdev->dev);
> > +     pm_runtime_force_suspend(&gpu->pdev->dev);
> >
> >       if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000)))
> >               DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
> > @@ -1258,10 +1253,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >
> >       pm_runtime_use_autosuspend(&gpu->pdev->dev);
> >
> > -     if (active_submits)
> > -             pm_runtime_get(&gpu->pdev->dev);
> > -
> > -     pm_runtime_get_sync(&gpu->pdev->dev);
> > +     pm_runtime_force_resume(&gpu->pdev->dev);
> >
> >       gpu->active_submits = active_submits;
> >       mutex_unlock(&gpu->active_lock);
> > --
> > 2.43.0
> >
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 268737e59131..a5660d63535b 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1244,12 +1244,7 @@  static void a6xx_recover(struct msm_gpu *gpu)
 	dev_pm_genpd_add_notifier(gmu->cxpd, &gmu->pd_nb);
 	dev_pm_genpd_synced_poweroff(gmu->cxpd);
 
-	/* Drop the rpm refcount from active submits */
-	if (active_submits)
-		pm_runtime_put(&gpu->pdev->dev);
-
-	/* And the final one from recover worker */
-	pm_runtime_put_sync(&gpu->pdev->dev);
+	pm_runtime_force_suspend(&gpu->pdev->dev);
 
 	if (!wait_for_completion_timeout(&gmu->pd_gate, msecs_to_jiffies(1000)))
 		DRM_DEV_ERROR(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
@@ -1258,10 +1253,7 @@  static void a6xx_recover(struct msm_gpu *gpu)
 
 	pm_runtime_use_autosuspend(&gpu->pdev->dev);
 
-	if (active_submits)
-		pm_runtime_get(&gpu->pdev->dev);
-
-	pm_runtime_get_sync(&gpu->pdev->dev);
+	pm_runtime_force_resume(&gpu->pdev->dev);
 
 	gpu->active_submits = active_submits;
 	mutex_unlock(&gpu->active_lock);