diff mbox series

[2/4] drm/msm/a5xx: properly clear preemption records on resume

Message ID 20240711100038.268803-3-vladimir.lypak@gmail.com (mailing list archive)
State New, archived
Headers show
Series fixes for Adreno A5Xx preemption | expand

Commit Message

Vladimir Lypak July 11, 2024, 10 a.m. UTC
Two fields of preempt_record which are used by CP aren't reset on
resume: "data" and "info". This is the reason behind faults which happen
when we try to switch to the ring that was active last before suspend.
In addition those faults can't be recovered from because we use suspend
and resume to do so (keeping values of those fields again).

Fixes: b1fc2839d2f9 ("drm/msm: Implement preemption for A5XX targets")
Signed-off-by: Vladimir Lypak <vladimir.lypak@gmail.com>
---
 drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Konrad Dybcio July 11, 2024, 10:42 a.m. UTC | #1
On 11.07.2024 12:00 PM, Vladimir Lypak wrote:
> Two fields of preempt_record which are used by CP aren't reset on
> resume: "data" and "info". This is the reason behind faults which happen
> when we try to switch to the ring that was active last before suspend.
> In addition those faults can't be recovered from because we use suspend
> and resume to do so (keeping values of those fields again).
> 
> Fixes: b1fc2839d2f9 ("drm/msm: Implement preemption for A5XX targets")
> Signed-off-by: Vladimir Lypak <vladimir.lypak@gmail.com>
> ---

Reviewed-by: Konrad Dybcio <konrad.dybcio@linaro.org>

Konrad
Akhil P Oommen Aug. 1, 2024, 1:16 p.m. UTC | #2
On Thu, Jul 11, 2024 at 10:00:19AM +0000, Vladimir Lypak wrote:
> Two fields of preempt_record which are used by CP aren't reset on
> resume: "data" and "info". This is the reason behind faults which happen
> when we try to switch to the ring that was active last before suspend.
> In addition those faults can't be recovered from because we use suspend
> and resume to do so (keeping values of those fields again).
> 
> Fixes: b1fc2839d2f9 ("drm/msm: Implement preemption for A5XX targets")
> Signed-off-by: Vladimir Lypak <vladimir.lypak@gmail.com>
> ---
>  drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> index f58dd564d122..67a8ef4adf6b 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> @@ -204,6 +204,8 @@ void a5xx_preempt_hw_init(struct msm_gpu *gpu)
>  		return;
>  
>  	for (i = 0; i < gpu->nr_rings; i++) {
> +		a5xx_gpu->preempt[i]->data = 0;
> +		a5xx_gpu->preempt[i]->info = 0;

I don't see this bit in the downstream driver. Just curious, do we need
to clear both fields to avoid the gpu faults?

-Akhil
>  		a5xx_gpu->preempt[i]->wptr = 0;
>  		a5xx_gpu->preempt[i]->rptr = 0;
>  		a5xx_gpu->preempt[i]->rbase = gpu->rb[i]->iova;
> -- 
> 2.45.2
>
Vladimir Lypak Aug. 2, 2024, 1:41 p.m. UTC | #3
On Thu, Aug 01, 2024 at 06:46:10PM +0530, Akhil P Oommen wrote:
> On Thu, Jul 11, 2024 at 10:00:19AM +0000, Vladimir Lypak wrote:
> > Two fields of preempt_record which are used by CP aren't reset on
> > resume: "data" and "info". This is the reason behind faults which happen
> > when we try to switch to the ring that was active last before suspend.
> > In addition those faults can't be recovered from because we use suspend
> > and resume to do so (keeping values of those fields again).
> > 
> > Fixes: b1fc2839d2f9 ("drm/msm: Implement preemption for A5XX targets")
> > Signed-off-by: Vladimir Lypak <vladimir.lypak@gmail.com>
> > ---
> >  drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > index f58dd564d122..67a8ef4adf6b 100644
> > --- a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > +++ b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > @@ -204,6 +204,8 @@ void a5xx_preempt_hw_init(struct msm_gpu *gpu)
> >  		return;
> >  
> >  	for (i = 0; i < gpu->nr_rings; i++) {
> > +		a5xx_gpu->preempt[i]->data = 0;
> > +		a5xx_gpu->preempt[i]->info = 0;
> 
> I don't see this bit in the downstream driver. Just curious, do we need
> to clear both fields to avoid the gpu faults?

Downstream gets away without doing so because it resumes on the same
ring that it suspended on. On mainline we always do GPU resume on first
ring. It was enough to zero info field to avoid faults but clearing
both shouldn't hurt.

I have tried to replicate faults again with local preemption disabled
and unmodified mesa and couldn't do so. It only happens when fine-grain
preemption is used and there was a switch from IB1.
This made me come up with explanation of what could be happening.
If preemption switch is initiated on a some ring at checkpoint in IB1,
CP should save position of that checkpoint in the preemption record and
set some flag in "info" field which will tell it to continue from that
checkpoint when switching back.
When switching back to that ring we program address of its preemption
record to CP_CONTEXT_SWITCH_RESTORE_ADDR. Apparently this won't remove
the flag from "info" field because the preemption record is only being
read from. This leaves preemption record outdated on that ring until
next switch will override it. This doesn't cause issues on downstream
because it won't try to restore from that record since it's ignored
during GPU power-up.

Vladimir

> 
> -Akhil
> >  		a5xx_gpu->preempt[i]->wptr = 0;
> >  		a5xx_gpu->preempt[i]->rptr = 0;
> >  		a5xx_gpu->preempt[i]->rbase = gpu->rb[i]->iova;
> > -- 
> > 2.45.2
> >
Akhil P Oommen Aug. 5, 2024, 7:07 p.m. UTC | #4
On Fri, Aug 02, 2024 at 01:41:32PM +0000, Vladimir Lypak wrote:
> On Thu, Aug 01, 2024 at 06:46:10PM +0530, Akhil P Oommen wrote:
> > On Thu, Jul 11, 2024 at 10:00:19AM +0000, Vladimir Lypak wrote:
> > > Two fields of preempt_record which are used by CP aren't reset on
> > > resume: "data" and "info". This is the reason behind faults which happen
> > > when we try to switch to the ring that was active last before suspend.
> > > In addition those faults can't be recovered from because we use suspend
> > > and resume to do so (keeping values of those fields again).
> > > 
> > > Fixes: b1fc2839d2f9 ("drm/msm: Implement preemption for A5XX targets")
> > > Signed-off-by: Vladimir Lypak <vladimir.lypak@gmail.com>
> > > ---
> > >  drivers/gpu/drm/msm/adreno/a5xx_preempt.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > > index f58dd564d122..67a8ef4adf6b 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
> > > @@ -204,6 +204,8 @@ void a5xx_preempt_hw_init(struct msm_gpu *gpu)
> > >  		return;
> > >  
> > >  	for (i = 0; i < gpu->nr_rings; i++) {
> > > +		a5xx_gpu->preempt[i]->data = 0;
> > > +		a5xx_gpu->preempt[i]->info = 0;
> > 
> > I don't see this bit in the downstream driver. Just curious, do we need
> > to clear both fields to avoid the gpu faults?
> 
> Downstream gets away without doing so because it resumes on the same
> ring that it suspended on. On mainline we always do GPU resume on first
> ring. It was enough to zero info field to avoid faults but clearing
> both shouldn't hurt.
> 
> I have tried to replicate faults again with local preemption disabled
> and unmodified mesa and couldn't do so. It only happens when fine-grain
> preemption is used and there was a switch from IB1.

So, I guess gpu is going to rpm suspend while there is pending
(preempted) submits present in another ringbuffer. Probably the other
fixes you have in this series make this not necessary during rpm suspend.
But we can keep as it is harmless and might help during gpu recovery.

> This made me come up with explanation of what could be happening.
> If preemption switch is initiated on a some ring at checkpoint in IB1,
> CP should save position of that checkpoint in the preemption record and
> set some flag in "info" field which will tell it to continue from that
> checkpoint when switching back.
> When switching back to that ring we program address of its preemption
> record to CP_CONTEXT_SWITCH_RESTORE_ADDR. Apparently this won't remove
> the flag from "info" field because the preemption record is only being
> read from. This leaves preemption record outdated on that ring until
> next switch will override it. This doesn't cause issues on downstream
> because it won't try to restore from that record since it's ignored
> during GPU power-up.

I guess it is fine if you never go to rpm suspend without idling all
RBs!

-Akhil

> 
> Vladimir
> 
> > 
> > -Akhil
> > >  		a5xx_gpu->preempt[i]->wptr = 0;
> > >  		a5xx_gpu->preempt[i]->rptr = 0;
> > >  		a5xx_gpu->preempt[i]->rbase = gpu->rb[i]->iova;
> > > -- 
> > > 2.45.2
> > >
diff mbox series

Patch

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
index f58dd564d122..67a8ef4adf6b 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_preempt.c
@@ -204,6 +204,8 @@  void a5xx_preempt_hw_init(struct msm_gpu *gpu)
 		return;
 
 	for (i = 0; i < gpu->nr_rings; i++) {
+		a5xx_gpu->preempt[i]->data = 0;
+		a5xx_gpu->preempt[i]->info = 0;
 		a5xx_gpu->preempt[i]->wptr = 0;
 		a5xx_gpu->preempt[i]->rptr = 0;
 		a5xx_gpu->preempt[i]->rbase = gpu->rb[i]->iova;