diff mbox series

[v1,1/1] drm/i915/gt: Increase a time to retry RING_HEAD reset

Message ID 20241205115736.1420991-1-nitin.r.gote@intel.com (mailing list archive)
State New
Headers show
Series [v1,1/1] drm/i915/gt: Increase a time to retry RING_HEAD reset | expand

Commit Message

Gote, Nitin R Dec. 5, 2024, 11:57 a.m. UTC
Issue is seen again where engine resets fails because the engine resumes
from an incorrect RING_HEAD. So, increase a time if at first the
write doesn't succeed and retry again.

Fixes: 6ef0e3ef2662 ("drm/i915/gt: Retry RING_HEAD reset until it get sticks")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12806
Signed-off-by: Nitin Gote <nitin.r.gote@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_ring_submission.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Andi Shyti Dec. 5, 2024, 1:05 p.m. UTC | #1
Hi Nitin,

On Thu, Dec 05, 2024 at 05:27:36PM +0530, Nitin Gote wrote:
> Issue is seen again where engine resets fails because the engine resumes
> from an incorrect RING_HEAD. So, increase a time if at first the
> write doesn't succeed and retry again.
> 
> Fixes: 6ef0e3ef2662 ("drm/i915/gt: Retry RING_HEAD reset until it get sticks")

Is this a real fix or is it more of a fine tuning?

> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12806
> Signed-off-by: Nitin Gote <nitin.r.gote@intel.com>

...

> @@ -231,7 +231,7 @@ static int xcs_resume(struct intel_engine_cs *engine)
>  	set_pp_dir(engine);
>  
>  	/* First wake the ring up to an empty/idle ring */
> -	for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC);
> +	for ((kt) = ktime_get() + (50 * NSEC_PER_MSEC);

Where is this 50 coming from?

Thanks,
Andi

>  			ktime_before(ktime_get(), (kt)); cpu_relax()) {
>  		/*
>  		 * In case of resets fails because engine resumes from
> -- 
> 2.25.1
Gote, Nitin R Dec. 5, 2024, 4:03 p.m. UTC | #2
Hi Andi,

> -----Original Message-----
> From: Andi Shyti <andi.shyti@linux.intel.com>
> Sent: Thursday, December 5, 2024 6:35 PM
> To: Gote, Nitin R <nitin.r.gote@intel.com>
> Cc: intel-gfx@lists.freedesktop.org; Wilson, Chris P <chris.p.wilson@intel.com>
> Subject: Re: [PATCH v1 1/1] drm/i915/gt: Increase a time to retry RING_HEAD
> reset
> 
> Hi Nitin,
> 
> On Thu, Dec 05, 2024 at 05:27:36PM +0530, Nitin Gote wrote:
> > Issue is seen again where engine resets fails because the engine
> > resumes from an incorrect RING_HEAD. So, increase a time if at first
> > the write doesn't succeed and retry again.
> >
> > Fixes: 6ef0e3ef2662 ("drm/i915/gt: Retry RING_HEAD reset until it get
> > sticks")
> 
> Is this a real fix or is it more of a fine tuning?

Here we can say this for more fine tuning as issue seen again and 
that's why I have added fixes : 6ef0e3ef2662.

> 
> > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12806
> > Signed-off-by: Nitin Gote <nitin.r.gote@intel.com>
> 
> ...
> 
> > @@ -231,7 +231,7 @@ static int xcs_resume(struct intel_engine_cs *engine)
> >  	set_pp_dir(engine);
> >
> >  	/* First wake the ring up to an empty/idle ring */
> > -	for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC);
> > +	for ((kt) = ktime_get() + (50 * NSEC_PER_MSEC);
> 
> Where is this 50 coming from?

Well, here HEAD is still not 0 even after writing in it. 
So, it could be the timing issue. I discussed this with Chris and we thought
It is better to add 50ms instead of 2ms delay here to let HEAD write complete.
I tested this on trybot for Haswell/Ivybridge platform https://patchwork.freedesktop.org/series/141779/ and 
I see BAT is successful and shards issues are not related.

> 
> Thanks,
> Andi
> 
> >  			ktime_before(ktime_get(), (kt)); cpu_relax()) {
> >  		/*
> >  		 * In case of resets fails because engine resumes from
> > --
> > 2.25.1
Andi Shyti Dec. 5, 2024, 4:26 p.m. UTC | #3
Hi Nitin,

On Thu, Dec 05, 2024 at 04:03:38PM +0000, Gote, Nitin R wrote:
> > On Thu, Dec 05, 2024 at 05:27:36PM +0530, Nitin Gote wrote:
> > > Issue is seen again where engine resets fails because the engine
> > > resumes from an incorrect RING_HEAD. So, increase a time if at first
> > > the write doesn't succeed and retry again.
> > >
> > > Fixes: 6ef0e3ef2662 ("drm/i915/gt: Retry RING_HEAD reset until it get
> > > sticks")
> > 
> > Is this a real fix or is it more of a fine tuning?
> 
> Here we can say this for more fine tuning as issue seen again and 
> that's why I have added fixes : 6ef0e3ef2662.

I thinkt he Fixes: tag is not necessary, here. You are not really
fixing a bug.

> > > Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12806
> > > Signed-off-by: Nitin Gote <nitin.r.gote@intel.com>
> > 
> > ...
> > 
> > > @@ -231,7 +231,7 @@ static int xcs_resume(struct intel_engine_cs *engine)
> > >  	set_pp_dir(engine);
> > >
> > >  	/* First wake the ring up to an empty/idle ring */
> > > -	for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC);
> > > +	for ((kt) = ktime_get() + (50 * NSEC_PER_MSEC);
> > 
> > Where is this 50 coming from?
> 
> Well, here HEAD is still not 0 even after writing in it. 
> So, it could be the timing issue. I discussed this with Chris and we thought
> It is better to add 50ms instead of 2ms delay here to let HEAD write complete.
> I tested this on trybot for Haswell/Ivybridge platform https://patchwork.freedesktop.org/series/141779/ and 
> I see BAT is successful and shards issues are not related.

Can you please add a comment explaining why you are using 50ms?
You can also mention that you experimented with different values
and determined that 50ms works best based on testing.

Andi
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
index 32f3b52a183a..bc9f28524713 100644
--- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c
@@ -231,7 +231,7 @@  static int xcs_resume(struct intel_engine_cs *engine)
 	set_pp_dir(engine);
 
 	/* First wake the ring up to an empty/idle ring */
-	for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC);
+	for ((kt) = ktime_get() + (50 * NSEC_PER_MSEC);
 			ktime_before(ktime_get(), (kt)); cpu_relax()) {
 		/*
 		 * In case of resets fails because engine resumes from