diff mbox

drm/i915/execlists: Skip lite restore on the currently executing request

Message ID 20180425105923.13430-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson April 25, 2018, 10:59 a.m. UTC
When WaIdleLiteRestore isn't enough.

Fixes an odd hang on gen8 (both bsw and bdw) during gem_ctx_switch,
where by all intents and purposes if we trigger a lite-restore as it is
processing the pipecontrol flushes, the RING is restored to the oword
following the command and tries to execute the destination address for
the pipecontrol rather than a valid command. With the theory being that
it doesn't like RING_HEAD being within a cacheline of the restored
RING_TAIL, we can evade that issue by not triggering a lite-restore if
we know we are inside the last request.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

Comments

Mika Kuoppala April 25, 2018, 11:19 a.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> When WaIdleLiteRestore isn't enough.
>
> Fixes an odd hang on gen8 (both bsw and bdw) during gem_ctx_switch,

Do you have a testcase name? (testcase tag would be nice too)

> where by all intents and purposes if we trigger a lite-restore as it is
> processing the pipecontrol flushes, the RING is restored to the oword

s/RING/RING_HEAD ?

> following the command and tries to execute the destination address for
> the pipecontrol rather than a valid command. With the theory being that
> it doesn't like RING_HEAD being within a cacheline of the restored
> RING_TAIL, we can evade that issue by not triggering a lite-restore if
> we know we are inside the last request.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 029901a8fa38..5c50263e45d3 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -639,6 +639,19 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
>  		if (port_count(&port[1]))
>  			goto unlock;
>  
> +		/*
> +		 * Skip invoking a lite-restore if we know we have already
> +		 * started processing the last request queued to HW. This
> +		 * prevents a mystery *unrecoverable* hang on gen8, maybe
> +		 * related to updating TAIL within a cacheline of HEAD? (As

Did you try with WA_TAIL_DWORDS 16?

-Mika

> +		 * there is still a delay between submitting the ESLP update
> +		 * and HW responding, we may still encounter whatever condition
> +		 * trips up, just less often.)
> +		 */
> +		if (i915_seqno_passed(intel_engine_get_seqno(engine),
> +				      last->global_seqno - 1))
> +			goto unlock;
> +
>  		/*
>  		 * WaIdleLiteRestore:bdw,skl
>  		 * Apply the wa NOOPs to prevent
> -- 
> 2.17.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson April 25, 2018, 11:23 a.m. UTC | #2
Quoting Mika Kuoppala (2018-04-25 12:19:08)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > When WaIdleLiteRestore isn't enough.
> >
> > Fixes an odd hang on gen8 (both bsw and bdw) during gem_ctx_switch,
> 
> Do you have a testcase name? (testcase tag would be nice too)

Just keep running gem_ctx_switch. Switching between the static set of
contexts, no rebind pressure whatsoever.

> > where by all intents and purposes if we trigger a lite-restore as it is
> > processing the pipecontrol flushes, the RING is restored to the oword
> 
> s/RING/RING_HEAD ?
> 
> > following the command and tries to execute the destination address for
> > the pipecontrol rather than a valid command. With the theory being that
> > it doesn't like RING_HEAD being within a cacheline of the restored
> > RING_TAIL, we can evade that issue by not triggering a lite-restore if
> > we know we are inside the last request.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > ---
> >  drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 029901a8fa38..5c50263e45d3 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -639,6 +639,19 @@ static void execlists_dequeue(struct intel_engine_cs *engine)
> >               if (port_count(&port[1]))
> >                       goto unlock;
> >  
> > +             /*
> > +              * Skip invoking a lite-restore if we know we have already
> > +              * started processing the last request queued to HW. This
> > +              * prevents a mystery *unrecoverable* hang on gen8, maybe
> > +              * related to updating TAIL within a cacheline of HEAD? (As
> 
> Did you try with WA_TAIL_DWORDS 16?

Sure can try, but the error state doesn't indicate TAIL==HEAD as would
be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't
generate the arbitration event, so no more context switches).
-Chris
Chris Wilson April 25, 2018, 11:31 a.m. UTC | #3
Quoting Chris Wilson (2018-04-25 12:23:30)
> Quoting Mika Kuoppala (2018-04-25 12:19:08)
> > Did you try with WA_TAIL_DWORDS 16?
> 
> Sure can try, but the error state doesn't indicate TAIL==HEAD as would
> be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't
> generate the arbitration event, so no more context switches).

Watch https://patchwork.freedesktop.org/series/42284/

Actually surviving on my bdw
-Chris
Chris Wilson April 25, 2018, 12:45 p.m. UTC | #4
Quoting Chris Wilson (2018-04-25 12:31:29)
> Quoting Chris Wilson (2018-04-25 12:23:30)
> > Quoting Mika Kuoppala (2018-04-25 12:19:08)
> > > Did you try with WA_TAIL_DWORDS 16?
> > 
> > Sure can try, but the error state doesn't indicate TAIL==HEAD as would
> > be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't
> > generate the arbitration event, so no more context switches).
> 
> Watch https://patchwork.freedesktop.org/series/42284/
> 
> Actually surviving on my bdw

And what appears to work just as well is

@@ -1387,7 +1387,7 @@  static int execlists_request_alloc(struct i915_request *request)
 	 */
 	request->reserved_space += EXECLISTS_REQUEST_SIZE;
 
-	ret = intel_ring_wait_for_space(request->ring, request->reserved_space);
+	ret = intel_ring_cacheline_align(request);
 	if (ret)
 		return ret;

If that doesn't hint towards some weirdness in HW, I don't know what
would ;)
-Chris
Chris Wilson April 25, 2018, 1:14 p.m. UTC | #5
Quoting Chris Wilson (2018-04-25 13:45:45)
> Quoting Chris Wilson (2018-04-25 12:31:29)
> > Quoting Chris Wilson (2018-04-25 12:23:30)
> > > Quoting Mika Kuoppala (2018-04-25 12:19:08)
> > > > Did you try with WA_TAIL_DWORDS 16?
> > > 
> > > Sure can try, but the error state doesn't indicate TAIL==HEAD as would
> > > be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't
> > > generate the arbitration event, so no more context switches).
> > 
> > Watch https://patchwork.freedesktop.org/series/42284/
> > 
> > Actually surviving on my bdw
> 
> And what appears to work just as well is
> 
> @@ -1387,7 +1387,7 @@  static int execlists_request_alloc(struct i915_request *request)
>          */
>         request->reserved_space += EXECLISTS_REQUEST_SIZE;
>  
> -       ret = intel_ring_wait_for_space(request->ring, request->reserved_space);
> +       ret = intel_ring_cacheline_align(request);
>         if (ret)
>                 return ret;
> 
> If that doesn't hint towards some weirdness in HW, I don't know what
> would ;)

Fwiw, I've now hit GPU hangs with WA_TAIL_DWORDS set to 16 and with the
cacheline_align; scratch that theory.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 029901a8fa38..5c50263e45d3 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -639,6 +639,19 @@  static void execlists_dequeue(struct intel_engine_cs *engine)
 		if (port_count(&port[1]))
 			goto unlock;
 
+		/*
+		 * Skip invoking a lite-restore if we know we have already
+		 * started processing the last request queued to HW. This
+		 * prevents a mystery *unrecoverable* hang on gen8, maybe
+		 * related to updating TAIL within a cacheline of HEAD? (As
+		 * there is still a delay between submitting the ESLP update
+		 * and HW responding, we may still encounter whatever condition
+		 * trips up, just less often.)
+		 */
+		if (i915_seqno_passed(intel_engine_get_seqno(engine),
+				      last->global_seqno - 1))
+			goto unlock;
+
 		/*
 		 * WaIdleLiteRestore:bdw,skl
 		 * Apply the wa NOOPs to prevent