diff mbox series

drm/i915/execlists: Clear STOP_RING bit before restoring the context

Message ID 20180814131150.9784-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series drm/i915/execlists: Clear STOP_RING bit before restoring the context | expand

Commit Message

Chris Wilson Aug. 14, 2018, 1:11 p.m. UTC
Before a reset, we set the STOP_RING bit of RING_MI_MODE to freeze the
engine. However, sometimes we observe that upon restart, the engine
freezes again with the STOP_RING bit still asserted. By inspection, we
know that the register itself is cleared by the GPU reset, so that bit
must be preserved inside the context image and reloaded from there. A
simple fix (as the RING_MI_MODE is at a fixed offset in a valid context)
is to clobber the STOP_RING bit inside the image before replaying the
request.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c     | 17 +++++++++++++++--
 drivers/gpu/drm/i915/intel_lrc_reg.h |  2 ++
 2 files changed, 17 insertions(+), 2 deletions(-)

Comments

Mika Kuoppala Aug. 14, 2018, 1:42 p.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Before a reset, we set the STOP_RING bit of RING_MI_MODE to freeze the
> engine. However, sometimes we observe that upon restart, the engine
> freezes again with the STOP_RING bit still asserted. By inspection, we
> know that the register itself is cleared by the GPU reset, so that bit
> must be preserved inside the context image and reloaded from there. A
> simple fix (as the RING_MI_MODE is at a fixed offset in a valid context)
> is to clobber the STOP_RING bit inside the image before replaying the
> request.
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c     | 17 +++++++++++++++--
>  drivers/gpu/drm/i915/intel_lrc_reg.h |  2 ++
>  2 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index 3f90c74038ef..37fe842de639 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1918,6 +1918,20 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  
>  	spin_unlock_irqrestore(&engine->timeline.lock, flags);
>  
> +	if (!request)
> +		return;
> +
> +	regs = request->hw_context->lrc_reg_state;
> +
> +	/*
> +	 * After a reset, the context may have preserved the STOP bit
> +	 * of RING_MI_MODE we used to freeze the active engine before
> +	 * the reset. If that bit is restored the ring stops instead
> +	 * of being executed.
> +	 */
> +	regs[CTX_MI_MODE + 1] |= STOP_RING << 16;
> +	regs[CTX_MI_MODE + 1] &= ~STOP_RING;

Ok, I did go and pondered if this is truely the simplest
way. Forcing the start outside a context would not
be simpler and would be slower.

Nice find and it will has potential to explain our some troubles
of re-enabling engines.

Did you find out this by looking at the error states or what?

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> +
>  	/*
>  	 * If the request was innocent, we leave the request in the ELSP
>  	 * and will try to replay it on restarting. The context image may
> @@ -1929,7 +1943,7 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  	 * and have to at least restore the RING register in the context
>  	 * image back to the expected values to skip over the guilty request.
>  	 */
> -	if (!request || request->fence.error != -EIO)
> +	if (request->fence.error != -EIO)
>  		return;
>  
>  	/*
> @@ -1940,7 +1954,6 @@ static void execlists_reset(struct intel_engine_cs *engine,
>  	 * future request will be after userspace has had the opportunity
>  	 * to recreate its own state.
>  	 */
> -	regs = request->hw_context->lrc_reg_state;
>  	if (engine->pinned_default_state) {
>  		memcpy(regs, /* skip restoring the vanilla PPHWSP */
>  		       engine->pinned_default_state + LRC_STATE_PN * PAGE_SIZE,
> diff --git a/drivers/gpu/drm/i915/intel_lrc_reg.h b/drivers/gpu/drm/i915/intel_lrc_reg.h
> index 5ef932d810a7..3b155ecbfa92 100644
> --- a/drivers/gpu/drm/i915/intel_lrc_reg.h
> +++ b/drivers/gpu/drm/i915/intel_lrc_reg.h
> @@ -39,6 +39,8 @@
>  #define CTX_R_PWR_CLK_STATE		0x42
>  #define CTX_END				0x44
>  
> +#define CTX_MI_MODE			0x54
> +
>  #define CTX_REG(reg_state, pos, reg, val) do { \
>  	u32 *reg_state__ = (reg_state); \
>  	const u32 pos__ = (pos); \
> -- 
> 2.18.0
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Aug. 14, 2018, 1:47 p.m. UTC | #2
Quoting Chris Wilson (2018-08-14 14:11:50)
> Before a reset, we set the STOP_RING bit of RING_MI_MODE to freeze the
> engine. However, sometimes we observe that upon restart, the engine
> freezes again with the STOP_RING bit still asserted. By inspection, we
> know that the register itself is cleared by the GPU reset, so that bit
> must be preserved inside the context image and reloaded from there. A
> simple fix (as the RING_MI_MODE is at a fixed offset in a valid context)
> is to clobber the STOP_RING bit inside the image before replaying the
> request.
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>

Note that we need to improve the active request tracking before this is
100% reliable. One patch at a time; though you can look at the
i915_reset.c restructuring series to see what I have in mind.
-Chris
Chris Wilson Aug. 14, 2018, 1:49 p.m. UTC | #3
Quoting Mika Kuoppala (2018-08-14 14:42:41)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Before a reset, we set the STOP_RING bit of RING_MI_MODE to freeze the
> > engine. However, sometimes we observe that upon restart, the engine
> > freezes again with the STOP_RING bit still asserted. By inspection, we
> > know that the register itself is cleared by the GPU reset, so that bit
> > must be preserved inside the context image and reloaded from there. A
> > simple fix (as the RING_MI_MODE is at a fixed offset in a valid context)
> > is to clobber the STOP_RING bit inside the image before replaying the
> > request.
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Michał Winiarski <michal.winiarski@intel.com>
> > ---
> >  drivers/gpu/drm/i915/intel_lrc.c     | 17 +++++++++++++++--
> >  drivers/gpu/drm/i915/intel_lrc_reg.h |  2 ++
> >  2 files changed, 17 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> > index 3f90c74038ef..37fe842de639 100644
> > --- a/drivers/gpu/drm/i915/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/intel_lrc.c
> > @@ -1918,6 +1918,20 @@ static void execlists_reset(struct intel_engine_cs *engine,
> >  
> >       spin_unlock_irqrestore(&engine->timeline.lock, flags);
> >  
> > +     if (!request)
> > +             return;
> > +
> > +     regs = request->hw_context->lrc_reg_state;
> > +
> > +     /*
> > +      * After a reset, the context may have preserved the STOP bit
> > +      * of RING_MI_MODE we used to freeze the active engine before
> > +      * the reset. If that bit is restored the ring stops instead
> > +      * of being executed.
> > +      */
> > +     regs[CTX_MI_MODE + 1] |= STOP_RING << 16;
> > +     regs[CTX_MI_MODE + 1] &= ~STOP_RING;
> 
> Ok, I did go and pondered if this is truely the simplest
> way. Forcing the start outside a context would not
> be simpler and would be slower.
> 
> Nice find and it will has potential to explain our some troubles
> of re-enabling engines.
> 
> Did you find out this by looking at the error states or what?

We're dumping sufficient info between the set-wedged and GEM_TRACE to
track the MI_MODE bits; e.g.

commit 9a4dc80399b1630cea0f1ad8ef0417436cbb95d0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 18 11:09:33 2018 +0100

    drm/i915: Flush the ring stop bit after clearing RING_HEAD in reset

    Inside the live_hangcheck (reset) selftests, we occasionally see
    failures like

    <7>[  239.094840] i915_gem_set_wedged rcs0
    <7>[  239.094843] i915_gem_set_wedged   current seqno 19a98, last 19a9a, hangcheck 0 [5158 ms]
    <7>[  239.094846] i915_gem_set_wedged   Reset count: 6239 (global 1)
    <7>[  239.094848] i915_gem_set_wedged   Requests:
    <7>[  239.095052] i915_gem_set_wedged           first  19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
    <7>[  239.095056] i915_gem_set_wedged           last   19a9a [e81:1a] prio=139 @ 5159ms: igt/rcs0[5977]/1
    <7>[  239.095059] i915_gem_set_wedged           active 19a99 [e8c:5f] prio=1024 @ 5159ms: (null)
    <7>[  239.095062] i915_gem_set_wedged           [head 0220, postfix 0280, tail 02a8, batch 0xffffffff_ffffffff]
    <7>[  239.100050] i915_gem_set_wedged           ring->start:  0x00283000
    <7>[  239.100053] i915_gem_set_wedged           ring->head:   0x000001f8
    <7>[  239.100055] i915_gem_set_wedged           ring->tail:   0x000002a8
    <7>[  239.100057] i915_gem_set_wedged           ring->emit:   0x000002a8
    <7>[  239.100059] i915_gem_set_wedged           ring->space:  0x00000f10
    <7>[  239.100085] i915_gem_set_wedged   RING_START: 0x00283000
    <7>[  239.100088] i915_gem_set_wedged   RING_HEAD:  0x00000260
    <7>[  239.100091] i915_gem_set_wedged   RING_TAIL:  0x000002a8
    <7>[  239.100094] i915_gem_set_wedged   RING_CTL:   0x00000001
    <7>[  239.100097] i915_gem_set_wedged   RING_MODE:  0x00000300 [idle]
    <7>[  239.100100] i915_gem_set_wedged   RING_IMR: fffffefe
    <7>[  239.100104] i915_gem_set_wedged   ACTHD:  0x00000000_0000609c
    <7>[  239.100108] i915_gem_set_wedged   BBADDR: 0x00000000_0000609d
    <7>[  239.100111] i915_gem_set_wedged   DMA_FADDR: 0x00000000_00283260
    <7>[  239.100114] i915_gem_set_wedged   IPEIR: 0x00000000
    <7>[  239.100117] i915_gem_set_wedged   IPEHR: 0x02800000
    <7>[  239.100120] i915_gem_set_wedged   Execlist status: 0x00044052 00000002
    <7>[  239.100124] i915_gem_set_wedged   Execlist CSB read 5 [5 cached], write 5 [5 from hws], interrupt posted? no, tasklet queued? no (enabled)
    <7>[  239.100128] i915_gem_set_wedged           ELSP[0] count=1, ring->start=00283000, rq: 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
    <7>[  239.100132] i915_gem_set_wedged           ELSP[1] count=1, ring->start=00257000, rq: 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
    <7>[  239.100135] i915_gem_set_wedged           HW active? 0x5
    <7>[  239.100250] i915_gem_set_wedged           E 19a99 [e8c:5f] prio=1024 @ 5164ms: (null)
    <7>[  239.100338] i915_gem_set_wedged           E 19a9a [e81:1a] prio=139 @ 5164ms: igt/rcs0[5977]/1
    <7>[  239.100340] i915_gem_set_wedged           Queue priority: 139
    <7>[  239.100343] i915_gem_set_wedged           Q 0 [e98:19] prio=132 @ 5164ms: igt/rcs0[5977]/8
    <7>[  239.100346] i915_gem_set_wedged           Q 0 [e84:19] prio=121 @ 5165ms: igt/rcs0[5977]/2
    <7>[  239.100349] i915_gem_set_wedged           Q 0 [e87:19] prio=82 @ 5165ms: igt/rcs0[5977]/3
    <7>[  239.100352] i915_gem_set_wedged           Q 0 [e84:1a] prio=44 @ 5164ms: igt/rcs0[5977]/2
    <7>[  239.100356] i915_gem_set_wedged           Q 0 [e8b:19] prio=20 @ 5165ms: igt/rcs0[5977]/4
    <7>[  239.100362] i915_gem_set_wedged   drv_selftest [5894] waiting for 19a99

    where the GPU saw an arbitration point and idles; AND HAS NOT BEEN RESET!
    The RING_MODE indicates that is idle and has the STOP_RING bit set, so
    try clearing it.

    v2: Only clear the bit on restarting the ring, as we want to be sure the
    STOP_RING bit is kept if reset fails on wedging.
    v3: Spot when the ring state doesn't make sense when re-initialising the
    engine and dump it to the logs so that we don't have to wait for an
    error later and try to guess what happened earlier.
    v4: Prepare to print all the unexpected state, not just the first.

and that patch proves that the register state across reset is fine; ergo
context image it is.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 3f90c74038ef..37fe842de639 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1918,6 +1918,20 @@  static void execlists_reset(struct intel_engine_cs *engine,
 
 	spin_unlock_irqrestore(&engine->timeline.lock, flags);
 
+	if (!request)
+		return;
+
+	regs = request->hw_context->lrc_reg_state;
+
+	/*
+	 * After a reset, the context may have preserved the STOP bit
+	 * of RING_MI_MODE we used to freeze the active engine before
+	 * the reset. If that bit is restored the ring stops instead
+	 * of being executed.
+	 */
+	regs[CTX_MI_MODE + 1] |= STOP_RING << 16;
+	regs[CTX_MI_MODE + 1] &= ~STOP_RING;
+
 	/*
 	 * If the request was innocent, we leave the request in the ELSP
 	 * and will try to replay it on restarting. The context image may
@@ -1929,7 +1943,7 @@  static void execlists_reset(struct intel_engine_cs *engine,
 	 * and have to at least restore the RING register in the context
 	 * image back to the expected values to skip over the guilty request.
 	 */
-	if (!request || request->fence.error != -EIO)
+	if (request->fence.error != -EIO)
 		return;
 
 	/*
@@ -1940,7 +1954,6 @@  static void execlists_reset(struct intel_engine_cs *engine,
 	 * future request will be after userspace has had the opportunity
 	 * to recreate its own state.
 	 */
-	regs = request->hw_context->lrc_reg_state;
 	if (engine->pinned_default_state) {
 		memcpy(regs, /* skip restoring the vanilla PPHWSP */
 		       engine->pinned_default_state + LRC_STATE_PN * PAGE_SIZE,
diff --git a/drivers/gpu/drm/i915/intel_lrc_reg.h b/drivers/gpu/drm/i915/intel_lrc_reg.h
index 5ef932d810a7..3b155ecbfa92 100644
--- a/drivers/gpu/drm/i915/intel_lrc_reg.h
+++ b/drivers/gpu/drm/i915/intel_lrc_reg.h
@@ -39,6 +39,8 @@ 
 #define CTX_R_PWR_CLK_STATE		0x42
 #define CTX_END				0x44
 
+#define CTX_MI_MODE			0x54
+
 #define CTX_REG(reg_state, pos, reg, val) do { \
 	u32 *reg_state__ = (reg_state); \
 	const u32 pos__ = (pos); \