drm/i915: Only try to stop engines after a failed reset

Message ID	20190213232047.8486-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Feb 2019 23:20:47 +0000 Message-Id: <20190213232047.8486-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH] drm/i915: Only try to stop engines after a failed reset Precedence: list Cc: Mika Kuoppala <mika.kuoppala@intel.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: Only try to stop engines after a failed reset \| expand drm/i915: Only try to stop engines after a failed reset

Message ID

20190213232047.8486-1-chris@chris-wilson.co.uk (mailing list archive)

State

New, archived

Headers

From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Date: Wed, 13 Feb 2019 23:20:47 +0000
Message-Id: <20190213232047.8486-1-chris@chris-wilson.co.uk>
MIME-Version: 1.0
Subject: [Intel-gfx] [PATCH] drm/i915: Only try to stop engines after a
 failed reset
Precedence: list
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

drm/i915: Only try to stop engines after a failed reset | expand

Commit Message

Chris Wilson Feb. 13, 2019, 11:20 p.m. UTC

Currently we try to stop the engine by programming the ring registers to
be disabled before we perform the reset. Sometimes, we see the context
image also have invalid ring registers, which one presumes may be
actually caused by us doing so. Lets risk not doing programming the
ring to zero on the first attempt to avoid preserving that corruption
into the context image, leaving the w/a in place for subsequent
reset attempts.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_reset.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

Comments

Mika Kuoppala Feb. 14, 2019, 11:29 a.m. UTC | #1

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Currently we try to stop the engine by programming the ring registers to
> be disabled before we perform the reset. Sometimes, we see the context
> image also have invalid ring registers, which one presumes may be
> actually caused by us doing so. Lets risk not doing programming the
> ring to zero on the first attempt to avoid preserving that corruption
> into the context image, leaving the w/a in place for subsequent
> reset attempts.

Might lead to some failed attempts but the
sentiment that use finesse instead of biggest hammer
in arsenal is sane.

It makes me also ponder if we can fight against this
in other side of the fence. Doing a precursory check,
for debug builds, on (first) submit that the lrc is sane?
Using the lrc_regs_ok()?

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

-Mika

>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_reset.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 12e74decd7a2..dfced5be1042 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -119,8 +119,10 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
>  	struct drm_i915_private *dev_priv = engine->i915;
>  	const u32 base = engine->mmio_base;
>  
> +	GEM_TRACE("%s\n", engine->name);
> +
>  	if (intel_engine_stop_cs(engine))
> -		DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n", engine->name);
> +		GEM_TRACE("%s: timed out on STOP_RING\n", engine->name);
>  
>  	I915_WRITE_FW(RING_HEAD(base), I915_READ_FW(RING_TAIL(base)));
>  	POSTING_READ_FW(RING_HEAD(base)); /* paranoia */
> @@ -133,9 +135,9 @@ static void gen3_stop_engine(struct intel_engine_cs *engine)
>  	I915_WRITE_FW(RING_CTL(base), 0);
>  
>  	/* Check acts as a post */
> -	if (I915_READ_FW(RING_HEAD(base)) != 0)
> -		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
> -				 engine->name);
> +	if (I915_READ_FW(RING_HEAD(base)))
> +		GEM_TRACE("%s: ring head [%x] not parked\n",
> +			  engine->name, I915_READ_FW(RING_HEAD(base)));
>  }
>  
>  static void i915_stop_engines(struct drm_i915_private *i915,
> @@ -579,7 +581,8 @@ int intel_gpu_reset(struct drm_i915_private *i915, unsigned int engine_mask)
>  		 *
>  		 * FIXME: Wa for more modern gens needs to be validated
>  		 */
> -		i915_stop_engines(i915, engine_mask);
> +		if (retry)
> +			i915_stop_engines(i915, engine_mask);
>  
>  		GEM_TRACE("engine_mask=%x\n", engine_mask);
>  		preempt_disable();
> -- 
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson Feb. 14, 2019, 11:39 a.m. UTC | #2

Quoting Mika Kuoppala (2019-02-14 11:29:13)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Currently we try to stop the engine by programming the ring registers to
> > be disabled before we perform the reset. Sometimes, we see the context
> > image also have invalid ring registers, which one presumes may be
> > actually caused by us doing so. Lets risk not doing programming the
> > ring to zero on the first attempt to avoid preserving that corruption
> > into the context image, leaving the w/a in place for subsequent
> > reset attempts.
> 
> Might lead to some failed attempts but the
> sentiment that use finesse instead of biggest hammer
> in arsenal is sane.

Yeah, the warning we have there is nasty, but fingers crossed this
balance works. So far gen9 seems happy, haven't dug out ctg yet.
Hopefully survives across the farm. I'm left again pondering if we can
increase the variety of hangs we induce.
 
> It makes me also ponder if we can fight against this
> in other side of the fence. Doing a precursory check,
> for debug builds, on (first) submit that the lrc is sane?
> Using the lrc_regs_ok()?

But we write those during first submit and before that the entire page
is invalid :)
-Chris

diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
index 12e74decd7a2..dfced5be1042 100644
--- a/drivers/gpu/drm/i915/i915_reset.c
+++ b/drivers/gpu/drm/i915/i915_reset.c
@@ -119,8 +119,10 @@  static void gen3_stop_engine(struct intel_engine_cs *engine)
 	struct drm_i915_private *dev_priv = engine->i915;
 	const u32 base = engine->mmio_base;
 
+	GEM_TRACE("%s\n", engine->name);
+
 	if (intel_engine_stop_cs(engine))
-		DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n", engine->name);
+		GEM_TRACE("%s: timed out on STOP_RING\n", engine->name);
 
 	I915_WRITE_FW(RING_HEAD(base), I915_READ_FW(RING_TAIL(base)));
 	POSTING_READ_FW(RING_HEAD(base)); /* paranoia */
@@ -133,9 +135,9 @@  static void gen3_stop_engine(struct intel_engine_cs *engine)
 	I915_WRITE_FW(RING_CTL(base), 0);
 
 	/* Check acts as a post */
-	if (I915_READ_FW(RING_HEAD(base)) != 0)
-		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
-				 engine->name);
+	if (I915_READ_FW(RING_HEAD(base)))
+		GEM_TRACE("%s: ring head [%x] not parked\n",
+			  engine->name, I915_READ_FW(RING_HEAD(base)));
 }
 
 static void i915_stop_engines(struct drm_i915_private *i915,
@@ -579,7 +581,8 @@  int intel_gpu_reset(struct drm_i915_private *i915, unsigned int engine_mask)
 		 *
 		 * FIXME: Wa for more modern gens needs to be validated
 		 */
-		i915_stop_engines(i915, engine_mask);
+		if (retry)
+			i915_stop_engines(i915, engine_mask);
 
 		GEM_TRACE("engine_mask=%x\n", engine_mask);
 		preempt_disable();

drm/i915: Only try to stop engines after a failed reset

Commit Message

Comments

Patch