drm/i915: Stop ring before doing readiness check

Message ID	20170913140117.11072-1-mika.kuoppala@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Mika Kuoppala <mika.kuoppala@linux.intel.com> To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Sep 2017 17:01:17 +0300 Message-Id: <20170913140117.11072-1-mika.kuoppala@intel.com> Subject: [Intel-gfx] [PATCH] drm/i915: Stop ring before doing readiness check Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

20170913140117.11072-1-mika.kuoppala@intel.com (mailing list archive)

State

New, archived

Headers

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Wed, 13 Sep 2017 17:01:17 +0300
Message-Id: <20170913140117.11072-1-mika.kuoppala@intel.com>
Subject: [Intel-gfx] [PATCH] drm/i915: Stop ring before doing readiness check
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

Mika Kuoppala Sept. 13, 2017, 2:01 p.m. UTC

Evidence indicates that even if the hardware happily
tells us to proceed with reset, it really isn't ready.
Resetting a freely running batchbuffer after we have
got ack for readiness, still can cause a system hang.

Attempt to stop ring before proceeding for ready check
and reset to avoid losing the machine.

Testcase: igt/prime_busy/hang-* # kbl
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 54 ++++++++++++++++++++++---------------
 1 file changed, 32 insertions(+), 22 deletions(-)

Comments

Chris Wilson Sept. 13, 2017, 2:08 p.m. UTC | #1

Quoting Mika Kuoppala (2017-09-13 15:01:17)
> Evidence indicates that even if the hardware happily
> tells us to proceed with reset, it really isn't ready.
> Resetting a freely running batchbuffer after we have
> got ack for readiness, still can cause a system hang.

Hmm, so we see it on early gen and late gen. I suggest we do it
universally (except gen2 which is lacking the mechanism). It's unlikely
that the requirement disappeared just for a couple of gen, more likely
that we simply haven't triggered the pathological behaviour.

Other than,
Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
for the find.
-Chris

Ville Syrjälä Sept. 13, 2017, 2:13 p.m. UTC | #2

On Wed, Sep 13, 2017 at 03:08:06PM +0100, Chris Wilson wrote:
> Quoting Mika Kuoppala (2017-09-13 15:01:17)
> > Evidence indicates that even if the hardware happily
> > tells us to proceed with reset, it really isn't ready.
> > Resetting a freely running batchbuffer after we have
> > got ack for readiness, still can cause a system hang.
> 
> Hmm, so we see it on early gen and late gen. I suggest we do it
> universally (except gen2 which is lacking the mechanism). It's unlikely
> that the requirement disappeared just for a couple of gen, more likely
> that we simply haven't triggered the pathological behaviour.

Could just try setting ring enable=false on gen2 maybe? But we don't have
GPU reset for gen2 anyway so I guess it doesn't matter.

Mika Kuoppala Sept. 13, 2017, 2:15 p.m. UTC | #3

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2017-09-13 15:01:17)
>> Evidence indicates that even if the hardware happily
>> tells us to proceed with reset, it really isn't ready.
>> Resetting a freely running batchbuffer after we have
>> got ack for readiness, still can cause a system hang.
>
> Hmm, so we see it on early gen and late gen. I suggest we do it
> universally (except gen2 which is lacking the mechanism). It's unlikely
> that the requirement disappeared just for a couple of gen, more likely
> that we simply haven't triggered the pathological behaviour.
>

Agreed that we should do a blanket approach. I was in a hurry
to post a proposed fix as I heard the prime_* are not yet
blacklisted on shards. So lets hope this helps.

> Other than,
> Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
> for the find.

Ta.

-Mika

Chris Wilson Sept. 14, 2017, 10:33 a.m. UTC | #4

Quoting Patchwork (2017-09-14 01:07:40)
> == Series Details ==
> 
> Series: drm/i915: Stop ring before doing readiness check
> URL   : https://patchwork.freedesktop.org/series/30298/
> State : failure
> 
> == Summary ==
> 
> Test kms_cursor_legacy:
>         Subgroup cursorA-vs-flipA-atomic-transitions:
>                 pass       -> FAIL       (shard-hsw)
> Test drv_missed_irq:
>                 pass       -> FAIL       (shard-hsw)
> Test kms_setmode:
>         Subgroup basic:
>                 pass       -> FAIL       (shard-hsw) fdo#99912
> 
> fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
> 
> shard-hsw        total:2313 pass:1242 dwarn:0   dfail:0   fail:16  skip:1055 time:9618s
> 
> == Logs ==
> 
> For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_5685/shards.html

Of course, it decided not to run the prime_busy hang tests!!!
-Chris

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 1b38eb94d461..f9ef1931516c 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1361,33 +1361,38 @@  int i915_reg_read_ioctl(struct drm_device *dev,
 	return ret;
 }
 
+static void gen3_stop_ring(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+	const u32 base = engine->mmio_base;
+	const i915_reg_t mode = RING_MI_MODE(base);
+
+	I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
+	if (intel_wait_for_register_fw(dev_priv,
+				       mode,
+				       MODE_IDLE,
+				       MODE_IDLE,
+				       500))
+		DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
+				 engine->name);
+
+	I915_WRITE_FW(RING_CTL(base), 0);
+	I915_WRITE_FW(RING_HEAD(base), 0);
+	I915_WRITE_FW(RING_TAIL(base), 0);
+
+	/* Check acts as a post */
+	if (I915_READ_FW(RING_HEAD(base)) != 0)
+		DRM_DEBUG_DRIVER("%s: ring head not parked\n",
+				 engine->name);
+}
+
 static void gen3_stop_rings(struct drm_i915_private *dev_priv)
 {
 	struct intel_engine_cs *engine;
 	enum intel_engine_id id;
 
-	for_each_engine(engine, dev_priv, id) {
-		const u32 base = engine->mmio_base;
-		const i915_reg_t mode = RING_MI_MODE(base);
-
-		I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
-		if (intel_wait_for_register_fw(dev_priv,
-					       mode,
-					       MODE_IDLE,
-					       MODE_IDLE,
-					       500))
-			DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
-					 engine->name);
-
-		I915_WRITE_FW(RING_CTL(base), 0);
-		I915_WRITE_FW(RING_HEAD(base), 0);
-		I915_WRITE_FW(RING_TAIL(base), 0);
-
-		/* Check acts as a post */
-		if (I915_READ_FW(RING_HEAD(base)) != 0)
-			DRM_DEBUG_DRIVER("%s: ring head not parked\n",
-					 engine->name);
-	}
+	for_each_engine(engine, dev_priv, id)
+		gen3_stop_ring(engine);
 }
 
 static bool i915_reset_complete(struct pci_dev *pdev)
@@ -1668,6 +1673,11 @@  static int gen8_reset_engine_start(struct intel_engine_cs *engine)
 	struct drm_i915_private *dev_priv = engine->i915;
 	int ret;
 
+	/* If the bb is still running at this stage, forcing a
+	 * reset risks a system hang.
+	 */
+	gen3_stop_ring(engine);
+
 	I915_WRITE_FW(RING_RESET_CTL(engine->mmio_base),
 		      _MASKED_BIT_ENABLE(RESET_CTL_REQUEST_RESET));