diff mbox series

[2/2] drm/i915: Force reset on unready engine

Message ID 20180813104242.32651-1-mika.kuoppala@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series None | expand

Commit Message

Mika Kuoppala Aug. 13, 2018, 10:42 a.m. UTC
If engine reports that it is not ready for reset, we
give up. Evidence shows that forcing a per engine reset
on an engine which is not reporting to be ready for reset,
can bring it back into a working order. There is risk that
we corrupt the context image currently executing on that
engine. But that is a risk worth taking as if we unblock
the engine, we prevent a whole device wedging in a case
of full gpu reset.

Reset individual engine even if it reports that it is not
prepared for reset, but only if we aim for full gpu reset
and not on first reset attempt.

v2: force reset only on later attempts, readability (Chris)
v3: simplify with adequate caffeine levels (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 36 ++++++++++++++++++-----------
 1 file changed, 23 insertions(+), 13 deletions(-)

Comments

Chris Wilson Aug. 13, 2018, 10:51 a.m. UTC | #1
Quoting Mika Kuoppala (2018-08-13 11:42:42)
> If engine reports that it is not ready for reset, we
> give up. Evidence shows that forcing a per engine reset
> on an engine which is not reporting to be ready for reset,
> can bring it back into a working order. There is risk that
> we corrupt the context image currently executing on that
> engine. But that is a risk worth taking as if we unblock
> the engine, we prevent a whole device wedging in a case
> of full gpu reset.
> 
> Reset individual engine even if it reports that it is not
> prepared for reset, but only if we aim for full gpu reset
> and not on first reset attempt.
> 
> v2: force reset only on later attempts, readability (Chris)
> v3: simplify with adequate caffeine levels (Chris)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

One last thing, you said you recalled one of the reasons for its
existence was to prevent machine lockups on kbl. Is the recollection
true? Do we want to leave a comment in case of fire?
-Chris
Mika Kuoppala Aug. 13, 2018, 11:02 a.m. UTC | #2
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2018-08-13 11:42:42)
>> If engine reports that it is not ready for reset, we
>> give up. Evidence shows that forcing a per engine reset
>> on an engine which is not reporting to be ready for reset,
>> can bring it back into a working order. There is risk that
>> we corrupt the context image currently executing on that
>> engine. But that is a risk worth taking as if we unblock
>> the engine, we prevent a whole device wedging in a case
>> of full gpu reset.
>> 
>> Reset individual engine even if it reports that it is not
>> prepared for reset, but only if we aim for full gpu reset
>> and not on first reset attempt.
>> 
>> v2: force reset only on later attempts, readability (Chris)
>> v3: simplify with adequate caffeine levels (Chris)
>> 
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
>
> One last thing, you said you recalled one of the reasons for its
> existence was to prevent machine lockups on kbl. Is the recollection
> true? Do we want to leave a comment in case of fire?

We got machine lockups if we did reset a non stopped, active
engine inside a batchbuffer. i915_stop_engines() arise from that
and we have a comment in intel_gpu_reset explaining it. That lockup
did apparently happen regardless of ready-to-reset ack.

How I read it is that we got ready-to-reset acks on
active engines, which then died if we proceed. So this
patch should not make things worse as i915_stop_engines
have hold water.

-Mika *knocks on wood*
Chris Wilson Aug. 13, 2018, 11:08 a.m. UTC | #3
Quoting Mika Kuoppala (2018-08-13 12:02:51)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2018-08-13 11:42:42)
> >> If engine reports that it is not ready for reset, we
> >> give up. Evidence shows that forcing a per engine reset
> >> on an engine which is not reporting to be ready for reset,
> >> can bring it back into a working order. There is risk that
> >> we corrupt the context image currently executing on that
> >> engine. But that is a risk worth taking as if we unblock
> >> the engine, we prevent a whole device wedging in a case
> >> of full gpu reset.
> >> 
> >> Reset individual engine even if it reports that it is not
> >> prepared for reset, but only if we aim for full gpu reset
> >> and not on first reset attempt.
> >> 
> >> v2: force reset only on later attempts, readability (Chris)
> >> v3: simplify with adequate caffeine levels (Chris)
> >> 
> >> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
> >
> > One last thing, you said you recalled one of the reasons for its
> > existence was to prevent machine lockups on kbl. Is the recollection
> > true? Do we want to leave a comment in case of fire?
> 
> We got machine lockups if we did reset a non stopped, active
> engine inside a batchbuffer. i915_stop_engines() arise from that
> and we have a comment in intel_gpu_reset explaining it. That lockup
> did apparently happen regardless of ready-to-reset ack.
> 
> How I read it is that we got ready-to-reset acks on
> active engines, which then died if we proceed. So this
> patch should not make things worse as i915_stop_engines
> have hold water.

Ok, I still think a comment about the danger for last-chance with a
reference back to i915_stop_engines will be useful for future
archaeology.
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 027d14574bfa..7cba14ae82a1 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2085,7 +2085,7 @@  int __intel_wait_for_register(struct drm_i915_private *dev_priv,
 	return ret;
 }
 
-static int gen8_reset_engine_start(struct intel_engine_cs *engine)
+static int gen8_engine_reset_prepare(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 	int ret;
@@ -2105,7 +2105,7 @@  static int gen8_reset_engine_start(struct intel_engine_cs *engine)
 	return ret;
 }
 
-static void gen8_reset_engine_cancel(struct intel_engine_cs *engine)
+static void gen8_engine_reset_cancel(struct intel_engine_cs *engine)
 {
 	struct drm_i915_private *dev_priv = engine->i915;
 
@@ -2113,29 +2113,36 @@  static void gen8_reset_engine_cancel(struct intel_engine_cs *engine)
 		      _MASKED_BIT_DISABLE(RESET_CTL_REQUEST_RESET));
 }
 
+static int reset_engines(struct drm_i915_private *i915,
+			 unsigned int engine_mask,
+			 unsigned int retry)
+{
+	if (INTEL_GEN(i915) >= 11)
+		return gen11_reset_engines(i915, engine_mask);
+	else
+		return gen6_reset_engines(i915, engine_mask, retry);
+}
+
 static int gen8_reset_engines(struct drm_i915_private *dev_priv,
 			      unsigned int engine_mask,
 			      unsigned int retry)
 {
 	struct intel_engine_cs *engine;
+	const bool reset_non_ready = retry >= 1;
 	unsigned int tmp;
 	int ret;
 
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp) {
-		if (gen8_reset_engine_start(engine)) {
-			ret = -EIO;
-			goto not_ready;
-		}
+		ret = gen8_engine_reset_prepare(engine);
+		if (ret && !reset_non_ready)
+			goto skip_reset;
 	}
 
-	if (INTEL_GEN(dev_priv) >= 11)
-		ret = gen11_reset_engines(dev_priv, engine_mask);
-	else
-		ret = gen6_reset_engines(dev_priv, engine_mask, retry);
+	ret = reset_engines(dev_priv, engine_mask, retry);
 
-not_ready:
+skip_reset:
 	for_each_engine_masked(engine, dev_priv, engine_mask, tmp)
-		gen8_reset_engine_cancel(engine);
+		gen8_engine_reset_cancel(engine);
 
 	return ret;
 }
@@ -2164,12 +2171,15 @@  static reset_func intel_get_gpu_reset(struct drm_i915_private *dev_priv)
 		return NULL;
 }
 
-int intel_gpu_reset(struct drm_i915_private *dev_priv, unsigned int engine_mask)
+int intel_gpu_reset(struct drm_i915_private *dev_priv,
+		    const unsigned int engine_mask)
 {
 	reset_func reset = intel_get_gpu_reset(dev_priv);
 	unsigned int retry;
 	int ret;
 
+	GEM_BUG_ON(!engine_mask);
+
 	/*
 	 * We want to perform per-engine reset from atomic context (e.g.
 	 * softirq), which imposes the constraint that we cannot sleep.