diff mbox

drm/i915: Treat i915_reset_engine() as guilty until proven innocent

Message ID 20180406210014.31335-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson April 6, 2018, 9 p.m. UTC
If we are resetting just one engine, we know it has stalled. So we can
pass the stalled parameter directly to i915_gem_reset_engine(), which
alleviates the necessity to poke at the generic engine->hangcheck.stalled
magic variable, leaving that under control of hangcheck as its name
implies. Other than simplifying by removing the indirect parameter along
this path, this allows us to introduce new reset mechanisms that run
independently of hangcheck.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Jeff McGee <jeff.mcgee@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c               |  2 +-
 drivers/gpu/drm/i915/i915_drv.h               |  3 +-
 drivers/gpu/drm/i915/i915_gem.c               | 36 +++++++++----------
 .../gpu/drm/i915/selftests/intel_hangcheck.c  |  9 -----
 4 files changed, 20 insertions(+), 30 deletions(-)

Comments

Michel Thierry April 6, 2018, 9:23 p.m. UTC | #1
And I thought we believed in presumption of innocence...

On 4/6/2018 2:00 PM, Chris Wilson wrote:
> If we are resetting just one engine, we know it has stalled. So we can
> pass the stalled parameter directly to i915_gem_reset_engine(), which
> alleviates the necessity to poke at the generic engine->hangcheck.stalled
> magic variable, leaving that under control of hangcheck as its name
> implies. Other than simplifying by removing the indirect parameter along
> this path, this allows us to introduce new reset mechanisms that run
> independently of hangcheck.
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Jeff McGee <jeff.mcgee@intel.com>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_drv.c               |  2 +-
>   drivers/gpu/drm/i915/i915_drv.h               |  3 +-
>   drivers/gpu/drm/i915/i915_gem.c               | 36 +++++++++----------
>   .../gpu/drm/i915/selftests/intel_hangcheck.c  |  9 -----
>   4 files changed, 20 insertions(+), 30 deletions(-)
> 
...
> @@ -774,7 +766,6 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
>   				break;
>   			}
>   
> -			engine->hangcheck.stalled = false;
>   			count++;
>   
>   			if (rq) {
> 

Are the ones in igt_handle_error() still needed?
   hangcheck.stalled = true;
   hangcheck.seqno = intel_engine_get_seqno(engine);

Because igt_handle_error is sending a real request.
(I think the only ones remaining in the selftest should be in 
fake_hangcheck).

Reviewed-by: Michel Thierry <michel.thierry@intel.com>
Chris Wilson April 6, 2018, 9:30 p.m. UTC | #2
Quoting Michel Thierry (2018-04-06 22:23:21)
> And I thought we believed in presumption of innocence...
> 
> On 4/6/2018 2:00 PM, Chris Wilson wrote:
> > If we are resetting just one engine, we know it has stalled. So we can
> > pass the stalled parameter directly to i915_gem_reset_engine(), which
> > alleviates the necessity to poke at the generic engine->hangcheck.stalled
> > magic variable, leaving that under control of hangcheck as its name
> > implies. Other than simplifying by removing the indirect parameter along
> > this path, this allows us to introduce new reset mechanisms that run
> > independently of hangcheck.
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Michel Thierry <michel.thierry@intel.com>
> > Cc: Jeff McGee <jeff.mcgee@intel.com>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_drv.c               |  2 +-
> >   drivers/gpu/drm/i915/i915_drv.h               |  3 +-
> >   drivers/gpu/drm/i915/i915_gem.c               | 36 +++++++++----------
> >   .../gpu/drm/i915/selftests/intel_hangcheck.c  |  9 -----
> >   4 files changed, 20 insertions(+), 30 deletions(-)
> > 
> ...
> > @@ -774,7 +766,6 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
> >                               break;
> >                       }
> >   
> > -                     engine->hangcheck.stalled = false;
> >                       count++;
> >   
> >                       if (rq) {
> > 
> 
> Are the ones in igt_handle_error() still needed?
>    hangcheck.stalled = true;
>    hangcheck.seqno = intel_engine_get_seqno(engine);
> 
> Because igt_handle_error is sending a real request.

> (I think the only ones remaining in the selftest should be in 
> fake_hangcheck).

Right, fake_hangcheck definitely still needs it to behave like
hangcheck.

i915_handle_error() is still "odd". At the moment, yes we still need to
be poking where we shouldn't. If i915_handle_error() uses the
engine_mask to do per-engine resets, no, we don't need the magic
hangcheck.stalled. But, if it falls back to the full device level, we
loose the guilty reset.  So we do get a difference in behaviour, that
really hasn't been noticed before as the only real caller is from
hangcheck. (i915_wedged, I dare anyone to say what they expect ;)

I think the answer will be to pass engine_mask to i915_reset. But I
haven't fleshed that out yet. I think it means we do away with
hangcheck.seqno as well, so bonus?
-Chris
Michel Thierry April 6, 2018, 9:44 p.m. UTC | #3
On 4/6/2018 2:30 PM, Chris Wilson wrote:
> Quoting Michel Thierry (2018-04-06 22:23:21)
>> And I thought we believed in presumption of innocence...
>>
>> On 4/6/2018 2:00 PM, Chris Wilson wrote:
>>> If we are resetting just one engine, we know it has stalled. So we can
>>> pass the stalled parameter directly to i915_gem_reset_engine(), which
>>> alleviates the necessity to poke at the generic engine->hangcheck.stalled
>>> magic variable, leaving that under control of hangcheck as its name
>>> implies. Other than simplifying by removing the indirect parameter along
>>> this path, this allows us to introduce new reset mechanisms that run
>>> independently of hangcheck.
>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Michel Thierry <michel.thierry@intel.com>
>>> Cc: Jeff McGee <jeff.mcgee@intel.com>
>>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_drv.c               |  2 +-
>>>    drivers/gpu/drm/i915/i915_drv.h               |  3 +-
>>>    drivers/gpu/drm/i915/i915_gem.c               | 36 +++++++++----------
>>>    .../gpu/drm/i915/selftests/intel_hangcheck.c  |  9 -----
>>>    4 files changed, 20 insertions(+), 30 deletions(-)
>>>
>> ...
>>> @@ -774,7 +766,6 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
>>>                                break;
>>>                        }
>>>    
>>> -                     engine->hangcheck.stalled = false;
>>>                        count++;
>>>    
>>>                        if (rq) {
>>>
>>
>> Are the ones in igt_handle_error() still needed?
>>     hangcheck.stalled = true;
>>     hangcheck.seqno = intel_engine_get_seqno(engine);
>>
>> Because igt_handle_error is sending a real request.
> 
>> (I think the only ones remaining in the selftest should be in
>> fake_hangcheck).
> 
> Right, fake_hangcheck definitely still needs it to behave like
> hangcheck.
> 
> i915_handle_error() is still "odd". At the moment, yes we still need to
> be poking where we shouldn't. If i915_handle_error() uses the
> engine_mask to do per-engine resets, no, we don't need the magic
> hangcheck.stalled. But, if it falls back to the full device level, we
> loose the guilty reset.  So we do get a difference in behaviour, that
> really hasn't been noticed before as the only real caller is from
> hangcheck. (i915_wedged, I dare anyone to say what they expect ;)
> 

Not the first time I forget about the full reset path.
Thanks for explaining it.

> I think the answer will be to pass engine_mask to i915_reset. But I
> haven't fleshed that out yet. I think it means we do away with
> hangcheck.seqno as well, so bonus?
Chris Wilson April 6, 2018, 10:05 p.m. UTC | #4
Quoting Michel Thierry (2018-04-06 22:44:34)
> On 4/6/2018 2:30 PM, Chris Wilson wrote:
> > Quoting Michel Thierry (2018-04-06 22:23:21)
> >> And I thought we believed in presumption of innocence...
> >>
> >> On 4/6/2018 2:00 PM, Chris Wilson wrote:
> >>> If we are resetting just one engine, we know it has stalled. So we can
> >>> pass the stalled parameter directly to i915_gem_reset_engine(), which
> >>> alleviates the necessity to poke at the generic engine->hangcheck.stalled
> >>> magic variable, leaving that under control of hangcheck as its name
> >>> implies. Other than simplifying by removing the indirect parameter along
> >>> this path, this allows us to introduce new reset mechanisms that run
> >>> independently of hangcheck.
> >>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Michel Thierry <michel.thierry@intel.com>
> >>> Cc: Jeff McGee <jeff.mcgee@intel.com>
> >>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_drv.c               |  2 +-
> >>>    drivers/gpu/drm/i915/i915_drv.h               |  3 +-
> >>>    drivers/gpu/drm/i915/i915_gem.c               | 36 +++++++++----------
> >>>    .../gpu/drm/i915/selftests/intel_hangcheck.c  |  9 -----
> >>>    4 files changed, 20 insertions(+), 30 deletions(-)
> >>>
> >> ...
> >>> @@ -774,7 +766,6 @@ static int __igt_reset_engines(struct drm_i915_private *i915,
> >>>                                break;
> >>>                        }
> >>>    
> >>> -                     engine->hangcheck.stalled = false;
> >>>                        count++;
> >>>    
> >>>                        if (rq) {
> >>>
> >>
> >> Are the ones in igt_handle_error() still needed?
> >>     hangcheck.stalled = true;
> >>     hangcheck.seqno = intel_engine_get_seqno(engine);
> >>
> >> Because igt_handle_error is sending a real request.
> > 
> >> (I think the only ones remaining in the selftest should be in
> >> fake_hangcheck).
> > 
> > Right, fake_hangcheck definitely still needs it to behave like
> > hangcheck.
> > 
> > i915_handle_error() is still "odd". At the moment, yes we still need to
> > be poking where we shouldn't. If i915_handle_error() uses the
> > engine_mask to do per-engine resets, no, we don't need the magic
> > hangcheck.stalled. But, if it falls back to the full device level, we
> > loose the guilty reset.  So we do get a difference in behaviour, that
> > really hasn't been noticed before as the only real caller is from
> > hangcheck. (i915_wedged, I dare anyone to say what they expect ;)
> > 
> 
> Not the first time I forget about the full reset path.
> Thanks for explaining it.
> 
> > I think the answer will be to pass engine_mask to i915_reset. But I
> > haven't fleshed that out yet. I think it means we do away with
> > hangcheck.seqno as well, so bonus?

Wasn't quite as hairy as I feared... At least the selftests were easy
enough to convert over!
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 684060ed8db6..7ce229c6f424 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -2050,7 +2050,7 @@  int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
 	 * active request and can drop it, adjust head to skip the offending
 	 * request to resume executing remaining requests in the queue.
 	 */
-	i915_gem_reset_engine(engine, active_request);
+	i915_gem_reset_engine(engine, active_request, true);
 
 	/*
 	 * The engine and its registers (and workarounds in case of render)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5373b171bb96..6b3f2f651def 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3132,7 +3132,8 @@  void i915_gem_reset_finish(struct drm_i915_private *dev_priv);
 void i915_gem_set_wedged(struct drm_i915_private *dev_priv);
 bool i915_gem_unset_wedged(struct drm_i915_private *dev_priv);
 void i915_gem_reset_engine(struct intel_engine_cs *engine,
-			   struct i915_request *request);
+			   struct i915_request *request,
+			   bool stalled);
 
 void i915_gem_init_mmio(struct drm_i915_private *i915);
 int __must_check i915_gem_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index a69dc19a0bdb..306d7a805eb7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2990,20 +2990,6 @@  i915_gem_find_active_request(struct intel_engine_cs *engine)
 	return active;
 }
 
-static bool engine_stalled(struct intel_engine_cs *engine)
-{
-	if (!engine->hangcheck.stalled)
-		return false;
-
-	/* Check for possible seqno movement after hang declaration */
-	if (engine->hangcheck.seqno != intel_engine_get_seqno(engine)) {
-		DRM_DEBUG_DRIVER("%s pardoned\n", engine->name);
-		return false;
-	}
-
-	return true;
-}
-
 /*
  * Ensure irq handler finishes, and not run again.
  * Also return the active request so that we only search for it once.
@@ -3142,7 +3128,8 @@  static void engine_skip_context(struct i915_request *request)
 /* Returns the request if it was guilty of the hang */
 static struct i915_request *
 i915_gem_reset_request(struct intel_engine_cs *engine,
-		       struct i915_request *request)
+		       struct i915_request *request,
+		       bool stalled)
 {
 	/* The guilty request will get skipped on a hung engine.
 	 *
@@ -3165,7 +3152,15 @@  i915_gem_reset_request(struct intel_engine_cs *engine,
 	 * subsequent hangs.
 	 */
 
-	if (engine_stalled(engine)) {
+	if (i915_request_completed(request)) {
+		GEM_TRACE("%s pardoned global=%d (fence %llx:%d), current %d\n",
+			  engine->name, request->global_seqno,
+			  request->fence.context, request->fence.seqno,
+			  intel_engine_get_seqno(engine));
+		stalled = false;
+	}
+
+	if (stalled) {
 		i915_gem_context_mark_guilty(request->ctx);
 		skip_request(request);
 
@@ -3196,7 +3191,8 @@  i915_gem_reset_request(struct intel_engine_cs *engine,
 }
 
 void i915_gem_reset_engine(struct intel_engine_cs *engine,
-			   struct i915_request *request)
+			   struct i915_request *request,
+			   bool stalled)
 {
 	/*
 	 * Make sure this write is visible before we re-enable the interrupt
@@ -3206,7 +3202,7 @@  void i915_gem_reset_engine(struct intel_engine_cs *engine,
 	smp_store_mb(engine->irq_posted, 0);
 
 	if (request)
-		request = i915_gem_reset_request(engine, request);
+		request = i915_gem_reset_request(engine, request, stalled);
 
 	if (request) {
 		DRM_DEBUG_DRIVER("resetting %s to restart from tail of request 0x%x\n",
@@ -3229,7 +3225,9 @@  void i915_gem_reset(struct drm_i915_private *dev_priv)
 	for_each_engine(engine, dev_priv, id) {
 		struct i915_gem_context *ctx;
 
-		i915_gem_reset_engine(engine, engine->hangcheck.active_request);
+		i915_gem_reset_engine(engine,
+				      engine->hangcheck.active_request,
+				      engine->hangcheck.stalled);
 		ctx = fetch_and_zero(&engine->last_retired_context);
 		if (ctx)
 			engine->context_unpin(engine, ctx);
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 8650853c8cb3..acfb4dcc9fb5 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -522,9 +522,6 @@  static int __igt_reset_engine(struct drm_i915_private *i915, bool active)
 				i915_request_put(rq);
 			}
 
-			engine->hangcheck.stalled = true;
-			engine->hangcheck.seqno = seqno;
-
 			err = i915_reset_engine(engine, NULL);
 			if (err) {
 				pr_err("i915_reset_engine failed\n");
@@ -545,8 +542,6 @@  static int __igt_reset_engine(struct drm_i915_private *i915, bool active)
 				err = -EINVAL;
 				break;
 			}
-
-			engine->hangcheck.stalled = false;
 		} while (time_before(jiffies, end_time));
 		clear_bit(I915_RESET_ENGINE + id, &i915->gpu_error.flags);
 
@@ -764,9 +759,6 @@  static int __igt_reset_engines(struct drm_i915_private *i915,
 				seqno = rq->global_seqno - 1;
 			}
 
-			engine->hangcheck.stalled = true;
-			engine->hangcheck.seqno = seqno;
-
 			err = i915_reset_engine(engine, NULL);
 			if (err) {
 				pr_err("i915_reset_engine(%s:%s): failed, err=%d\n",
@@ -774,7 +766,6 @@  static int __igt_reset_engines(struct drm_i915_private *i915,
 				break;
 			}
 
-			engine->hangcheck.stalled = false;
 			count++;
 
 			if (rq) {