diff mbox

[02/11] drm/i915: Only reset hangcheck at the start of an activity cycle

Message ID 20180709130208.11730-2-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson July 9, 2018, 1:01 p.m. UTC
Across a reset, the seqno (and thus hangcheck) should restart and the
hangcheck naturally progress, for when it does not, we want to declare an
emergency. Currently, we only detect if reset and reinit fails, but we
do not detect if the call to reinit succeeds but the HW is fried - as we
are resetting hangcheck on initialisation the engine. Remove that and
rely on the natural progress to reset the hangcheck timer.

References: e21b141376f9 ("drm/i915: Mark the hangcheck as idle when unparking the engines")
References: 1fd00c0faeec ("drm/i915: Declare the driver wedged if hangcheck makes no progress")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c        | 1 -
 drivers/gpu/drm/i915/intel_ringbuffer.c | 2 --
 2 files changed, 3 deletions(-)

Comments

Mika Kuoppala July 9, 2018, 2:13 p.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Across a reset, the seqno (and thus hangcheck) should restart and the
> hangcheck naturally progress, for when it does not, we want to declare an
> emergency. Currently, we only detect if reset and reinit fails, but we
> do not detect if the call to reinit succeeds but the HW is fried - as we
> are resetting hangcheck on initialisation the engine. Remove that and
> rely on the natural progress to reset the hangcheck timer.

I take it that the intention is not to give reset
any special leeway wrt to request completion. So
we now assume that reset/recovery must fit inside
one hangcheck tick?

-Mika


>
> References: e21b141376f9 ("drm/i915: Mark the hangcheck as idle when unparking the engines")
> References: 1fd00c0faeec ("drm/i915: Declare the driver wedged if hangcheck makes no progress")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/intel_lrc.c        | 1 -
>  drivers/gpu/drm/i915/intel_ringbuffer.c | 2 --
>  2 files changed, 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
> index ab89dabc2965..933495996e91 100644
> --- a/drivers/gpu/drm/i915/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/intel_lrc.c
> @@ -1852,7 +1852,6 @@ static int gen8_init_common_ring(struct intel_engine_cs *engine)
>  		return ret;
>  
>  	intel_engine_reset_breadcrumbs(engine);
> -	intel_engine_init_hangcheck(engine);
>  
>  	if (GEM_SHOW_DEBUG() && unexpected_starting_state(engine)) {
>  		struct drm_printer p = drm_debug_printer(__func__);
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 700f94c371b3..f4bd185c9369 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -524,8 +524,6 @@ static int init_ring_common(struct intel_engine_cs *engine)
>  		goto out;
>  	}
>  
> -	intel_engine_init_hangcheck(engine);
> -
>  	if (INTEL_GEN(dev_priv) > 2)
>  		I915_WRITE_MODE(engine, _MASKED_BIT_DISABLE(STOP_RING));
>  
> -- 
> 2.18.0
Chris Wilson July 9, 2018, 2:27 p.m. UTC | #2
Quoting Mika Kuoppala (2018-07-09 15:13:44)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Across a reset, the seqno (and thus hangcheck) should restart and the
> > hangcheck naturally progress, for when it does not, we want to declare an
> > emergency. Currently, we only detect if reset and reinit fails, but we
> > do not detect if the call to reinit succeeds but the HW is fried - as we
> > are resetting hangcheck on initialisation the engine. Remove that and
> > rely on the natural progress to reset the hangcheck timer.
> 
> I take it that the intention is not to give reset
> any special leeway wrt to request completion. So
> we now assume that reset/recovery must fit inside
> one hangcheck tick?

We call the synchronous i915_handle_error() from inside hangcheck, so we
know the reset is completed before we schedule the next tick. So yes it
seems fair that the recovery should always be expected to complete
within that tick as we would expect any other batch to complete (and the
recovery request is just to advance the breadcrumb, no batch).

So yes, reset/recovery must fit inside the tick.
-Chris
Mika Kuoppala July 9, 2018, 2:35 p.m. UTC | #3
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2018-07-09 15:13:44)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > Across a reset, the seqno (and thus hangcheck) should restart and the
>> > hangcheck naturally progress, for when it does not, we want to declare an
>> > emergency. Currently, we only detect if reset and reinit fails, but we
>> > do not detect if the call to reinit succeeds but the HW is fried - as we
>> > are resetting hangcheck on initialisation the engine. Remove that and
>> > rely on the natural progress to reset the hangcheck timer.
>> 
>> I take it that the intention is not to give reset
>> any special leeway wrt to request completion. So
>> we now assume that reset/recovery must fit inside
>> one hangcheck tick?
>
> We call the synchronous i915_handle_error() from inside hangcheck, so we
> know the reset is completed before we schedule the next tick. So yes it
> seems fair that the recovery should always be expected to complete
> within that tick as we would expect any other batch to complete (and the
> recovery request is just to advance the breadcrumb, no batch).
>
> So yes, reset/recovery must fit inside the tick.

Worthy goal. And yes it explains the natural progression in
the commit message.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index ab89dabc2965..933495996e91 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1852,7 +1852,6 @@  static int gen8_init_common_ring(struct intel_engine_cs *engine)
 		return ret;
 
 	intel_engine_reset_breadcrumbs(engine);
-	intel_engine_init_hangcheck(engine);
 
 	if (GEM_SHOW_DEBUG() && unexpected_starting_state(engine)) {
 		struct drm_printer p = drm_debug_printer(__func__);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 700f94c371b3..f4bd185c9369 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -524,8 +524,6 @@  static int init_ring_common(struct intel_engine_cs *engine)
 		goto out;
 	}
 
-	intel_engine_init_hangcheck(engine);
-
 	if (INTEL_GEN(dev_priv) > 2)
 		I915_WRITE_MODE(engine, _MASKED_BIT_DISABLE(STOP_RING));