diff mbox series

drm/i915: Clear stop-engine for a pardoned reset

Message ID 20180814171857.24673-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show
Series drm/i915: Clear stop-engine for a pardoned reset | expand

Commit Message

Chris Wilson Aug. 14, 2018, 5:18 p.m. UTC
If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
it on the per-engine exit path as we know that either we skipped the
reset and so need the cancellation, or the reset was successful and the
cancellation is a no-op, or there was an error and we will follow up
with a full-reset or wedging (both of which will stop the engines again
as required).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c         |  1 +
 drivers/gpu/drm/i915/intel_engine_cs.c  | 10 ++++++++++
 drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
 3 files changed, 12 insertions(+)

Comments

Chris Wilson Aug. 14, 2018, 5:53 p.m. UTC | #1
Quoting Patchwork (2018-08-14 18:48:08)
> == Series Details ==
> 
> Series: drm/i915: Clear stop-engine for a pardoned reset
> URL   : https://patchwork.freedesktop.org/series/48202/
> State : success
> 
> == Summary ==
> 
> = CI Bug Log - changes from CI_DRM_4666 -> Patchwork_9943 =
> 
> == Summary - SUCCESS ==
> 
>   No regressions found.
> 
>   External URL: https://patchwork.freedesktop.org/api/1.0/series/48202/revisions/1/mbox/
> 
> == Known issues ==
> 
>   Here are the changes found in Patchwork_9943 that come from known issues:
> 
>   === IGT changes ===
> 
>     ==== Issues hit ====
> 
>     igt@drv_selftest@live_hangcheck:
>       {fi-cfl-8109u}:     PASS -> DMESG-FAIL (fdo#106560)

A different bug at last! That's a missed CS interrupt of some
description instead.
-Chris
Mika Kuoppala Aug. 15, 2018, 8:52 a.m. UTC | #2
Chris Wilson <chris@chris-wilson.co.uk> writes:

> If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
> in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
> it on the per-engine exit path as we know that either we skipped the
> reset and so need the cancellation, or the reset was successful and the
> cancellation is a no-op, or there was an error and we will follow up
> with a full-reset or wedging (both of which will stop the engines again
> as required).
>
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c         |  1 +
>  drivers/gpu/drm/i915/intel_engine_cs.c  | 10 ++++++++++
>  drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
>  3 files changed, 12 insertions(+)
>
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 9dce55182c3a..41111f2a9c39 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -2079,6 +2079,7 @@ int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
>  		goto out;
>  
>  out:
> +	intel_engine_cancel_stop_cs(engine);
>  	i915_gem_reset_finish_engine(engine);

Should we just lift the whole stop/start dance into
gem_reset_prepare|finish_engine()s?

-Mika

>  	return ret;
>  }
> diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
> index 99d5a24219c1..8628567d8f6e 100644
> --- a/drivers/gpu/drm/i915/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/intel_engine_cs.c
> @@ -788,6 +788,16 @@ int intel_engine_stop_cs(struct intel_engine_cs *engine)
>  	return err;
>  }
>  
> +void intel_engine_cancel_stop_cs(struct intel_engine_cs *engine)
> +{
> +	struct drm_i915_private *dev_priv = engine->i915;
> +
> +	GEM_TRACE("%s\n", engine->name);
> +
> +	I915_WRITE_FW(RING_MI_MODE(engine->mmio_base),
> +		      _MASKED_BIT_DISABLE(STOP_RING));
> +}
> +
>  const char *i915_cache_level_str(struct drm_i915_private *i915, int type)
>  {
>  	switch (type) {
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
> index 9090885d57de..3f6920dd7880 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.h
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
> @@ -906,6 +906,7 @@ int intel_init_blt_ring_buffer(struct intel_engine_cs *engine);
>  int intel_init_vebox_ring_buffer(struct intel_engine_cs *engine);
>  
>  int intel_engine_stop_cs(struct intel_engine_cs *engine);
> +void intel_engine_cancel_stop_cs(struct intel_engine_cs *engine);
>  
>  u64 intel_engine_get_active_head(const struct intel_engine_cs *engine);
>  u64 intel_engine_get_last_batch_head(const struct intel_engine_cs *engine);
> -- 
> 2.18.0
Chris Wilson Aug. 15, 2018, 9 a.m. UTC | #3
Quoting Mika Kuoppala (2018-08-15 09:52:18)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
> > in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
> > it on the per-engine exit path as we know that either we skipped the
> > reset and so need the cancellation, or the reset was successful and the
> > cancellation is a no-op, or there was an error and we will follow up
> > with a full-reset or wedging (both of which will stop the engines again
> > as required).
> >
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_drv.c         |  1 +
> >  drivers/gpu/drm/i915/intel_engine_cs.c  | 10 ++++++++++
> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
> >  3 files changed, 12 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> > index 9dce55182c3a..41111f2a9c39 100644
> > --- a/drivers/gpu/drm/i915/i915_drv.c
> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> > @@ -2079,6 +2079,7 @@ int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
> >               goto out;
> >  
> >  out:
> > +     intel_engine_cancel_stop_cs(engine);
> >       i915_gem_reset_finish_engine(engine);
> 
> Should we just lift the whole stop/start dance into
> gem_reset_prepare|finish_engine()s?

No, because it is also used by wedging where we do want the asymmetry.
Been there, done that, have the gem_eio bruises.
-Chris
Mika Kuoppala Aug. 15, 2018, 9:08 a.m. UTC | #4
Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Mika Kuoppala (2018-08-15 09:52:18)
>> Chris Wilson <chris@chris-wilson.co.uk> writes:
>> 
>> > If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
>> > in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
>> > it on the per-engine exit path as we know that either we skipped the
>> > reset and so need the cancellation, or the reset was successful and the
>> > cancellation is a no-op, or there was an error and we will follow up
>> > with a full-reset or wedging (both of which will stop the engines again
>> > as required).
>> >
>> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> > ---
>> >  drivers/gpu/drm/i915/i915_drv.c         |  1 +
>> >  drivers/gpu/drm/i915/intel_engine_cs.c  | 10 ++++++++++
>> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
>> >  3 files changed, 12 insertions(+)
>> >
>> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> > index 9dce55182c3a..41111f2a9c39 100644
>> > --- a/drivers/gpu/drm/i915/i915_drv.c
>> > +++ b/drivers/gpu/drm/i915/i915_drv.c
>> > @@ -2079,6 +2079,7 @@ int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
>> >               goto out;
>> >  
>> >  out:
>> > +     intel_engine_cancel_stop_cs(engine);
>> >       i915_gem_reset_finish_engine(engine);
>> 
>> Should we just lift the whole stop/start dance into
>> gem_reset_prepare|finish_engine()s?
>
> No, because it is also used by wedging where we do want the asymmetry.
> Been there, done that, have the gem_eio bruises.

On wedge the submission path is blocked but yeah,
gpu can be in any state and enabling the engine at that
point is asking for trouble.

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

> -Chris
Chris Wilson Aug. 15, 2018, 9:18 a.m. UTC | #5
Quoting Mika Kuoppala (2018-08-15 10:08:09)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > Quoting Mika Kuoppala (2018-08-15 09:52:18)
> >> Chris Wilson <chris@chris-wilson.co.uk> writes:
> >> 
> >> > If we pardon a per-engine reset, we may leave the STOP_RING bit asserted
> >> > in RING_MI_MODE resulting in the engine hanging. Unconditionally clear
> >> > it on the per-engine exit path as we know that either we skipped the
> >> > reset and so need the cancellation, or the reset was successful and the
> >> > cancellation is a no-op, or there was an error and we will follow up
> >> > with a full-reset or wedging (both of which will stop the engines again
> >> > as required).
> >> >
> >> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> >> > ---
> >> >  drivers/gpu/drm/i915/i915_drv.c         |  1 +
> >> >  drivers/gpu/drm/i915/intel_engine_cs.c  | 10 ++++++++++
> >> >  drivers/gpu/drm/i915/intel_ringbuffer.h |  1 +
> >> >  3 files changed, 12 insertions(+)
> >> >
> >> > diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> >> > index 9dce55182c3a..41111f2a9c39 100644
> >> > --- a/drivers/gpu/drm/i915/i915_drv.c
> >> > +++ b/drivers/gpu/drm/i915/i915_drv.c
> >> > @@ -2079,6 +2079,7 @@ int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
> >> >               goto out;
> >> >  
> >> >  out:
> >> > +     intel_engine_cancel_stop_cs(engine);
> >> >       i915_gem_reset_finish_engine(engine);
> >> 
> >> Should we just lift the whole stop/start dance into
> >> gem_reset_prepare|finish_engine()s?
> >
> > No, because it is also used by wedging where we do want the asymmetry.
> > Been there, done that, have the gem_eio bruises.
> 
> On wedge the submission path is blocked but yeah,
> gpu can be in any state and enabling the engine at that
> point is asking for trouble.
> 
> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Grabbed, pushed, and scarpered.

Although this should fix the silly STOP_RING hangs that have been
plaguing us for the last few months, there still seems to be more fun to
be had here. /o\
-Chris
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 9dce55182c3a..41111f2a9c39 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -2079,6 +2079,7 @@  int i915_reset_engine(struct intel_engine_cs *engine, const char *msg)
 		goto out;
 
 out:
+	intel_engine_cancel_stop_cs(engine);
 	i915_gem_reset_finish_engine(engine);
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index 99d5a24219c1..8628567d8f6e 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -788,6 +788,16 @@  int intel_engine_stop_cs(struct intel_engine_cs *engine)
 	return err;
 }
 
+void intel_engine_cancel_stop_cs(struct intel_engine_cs *engine)
+{
+	struct drm_i915_private *dev_priv = engine->i915;
+
+	GEM_TRACE("%s\n", engine->name);
+
+	I915_WRITE_FW(RING_MI_MODE(engine->mmio_base),
+		      _MASKED_BIT_DISABLE(STOP_RING));
+}
+
 const char *i915_cache_level_str(struct drm_i915_private *i915, int type)
 {
 	switch (type) {
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 9090885d57de..3f6920dd7880 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -906,6 +906,7 @@  int intel_init_blt_ring_buffer(struct intel_engine_cs *engine);
 int intel_init_vebox_ring_buffer(struct intel_engine_cs *engine);
 
 int intel_engine_stop_cs(struct intel_engine_cs *engine);
+void intel_engine_cancel_stop_cs(struct intel_engine_cs *engine);
 
 u64 intel_engine_get_active_head(const struct intel_engine_cs *engine);
 u64 intel_engine_get_last_batch_head(const struct intel_engine_cs *engine);