[1/2] drm/i915: Pull sync_scru for device reset outside of wedge_mutex
diff mbox series

Message ID 20190211135040.1234-1-chris@chris-wilson.co.uk
State New
Headers show
Series
  • [1/2] drm/i915: Pull sync_scru for device reset outside of wedge_mutex
Related show

Commit Message

Chris Wilson Feb. 11, 2019, 1:50 p.m. UTC
We need to flush our srcu protecting resources about to be clobbered
by the reset, inside of our timer failsafe but outside of the
error->wedge_mutex, so that the failsafe can run in case the
synchronize_srcu() takes too long (hits a shrinker deadlock?).

Fixes: 72eb16df010a ("drm/i915: Serialise resets with wedging")
References: https://bugs.freedesktop.org/show_bug.cgi?id=109605
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_reset.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Mika Kuoppala Feb. 11, 2019, 3:09 p.m. UTC | #1
Chris Wilson <chris@chris-wilson.co.uk> writes:

> We need to flush our srcu protecting resources about to be clobbered
> by the reset, inside of our timer failsafe but outside of the
> error->wedge_mutex, so that the failsafe can run in case the
> synchronize_srcu() takes too long (hits a shrinker deadlock?).
>
> Fixes: 72eb16df010a ("drm/i915: Serialise resets with wedging")
> References: https://bugs.freedesktop.org/show_bug.cgi?id=109605
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_reset.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> index 9494b015185a..c2b7570730c2 100644
> --- a/drivers/gpu/drm/i915/i915_reset.c
> +++ b/drivers/gpu/drm/i915/i915_reset.c
> @@ -941,9 +941,6 @@ static int do_reset(struct drm_i915_private *i915, unsigned int stalled_mask)
>  {
>  	int err, i;
>  
> -	/* Flush everyone currently using a resource about to be clobbered */
> -	synchronize_srcu(&i915->gpu_error.reset_backoff_srcu);
> -
>  	err = intel_gpu_reset(i915, ALL_ENGINES);
>  	for (i = 0; err && i < RESET_MAX_RETRIES; i++) {
>  		msleep(10 * (i + 1));
> @@ -1140,6 +1137,9 @@ static void i915_reset_device(struct drm_i915_private *i915,
>  	i915_wedge_on_timeout(&w, i915, 5 * HZ) {
>  		intel_prepare_reset(i915);
>  
> +		/* Flush everyone using a resource about to be clobbered */
> +		synchronize_srcu(&error->reset_backoff_srcu);
> +

Do we easily see which one it will be? This one or
the block below to timeout on wedge?

Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

>  		mutex_lock(&error->wedge_mutex);
>  		i915_reset(i915, engine_mask, reason);
>  		mutex_unlock(&error->wedge_mutex);
> -- 
> 2.20.1
>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Chris Wilson Feb. 11, 2019, 3:14 p.m. UTC | #2
Quoting Mika Kuoppala (2019-02-11 15:09:48)
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > We need to flush our srcu protecting resources about to be clobbered
> > by the reset, inside of our timer failsafe but outside of the
> > error->wedge_mutex, so that the failsafe can run in case the
> > synchronize_srcu() takes too long (hits a shrinker deadlock?).
> >
> > Fixes: 72eb16df010a ("drm/i915: Serialise resets with wedging")
> > References: https://bugs.freedesktop.org/show_bug.cgi?id=109605
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > ---
> >  drivers/gpu/drm/i915/i915_reset.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
> > index 9494b015185a..c2b7570730c2 100644
> > --- a/drivers/gpu/drm/i915/i915_reset.c
> > +++ b/drivers/gpu/drm/i915/i915_reset.c
> > @@ -941,9 +941,6 @@ static int do_reset(struct drm_i915_private *i915, unsigned int stalled_mask)
> >  {
> >       int err, i;
> >  
> > -     /* Flush everyone currently using a resource about to be clobbered */
> > -     synchronize_srcu(&i915->gpu_error.reset_backoff_srcu);
> > -
> >       err = intel_gpu_reset(i915, ALL_ENGINES);
> >       for (i = 0; err && i < RESET_MAX_RETRIES; i++) {
> >               msleep(10 * (i + 1));
> > @@ -1140,6 +1137,9 @@ static void i915_reset_device(struct drm_i915_private *i915,
> >       i915_wedge_on_timeout(&w, i915, 5 * HZ) {
> >               intel_prepare_reset(i915);
> >  
> > +             /* Flush everyone using a resource about to be clobbered */
> > +             synchronize_srcu(&error->reset_backoff_srcu);
> > +
> 
> Do we easily see which one it will be? This one or
> the block below to timeout on wedge?

It would be easy to reconstruct if we have all the stack traces so we
can switch which process is stuck where, but we do not. Failing that my
hunch is that it's sync_srcu taking too long, and by design we know it
can deadlock around an unfortunate shrinker interaction :( But I'm not
entirely convinced we're hitting that.
-Chris
Chris Wilson Feb. 11, 2019, 9:12 p.m. UTC | #3
Quoting Patchwork (2019-02-11 19:38:40)
> == Series Details ==
> 
> Series: series starting with [1/2] drm/i915: Pull sync_scru for device reset outside of wedge_mutex
> URL   : https://patchwork.freedesktop.org/series/56496/
> State : success
> 
> == Summary ==
> 
> CI Bug Log - changes from CI_DRM_5588_full -> Patchwork_12194_full
> ====================================================
> 
> Summary
> -------
> 
>   **SUCCESS**
> 
>   No regressions found.

Not mentioned was that this didn't fix the hang. Nevertheless, it should
prevent one corner case from ending up in a deadlock, and it does mean
that the srcu is not responsible for the timeout here. The mystery
remains exactly what is.
-Chris

Patch
diff mbox series

diff --git a/drivers/gpu/drm/i915/i915_reset.c b/drivers/gpu/drm/i915/i915_reset.c
index 9494b015185a..c2b7570730c2 100644
--- a/drivers/gpu/drm/i915/i915_reset.c
+++ b/drivers/gpu/drm/i915/i915_reset.c
@@ -941,9 +941,6 @@  static int do_reset(struct drm_i915_private *i915, unsigned int stalled_mask)
 {
 	int err, i;
 
-	/* Flush everyone currently using a resource about to be clobbered */
-	synchronize_srcu(&i915->gpu_error.reset_backoff_srcu);
-
 	err = intel_gpu_reset(i915, ALL_ENGINES);
 	for (i = 0; err && i < RESET_MAX_RETRIES; i++) {
 		msleep(10 * (i + 1));
@@ -1140,6 +1137,9 @@  static void i915_reset_device(struct drm_i915_private *i915,
 	i915_wedge_on_timeout(&w, i915, 5 * HZ) {
 		intel_prepare_reset(i915);
 
+		/* Flush everyone using a resource about to be clobbered */
+		synchronize_srcu(&error->reset_backoff_srcu);
+
 		mutex_lock(&error->wedge_mutex);
 		i915_reset(i915, engine_mask, reason);
 		mutex_unlock(&error->wedge_mutex);