Patchwork [5/9] drm/i915: More surgically unbreak the modeset vs reset deadlock

login
register
mail settings
Submitter Daniel Vetter
Date July 19, 2017, 12:54 p.m.
Message ID <20170719125502.25696-6-daniel.vetter@ffwll.ch>
Download mbox | patch
Permalink /patch/9851949/
State New
Headers show

Comments

Daniel Vetter - July 19, 2017, 12:54 p.m.
There's no reason to entirely wedge the gpu, for the minimal deadlock
bugfix we only need to unbreak/decouple the atomic commit from the gpu
reset. The simplest wait to fix that is by replacing the
unconditional fence wait a the top of commit_tail by a wait which
completes either when the fences are done (normal case, or when a
reset doesn't need to touch the display state). Or when the gpu reset
needs to force-unblock all pending modeset states.

Note that in both cases TDR itself keeps working, so from a userspace
pov this trickery isn't observable. Users themselvs might spot a short
glitch while the rendering is catching up again, but that's still
better than pre-TDR where we've thrown away all the rendering,
including innocent batches. Also, this fixes the regression TDR
introduced of making gpu resets deadlock-prone when we do need to
touch the display.

One thing I noticed is that gpu_error.flags seems to use both our own
wait-queue in gpu_error.wait_queue, and the generic wait_on_bit
facilities. Not entirely sure why this inconsistency exists, I just
picked one style.

A possible future avenue could be to insert the gpu reset in-between
ongoing modeset changes, which would avoid the momentary glitch. But
that's a lot more work to implement in the atomic commit machinery,
and given that we only need this for pre-g4x hw, of questionable
utility just for the sake of polishing gpu reset even more on those
old boxes. It might be useful for other features though.

v2: Rebase onto 4.13 with a s/wait_queue_t/struct wait_queue_entry/.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h      |  1 +
 drivers/gpu/drm/i915/intel_display.c | 35 ++++++++++++++++++++++++++++++-----
 2 files changed, 31 insertions(+), 5 deletions(-)
Chris Wilson - July 19, 2017, 1:42 p.m.
Quoting Daniel Vetter (2017-07-19 13:54:58)
> There's no reason to entirely wedge the gpu, for the minimal deadlock
> bugfix we only need to unbreak/decouple the atomic commit from the gpu
> reset. The simplest wait to fix that is by replacing the
> unconditional fence wait a the top of commit_tail by a wait which
> completes either when the fences are done (normal case, or when a
> reset doesn't need to touch the display state). Or when the gpu reset
> needs to force-unblock all pending modeset states.
> 
> Note that in both cases TDR itself keeps working, so from a userspace
> pov this trickery isn't observable. Users themselvs might spot a short
> glitch while the rendering is catching up again, but that's still
> better than pre-TDR where we've thrown away all the rendering,
> including innocent batches. Also, this fixes the regression TDR
> introduced of making gpu resets deadlock-prone when we do need to
> touch the display.
> 
> One thing I noticed is that gpu_error.flags seems to use both our own
> wait-queue in gpu_error.wait_queue, and the generic wait_on_bit
> facilities. Not entirely sure why this inconsistency exists, I just
> picked one style.
> 
> A possible future avenue could be to insert the gpu reset in-between
> ongoing modeset changes, which would avoid the momentary glitch. But
> that's a lot more work to implement in the atomic commit machinery,
> and given that we only need this for pre-g4x hw, of questionable
> utility just for the sake of polishing gpu reset even more on those
> old boxes. It might be useful for other features though.
> 
> v2: Rebase onto 4.13 with a s/wait_queue_t/struct wait_queue_entry/.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.h      |  1 +
>  drivers/gpu/drm/i915/intel_display.c | 35 ++++++++++++++++++++++++++++++-----
>  2 files changed, 31 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
> index 07e98b07c5bc..369968539b40 100644
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -1564,6 +1564,7 @@ struct i915_gpu_error {
>         unsigned long flags;
>  #define I915_RESET_BACKOFF     0
>  #define I915_RESET_HANDOFF     1
> +#define I915_RESET_MODESET     2
>  #define I915_WEDGED            (BITS_PER_LONG - 1)
>  #define I915_RESET_ENGINE      (I915_WEDGED - I915_NUM_ENGINES)
>  
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 5aa7ca1ab592..4762f158032d 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -3471,10 +3471,9 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>             !gpu_reset_clobbers_display(dev_priv))
>                 return;
>  
> -       /* We have a modeset vs reset deadlock, defensively unbreak it.
> -        *
> -        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
> -       i915_gem_set_wedged(dev_priv);
> +       /* We have a modeset vs reset deadlock, defensively unbreak it. */
> +       set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags);
> +       wake_up_all(&dev_priv->gpu_error.wait_queue);

How are we breaking the

	modeset_lock -> struct_mutex -> wait_on_reset ?

We wait the modeset_lock next which stops the reset from
proceeding, and so the deadlock persists until the wedge-me timeout?
-Chris
Daniel Vetter - July 19, 2017, 2:05 p.m.
On Wed, Jul 19, 2017 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-07-19 13:54:58)
>> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
>> index 5aa7ca1ab592..4762f158032d 100644
>> --- a/drivers/gpu/drm/i915/intel_display.c
>> +++ b/drivers/gpu/drm/i915/intel_display.c
>> @@ -3471,10 +3471,9 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>>             !gpu_reset_clobbers_display(dev_priv))
>>                 return;
>>
>> -       /* We have a modeset vs reset deadlock, defensively unbreak it.
>> -        *
>> -        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
>> -       i915_gem_set_wedged(dev_priv);
>> +       /* We have a modeset vs reset deadlock, defensively unbreak it. */
>> +       set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags);
>> +       wake_up_all(&dev_priv->gpu_error.wait_queue);
>
> How are we breaking the
>
>         modeset_lock -> struct_mutex -> wait_on_reset ?
>
> We wait the modeset_lock next which stops the reset from
> proceeding, and so the deadlock persists until the wedge-me timeout?

Hm indeed, I didn't check my logs carefully enough and there's still
"i915_reset_device timed out" in it. But I also thought the only real
wait we have left for the gpu is the one under i915_sw_fence. I think
we could simply switch i915_mutex_lock_interruptible calls in atomic
modeset over mutex_lock_interruptible? Or is there another can of
worms I'm missing?
-Daniel
Daniel Vetter - July 19, 2017, 2:11 p.m.
On Wed, Jul 19, 2017 at 4:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Wed, Jul 19, 2017 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> Quoting Daniel Vetter (2017-07-19 13:54:58)
>>> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
>>> index 5aa7ca1ab592..4762f158032d 100644
>>> --- a/drivers/gpu/drm/i915/intel_display.c
>>> +++ b/drivers/gpu/drm/i915/intel_display.c
>>> @@ -3471,10 +3471,9 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>>>             !gpu_reset_clobbers_display(dev_priv))
>>>                 return;
>>>
>>> -       /* We have a modeset vs reset deadlock, defensively unbreak it.
>>> -        *
>>> -        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
>>> -       i915_gem_set_wedged(dev_priv);
>>> +       /* We have a modeset vs reset deadlock, defensively unbreak it. */
>>> +       set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags);
>>> +       wake_up_all(&dev_priv->gpu_error.wait_queue);
>>
>> How are we breaking the
>>
>>         modeset_lock -> struct_mutex -> wait_on_reset ?
>>
>> We wait the modeset_lock next which stops the reset from
>> proceeding, and so the deadlock persists until the wedge-me timeout?
>
> Hm indeed, I didn't check my logs carefully enough and there's still
> "i915_reset_device timed out" in it. But I also thought the only real
> wait we have left for the gpu is the one under i915_sw_fence. I think
> we could simply switch i915_mutex_lock_interruptible calls in atomic
> modeset over mutex_lock_interruptible? Or is there another can of
> worms I'm missing?

Obviously, because that's what we're doing already. But I don't have
the DRM_ERROR from i915_gem_wait_for_error anywhere in my logs either,
so not clear what exactly is going on ...
-Daniel

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 07e98b07c5bc..369968539b40 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1564,6 +1564,7 @@  struct i915_gpu_error {
 	unsigned long flags;
 #define I915_RESET_BACKOFF	0
 #define I915_RESET_HANDOFF	1
+#define I915_RESET_MODESET	2
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 #define I915_RESET_ENGINE	(I915_WEDGED - I915_NUM_ENGINES)
 
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 5aa7ca1ab592..4762f158032d 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3471,10 +3471,9 @@  void intel_prepare_reset(struct drm_i915_private *dev_priv)
 	    !gpu_reset_clobbers_display(dev_priv))
 		return;
 
-	/* We have a modeset vs reset deadlock, defensively unbreak it.
-	 *
-	 * FIXME: We can do a _lot_ better, this is just a first iteration.*/
-	i915_gem_set_wedged(dev_priv);
+	/* We have a modeset vs reset deadlock, defensively unbreak it. */
+	set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags);
+	wake_up_all(&dev_priv->gpu_error.wait_queue);
 
 	/*
 	 * Need mode_config.mutex so that we don't
@@ -3569,6 +3568,8 @@  void intel_finish_reset(struct drm_i915_private *dev_priv)
 	drm_modeset_drop_locks(ctx);
 	drm_modeset_acquire_fini(ctx);
 	mutex_unlock(&dev->mode_config.mutex);
+
+	clear_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags);
 }
 
 static bool abort_flip_on_reset(struct intel_crtc *crtc)
@@ -12384,6 +12385,30 @@  static void intel_atomic_helper_free_state_worker(struct work_struct *work)
 	intel_atomic_helper_free_state(dev_priv);
 }
 
+static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_state)
+{
+	struct wait_queue_entry wait_fence, wait_reset;
+	struct drm_i915_private *dev_priv = to_i915(intel_state->base.dev);
+
+	init_wait_entry(&wait_fence, 0);
+	init_wait_entry(&wait_reset, 0);
+	for (;;) {
+		prepare_to_wait(&intel_state->commit_ready.wait,
+				&wait_fence, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(&dev_priv->gpu_error.wait_queue,
+				&wait_reset, TASK_UNINTERRUPTIBLE);
+
+
+		if (i915_sw_fence_done(&intel_state->commit_ready)
+		    || (dev_priv->gpu_error.flags & I915_RESET_MODESET))
+			break;
+
+		schedule();
+	}
+	finish_wait(&intel_state->commit_ready.wait, &wait_fence);
+	finish_wait(&dev_priv->gpu_error.wait_queue, &wait_reset);
+}
+
 static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 {
 	struct drm_device *dev = state->dev;
@@ -12397,7 +12422,7 @@  static void intel_atomic_commit_tail(struct drm_atomic_state *state)
 	unsigned crtc_vblank_mask = 0;
 	int i;
 
-	i915_sw_fence_wait(&intel_state->commit_ready);
+	intel_atomic_commit_fence_wait(intel_state);
 
 	drm_atomic_helper_wait_for_dependencies(state);