diff mbox

[1/7] drm/i915: Avoid the gpu reset vs. modeset deadlock

Message ID 20170720175754.30751-2-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Vetter July 20, 2017, 5:57 p.m. UTC
... using the biggest hammer we have. This is essentially a weaponized
version of the timeout-based wedging Chris added in

commit 36703e79a982c8ce5a8e43833291f2719e92d0d1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 22 11:56:25 2017 +0100

    drm/i915: Break modeset deadlocks on reset

Because defense-in-depth is good it's good to still have both. Also
note that with the locking change we can now restrict this a lot (old
gpus and special testing only), so this doesn't kill the TDR benefits
on at least anything remotely modern.

And futuremore with a few tricks it should be possible to make a much
more educated guess about whether an atomic commit is stuck waiting on
the gpu (atomic_t counting the pending i915_sw_fence used by the
atomic modeset code should do it), so we can improve this.

But for now just start with something that is guaranteed to recover
faster, for much better CI througput.

This defacto reverts TDR on these platforms, but there's not really a
single commit to specify as the sole offender.

v2: Add a debug message to explain what's going on. We can't DRM_ERROR
because that spams CI. And the timeout based fallback still prints a
DRM_ERROR, in case something goes wrong.

Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/intel_display.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Chris Wilson July 20, 2017, 7:47 p.m. UTC | #1
Quoting Daniel Vetter (2017-07-20 18:57:48)
> ... using the biggest hammer we have. This is essentially a weaponized
> version of the timeout-based wedging Chris added in
> 
> commit 36703e79a982c8ce5a8e43833291f2719e92d0d1
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Jun 22 11:56:25 2017 +0100
> 
>     drm/i915: Break modeset deadlocks on reset
> 
> Because defense-in-depth is good it's good to still have both. Also
> note that with the locking change we can now restrict this a lot (old
> gpus and special testing only), so this doesn't kill the TDR benefits
> on at least anything remotely modern.
> 
> And futuremore with a few tricks it should be possible to make a much
> more educated guess about whether an atomic commit is stuck waiting on
> the gpu (atomic_t counting the pending i915_sw_fence used by the
> atomic modeset code should do it), so we can improve this.
> 
> But for now just start with something that is guaranteed to recover
> faster, for much better CI througput.
> 
> This defacto reverts TDR on these platforms, but there's not really a
> single commit to specify as the sole offender.
> 
> v2: Add a debug message to explain what's going on. We can't DRM_ERROR
> because that spams CI. And the timeout based fallback still prints a
> DRM_ERROR, in case something goes wrong.
> 
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
> Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_display.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 02b1f4966049..995522e40ec1 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>             !gpu_reset_clobbers_display(dev_priv))
>                 return;
>  
> +       /* We have a modeset vs reset deadlock, defensively unbreak it.
> +        *
> +        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
> +       i915_gem_set_wedged(dev_priv);
> +       DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n");

I meant this a dev_err(). It has user visible impact and user data loss.
-Chris
Daniel Vetter July 20, 2017, 8:04 p.m. UTC | #2
On Thu, Jul 20, 2017 at 9:47 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
>> index 02b1f4966049..995522e40ec1 100644
>> --- a/drivers/gpu/drm/i915/intel_display.c
>> +++ b/drivers/gpu/drm/i915/intel_display.c
>> @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>>             !gpu_reset_clobbers_display(dev_priv))
>>                 return;
>>
>> +       /* We have a modeset vs reset deadlock, defensively unbreak it.
>> +        *
>> +        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
>> +       i915_gem_set_wedged(dev_priv);
>> +       DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n");
>
> I meant this a dev_err(). It has user visible impact and user data loss.

stop_rings is gone so I can't differentiate, and DRM_ERROR breaks CI.
Which after your timeout is the only point of this patch really. I
guess I could throw a pr_notice or so in there, but we already do
that. This was all removed in

commit 7b4d3a16dd97be0ebc793ea046b9af9d5c9b1b1a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 4 08:08:37 2016 +0100

    drm/i915: Remove stop-rings debugfs interface

Should I explain this better in the commit message? I tried already ...
-Daniel
Chris Wilson July 20, 2017, 8:16 p.m. UTC | #3
Quoting Daniel Vetter (2017-07-20 21:04:50)
> On Thu, Jul 20, 2017 at 9:47 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> >> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> >> index 02b1f4966049..995522e40ec1 100644
> >> --- a/drivers/gpu/drm/i915/intel_display.c
> >> +++ b/drivers/gpu/drm/i915/intel_display.c
> >> @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
> >>             !gpu_reset_clobbers_display(dev_priv))
> >>                 return;
> >>
> >> +       /* We have a modeset vs reset deadlock, defensively unbreak it.
> >> +        *
> >> +        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
> >> +       i915_gem_set_wedged(dev_priv);
> >> +       DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n");
> >
> > I meant this a dev_err(). It has user visible impact and user data loss.
> 
> stop_rings is gone so I can't differentiate, and DRM_ERROR breaks CI.
> Which after your timeout is the only point of this patch really.

So drop this patch. If the only reason is to sweep the bug under the
carpet, don't. I thought this patch was to move the wedge to where you
planned to replace it with the refined approach to abort the modeset.	

> guess I could throw a pr_notice or so in there, but we already do
> that. This was all removed in
> 
> commit 7b4d3a16dd97be0ebc793ea046b9af9d5c9b1b1a
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Jul 4 08:08:37 2016 +0100
> 
>     drm/i915: Remove stop-rings debugfs interface
> 
> Should I explain this better in the commit message? I tried already ...

No idea what you mean. If you mean you want to know whether this error
was simulated or not, look in the error state. But the point of this
series is to avoid the dev_err() in the first place, in which case why
do we care whether it was simulated or not, if we hit the err it is a
bug and CI should be flagging it.
-Chris
Daniel Vetter July 20, 2017, 8:18 p.m. UTC | #4
On Thu, Jul 20, 2017 at 10:16 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-07-20 21:04:50)
>> On Thu, Jul 20, 2017 at 9:47 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> >> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
>> >> index 02b1f4966049..995522e40ec1 100644
>> >> --- a/drivers/gpu/drm/i915/intel_display.c
>> >> +++ b/drivers/gpu/drm/i915/intel_display.c
>> >> @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>> >>             !gpu_reset_clobbers_display(dev_priv))
>> >>                 return;
>> >>
>> >> +       /* We have a modeset vs reset deadlock, defensively unbreak it.
>> >> +        *
>> >> +        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
>> >> +       i915_gem_set_wedged(dev_priv);
>> >> +       DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n");
>> >
>> > I meant this a dev_err(). It has user visible impact and user data loss.
>>
>> stop_rings is gone so I can't differentiate, and DRM_ERROR breaks CI.
>> Which after your timeout is the only point of this patch really.
>
> So drop this patch. If the only reason is to sweep the bug under the
> carpet, don't. I thought this patch was to move the wedge to where you
> planned to replace it with the refined approach to abort the modeset.

It fixes a regression we have in CI, because the dmesg-warn can hide
other stuff. If you want, resurrect some better way to handle that,
meanwhile this here gets the job done. Since the latter patches don't
work I'm proposing just this one here for merging.
-Daniel
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 02b1f4966049..995522e40ec1 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3471,6 +3471,12 @@  void intel_prepare_reset(struct drm_i915_private *dev_priv)
 	    !gpu_reset_clobbers_display(dev_priv))
 		return;
 
+	/* We have a modeset vs reset deadlock, defensively unbreak it.
+	 *
+	 * FIXME: We can do a _lot_ better, this is just a first iteration.*/
+	i915_gem_set_wedged(dev_priv);
+	DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n");
+
 	/*
 	 * Need mode_config.mutex so that we don't
 	 * trample ongoing ->detect() and whatnot.