[3/9] drm/i915: Avoid the gpu reset vs. modeset deadlock
diff mbox

Message ID 20170719125502.25696-4-daniel.vetter@ffwll.ch
State New
Headers show

Commit Message

Daniel Vetter July 19, 2017, 12:54 p.m. UTC
... using the biggest hammer we have. This is essentially a weaponized
version of the timeout-based wedging Chris added in

commit 36703e79a982c8ce5a8e43833291f2719e92d0d1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 22 11:56:25 2017 +0100

    drm/i915: Break modeset deadlocks on reset

Because defense-in-depth is good it's good to still have both. Also
note that with the locking change we can now restrict this a lot (old
gpus and special testing only), so this doesn't kill the TDR benefits
on at least anything remotely modern.

And futuremore with a few tricks it should be possible to make a much
more educated guess about whether an atomic commit is stuck waiting on
the gpu (atomic_t counting the pending i915_sw_fence used by the
atomic modeset code should do it), so we can improve this.

But for now just start with something that is guaranteed to recover
faster, for much better CI througput.

This defacto reverts TDR on these platforms, but there's not really a
single commit to specify as the sole offender.

Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/i915/intel_display.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Chris Wilson July 19, 2017, 1:32 p.m. UTC | #1
Quoting Daniel Vetter (2017-07-19 13:54:56)
> ... using the biggest hammer we have. This is essentially a weaponized
> version of the timeout-based wedging Chris added in
> 
> commit 36703e79a982c8ce5a8e43833291f2719e92d0d1
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Jun 22 11:56:25 2017 +0100
> 
>     drm/i915: Break modeset deadlocks on reset
> 
> Because defense-in-depth is good it's good to still have both. Also
> note that with the locking change we can now restrict this a lot (old
> gpus and special testing only), so this doesn't kill the TDR benefits
> on at least anything remotely modern.
> 
> And futuremore with a few tricks it should be possible to make a much
> more educated guess about whether an atomic commit is stuck waiting on
> the gpu (atomic_t counting the pending i915_sw_fence used by the
> atomic modeset code should do it), so we can improve this.
> 
> But for now just start with something that is guaranteed to recover
> faster, for much better CI througput.
> 
> This defacto reverts TDR on these platforms, but there's not really a
> single commit to specify as the sole offender.
> 
> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
> Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_display.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
> index 97777ffa1566..010a1f3e000c 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -3471,6 +3471,11 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>             !gpu_reset_clobbers_display(dev_priv))
>                 return;
>  
> +       /* We have a modeset vs reset deadlock, defensively unbreak it.
> +        *
> +        * FIXME: We can do a _lot_ better, this is just a first iteration.*/

You should keep the error message for wedging the device. If all goes
well it is removed in a few patches, so shouldn't be an eyesore for
long.
-Chris
Daniel Vetter July 19, 2017, 1:44 p.m. UTC | #2
On Wed, Jul 19, 2017 at 3:32 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-07-19 13:54:56)
>> ... using the biggest hammer we have. This is essentially a weaponized
>> version of the timeout-based wedging Chris added in
>>
>> commit 36703e79a982c8ce5a8e43833291f2719e92d0d1
>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>> Date:   Thu Jun 22 11:56:25 2017 +0100
>>
>>     drm/i915: Break modeset deadlocks on reset
>>
>> Because defense-in-depth is good it's good to still have both. Also
>> note that with the locking change we can now restrict this a lot (old
>> gpus and special testing only), so this doesn't kill the TDR benefits
>> on at least anything remotely modern.
>>
>> And futuremore with a few tricks it should be possible to make a much
>> more educated guess about whether an atomic commit is stuck waiting on
>> the gpu (atomic_t counting the pending i915_sw_fence used by the
>> atomic modeset code should do it), so we can improve this.
>>
>> But for now just start with something that is guaranteed to recover
>> faster, for much better CI througput.
>>
>> This defacto reverts TDR on these platforms, but there's not really a
>> single commit to specify as the sole offender.
>>
>> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
>> Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>> ---
>>  drivers/gpu/drm/i915/intel_display.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
>> index 97777ffa1566..010a1f3e000c 100644
>> --- a/drivers/gpu/drm/i915/intel_display.c
>> +++ b/drivers/gpu/drm/i915/intel_display.c
>> @@ -3471,6 +3471,11 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>>             !gpu_reset_clobbers_display(dev_priv))
>>                 return;
>>
>> +       /* We have a modeset vs reset deadlock, defensively unbreak it.
>> +        *
>> +        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
>
> You should keep the error message for wedging the device. If all goes
> well it is removed in a few patches, so shouldn't be an eyesore for
> long.

Yeah makes sense, just in case the next patches need to be reverted
for some reasons. That's why I split it ou. Something like

DRM_ERROR("Wedging gpu to unblock modesets\n");

and then remove that again 2 patches later?
-Daniel
Daniel Vetter July 19, 2017, 6:44 p.m. UTC | #3
On Wed, Jul 19, 2017 at 3:44 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Wed, Jul 19, 2017 at 3:32 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> Quoting Daniel Vetter (2017-07-19 13:54:56)
>>> ... using the biggest hammer we have. This is essentially a weaponized
>>> version of the timeout-based wedging Chris added in
>>>
>>> commit 36703e79a982c8ce5a8e43833291f2719e92d0d1
>>> Author: Chris Wilson <chris@chris-wilson.co.uk>
>>> Date:   Thu Jun 22 11:56:25 2017 +0100
>>>
>>>     drm/i915: Break modeset deadlocks on reset
>>>
>>> Because defense-in-depth is good it's good to still have both. Also
>>> note that with the locking change we can now restrict this a lot (old
>>> gpus and special testing only), so this doesn't kill the TDR benefits
>>> on at least anything remotely modern.
>>>
>>> And futuremore with a few tricks it should be possible to make a much
>>> more educated guess about whether an atomic commit is stuck waiting on
>>> the gpu (atomic_t counting the pending i915_sw_fence used by the
>>> atomic modeset code should do it), so we can improve this.
>>>
>>> But for now just start with something that is guaranteed to recover
>>> faster, for much better CI througput.
>>>
>>> This defacto reverts TDR on these platforms, but there's not really a
>>> single commit to specify as the sole offender.
>>>
>>> Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion")
>>> Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter")
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> ---
>>>  drivers/gpu/drm/i915/intel_display.c | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
>>> index 97777ffa1566..010a1f3e000c 100644
>>> --- a/drivers/gpu/drm/i915/intel_display.c
>>> +++ b/drivers/gpu/drm/i915/intel_display.c
>>> @@ -3471,6 +3471,11 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv)
>>>             !gpu_reset_clobbers_display(dev_priv))
>>>                 return;
>>>
>>> +       /* We have a modeset vs reset deadlock, defensively unbreak it.
>>> +        *
>>> +        * FIXME: We can do a _lot_ better, this is just a first iteration.*/
>>
>> You should keep the error message for wedging the device. If all goes
>> well it is removed in a few patches, so shouldn't be an eyesore for
>> long.
>
> Yeah makes sense, just in case the next patches need to be reverted
> for some reasons. That's why I split it ou. Something like
>
> DRM_ERROR("Wedging gpu to unblock modesets\n");
>
> and then remove that again 2 patches later?

After thinking about this a bit more, s/DRM_ERROR/DRM_DEBUG_KMS/ or
so. We know we can deadlock here, complaining about that only spams
dmesg and results in noise in CI, hiding worse stuff. The timeout
based wedging as fallback should have a DRM_ERROR since it's
unexpected, this one here imo sholdn't. And with the follow-up patches
it won't be unconditional anymore (if we don't have to revert them
again).
-Daniel

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 97777ffa1566..010a1f3e000c 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3471,6 +3471,11 @@  void intel_prepare_reset(struct drm_i915_private *dev_priv)
 	    !gpu_reset_clobbers_display(dev_priv))
 		return;
 
+	/* We have a modeset vs reset deadlock, defensively unbreak it.
+	 *
+	 * FIXME: We can do a _lot_ better, this is just a first iteration.*/
+	i915_gem_set_wedged(dev_priv);
+
 	/*
 	 * Need mode_config.mutex so that we don't
 	 * trample ongoing ->detect() and whatnot.