Message ID | 20170720175754.30751-2-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Daniel Vetter (2017-07-20 18:57:48) > ... using the biggest hammer we have. This is essentially a weaponized > version of the timeout-based wedging Chris added in > > commit 36703e79a982c8ce5a8e43833291f2719e92d0d1 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Jun 22 11:56:25 2017 +0100 > > drm/i915: Break modeset deadlocks on reset > > Because defense-in-depth is good it's good to still have both. Also > note that with the locking change we can now restrict this a lot (old > gpus and special testing only), so this doesn't kill the TDR benefits > on at least anything remotely modern. > > And futuremore with a few tricks it should be possible to make a much > more educated guess about whether an atomic commit is stuck waiting on > the gpu (atomic_t counting the pending i915_sw_fence used by the > atomic modeset code should do it), so we can improve this. > > But for now just start with something that is guaranteed to recover > faster, for much better CI througput. > > This defacto reverts TDR on these platforms, but there's not really a > single commit to specify as the sole offender. > > v2: Add a debug message to explain what's going on. We can't DRM_ERROR > because that spams CI. And the timeout based fallback still prints a > DRM_ERROR, in case something goes wrong. > > Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion") > Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter") > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > --- > drivers/gpu/drm/i915/intel_display.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c > index 02b1f4966049..995522e40ec1 100644 > --- a/drivers/gpu/drm/i915/intel_display.c > +++ b/drivers/gpu/drm/i915/intel_display.c > @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) > !gpu_reset_clobbers_display(dev_priv)) > return; > > + /* We have a modeset vs reset deadlock, defensively unbreak it. > + * > + * FIXME: We can do a _lot_ better, this is just a first iteration.*/ > + i915_gem_set_wedged(dev_priv); > + DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n"); I meant this a dev_err(). It has user visible impact and user data loss. -Chris
On Thu, Jul 20, 2017 at 9:47 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: >> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c >> index 02b1f4966049..995522e40ec1 100644 >> --- a/drivers/gpu/drm/i915/intel_display.c >> +++ b/drivers/gpu/drm/i915/intel_display.c >> @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) >> !gpu_reset_clobbers_display(dev_priv)) >> return; >> >> + /* We have a modeset vs reset deadlock, defensively unbreak it. >> + * >> + * FIXME: We can do a _lot_ better, this is just a first iteration.*/ >> + i915_gem_set_wedged(dev_priv); >> + DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n"); > > I meant this a dev_err(). It has user visible impact and user data loss. stop_rings is gone so I can't differentiate, and DRM_ERROR breaks CI. Which after your timeout is the only point of this patch really. I guess I could throw a pr_notice or so in there, but we already do that. This was all removed in commit 7b4d3a16dd97be0ebc793ea046b9af9d5c9b1b1a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jul 4 08:08:37 2016 +0100 drm/i915: Remove stop-rings debugfs interface Should I explain this better in the commit message? I tried already ... -Daniel
Quoting Daniel Vetter (2017-07-20 21:04:50) > On Thu, Jul 20, 2017 at 9:47 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > >> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c > >> index 02b1f4966049..995522e40ec1 100644 > >> --- a/drivers/gpu/drm/i915/intel_display.c > >> +++ b/drivers/gpu/drm/i915/intel_display.c > >> @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) > >> !gpu_reset_clobbers_display(dev_priv)) > >> return; > >> > >> + /* We have a modeset vs reset deadlock, defensively unbreak it. > >> + * > >> + * FIXME: We can do a _lot_ better, this is just a first iteration.*/ > >> + i915_gem_set_wedged(dev_priv); > >> + DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n"); > > > > I meant this a dev_err(). It has user visible impact and user data loss. > > stop_rings is gone so I can't differentiate, and DRM_ERROR breaks CI. > Which after your timeout is the only point of this patch really. So drop this patch. If the only reason is to sweep the bug under the carpet, don't. I thought this patch was to move the wedge to where you planned to replace it with the refined approach to abort the modeset. > guess I could throw a pr_notice or so in there, but we already do > that. This was all removed in > > commit 7b4d3a16dd97be0ebc793ea046b9af9d5c9b1b1a > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Mon Jul 4 08:08:37 2016 +0100 > > drm/i915: Remove stop-rings debugfs interface > > Should I explain this better in the commit message? I tried already ... No idea what you mean. If you mean you want to know whether this error was simulated or not, look in the error state. But the point of this series is to avoid the dev_err() in the first place, in which case why do we care whether it was simulated or not, if we hit the err it is a bug and CI should be flagging it. -Chris
On Thu, Jul 20, 2017 at 10:16 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Daniel Vetter (2017-07-20 21:04:50) >> On Thu, Jul 20, 2017 at 9:47 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: >> >> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c >> >> index 02b1f4966049..995522e40ec1 100644 >> >> --- a/drivers/gpu/drm/i915/intel_display.c >> >> +++ b/drivers/gpu/drm/i915/intel_display.c >> >> @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) >> >> !gpu_reset_clobbers_display(dev_priv)) >> >> return; >> >> >> >> + /* We have a modeset vs reset deadlock, defensively unbreak it. >> >> + * >> >> + * FIXME: We can do a _lot_ better, this is just a first iteration.*/ >> >> + i915_gem_set_wedged(dev_priv); >> >> + DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n"); >> > >> > I meant this a dev_err(). It has user visible impact and user data loss. >> >> stop_rings is gone so I can't differentiate, and DRM_ERROR breaks CI. >> Which after your timeout is the only point of this patch really. > > So drop this patch. If the only reason is to sweep the bug under the > carpet, don't. I thought this patch was to move the wedge to where you > planned to replace it with the refined approach to abort the modeset. It fixes a regression we have in CI, because the dmesg-warn can hide other stuff. If you want, resurrect some better way to handle that, meanwhile this here gets the job done. Since the latter patches don't work I'm proposing just this one here for merging. -Daniel
diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 02b1f4966049..995522e40ec1 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -3471,6 +3471,12 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) !gpu_reset_clobbers_display(dev_priv)) return; + /* We have a modeset vs reset deadlock, defensively unbreak it. + * + * FIXME: We can do a _lot_ better, this is just a first iteration.*/ + i915_gem_set_wedged(dev_priv); + DRM_DEBUG_DRIVER("Wedging GPU to avoid deadlocks with pending modeset updates\n"); + /* * Need mode_config.mutex so that we don't * trample ongoing ->detect() and whatnot.
... using the biggest hammer we have. This is essentially a weaponized version of the timeout-based wedging Chris added in commit 36703e79a982c8ce5a8e43833291f2719e92d0d1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jun 22 11:56:25 2017 +0100 drm/i915: Break modeset deadlocks on reset Because defense-in-depth is good it's good to still have both. Also note that with the locking change we can now restrict this a lot (old gpus and special testing only), so this doesn't kill the TDR benefits on at least anything remotely modern. And futuremore with a few tricks it should be possible to make a much more educated guess about whether an atomic commit is stuck waiting on the gpu (atomic_t counting the pending i915_sw_fence used by the atomic modeset code should do it), so we can improve this. But for now just start with something that is guaranteed to recover faster, for much better CI througput. This defacto reverts TDR on these platforms, but there's not really a single commit to specify as the sole offender. v2: Add a debug message to explain what's going on. We can't DRM_ERROR because that spams CI. And the timeout based fallback still prints a DRM_ERROR, in case something goes wrong. Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion") Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter") Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- drivers/gpu/drm/i915/intel_display.c | 6 ++++++ 1 file changed, 6 insertions(+)