Message ID | 20170719125502.25696-6-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Quoting Daniel Vetter (2017-07-19 13:54:58) > There's no reason to entirely wedge the gpu, for the minimal deadlock > bugfix we only need to unbreak/decouple the atomic commit from the gpu > reset. The simplest wait to fix that is by replacing the > unconditional fence wait a the top of commit_tail by a wait which > completes either when the fences are done (normal case, or when a > reset doesn't need to touch the display state). Or when the gpu reset > needs to force-unblock all pending modeset states. > > Note that in both cases TDR itself keeps working, so from a userspace > pov this trickery isn't observable. Users themselvs might spot a short > glitch while the rendering is catching up again, but that's still > better than pre-TDR where we've thrown away all the rendering, > including innocent batches. Also, this fixes the regression TDR > introduced of making gpu resets deadlock-prone when we do need to > touch the display. > > One thing I noticed is that gpu_error.flags seems to use both our own > wait-queue in gpu_error.wait_queue, and the generic wait_on_bit > facilities. Not entirely sure why this inconsistency exists, I just > picked one style. > > A possible future avenue could be to insert the gpu reset in-between > ongoing modeset changes, which would avoid the momentary glitch. But > that's a lot more work to implement in the atomic commit machinery, > and given that we only need this for pre-g4x hw, of questionable > utility just for the sake of polishing gpu reset even more on those > old boxes. It might be useful for other features though. > > v2: Rebase onto 4.13 with a s/wait_queue_t/struct wait_queue_entry/. > > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> > --- > drivers/gpu/drm/i915/i915_drv.h | 1 + > drivers/gpu/drm/i915/intel_display.c | 35 ++++++++++++++++++++++++++++++----- > 2 files changed, 31 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h > index 07e98b07c5bc..369968539b40 100644 > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -1564,6 +1564,7 @@ struct i915_gpu_error { > unsigned long flags; > #define I915_RESET_BACKOFF 0 > #define I915_RESET_HANDOFF 1 > +#define I915_RESET_MODESET 2 > #define I915_WEDGED (BITS_PER_LONG - 1) > #define I915_RESET_ENGINE (I915_WEDGED - I915_NUM_ENGINES) > > diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c > index 5aa7ca1ab592..4762f158032d 100644 > --- a/drivers/gpu/drm/i915/intel_display.c > +++ b/drivers/gpu/drm/i915/intel_display.c > @@ -3471,10 +3471,9 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) > !gpu_reset_clobbers_display(dev_priv)) > return; > > - /* We have a modeset vs reset deadlock, defensively unbreak it. > - * > - * FIXME: We can do a _lot_ better, this is just a first iteration.*/ > - i915_gem_set_wedged(dev_priv); > + /* We have a modeset vs reset deadlock, defensively unbreak it. */ > + set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags); > + wake_up_all(&dev_priv->gpu_error.wait_queue); How are we breaking the modeset_lock -> struct_mutex -> wait_on_reset ? We wait the modeset_lock next which stops the reset from proceeding, and so the deadlock persists until the wedge-me timeout? -Chris
On Wed, Jul 19, 2017 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: > Quoting Daniel Vetter (2017-07-19 13:54:58) >> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c >> index 5aa7ca1ab592..4762f158032d 100644 >> --- a/drivers/gpu/drm/i915/intel_display.c >> +++ b/drivers/gpu/drm/i915/intel_display.c >> @@ -3471,10 +3471,9 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) >> !gpu_reset_clobbers_display(dev_priv)) >> return; >> >> - /* We have a modeset vs reset deadlock, defensively unbreak it. >> - * >> - * FIXME: We can do a _lot_ better, this is just a first iteration.*/ >> - i915_gem_set_wedged(dev_priv); >> + /* We have a modeset vs reset deadlock, defensively unbreak it. */ >> + set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags); >> + wake_up_all(&dev_priv->gpu_error.wait_queue); > > How are we breaking the > > modeset_lock -> struct_mutex -> wait_on_reset ? > > We wait the modeset_lock next which stops the reset from > proceeding, and so the deadlock persists until the wedge-me timeout? Hm indeed, I didn't check my logs carefully enough and there's still "i915_reset_device timed out" in it. But I also thought the only real wait we have left for the gpu is the one under i915_sw_fence. I think we could simply switch i915_mutex_lock_interruptible calls in atomic modeset over mutex_lock_interruptible? Or is there another can of worms I'm missing? -Daniel
On Wed, Jul 19, 2017 at 4:05 PM, Daniel Vetter <daniel.vetter@ffwll.ch> wrote: > On Wed, Jul 19, 2017 at 3:42 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote: >> Quoting Daniel Vetter (2017-07-19 13:54:58) >>> diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c >>> index 5aa7ca1ab592..4762f158032d 100644 >>> --- a/drivers/gpu/drm/i915/intel_display.c >>> +++ b/drivers/gpu/drm/i915/intel_display.c >>> @@ -3471,10 +3471,9 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) >>> !gpu_reset_clobbers_display(dev_priv)) >>> return; >>> >>> - /* We have a modeset vs reset deadlock, defensively unbreak it. >>> - * >>> - * FIXME: We can do a _lot_ better, this is just a first iteration.*/ >>> - i915_gem_set_wedged(dev_priv); >>> + /* We have a modeset vs reset deadlock, defensively unbreak it. */ >>> + set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags); >>> + wake_up_all(&dev_priv->gpu_error.wait_queue); >> >> How are we breaking the >> >> modeset_lock -> struct_mutex -> wait_on_reset ? >> >> We wait the modeset_lock next which stops the reset from >> proceeding, and so the deadlock persists until the wedge-me timeout? > > Hm indeed, I didn't check my logs carefully enough and there's still > "i915_reset_device timed out" in it. But I also thought the only real > wait we have left for the gpu is the one under i915_sw_fence. I think > we could simply switch i915_mutex_lock_interruptible calls in atomic > modeset over mutex_lock_interruptible? Or is there another can of > worms I'm missing? Obviously, because that's what we're doing already. But I don't have the DRM_ERROR from i915_gem_wait_for_error anywhere in my logs either, so not clear what exactly is going on ... -Daniel
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 07e98b07c5bc..369968539b40 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1564,6 +1564,7 @@ struct i915_gpu_error { unsigned long flags; #define I915_RESET_BACKOFF 0 #define I915_RESET_HANDOFF 1 +#define I915_RESET_MODESET 2 #define I915_WEDGED (BITS_PER_LONG - 1) #define I915_RESET_ENGINE (I915_WEDGED - I915_NUM_ENGINES) diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 5aa7ca1ab592..4762f158032d 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -3471,10 +3471,9 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) !gpu_reset_clobbers_display(dev_priv)) return; - /* We have a modeset vs reset deadlock, defensively unbreak it. - * - * FIXME: We can do a _lot_ better, this is just a first iteration.*/ - i915_gem_set_wedged(dev_priv); + /* We have a modeset vs reset deadlock, defensively unbreak it. */ + set_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags); + wake_up_all(&dev_priv->gpu_error.wait_queue); /* * Need mode_config.mutex so that we don't @@ -3569,6 +3568,8 @@ void intel_finish_reset(struct drm_i915_private *dev_priv) drm_modeset_drop_locks(ctx); drm_modeset_acquire_fini(ctx); mutex_unlock(&dev->mode_config.mutex); + + clear_bit(I915_RESET_MODESET, &dev_priv->gpu_error.flags); } static bool abort_flip_on_reset(struct intel_crtc *crtc) @@ -12384,6 +12385,30 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work) intel_atomic_helper_free_state(dev_priv); } +static void intel_atomic_commit_fence_wait(struct intel_atomic_state *intel_state) +{ + struct wait_queue_entry wait_fence, wait_reset; + struct drm_i915_private *dev_priv = to_i915(intel_state->base.dev); + + init_wait_entry(&wait_fence, 0); + init_wait_entry(&wait_reset, 0); + for (;;) { + prepare_to_wait(&intel_state->commit_ready.wait, + &wait_fence, TASK_UNINTERRUPTIBLE); + prepare_to_wait(&dev_priv->gpu_error.wait_queue, + &wait_reset, TASK_UNINTERRUPTIBLE); + + + if (i915_sw_fence_done(&intel_state->commit_ready) + || (dev_priv->gpu_error.flags & I915_RESET_MODESET)) + break; + + schedule(); + } + finish_wait(&intel_state->commit_ready.wait, &wait_fence); + finish_wait(&dev_priv->gpu_error.wait_queue, &wait_reset); +} + static void intel_atomic_commit_tail(struct drm_atomic_state *state) { struct drm_device *dev = state->dev; @@ -12397,7 +12422,7 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state) unsigned crtc_vblank_mask = 0; int i; - i915_sw_fence_wait(&intel_state->commit_ready); + intel_atomic_commit_fence_wait(intel_state); drm_atomic_helper_wait_for_dependencies(state);
There's no reason to entirely wedge the gpu, for the minimal deadlock bugfix we only need to unbreak/decouple the atomic commit from the gpu reset. The simplest wait to fix that is by replacing the unconditional fence wait a the top of commit_tail by a wait which completes either when the fences are done (normal case, or when a reset doesn't need to touch the display state). Or when the gpu reset needs to force-unblock all pending modeset states. Note that in both cases TDR itself keeps working, so from a userspace pov this trickery isn't observable. Users themselvs might spot a short glitch while the rendering is catching up again, but that's still better than pre-TDR where we've thrown away all the rendering, including innocent batches. Also, this fixes the regression TDR introduced of making gpu resets deadlock-prone when we do need to touch the display. One thing I noticed is that gpu_error.flags seems to use both our own wait-queue in gpu_error.wait_queue, and the generic wait_on_bit facilities. Not entirely sure why this inconsistency exists, I just picked one style. A possible future avenue could be to insert the gpu reset in-between ongoing modeset changes, which would avoid the momentary glitch. But that's a lot more work to implement in the atomic commit machinery, and given that we only need this for pre-g4x hw, of questionable utility just for the sake of polishing gpu reset even more on those old boxes. It might be useful for other features though. v2: Rebase onto 4.13 with a s/wait_queue_t/struct wait_queue_entry/. Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- drivers/gpu/drm/i915/i915_drv.h | 1 + drivers/gpu/drm/i915/intel_display.c | 35 ++++++++++++++++++++++++++++++----- 2 files changed, 31 insertions(+), 5 deletions(-)