Message ID | 1422017048-4792-1-git-send-email-mika.kuoppala@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, Jan 23, 2015 at 02:44:07PM +0200, Mika Kuoppala wrote: > From: Chris Wilson <chris@chris-wilson.co.uk> > > When run as a timer, i915_hangcheck_elapsed() must adhere to all the > rules of running in a softirq context. This is advantageous to us as we > want to minimise the risk that a driver bug will prevent us from > detecting a hung GPU. However, that is irrelevant if the driver bug > prevents us from resetting and recovering. Still it is prudent not to > rely on mutexes inside the checker, but given the coarseness of > dev->struct_mutex doing so is extremely hard. > > Give in and run from a work queue, i.e. outside of softirq. > > v2: > > The conversion does have one significant change, from the use of > mod_timer to schedule_delayed_work, means that the time that we execute > the first hangcheck is fixed and not continually deferred by later work. > This has the advantage of not allowing userspace to fill the ring before > hangcheck can finally run. At the same time, it removes the ability for > the interrupt to defer the hangcheck as well. This is sensible for that > an interrupt is only for a single engine, whereas we perform hangcheck > globally, so whilst one ring may have hung, the other could be running > normally and preventing the hangcheck from firing. > > Cc: Jani Nikula <jani.nikula@intel.com> > Cc: Daniel Vetter <dnaiel.vetter@ffwll.chm> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> (v2) > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> One thing special with timers is that they'll always run, and if you do a del_timer_sync from process context you can't deadlock with timer A because you hold some locks for timer B that's in front of A in the queues. With workqueues that's not the case, and it's really easy to cause deadlocks by blocking some random work item in front of the queue by accident. I think for this switch we need our own, dedicated hangcheck work queue, with it's own thread to make sure it gets run reliable. But besides that I really like this chnage. -Daniel
On Fri, Jan 23, 2015 at 02:44:07PM +0200, Mika Kuoppala wrote: > From: Chris Wilson <chris@chris-wilson.co.uk> > > When run as a timer, i915_hangcheck_elapsed() must adhere to all the > rules of running in a softirq context. This is advantageous to us as we > want to minimise the risk that a driver bug will prevent us from > detecting a hung GPU. However, that is irrelevant if the driver bug > prevents us from resetting and recovering. Still it is prudent not to > rely on mutexes inside the checker, but given the coarseness of > dev->struct_mutex doing so is extremely hard. > > Give in and run from a work queue, i.e. outside of softirq. > > v2: > > The conversion does have one significant change, from the use of > mod_timer to schedule_delayed_work, means that the time that we execute > the first hangcheck is fixed and not continually deferred by later work. > This has the advantage of not allowing userspace to fill the ring before > hangcheck can finally run. At the same time, it removes the ability for > the interrupt to defer the hangcheck as well. This is sensible for that > an interrupt is only for a single engine, whereas we perform hangcheck > globally, so whilst one ring may have hung, the other could be running > normally and preventing the hangcheck from firing. We can drop this comment since we have already applied this change in an earlier patch. > @@ -3097,17 +3099,11 @@ static void i915_hangcheck_elapsed(unsigned long data) > > void i915_queue_hangcheck(struct drm_device *dev) > { > - struct drm_i915_private *dev_priv = dev->dev_private; > - struct timer_list *timer = &dev_priv->gpu_error.hangcheck_timer; > - > if (!i915.enable_hangcheck) > return; > > - /* Don't continually defer the hangcheck, but make sure it is active */ Keep the comment, or more preferrably let's capture the reason why we don't want to continually defer the hangcheck: /* Don't continually defer the hangcheck so that it is always run at * least once after work has been scheduled on any ring. Otherwise, * we will ignore a hung ring if a second ring is kept busy. */ > - if (timer_pending(timer)) > - return; > - mod_timer(timer, > - round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES)); > + schedule_delayed_work(&to_i915(dev)->gpu_error.hangcheck_work, > + round_jiffies_up_relative(DRM_I915_HANGCHECK_JIFFIES)); > }
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c index 51e8fe5..7c64669 100644 --- a/drivers/gpu/drm/i915/i915_dma.c +++ b/drivers/gpu/drm/i915/i915_dma.c @@ -934,7 +934,7 @@ int i915_driver_unload(struct drm_device *dev) } /* Free error state after interrupts are fully disabled. */ - del_timer_sync(&dev_priv->gpu_error.hangcheck_timer); + cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work); cancel_work_sync(&dev_priv->gpu_error.work); i915_destroy_error_state(dev); diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 66c72bd..cb1468d 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1396,7 +1396,7 @@ static int intel_runtime_suspend(struct device *device) return ret; } - del_timer_sync(&dev_priv->gpu_error.hangcheck_timer); + cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work); intel_uncore_forcewake_reset(dev, false); dev_priv->pm.suspended = true; diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 0d67b17..ce2acdd 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1345,7 +1345,7 @@ struct i915_gpu_error { /* Hang gpu twice in this window and your context gets banned */ #define DRM_I915_CTX_BAN_PERIOD DIV_ROUND_UP(8*DRM_I915_HANGCHECK_PERIOD, 1000) - struct timer_list hangcheck_timer; + struct delayed_work hangcheck_work; /* For reset and error_state handling. */ spinlock_t lock; diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index fc81889..8a178cd 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -4615,7 +4615,7 @@ i915_gem_suspend(struct drm_device *dev) i915_gem_stop_ringbuffers(dev); mutex_unlock(&dev->struct_mutex); - del_timer_sync(&dev_priv->gpu_error.hangcheck_timer); + cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work); cancel_delayed_work_sync(&dev_priv->mm.retire_work); flush_delayed_work(&dev_priv->mm.idle_work); diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 25b2299..a188c7d 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2980,10 +2980,12 @@ ring_stuck(struct intel_engine_cs *ring, u64 acthd) * we kick the ring. If we see no progress on three subsequent calls * we assume chip is wedged and try to fix it by resetting the chip. */ -static void i915_hangcheck_elapsed(unsigned long data) +static void i915_hangcheck_elapsed(struct work_struct *work) { - struct drm_device *dev = (struct drm_device *)data; - struct drm_i915_private *dev_priv = dev->dev_private; + struct drm_i915_private *dev_priv = + container_of(work, typeof(*dev_priv), + gpu_error.hangcheck_work.work); + struct drm_device *dev = dev_priv->dev; struct intel_engine_cs *ring; int i; int busy_count = 0, rings_hung = 0; @@ -3097,17 +3099,11 @@ static void i915_hangcheck_elapsed(unsigned long data) void i915_queue_hangcheck(struct drm_device *dev) { - struct drm_i915_private *dev_priv = dev->dev_private; - struct timer_list *timer = &dev_priv->gpu_error.hangcheck_timer; - if (!i915.enable_hangcheck) return; - /* Don't continually defer the hangcheck, but make sure it is active */ - if (timer_pending(timer)) - return; - mod_timer(timer, - round_jiffies_up(jiffies + DRM_I915_HANGCHECK_JIFFIES)); + schedule_delayed_work(&to_i915(dev)->gpu_error.hangcheck_work, + round_jiffies_up_relative(DRM_I915_HANGCHECK_JIFFIES)); } static void ibx_irq_reset(struct drm_device *dev) @@ -4351,9 +4347,8 @@ void intel_irq_init(struct drm_i915_private *dev_priv) else dev_priv->pm_rps_events = GEN6_PM_RPS_EVENTS; - setup_timer(&dev_priv->gpu_error.hangcheck_timer, - i915_hangcheck_elapsed, - (unsigned long) dev); + INIT_DELAYED_WORK(&dev_priv->gpu_error.hangcheck_work, + i915_hangcheck_elapsed); INIT_DELAYED_WORK(&dev_priv->hotplug_reenable_work, intel_hpd_irq_reenable_work);