Message ID | 1452706112-8617-7-git-send-email-arun.siluvery@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jan 13, 2016 at 05:28:18PM +0000, Arun Siluvery wrote: > From: Tomas Elf <tomas.elf@intel.com> > > There used to be a work queue separating the error handler from the hang > recovery path, which was removed a while back in this commit: > > commit b8d24a06568368076ebd5a858a011699a97bfa42 > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Date: Wed Jan 28 17:03:14 2015 +0200 > > drm/i915: Remove nested work in gpu error handling > > Now we need to revert most of that commit since the work queue separating hang > detection from hang recovery is needed in preparation for the upcoming watchdog > timeout feature. The watchdog interrupt service routine will be a second > callsite of the error handler alongside the periodic hang checker, which runs > in a work queue context. Seeing as the error handler will be serving a caller > in a hard interrupt execution context that means that the error handler must > never end up in a situation where it needs to grab the struct_mutex. > Unfortunately, that is exactly what we need to do first at the start of the > hang recovery path, which might potentially sleep if the struct_mutex is > already held by another thread. Not good when you're in a hard interrupt > context. We also would not dream of running i915_handle_error() from inside an interrupt handler anyway as the capture is too heavy... > Signed-off-by: Tomas Elf <tomas.elf@intel.com> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com> > --- > drivers/gpu/drm/i915/i915_dma.c | 1 + > drivers/gpu/drm/i915/i915_drv.h | 1 + > drivers/gpu/drm/i915/i915_irq.c | 31 ++++++++++++++++++++++++------- > 3 files changed, 26 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c > index c45ec353..67003c2 100644 > --- a/drivers/gpu/drm/i915/i915_dma.c > +++ b/drivers/gpu/drm/i915/i915_dma.c > @@ -1203,6 +1203,7 @@ int i915_driver_unload(struct drm_device *dev) > /* Free error state after interrupts are fully disabled. */ > cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work); > i915_destroy_error_state(dev); > + cancel_work_sync(&dev_priv->gpu_error.work); This should be before the destroy as we could be in the process of resetting state but after the cancel(hangcheck_work), as that may queue the error_work. > @@ -2827,7 +2830,21 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged, > i915_error_wake_up(dev_priv, false); > } > > - i915_reset_and_wakeup(dev); > + /* > + * Gen 7: > + * Gen 7? A little misleading -Chris
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c index c45ec353..67003c2 100644 --- a/drivers/gpu/drm/i915/i915_dma.c +++ b/drivers/gpu/drm/i915/i915_dma.c @@ -1203,6 +1203,7 @@ int i915_driver_unload(struct drm_device *dev) /* Free error state after interrupts are fully disabled. */ cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work); i915_destroy_error_state(dev); + cancel_work_sync(&dev_priv->gpu_error.work); if (dev->pdev->msi_enabled) pci_disable_msi(dev->pdev); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 5be7d3e..072ca37 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1337,6 +1337,7 @@ struct i915_gpu_error { spinlock_t lock; /* Protected by the above dev->gpu_error.lock. */ struct drm_i915_error_state *first_error; + struct work_struct work; unsigned long missed_irq_rings; diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index fef74cf..8937c82 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2457,16 +2457,19 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv, } /** - * i915_reset_and_wakeup - do process context error handling work - * @dev: drm device + * i915_error_work_func - do process context error handling work + * @work: work item containing error struct, passed by the error handler * * Fire an error uevent so userspace can see that a hang or error * was detected. */ -static void i915_reset_and_wakeup(struct drm_device *dev) +static void i915_error_work_func(struct work_struct *work) { - struct drm_i915_private *dev_priv = to_i915(dev); - struct i915_gpu_error *error = &dev_priv->gpu_error; + struct i915_gpu_error *error = container_of(work, struct i915_gpu_error, + work); + struct drm_i915_private *dev_priv = + container_of(error, struct drm_i915_private, gpu_error); + struct drm_device *dev = dev_priv->dev; char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL }; @@ -2827,7 +2830,21 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged, i915_error_wake_up(dev_priv, false); } - i915_reset_and_wakeup(dev); + /* + * Gen 7: + * + * Our reset work can grab modeset locks (since it needs to reset the + * state of outstanding pageflips). Hence it must not be run on our own + * dev-priv->wq work queue for otherwise the flush_work in the pageflip + * code will deadlock. + * If error_work is already in the work queue then it will not be added + * again. It hasn't yet executed so it will see the reset flags when + * it is scheduled. If it isn't in the queue or it is currently + * executing then this call will add it to the queue again so that + * even if it misses the reset flags during the current call it is + * guaranteed to see them on the next call. + */ + schedule_work(&dev_priv->gpu_error.work); } /* Called from drm generic code, passed 'crtc' which @@ -4682,7 +4699,7 @@ void intel_irq_init(struct drm_i915_private *dev_priv) struct drm_device *dev = dev_priv->dev; intel_hpd_init_work(dev_priv); - + INIT_WORK(&dev_priv->gpu_error.work, i915_error_work_func); INIT_WORK(&dev_priv->rps.work, gen6_pm_rps_work); INIT_WORK(&dev_priv->l3_parity.error_work, ivybridge_parity_work);