diff mbox

drm/i915: Remove nested work in gpu error handling

Message ID 1422457394-27331-1-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mika Kuoppala Jan. 28, 2015, 3:03 p.m. UTC
Now when we declare gpu errors only through our own dedicated
hangcheck workqueue there is no need to have a separate workqueue
for handling the resetting and waking up the clients as the deadlock
concerns are no more.

The only exception is i915_debugfs::i915_set_wedged, which triggers
error handling through process context. However as this is only used through
test harness it is responsibility for test harness not to introduce hangs
through both debug interface and through hangcheck mechanism at the same time.

Remove gpu_error.work and let the hangcheck work do the tasks it used to.

v2: Add a big warning sign into i915_debugfs::i915_set_wedged (Chris)

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_debugfs.c | 11 +++++++++++
 drivers/gpu/drm/i915/i915_dma.c     |  1 -
 drivers/gpu/drm/i915/i915_drv.h     |  2 --
 drivers/gpu/drm/i915/i915_irq.c     | 34 +++++++++++++---------------------
 4 files changed, 24 insertions(+), 24 deletions(-)

Comments

Chris Wilson Jan. 28, 2015, 3:30 p.m. UTC | #1
On Wed, Jan 28, 2015 at 05:03:14PM +0200, Mika Kuoppala wrote:
> Now when we declare gpu errors only through our own dedicated
> hangcheck workqueue there is no need to have a separate workqueue
> for handling the resetting and waking up the clients as the deadlock
> concerns are no more.
> 
> The only exception is i915_debugfs::i915_set_wedged, which triggers
> error handling through process context. However as this is only used through
> test harness it is responsibility for test harness not to introduce hangs
> through both debug interface and through hangcheck mechanism at the same time.
> 
> Remove gpu_error.work and let the hangcheck work do the tasks it used to.
> 
> v2: Add a big warning sign into i915_debugfs::i915_set_wedged (Chris)
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Daniel Vetter Jan. 29, 2015, 5:03 p.m. UTC | #2
On Wed, Jan 28, 2015 at 03:30:35PM +0000, Chris Wilson wrote:
> On Wed, Jan 28, 2015 at 05:03:14PM +0200, Mika Kuoppala wrote:
> > Now when we declare gpu errors only through our own dedicated
> > hangcheck workqueue there is no need to have a separate workqueue
> > for handling the resetting and waking up the clients as the deadlock
> > concerns are no more.
> > 
> > The only exception is i915_debugfs::i915_set_wedged, which triggers
> > error handling through process context. However as this is only used through
> > test harness it is responsibility for test harness not to introduce hangs
> > through both debug interface and through hangcheck mechanism at the same time.
> > 
> > Remove gpu_error.work and let the hangcheck work do the tasks it used to.
> > 
> > v2: Add a big warning sign into i915_debugfs::i915_set_wedged (Chris)
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

Queued for -next, thanks for the patch.
-Daniel
Chris Wilson Feb. 2, 2015, 9:17 a.m. UTC | #3
On Wed, Jan 28, 2015 at 05:03:14PM +0200, Mika Kuoppala wrote:
> @@ -2616,6 +2612,9 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
>  	va_list args;
>  	char error_msg[80];
>  
> +	if (WARN_ON(mutex_is_locked(&dev_priv->dev->struct_mutex)))
> +		return;
> +

Oops, sorry, I should have realised this was wrong earlier. The mutex
breaking occurs later in i915_handle_error.
-Chris
Daniel Vetter Feb. 2, 2015, 9:38 a.m. UTC | #4
On Mon, Feb 02, 2015 at 09:17:14AM +0000, Chris Wilson wrote:
> On Wed, Jan 28, 2015 at 05:03:14PM +0200, Mika Kuoppala wrote:
> > @@ -2616,6 +2612,9 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
> >  	va_list args;
> >  	char error_msg[80];
> >  
> > +	if (WARN_ON(mutex_is_locked(&dev_priv->dev->struct_mutex)))
> > +		return;
> > +
> 
> Oops, sorry, I should have realised this was wrong earlier. The mutex
> breaking occurs later in i915_handle_error.

Oh well, already merged. Also, prts seems to complain that a bunch of
hang stress-tests changed from fail to timeout because of this one here.
Is this patch accidentally fix a bug and we just need to tune the tests,
or is there some new deadlock now? prts results are really thin, per usual
:(

Thanks, Daniel
Chris Wilson Feb. 2, 2015, 10:08 a.m. UTC | #5
On Mon, Feb 02, 2015 at 10:38:19AM +0100, Daniel Vetter wrote:
> On Mon, Feb 02, 2015 at 09:17:14AM +0000, Chris Wilson wrote:
> > On Wed, Jan 28, 2015 at 05:03:14PM +0200, Mika Kuoppala wrote:
> > > @@ -2616,6 +2612,9 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
> > >  	va_list args;
> > >  	char error_msg[80];
> > >  
> > > +	if (WARN_ON(mutex_is_locked(&dev_priv->dev->struct_mutex)))
> > > +		return;
> > > +
> > 
> > Oops, sorry, I should have realised this was wrong earlier. The mutex
> > breaking occurs later in i915_handle_error.
> 
> Oh well, already merged. Also, prts seems to complain that a bunch of
> hang stress-tests changed from fail to timeout because of this one here.
> Is this patch accidentally fix a bug and we just need to tune the tests,
> or is there some new deadlock now? prts results are really thin, per usual
> :(

Yes. It will also prevent the gpu reset which those tests depend upon.
-Chris
Mika Kuoppala Feb. 3, 2015, 2:05 p.m. UTC | #6
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Mon, Feb 02, 2015 at 10:38:19AM +0100, Daniel Vetter wrote:
>> On Mon, Feb 02, 2015 at 09:17:14AM +0000, Chris Wilson wrote:
>> > On Wed, Jan 28, 2015 at 05:03:14PM +0200, Mika Kuoppala wrote:
>> > > @@ -2616,6 +2612,9 @@ void i915_handle_error(struct drm_device *dev, bool wedged,
>> > >  	va_list args;
>> > >  	char error_msg[80];
>> > >  
>> > > +	if (WARN_ON(mutex_is_locked(&dev_priv->dev->struct_mutex)))
>> > > +		return;
>> > > +
>> > 
>> > Oops, sorry, I should have realised this was wrong earlier. The mutex
>> > breaking occurs later in i915_handle_error.
>> 
>> Oh well, already merged. Also, prts seems to complain that a bunch of
>> hang stress-tests changed from fail to timeout because of this one here.
>> Is this patch accidentally fix a bug and we just need to tune the tests,
>> or is there some new deadlock now? prts results are really thin, per usual
>> :(
>
> Yes. It will also prevent the gpu reset which those tests depend upon.

Jani pointed me on the dups. There will be more I think. The test
should have been more deep into the reset handling and instead of
bailing out we should have requeued ourselves to the least.

Sorry.
-Mika

> -Chris
>
> -- 
> Chris Wilson, Intel Open Source Technology Centre
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c
index 3b332a4..211d494 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -3969,6 +3969,17 @@  i915_wedged_set(void *data, u64 val)
 	struct drm_device *dev = data;
 	struct drm_i915_private *dev_priv = dev->dev_private;
 
+	/*
+	 * There is no safeguard against this debugfs entry colliding
+	 * with the hangcheck calling same i915_handle_error() in
+	 * parallel, causing an explosion. For now we assume that the
+	 * test harness is responsible enough not to inject gpu hangs
+	 * while it is writing to 'i915_wedged'
+	 */
+
+	if (i915_reset_in_progress(&dev_priv->gpu_error))
+		return -EAGAIN;
+
 	intel_runtime_pm_get(dev_priv);
 
 	i915_handle_error(dev, val,
diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 6eaf795..1a46787 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -945,7 +945,6 @@  int i915_driver_unload(struct drm_device *dev)
 
 	/* Free error state after interrupts are fully disabled. */
 	cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
-	cancel_work_sync(&dev_priv->gpu_error.work);
 	i915_destroy_error_state(dev);
 
 	if (dev->pdev->msi_enabled)
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index b09173f..07f99ca 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1352,8 +1352,6 @@  struct i915_gpu_error {
 	spinlock_t lock;
 	/* Protected by the above dev->gpu_error.lock. */
 	struct drm_i915_error_state *first_error;
-	struct work_struct work;
-
 
 	unsigned long missed_irq_rings;
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 234b1f7..44dbf78 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2421,19 +2421,15 @@  static void i915_error_wake_up(struct drm_i915_private *dev_priv,
 }
 
 /**
- * i915_error_work_func - do process context error handling work
- * @work: work struct
+ * i915_reset_and_wakeup - do process context error handling work
  *
  * Fire an error uevent so userspace can see that a hang or error
  * was detected.
  */
-static void i915_error_work_func(struct work_struct *work)
+static void i915_reset_and_wakeup(struct drm_device *dev)
 {
-	struct i915_gpu_error *error = container_of(work, struct i915_gpu_error,
-						    work);
-	struct drm_i915_private *dev_priv =
-		container_of(error, struct drm_i915_private, gpu_error);
-	struct drm_device *dev = dev_priv->dev;
+	struct drm_i915_private *dev_priv = to_i915(dev);
+	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
 	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
@@ -2600,10 +2596,10 @@  static void i915_report_and_clear_eir(struct drm_device *dev)
 }
 
 /**
- * i915_handle_error - handle an error interrupt
+ * i915_handle_error - handle a gpu error
  * @dev: drm device
  *
- * Do some basic checking of regsiter state at error interrupt time and
+ * Do some basic checking of regsiter state at error time and
  * dump it to the syslog.  Also call i915_capture_error_state() to make
  * sure we get a record and make it available in debugfs.  Fire a uevent
  * so userspace knows something bad happened (should trigger collection
@@ -2616,6 +2612,9 @@  void i915_handle_error(struct drm_device *dev, bool wedged,
 	va_list args;
 	char error_msg[80];
 
+	if (WARN_ON(mutex_is_locked(&dev_priv->dev->struct_mutex)))
+		return;
+
 	va_start(args, fmt);
 	vscnprintf(error_msg, sizeof(error_msg), fmt, args);
 	va_end(args);
@@ -2628,9 +2627,9 @@  void i915_handle_error(struct drm_device *dev, bool wedged,
 				&dev_priv->gpu_error.reset_counter);
 
 		/*
-		 * Wakeup waiting processes so that the reset work function
-		 * i915_error_work_func doesn't deadlock trying to grab various
-		 * locks. By bumping the reset counter first, the woken
+		 * Wakeup waiting processes so that the reset function
+		 * i915_reset_and_wakeup doesn't deadlock trying to grab
+		 * various locks. By bumping the reset counter first, the woken
 		 * processes will see a reset in progress and back off,
 		 * releasing their locks and then wait for the reset completion.
 		 * We must do this for _all_ gpu waiters that might hold locks
@@ -2643,13 +2642,7 @@  void i915_handle_error(struct drm_device *dev, bool wedged,
 		i915_error_wake_up(dev_priv, false);
 	}
 
-	/*
-	 * Our reset work can grab modeset locks (since it needs to reset the
-	 * state of outstanding pagelips). Hence it must not be run on our own
-	 * dev-priv->wq work queue for otherwise the flush_work in the pageflip
-	 * code will deadlock.
-	 */
-	schedule_work(&dev_priv->gpu_error.work);
+	i915_reset_and_wakeup(dev);
 }
 
 /* Called from drm generic code, passed 'crtc' which
@@ -4345,7 +4338,6 @@  void intel_irq_init(struct drm_i915_private *dev_priv)
 
 	INIT_WORK(&dev_priv->hotplug_work, i915_hotplug_work_func);
 	INIT_WORK(&dev_priv->dig_port_work, i915_digport_work_func);
-	INIT_WORK(&dev_priv->gpu_error.work, i915_error_work_func);
 	INIT_WORK(&dev_priv->rps.work, gen6_pm_rps_work);
 	INIT_WORK(&dev_priv->l3_parity.error_work, ivybridge_parity_work);