From patchwork Tue Jul 21 13:58:48 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Tomas Elf <tomas.elf@intel.com>
X-Patchwork-Id: 6836471
Return-Path: <intel-gfx-bounces@lists.freedesktop.org>
X-Original-To: patchwork-intel-gfx@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id 7E13FC05AC
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Tue, 21 Jul 2015 13:59:56 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 80A3320670
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Tue, 21 Jul 2015 13:59:55 +0000 (UTC)
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	by mail.kernel.org (Postfix) with ESMTP id 73AE020648
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Tue, 21 Jul 2015 13:59:54 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id DE3927208D;
	Tue, 21 Jul 2015 06:59:53 -0700 (PDT)
X-Original-To: Intel-GFX@lists.freedesktop.org
Delivered-To: Intel-GFX@lists.freedesktop.org
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
	by gabe.freedesktop.org (Postfix) with ESMTP id BB5067208D
	for <Intel-GFX@lists.freedesktop.org>;
	Tue, 21 Jul 2015 06:59:52 -0700 (PDT)
Received: from orsmga001.jf.intel.com ([10.7.209.18])
	by fmsmga103.fm.intel.com with ESMTP; 21 Jul 2015 06:59:52 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.15,516,1432623600"; d="scan'208";a="732757481"
Received: from telf-linux2.isw.intel.com ([10.102.226.163])
	by orsmga001.jf.intel.com with ESMTP; 21 Jul 2015 06:59:51 -0700
From: Tomas Elf <tomas.elf@intel.com>
To: Intel-GFX@Lists.FreeDesktop.Org
Date: Tue, 21 Jul 2015 14:58:48 +0100
Message-Id: <1437487135-32520-6-git-send-email-tomas.elf@intel.com>
X-Mailer: git-send-email 1.9.1
In-Reply-To: <1437487135-32520-1-git-send-email-tomas.elf@intel.com>
References: <1437487135-32520-1-git-send-email-tomas.elf@intel.com>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Subject: [Intel-gfx] [RFCv2 05/12] drm/i915: Reinstate hang recovery work
	queue.
X-BeenThere: intel-gfx@lists.freedesktop.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Intel graphics driver community testing & development
	<intel-gfx.lists.freedesktop.org>
List-Unsubscribe: <http://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <http://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <http://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
MIME-Version: 1.0
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
X-Spam-Status: No, score=-5.4 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED,
	RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

There used to be a work queue separating the error handler from the hang
recovery path, which was removed a while back in this commit:

	commit b8d24a06568368076ebd5a858a011699a97bfa42
	Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
	Date:   Wed Jan 28 17:03:14 2015 +0200

	    drm/i915: Remove nested work in gpu error handling

Now we need to revert most of that commit since the work queue separating hang
detection from hang recovery is needed in preparation for the upcoming watchdog
timeout feature. The watchdog interrupt service routine will be a second
callsite of the error handler alongside the periodic hang checker, which runs
in a work queue context. Seeing as the error handler will be serving a caller
in a hard interrupt execution context that means that the error handler must
never end up in a situation where it needs to grab the struct_mutex.
Unfortunately, that is exactly what we need to do first at the start of the
hang recovery path, which might potentially sleep if the struct_mutex is
already held by another thread. Not good when you're in a hard interrupt
context.

Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_dma.c |  1 +
 drivers/gpu/drm/i915/i915_drv.h |  1 +
 drivers/gpu/drm/i915/i915_irq.c | 28 +++++++++++++++++++++++-----
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index cf01e84..39b8f5f 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1116,6 +1116,7 @@ int i915_driver_unload(struct drm_device *dev)
 	/* Free error state after interrupts are fully disabled. */
 	cancel_delayed_work_sync(&dev_priv->gpu_error.hangcheck_work);
 	i915_destroy_error_state(dev);
+	cancel_work_sync(&dev_priv->gpu_error.work);
 
 	if (dev->pdev->msi_enabled)
 		pci_disable_msi(dev->pdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index f4d0c1f..ef7c129 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -1245,6 +1245,7 @@ struct i915_gpu_error {
 	spinlock_t lock;
 	/* Protected by the above dev->gpu_error.lock. */
 	struct drm_i915_error_state *first_error;
+	struct work_struct work;
 
 	unsigned long missed_irq_rings;
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index e869823..4973826 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2300,15 +2300,18 @@ static void i915_error_wake_up(struct drm_i915_private *dev_priv,
 }
 
 /**
- * i915_reset_and_wakeup - do process context error handling work
+ * i915_error_work_func - do process context error handling work
  *
  * Fire an error uevent so userspace can see that a hang or error
  * was detected.
  */
-static void i915_reset_and_wakeup(struct drm_device *dev)
+static void i915_error_work_func(struct work_struct *work)
 {
-	struct drm_i915_private *dev_priv = to_i915(dev);
-	struct i915_gpu_error *error = &dev_priv->gpu_error;
+	struct i915_gpu_error *error = container_of(work, struct i915_gpu_error,
+	                                            work);
+	struct drm_i915_private *dev_priv =
+	        container_of(error, struct drm_i915_private, gpu_error);
+	struct drm_device *dev = dev_priv->dev;
 	char *error_event[] = { I915_ERROR_UEVENT "=1", NULL };
 	char *reset_event[] = { I915_RESET_UEVENT "=1", NULL };
 	char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
@@ -2668,7 +2671,21 @@ void i915_handle_error(struct drm_device *dev, u32 engine_mask, bool wedged,
 		i915_error_wake_up(dev_priv, false);
 	}
 
-	i915_reset_and_wakeup(dev);
+	/*
+	 * Gen 7:
+	 *
+	 * Our reset work can grab modeset locks (since it needs to reset the
+	 * state of outstanding pageflips). Hence it must not be run on our own
+	 * dev-priv->wq work queue for otherwise the flush_work in the pageflip
+	 * code will deadlock.
+	 * If error_work is already in the work queue then it will not be added
+	 * again. It hasn't yet executed so it will see the reset flags when
+	 * it is scheduled. If it isn't in the queue or it is currently
+	 * executing then this call will add it to the queue again so that
+	 * even if it misses the reset flags during the current call it is
+	 * guaranteed to see them on the next call.
+	 */
+	schedule_work(&dev_priv->gpu_error.work);
 }
 
 /* Called from drm generic code, passed 'crtc' which
@@ -4389,6 +4406,7 @@ void intel_irq_init(struct drm_i915_private *dev_priv)
 	struct drm_device *dev = dev_priv->dev;
 
 	INIT_WORK(&dev_priv->hotplug_work, i915_hotplug_work_func);
+	INIT_WORK(&dev_priv->gpu_error.work, i915_error_work_func);
 	INIT_WORK(&dev_priv->dig_port_work, i915_digport_work_func);
 	INIT_WORK(&dev_priv->rps.work, gen6_pm_rps_work);
 	INIT_WORK(&dev_priv->l3_parity.error_work, ivybridge_parity_work);