diff mbox

[v5,04/18] drm/i915/tdr: Modify error handler for per engine hang recovery

Message ID 20170325013010.36244-5-michel.thierry@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michel Thierry March 25, 2017, 1:29 a.m. UTC
From: Arun Siluvery <arun.siluvery@linux.intel.com>

This is a preparatory patch which modifies error handler to do per engine
hang recovery. The actual patch which implements this sequence follows
later in the series. The aim is to prepare existing recovery function to
adapt to this new function where applicable (which fails at this point
because core implementation is lacking) and continue recovery using legacy
full gpu reset.

A helper function is also added to query the availability of engine
reset.

The error events behaviour that are used to notify user of reset are
adapted to engine reset such that it doesn't break users listening to these
events. In legacy we report an error event, a reset event before resetting
the gpu and a reset done event marking the completion of reset. The same
behaviour is adapted but reset event is only dispatched once even when
multiple engines are hung. Finally once reset is complete we send reset
done event as usual.

Note that this implementation of engine reset is for i915 directly
submitting to the ELSP, where the driver manages the hang detection,
recovery and resubmission. With GuC submission these tasks are shared
between driver and firmware; i915 will still responsible for detecting a
hang, and when it does it will have to request GuC to reset that Engine and
remind the firmware about the outstanding submissions. This will be
added in different patch.

v2: rebase, advertise engine reset availability in platform definition,
add note about GuC submission.
v3: s/*engine_reset*/*reset_engine*/. (Chris)
Handle reset as 2 level resets, by first going to engine only and fall
backing to full/chip reset as needed, i.e. reset_engine will need the
struct_mutex.
v4: Pass the engine mask to i915_reset. (Chris)
v5: Rebase, update selftests.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ian Lister <ian.lister@intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c                  | 49 +++++++++++++++++++++++-
 drivers/gpu/drm/i915/i915_drv.h                  |  7 +++-
 drivers/gpu/drm/i915/i915_gem_request.c          |  3 +-
 drivers/gpu/drm/i915/i915_irq.c                  | 19 ++++++++-
 drivers/gpu/drm/i915/i915_pci.c                  |  5 ++-
 drivers/gpu/drm/i915/intel_ringbuffer.h          | 16 ++++++++
 drivers/gpu/drm/i915/intel_uncore.c              | 11 ++++++
 drivers/gpu/drm/i915/selftests/intel_hangcheck.c |  4 +-
 8 files changed, 105 insertions(+), 9 deletions(-)

Comments

Chris Wilson March 25, 2017, 9:10 a.m. UTC | #1
On Fri, Mar 24, 2017 at 06:29:56PM -0700, Michel Thierry wrote:
> From: Arun Siluvery <arun.siluvery@linux.intel.com>
> 
> This is a preparatory patch which modifies error handler to do per engine
> hang recovery. The actual patch which implements this sequence follows
> later in the series. The aim is to prepare existing recovery function to
> adapt to this new function where applicable (which fails at this point
> because core implementation is lacking) and continue recovery using legacy
> full gpu reset.
> 
> A helper function is also added to query the availability of engine
> reset.
> 
> The error events behaviour that are used to notify user of reset are
> adapted to engine reset such that it doesn't break users listening to these
> events. In legacy we report an error event, a reset event before resetting
> the gpu and a reset done event marking the completion of reset. The same
> behaviour is adapted but reset event is only dispatched once even when
> multiple engines are hung. Finally once reset is complete we send reset
> done event as usual.
> 
> Note that this implementation of engine reset is for i915 directly
> submitting to the ELSP, where the driver manages the hang detection,
> recovery and resubmission. With GuC submission these tasks are shared
> between driver and firmware; i915 will still responsible for detecting a
> hang, and when it does it will have to request GuC to reset that Engine and
> remind the firmware about the outstanding submissions. This will be
> added in different patch.
> 
> v2: rebase, advertise engine reset availability in platform definition,
> add note about GuC submission.
> v3: s/*engine_reset*/*reset_engine*/. (Chris)
> Handle reset as 2 level resets, by first going to engine only and fall
> backing to full/chip reset as needed, i.e. reset_engine will need the
> struct_mutex.
> v4: Pass the engine mask to i915_reset. (Chris)
> v5: Rebase, update selftests.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Ian Lister <ian.lister@intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> Signed-off-by: Michel Thierry <michel.thierry@intel.com>

4 authors in and this patch is still trying to do reset_engine until the
mutex, requiring the handoff. Why? We should be ready now to be able to
do the first pass of resets before the mutex - we don't need to even
prepare the display for the per-engine resets and should be able to do
it much lighter. We can land the per-engine struct_mutex resets for
hangcheck as soon as it is ready as it would see immediate testing.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 6d9944a00b7d..c7eb43deb33e 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1794,7 +1794,7 @@  static int i915_resume_switcheroo(struct drm_device *dev)
 }
 
 /**
- * i915_reset - reset chip after a hang
+ * i915_reset_chip - reset chip after a hang
  * @dev_priv: device private to reset
  *
  * Reset the chip.  Useful if a hang is detected. Marks the device as wedged
@@ -1810,7 +1810,7 @@  static int i915_resume_switcheroo(struct drm_device *dev)
  *   - re-init interrupt state
  *   - re-init display
  */
-void i915_reset(struct drm_i915_private *dev_priv)
+void i915_reset_chip(struct drm_i915_private *dev_priv)
 {
 	struct i915_gpu_error *error = &dev_priv->gpu_error;
 	int ret;
@@ -1821,6 +1821,8 @@  void i915_reset(struct drm_i915_private *dev_priv)
 	if (!test_bit(I915_RESET_HANDOFF, &error->flags))
 		return;
 
+	DRM_DEBUG_DRIVER("resetting chip\n");
+
 	/* Clear any previous failed attempts at recovery. Time to try again. */
 	if (!i915_gem_unset_wedged(dev_priv))
 		goto wakeup;
@@ -1884,6 +1886,49 @@  void i915_reset(struct drm_i915_private *dev_priv)
 	goto finish;
 }
 
+/**
+ * i915_reset_engine - reset GPU engine to recover from a hang
+ * @engine: engine to reset
+ *
+ * Reset a specific GPU engine. Useful if a hang is detected.
+ * Returns zero on successful reset or otherwise an error code.
+ */
+int i915_reset_engine(struct intel_engine_cs *engine)
+{
+	/* FIXME: replace me with engine reset sequence */
+	return -ENODEV;
+}
+
+/**
+ * i915_reset - start either engine or full GPU reset to recover from a hang
+ * @dev_priv: device private
+ * @engine_mask: mask representing engines that are hung
+ *
+ * Wrapper function to initiate a GPU reset. When platform supports it, attempt
+ * to reset the hung engine only. If engine reset fails (or is not supported),
+ * reset the full GPU. If more than one engine is hung, the speed gains of
+ * reset_engine are negligible, thus promote to full reset.
+ *
+ * Caller must hold the struct_mutex.
+ */
+void i915_reset(struct drm_i915_private *dev_priv, u32 engine_mask)
+{
+	/* try engine reset first */
+	if (intel_has_reset_engine(dev_priv) &&
+	    !(engine_mask & (engine_mask - 1))) {
+		struct intel_engine_cs *engine =
+			dev_priv->engine[intel_engineid_from_flag(engine_mask)];
+
+		if (i915_reset_engine(engine))
+			goto reset_chip;
+
+		return;
+	}
+
+reset_chip:
+	i915_reset_chip(dev_priv);
+}
+
 static int i915_pm_suspend(struct device *kdev)
 {
 	struct pci_dev *pdev = to_pci_dev(kdev);
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 6f19f5467f74..b02393f33753 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -832,6 +832,7 @@  struct intel_csr {
 	func(has_ddi); \
 	func(has_decoupled_mmio); \
 	func(has_dp_mst); \
+	func(has_reset_engine); \
 	func(has_fbc); \
 	func(has_fpga_dbg); \
 	func(has_full_ppgtt); \
@@ -1636,6 +1637,9 @@  struct i915_gpu_error {
 #define I915_RESET_HANDOFF	1
 #define I915_WEDGED		(BITS_PER_LONG - 1)
 
+	/* if available, engine-specific reset is tried before full gpu reset */
+	u32 reset_engine_mask;
+
 	/**
 	 * Waitqueue to signal when a hang is detected. Used to for waiters
 	 * to release the struct_mutex for the reset to procede.
@@ -3038,7 +3042,8 @@  extern int i915_driver_load(struct pci_dev *pdev,
 extern void i915_driver_unload(struct drm_device *dev);
 extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
-extern void i915_reset(struct drm_i915_private *dev_priv);
+extern void i915_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
+extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
 extern void intel_hangcheck_init(struct drm_i915_private *dev_priv);
diff --git a/drivers/gpu/drm/i915/i915_gem_request.c b/drivers/gpu/drm/i915/i915_gem_request.c
index dbfa9db2419d..5ecb5a7a0c64 100644
--- a/drivers/gpu/drm/i915/i915_gem_request.c
+++ b/drivers/gpu/drm/i915/i915_gem_request.c
@@ -1016,7 +1016,8 @@  static bool __i915_wait_request_check_and_reset(struct drm_i915_gem_request *req
 		return false;
 
 	__set_current_state(TASK_RUNNING);
-	i915_reset(request->i915);
+	i915_reset(request->i915,
+		   request->i915->gpu_error.reset_engine_mask);
 	return true;
 }
 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 3d2e6232f17a..87e76ef589b1 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2647,7 +2647,15 @@  static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 
 	kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);
 
-	DRM_DEBUG_DRIVER("resetting chip\n");
+	/*
+	 * This event needs to be sent before performing gpu reset. When
+	 * engine resets are supported we iterate through all engines and
+	 * reset hung engines individually. To keep the event dispatch
+	 * mechanism consistent with full gpu reset, this is only sent once
+	 * even when multiple engines are hung. It is also safe to move this
+	 * here because when we are in this function, we will definitely
+	 * perform gpu reset.
+	 */
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
 	intel_prepare_reset(dev_priv);
@@ -2663,7 +2671,8 @@  static void i915_reset_and_wakeup(struct drm_i915_private *dev_priv)
 		 * deadlocks with the reset work.
 		 */
 		if (mutex_trylock(&dev_priv->drm.struct_mutex)) {
-			i915_reset(dev_priv);
+			i915_reset(dev_priv,
+				   dev_priv->gpu_error.reset_engine_mask);
 			mutex_unlock(&dev_priv->drm.struct_mutex);
 		}
 
@@ -2780,6 +2789,12 @@  void i915_handle_error(struct drm_i915_private *dev_priv,
 			     &dev_priv->gpu_error.flags))
 		goto out;
 
+	/*
+	 * Save which engines need reset; if engine support is available,
+	 * we can just reset the hung engines.
+	 */
+	dev_priv->gpu_error.reset_engine_mask = engine_mask;
+
 	i915_reset_and_wakeup(dev_priv);
 
 out:
diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c
index 732101ed57fb..da8b82acd055 100644
--- a/drivers/gpu/drm/i915/i915_pci.c
+++ b/drivers/gpu/drm/i915/i915_pci.c
@@ -308,7 +308,8 @@  static const struct intel_device_info intel_haswell_info = {
 	BDW_COLORS, \
 	.has_logical_ring_contexts = 1, \
 	.has_full_48bit_ppgtt = 1, \
-	.has_64bit_reloc = 1
+	.has_64bit_reloc = 1, \
+	.has_reset_engine = 1
 
 static const struct intel_device_info intel_broadwell_info = {
 	BDW_FEATURES,
@@ -340,6 +341,7 @@  static const struct intel_device_info intel_cherryview_info = {
 	.has_gmch_display = 1,
 	.has_aliasing_ppgtt = 1,
 	.has_full_ppgtt = 1,
+	.has_reset_engine = 1,
 	.display_mmio_offset = VLV_DISPLAY_BASE,
 	GEN_CHV_PIPEOFFSETS,
 	CURSOR_OFFSETS,
@@ -389,6 +391,7 @@  static const struct intel_device_info intel_skylake_gt3_info = {
 	.has_aliasing_ppgtt = 1, \
 	.has_full_ppgtt = 1, \
 	.has_full_48bit_ppgtt = 1, \
+	.has_reset_engine = 1, \
 	GEN_DEFAULT_PIPEOFFSETS, \
 	IVB_CURSOR_OFFSETS, \
 	BDW_COLORS
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h
index 2ecb41788fb6..e8faf2c34c97 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.h
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.h
@@ -438,6 +438,22 @@  intel_engine_flag(const struct intel_engine_cs *engine)
 	return 1 << engine->id;
 }
 
+/* works only for engine_mask with 1 engine bit set */
+static inline unsigned
+intel_engineid_from_flag(unsigned engine_mask)
+{
+	unsigned engine_id = 0;
+
+	GEM_BUG_ON(engine_mask & (engine_mask - 1));
+
+	while (engine_mask >>= 1)
+		engine_id++;
+
+	GEM_BUG_ON(engine_id >= I915_NUM_ENGINES);
+
+	return engine_id;
+}
+
 static inline void
 intel_flush_status_page(struct intel_engine_cs *engine, int reg)
 {
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 5f815a100a8d..be51b5a0689a 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1751,6 +1751,17 @@  bool intel_has_gpu_reset(struct drm_i915_private *dev_priv)
 	return intel_get_gpu_reset(dev_priv) != NULL;
 }
 
+/*
+ * When GuC submission is enabled, GuC manages ELSP and can initiate the
+ * engine reset too. For now, fall back to full GPU reset if it is enabled.
+ */
+bool intel_has_reset_engine(struct drm_i915_private *dev_priv)
+{
+	return (dev_priv->info.has_reset_engine &&
+		!dev_priv->guc.execbuf_client &&
+		i915.reset == 2);
+}
+
 int intel_guc_reset(struct drm_i915_private *dev_priv)
 {
 	int ret;
diff --git a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
index 6ec7c731a267..76d5f78c1724 100644
--- a/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
+++ b/drivers/gpu/drm/i915/selftests/intel_hangcheck.c
@@ -307,7 +307,7 @@  static int igt_global_reset(void *arg)
 	mutex_lock(&i915->drm.struct_mutex);
 	reset_count = i915_reset_count(&i915->gpu_error);
 
-	i915_reset(i915);
+	i915_reset(i915, ALL_ENGINES);
 
 	if (i915_reset_count(&i915->gpu_error) == reset_count) {
 		pr_err("No GPU reset recorded!\n");
@@ -472,7 +472,7 @@  static int igt_reset_queue(void *arg)
 
 			reset_count = fake_hangcheck(prev);
 
-			i915_reset(i915);
+			i915_reset(i915, ALL_ENGINES);
 
 			GEM_BUG_ON(test_bit(I915_RESET_HANDOFF,
 					    &i915->gpu_error.flags));