diff mbox

[07/11] drm/i915/tdr: Add support for per engine reset recovery

Message ID 1469551257-26803-8-git-send-email-arun.siluvery@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

arun.siluvery@linux.intel.com July 26, 2016, 4:40 p.m. UTC
This change implements support for per-engine reset as an initial, less
intrusive hang recovery option to be attempted before falling back to the
legacy full GPU reset recovery mode if necessary. This is only supported
from Gen8 onwards.

Hangchecker determines which engines are hung and invokes error handler to
recover from it. Error handler schedules recovery for each of those engines
that are hung. The recovery procedure is as follows,
 - force engine to idle: this is done by issuing a reset request
 - identifies the request that caused the hang and it is dropped
 - reset and re-init engine
 - restart submissions to the engine

If engine reset fails then we fall back to heavy weight full gpu reset
which resets all engines and reinitiazes complete state of HW and SW.

Possible reasons for failure,
 - engine not ready for reset
 - if the HW and SW doesn't agree on the context that caused the hang
 - reset itself failed for some reason

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Tomas Elf <tomas.elf@intel.com>
Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
 drivers/gpu/drm/i915/i915_drv.h     |  3 ++
 drivers/gpu/drm/i915/i915_gem.c     |  2 +-
 drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
 drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
 5 files changed, 90 insertions(+), 7 deletions(-)

Comments

Chris Wilson July 26, 2016, 9:51 p.m. UTC | #1
On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
> This change implements support for per-engine reset as an initial, less
> intrusive hang recovery option to be attempted before falling back to the
> legacy full GPU reset recovery mode if necessary. This is only supported
> from Gen8 onwards.
> 
> Hangchecker determines which engines are hung and invokes error handler to
> recover from it. Error handler schedules recovery for each of those engines
> that are hung. The recovery procedure is as follows,
>  - force engine to idle: this is done by issuing a reset request
>  - identifies the request that caused the hang and it is dropped
>  - reset and re-init engine
>  - restart submissions to the engine
> 
> If engine reset fails then we fall back to heavy weight full gpu reset
> which resets all engines and reinitiazes complete state of HW and SW.
> 
> Possible reasons for failure,
>  - engine not ready for reset
>  - if the HW and SW doesn't agree on the context that caused the hang
>  - reset itself failed for some reason
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
> ---
>  drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
>  drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>  drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>  drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
>  drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>  5 files changed, 90 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
> index 5c20e5d..8151aa9 100644
> --- a/drivers/gpu/drm/i915/i915_drv.c
> +++ b/drivers/gpu/drm/i915/i915_drv.c
> @@ -1824,17 +1824,60 @@ error:
>   * Returns zero on successful reset or otherwise an error code.
>   *
>   * Procedure is fairly simple:
> - *  - force engine to idle
> - *  - save current state which includes head and current request
> - *  - reset engine
> - *  - restore saved state and resubmit context
> + *    - force engine to idle: this is done by issuing a reset request
> + *    - identifies the request that caused the hang and it is retired
> + *    - reset engine
> + *    - restart submissions to the engine
>   */
>  int i915_reset_engine(struct intel_engine_cs *engine)
>  {
> +	struct drm_i915_private *dev_priv = engine->i915;
>  	int ret;
>  
> -	/* FIXME: replace me with engine reset sequence */
> -	ret = -ENODEV;
> +	/* Ensure irq handler finishes or is cancelled. */
> +	tasklet_kill(&engine->irq_tasklet);
> +
> +	i915_gem_reset_engine_status(engine);

Not yet safe to be used lockless.

> +	/*Take wake lock to prevent power saving mode */

Why else would you be taking it? We know the wakelock prevents the GT
power well disappearing, the question is why is that important now
across the entirety of this sequence? That's what you explain in
comments.

> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
> +
> +	ret = intel_request_for_reset(engine);
> +	if (ret) {
> +		DRM_ERROR("Failed to disable %s\n", engine->name);
> +		goto out;
> +	}
> +
> +	ret = intel_execlists_reset_prepare(engine);
> +	if (ret)
> +		goto enable_engine;
> +
> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
> +	if (ret) {
> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
> +		goto enable_engine;
> +	}
> +
> +	ret = engine->init_hw(engine);
> +	if (ret)
> +		goto out;
> +
> +	/* Restart submissions to the engine after reset */
> +	intel_execlists_restart_submission(engine);

Clumsy. See above as we already have an entry point that can initiate
the hw.

> +enable_engine:
> +	/*
> +	 * we only need to enable engine if we cannot prepare engine for
> +	 * reset or reset fails. If the reset is successful, engine gets
> +	 * enabled automatically so we can skip this step.
> +	 */
> +	if (ret)

Then why is the normal path coming through here?

> +		intel_clear_reset_request(engine);
> +
> +out:
> +	/* Wake up anything waiting on this engine's queue */
> +	intel_engine_wakeup(engine);

? Random call with incorrect locking. What do you think you are
achieving here because it's not what the comment implies.

> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>  
>  	return ret;
>  }
> +/*
> + * On gen8+ a reset request has to be issued via the reset control register
> + * before a GPU engine can be reset in order to stop the command streamer
> + * and idle the engine. This replaces the legacy way of stopping an engine
> + * by writing to the stop ring bit in the MI_MODE register.
> + */
> +int intel_request_for_reset(struct intel_engine_cs *engine)

intel_engine_reset_begin()

It would be preferrable if we can avoid using request here as we are
talking about reseting the hung request elsewhere in this chain.

> +{
> +	if (!intel_has_engine_reset(engine->i915)) {
> +		DRM_ERROR("Engine Reset not supported on Gen%d\n",
> +			  INTEL_INFO(engine->i915)->gen);

ERROR?

> +		return -EINVAL;
return -ENODEV;

> +	}
> +
> +	return gen8_request_engine_reset(engine);
> +}

gen8_engine_reset_begin()

> +/*
> + * It is possible to back off from a previously issued reset request by simply
> + * clearing the reset request bit in the reset control register.
> + */
> +int intel_clear_reset_request(struct intel_engine_cs *engine)

intel_engine_reset_cancel()

> +{
> +{
> +	if (!intel_has_engine_reset(engine->i915)) {
> +		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
> +			  INTEL_INFO(engine->i915)->gen);
> +		return -EINVAL;
> +	}
> +
> +	gen8_unrequest_engine_reset(engine);

gen8_engine_reset_cancel()

(only suggestions for avoiding having confusion over requests)
arun.siluvery@linux.intel.com July 27, 2016, 11:49 a.m. UTC | #2
On 26/07/2016 22:51, Chris Wilson wrote:
> On Tue, Jul 26, 2016 at 05:40:53PM +0100, Arun Siluvery wrote:
>> This change implements support for per-engine reset as an initial, less
>> intrusive hang recovery option to be attempted before falling back to the
>> legacy full GPU reset recovery mode if necessary. This is only supported
>> from Gen8 onwards.
>>
>> Hangchecker determines which engines are hung and invokes error handler to
>> recover from it. Error handler schedules recovery for each of those engines
>> that are hung. The recovery procedure is as follows,
>>   - force engine to idle: this is done by issuing a reset request
>>   - identifies the request that caused the hang and it is dropped
>>   - reset and re-init engine
>>   - restart submissions to the engine
>>
>> If engine reset fails then we fall back to heavy weight full gpu reset
>> which resets all engines and reinitiazes complete state of HW and SW.
>>
>> Possible reasons for failure,
>>   - engine not ready for reset
>>   - if the HW and SW doesn't agree on the context that caused the hang
>>   - reset itself failed for some reason
>>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>> Signed-off-by: Tomas Elf <tomas.elf@intel.com>
>> Signed-off-by: Arun Siluvery <arun.siluvery@linux.intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_drv.c     | 55 +++++++++++++++++++++++++++++++++----
>>   drivers/gpu/drm/i915/i915_drv.h     |  3 ++
>>   drivers/gpu/drm/i915/i915_gem.c     |  2 +-
>>   drivers/gpu/drm/i915/intel_lrc.h    |  4 +++
>>   drivers/gpu/drm/i915/intel_uncore.c | 33 ++++++++++++++++++++++
>>   5 files changed, 90 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
>> index 5c20e5d..8151aa9 100644
>> --- a/drivers/gpu/drm/i915/i915_drv.c
>> +++ b/drivers/gpu/drm/i915/i915_drv.c
>> @@ -1824,17 +1824,60 @@ error:
>>    * Returns zero on successful reset or otherwise an error code.
>>    *
>>    * Procedure is fairly simple:
>> - *  - force engine to idle
>> - *  - save current state which includes head and current request
>> - *  - reset engine
>> - *  - restore saved state and resubmit context
>> + *    - force engine to idle: this is done by issuing a reset request
>> + *    - identifies the request that caused the hang and it is retired
>> + *    - reset engine
>> + *    - restart submissions to the engine
>>    */
>>   int i915_reset_engine(struct intel_engine_cs *engine)
>>   {
>> +	struct drm_i915_private *dev_priv = engine->i915;
>>   	int ret;
>>
>> -	/* FIXME: replace me with engine reset sequence */
>> -	ret = -ENODEV;
>> +	/* Ensure irq handler finishes or is cancelled. */
>> +	tasklet_kill(&engine->irq_tasklet);
>> +
>> +	i915_gem_reset_engine_status(engine);
>
> Not yet safe to be used lockless.
>
engine is hung and intel_lrc_irq_handler() also won't be running at this 
point and we should've received all seqno updates, is it still not safe? 
do I really need struct_mutex for this? only caller is i915_gem_reset() 
and it takes it.

>> +	/*Take wake lock to prevent power saving mode */
>
> Why else would you be taking it? We know the wakelock prevents the GT
> power well disappearing, the question is why is that important now
> across the entirety of this sequence? That's what you explain in
> comments.
>
yes, not useful. It is carried over from initial patch but I agree I 
should've updated this.

>> +	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
>> +
>> +	ret = intel_request_for_reset(engine);
>> +	if (ret) {
>> +		DRM_ERROR("Failed to disable %s\n", engine->name);
>> +		goto out;
>> +	}
>> +
>> +	ret = intel_execlists_reset_prepare(engine);
>> +	if (ret)
>> +		goto enable_engine;
>> +
>> +	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
>> +	if (ret) {
>> +		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
>> +		goto enable_engine;
>> +	}
>> +
>> +	ret = engine->init_hw(engine);
>> +	if (ret)
>> +		goto out;
>> +
>> +	/* Restart submissions to the engine after reset */
>> +	intel_execlists_restart_submission(engine);
>
> Clumsy. See above as we already have an entry point that can initiate
> the hw.
>
I think this is a separate step from init_hw().
In init_hw() we initialize hw state and keep it ready then we feed 
submissions.

>> +enable_engine:
>> +	/*
>> +	 * we only need to enable engine if we cannot prepare engine for
>> +	 * reset or reset fails. If the reset is successful, engine gets
>> +	 * enabled automatically so we can skip this step.
>> +	 */
>> +	if (ret)
>
> Then why is the normal path coming through here?
>
>> +		intel_clear_reset_request(engine);
>> +
there is no harm in doing this even when reset is successful, hence 
included in the normal path. I will remove ret check to avoid confusion.

>> +out:
>> +	/* Wake up anything waiting on this engine's queue */
>> +	intel_engine_wakeup(engine);
>
> ? Random call with incorrect locking. What do you think you are
> achieving here because it's not what the comment implies.

sorry, I will come back on this one, I need to better understand few 
more things.
>
>> +	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
>>
>>   	return ret;
>>   }
>> +/*
>> + * On gen8+ a reset request has to be issued via the reset control register
>> + * before a GPU engine can be reset in order to stop the command streamer
>> + * and idle the engine. This replaces the legacy way of stopping an engine
>> + * by writing to the stop ring bit in the MI_MODE register.
>> + */
>> +int intel_request_for_reset(struct intel_engine_cs *engine)
>
> intel_engine_reset_begin()
>
> It would be preferrable if we can avoid using request here as we are
> talking about reseting the hung request elsewhere in this chain.
>
the term request is overloaded but here infact we are actually 
requesting the hw to prepare for reset and wait for it to acknowledge.

intel_engine_reset_begin() sounds fine.

>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Engine Reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>
> ERROR?
This msg is probably unnecessary as we go for engine reset only when it 
is available.
>
>> +		return -EINVAL;
> return -ENODEV;

ok.

>
>> +	}
>> +
>> +	return gen8_request_engine_reset(engine);
>> +}
>
> gen8_engine_reset_begin()
>
>> +/*
>> + * It is possible to back off from a previously issued reset request by simply
>> + * clearing the reset request bit in the reset control register.
>> + */
>> +int intel_clear_reset_request(struct intel_engine_cs *engine)
>
> intel_engine_reset_cancel()
>
>> +{
>> +{
>> +	if (!intel_has_engine_reset(engine->i915)) {
>> +		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
>> +			  INTEL_INFO(engine->i915)->gen);
>> +		return -EINVAL;
>> +	}
>> +
>> +	gen8_unrequest_engine_reset(engine);
>
> gen8_engine_reset_cancel()
>
> (only suggestions for avoiding having confusion over requests)

ok, thanks for the renaming suggestions.

regards
Arun

>
>
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 5c20e5d..8151aa9 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -1824,17 +1824,60 @@  error:
  * Returns zero on successful reset or otherwise an error code.
  *
  * Procedure is fairly simple:
- *  - force engine to idle
- *  - save current state which includes head and current request
- *  - reset engine
- *  - restore saved state and resubmit context
+ *    - force engine to idle: this is done by issuing a reset request
+ *    - identifies the request that caused the hang and it is retired
+ *    - reset engine
+ *    - restart submissions to the engine
  */
 int i915_reset_engine(struct intel_engine_cs *engine)
 {
+	struct drm_i915_private *dev_priv = engine->i915;
 	int ret;
 
-	/* FIXME: replace me with engine reset sequence */
-	ret = -ENODEV;
+	/* Ensure irq handler finishes or is cancelled. */
+	tasklet_kill(&engine->irq_tasklet);
+
+	i915_gem_reset_engine_status(engine);
+
+	/*Take wake lock to prevent power saving mode */
+	intel_uncore_forcewake_get(dev_priv, FORCEWAKE_ALL);
+
+	ret = intel_request_for_reset(engine);
+	if (ret) {
+		DRM_ERROR("Failed to disable %s\n", engine->name);
+		goto out;
+	}
+
+	ret = intel_execlists_reset_prepare(engine);
+	if (ret)
+		goto enable_engine;
+
+	ret = intel_gpu_reset(dev_priv, intel_engine_flag(engine));
+	if (ret) {
+		DRM_ERROR("Failed to reset %s, ret=%d\n", engine->name, ret);
+		goto enable_engine;
+	}
+
+	ret = engine->init_hw(engine);
+	if (ret)
+		goto out;
+
+	/* Restart submissions to the engine after reset */
+	intel_execlists_restart_submission(engine);
+
+enable_engine:
+	/*
+	 * we only need to enable engine if we cannot prepare engine for
+	 * reset or reset fails. If the reset is successful, engine gets
+	 * enabled automatically so we can skip this step.
+	 */
+	if (ret)
+		intel_clear_reset_request(engine);
+
+out:
+	/* Wake up anything waiting on this engine's queue */
+	intel_engine_wakeup(engine);
+	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
 
 	return ret;
 }
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 61e57a8b..ac7a42c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2837,6 +2837,8 @@  extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask);
 extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv);
 extern int i915_reset(struct drm_i915_private *dev_priv);
 extern bool intel_has_engine_reset(struct drm_i915_private *dev_priv);
+extern int intel_request_for_reset(struct intel_engine_cs *engine);
+extern int intel_clear_reset_request(struct intel_engine_cs *engine);
 extern int i915_reset_engine(struct intel_engine_cs *engine);
 extern int intel_guc_reset(struct drm_i915_private *dev_priv);
 extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine);
@@ -3231,6 +3233,7 @@  static inline bool i915_engine_reset_in_progress(struct i915_gpu_error *error,
 }
 
 void i915_gem_reset(struct drm_device *dev);
+void i915_gem_reset_engine_status(struct intel_engine_cs *ring);
 bool i915_gem_clflush_object(struct drm_i915_gem_object *obj, bool force);
 int __must_check i915_gem_init(struct drm_device *dev);
 int __must_check i915_gem_init_hw(struct drm_device *dev);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 443570f..e34effb 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2468,7 +2468,7 @@  i915_gem_find_active_request(struct intel_engine_cs *engine)
 	return NULL;
 }
 
-static void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
+void i915_gem_reset_engine_status(struct intel_engine_cs *engine)
 {
 	struct drm_i915_gem_request *request;
 	bool ring_hung;
diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h
index 1171ea1..4035f63 100644
--- a/drivers/gpu/drm/i915/intel_lrc.h
+++ b/drivers/gpu/drm/i915/intel_lrc.h
@@ -132,4 +132,8 @@  int intel_execlists_submission(struct i915_execbuffer_params *params,
 
 void intel_execlists_cancel_requests(struct intel_engine_cs *engine);
 
+/* Engine reset */
+int intel_execlists_reset_prepare(struct intel_engine_cs *engine);
+void intel_execlists_restart_submission(struct intel_engine_cs *engine);
+
 #endif /* _INTEL_LRC_H_ */
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 418fd0d..d807871 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -1799,6 +1799,39 @@  int intel_guc_reset(struct drm_i915_private *dev_priv)
 	return ret;
 }
 
+/*
+ * On gen8+ a reset request has to be issued via the reset control register
+ * before a GPU engine can be reset in order to stop the command streamer
+ * and idle the engine. This replaces the legacy way of stopping an engine
+ * by writing to the stop ring bit in the MI_MODE register.
+ */
+int intel_request_for_reset(struct intel_engine_cs *engine)
+{
+	if (!intel_has_engine_reset(engine->i915)) {
+		DRM_ERROR("Engine Reset not supported on Gen%d\n",
+			  INTEL_INFO(engine->i915)->gen);
+		return -EINVAL;
+	}
+
+	return gen8_request_engine_reset(engine);
+}
+
+/*
+ * It is possible to back off from a previously issued reset request by simply
+ * clearing the reset request bit in the reset control register.
+ */
+int intel_clear_reset_request(struct intel_engine_cs *engine)
+{
+	if (!intel_has_engine_reset(engine->i915)) {
+		DRM_ERROR("Request to clear reset not supported on Gen%d\n",
+			  INTEL_INFO(engine->i915)->gen);
+		return -EINVAL;
+	}
+
+	gen8_unrequest_engine_reset(engine);
+	return 0;
+}
+
 bool intel_uncore_unclaimed_mmio(struct drm_i915_private *dev_priv)
 {
 	return check_for_unclaimed_mmio(dev_priv);