diff mbox series

[3/4] drm/i915/guc: Flag an error if an engine reset fails

Message ID 20211211065859.2248188-4-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series More fixes/tweaks to GuC support | expand

Commit Message

John Harrison Dec. 11, 2021, 6:58 a.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

If GuC encounters an error during engine reset, the i915 driver
promotes to full GT reset. This includes an info message about why the
reset is happening. However, that is not treated as a failure by any
of the CI systems because resets are an expected occurrance during
testing. This kind of failure is a major problem and should never
happen. So, complain more loudly and make sure CI notices.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Comments

Daniele Ceraolo Spurio Dec. 16, 2021, 1:16 a.m. UTC | #1
On 12/10/2021 10:58 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> If GuC encounters an error during engine reset, the i915 driver
> promotes to full GT reset. This includes an info message about why the
> reset is happening. However, that is not treated as a failure by any
> of the CI systems because resets are an expected occurrance during
> testing. This kind of failure is a major problem and should never
> happen. So, complain more loudly and make sure CI notices.
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +++++++++++---
>   1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 97311119da6f..6015815f1da0 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4018,11 +4018,12 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   					 const u32 *msg, u32 len)
>   {
>   	struct intel_engine_cs *engine;
> +	struct intel_gt *gt = guc_to_gt(guc);
>   	u8 guc_class, instance;
>   	u32 reason;
>   
>   	if (unlikely(len != 3)) {
> -		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
> +		drm_err(&gt->i915->drm, "Invalid length %u", len);
>   		return -EPROTO;
>   	}
>   
> @@ -4032,12 +4033,19 @@ int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>   
>   	engine = guc_lookup_engine(guc, guc_class, instance);
>   	if (unlikely(!engine)) {
> -		drm_err(&guc_to_gt(guc)->i915->drm,
> +		drm_err(&gt->i915->drm,
>   			"Invalid engine %d:%d", guc_class, instance);
>   		return -EPROTO;
>   	}
>   
> -	intel_gt_handle_error(guc_to_gt(guc), engine->mask,
> +	/*
> +	 * This is an unexpected failure of a hardware feature. So, log a real
> +	 * error message not just the informational that comes with the reset.
> +	 */
> +	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because %d",

In the error handling called below, the reason is logged as 0x%08x, so 
IMO we should do the same here for consistency.
With that:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> +		guc_class, instance, engine->name, reason);
> +
> +	intel_gt_handle_error(gt, engine->mask,
>   			      I915_ERROR_CAPTURE,
>   			      "GuC failed to reset %s (reason=0x%08x)\n",
>   			      engine->name, reason);
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 97311119da6f..6015815f1da0 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4018,11 +4018,12 @@  int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 					 const u32 *msg, u32 len)
 {
 	struct intel_engine_cs *engine;
+	struct intel_gt *gt = guc_to_gt(guc);
 	u8 guc_class, instance;
 	u32 reason;
 
 	if (unlikely(len != 3)) {
-		drm_err(&guc_to_gt(guc)->i915->drm, "Invalid length %u", len);
+		drm_err(&gt->i915->drm, "Invalid length %u", len);
 		return -EPROTO;
 	}
 
@@ -4032,12 +4033,19 @@  int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
 
 	engine = guc_lookup_engine(guc, guc_class, instance);
 	if (unlikely(!engine)) {
-		drm_err(&guc_to_gt(guc)->i915->drm,
+		drm_err(&gt->i915->drm,
 			"Invalid engine %d:%d", guc_class, instance);
 		return -EPROTO;
 	}
 
-	intel_gt_handle_error(guc_to_gt(guc), engine->mask,
+	/*
+	 * This is an unexpected failure of a hardware feature. So, log a real
+	 * error message not just the informational that comes with the reset.
+	 */
+	drm_err(&gt->i915->drm, "GuC engine reset request failed on %d:%d (%s) because %d",
+		guc_class, instance, engine->name, reason);
+
+	intel_gt_handle_error(gt, engine->mask,
 			      I915_ERROR_CAPTURE,
 			      "GuC failed to reset %s (reason=0x%08x)\n",
 			      engine->name, reason);