diff mbox series

drm/i915: Set wedged if enable guc communication failed

Message ID 20230302215020.1307608-1-zhanjun.dong@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915: Set wedged if enable guc communication failed | expand

Commit Message

Dong, Zhanjun March 2, 2023, 9:50 p.m. UTC
Add err code check for enable_communication on resume path. When resume failed, we can no longer use the GPU, marking the GPU as wedged.

Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gt_pm.c | 7 ++++++-
 drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++++++--
 2 files changed, 13 insertions(+), 3 deletions(-)

Comments

Daniele Ceraolo Spurio March 4, 2023, 12:10 a.m. UTC | #1
On 3/2/2023 1:50 PM, Zhanjun Dong wrote:
> Add err code check for enable_communication on resume path. When resume failed, we can no longer use the GPU, marking the GPU as wedged.

This is slightly incorrect. If we fail to enable communication, the 
consequence is that we can't use the GuC. That translates to a failure 
to use the GT only if GuC submission is enabled; in execlists submission 
mode we can keep going, although we might end up losing HuC 
functionality (which is not considered a fatal error). Therefore, the 
code below should be updated to reflect this.

>
> Signed-off-by: Zhanjun Dong <zhanjun.dong@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_gt_pm.c | 7 ++++++-
>   drivers/gpu/drm/i915/gt/uc/intel_uc.c | 9 +++++++--
>   2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> index cef3d6f5c34e..42a7d75ce39e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
> @@ -401,8 +401,13 @@ int intel_gt_runtime_resume(struct intel_gt *gt)
>   	intel_ggtt_restore_fences(gt->ggtt);
>   
>   	ret = intel_uc_runtime_resume(&gt->uc);
> -	if (ret)
> +	if (ret) {
> +		/* Resume failed, we can no longer use the GPU, marking the GPU
> +		 * as wedged.
> +		 */
> +		intel_gt_set_wedged(gt);

intel_gt_set_wedged calls intel_runtime_pm_get, so it will deadlock if 
you call if from within the runtime_resume flow.

>   		return ret;
> +	}
>   
>   	return 0;
>   }
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> index 6648691bd645..d4f428acf20a 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
> @@ -698,8 +698,13 @@ static int __uc_resume(struct intel_uc *uc, bool enable_communication)
>   	/* Make sure we enable communication if and only if it's disabled */
>   	GEM_BUG_ON(enable_communication == intel_guc_ct_enabled(&guc->ct));
>   
> -	if (enable_communication)
> -		guc_enable_communication(guc);
> +	if (enable_communication) {
> +		err = guc_enable_communication(guc);
> +		if (err) {
> +			guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));

nit: this isn't exactly a failure to resume because the GuC is running. 
It's more a failure to re-establish communication with the GuC.

Daniele

> +			return err;
> +		}
> +	}
>   
>   	/* If we are only resuming GuC communication but not reloading
>   	 * GuC, we need to ensure the ARAT timer interrupt is enabled
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
index cef3d6f5c34e..42a7d75ce39e 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c
@@ -401,8 +401,13 @@  int intel_gt_runtime_resume(struct intel_gt *gt)
 	intel_ggtt_restore_fences(gt->ggtt);
 
 	ret = intel_uc_runtime_resume(&gt->uc);
-	if (ret)
+	if (ret) {
+		/* Resume failed, we can no longer use the GPU, marking the GPU
+		 * as wedged.
+		 */
+		intel_gt_set_wedged(gt);
 		return ret;
+	}
 
 	return 0;
 }
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_uc.c b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
index 6648691bd645..d4f428acf20a 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_uc.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_uc.c
@@ -698,8 +698,13 @@  static int __uc_resume(struct intel_uc *uc, bool enable_communication)
 	/* Make sure we enable communication if and only if it's disabled */
 	GEM_BUG_ON(enable_communication == intel_guc_ct_enabled(&guc->ct));
 
-	if (enable_communication)
-		guc_enable_communication(guc);
+	if (enable_communication) {
+		err = guc_enable_communication(guc);
+		if (err) {
+			guc_dbg(guc, "Failed to resume, %pe", ERR_PTR(err));
+			return err;
+		}
+	}
 
 	/* If we are only resuming GuC communication but not reloading
 	 * GuC, we need to ensure the ARAT timer interrupt is enabled