diff mbox series

drm/i915/reset: Handle reset timeouts under unrelated kernel hangs

Message ID 20220628191741.28866-1-ashutosh.dixit@intel.com (mailing list archive)
State New, archived
Headers show
Series drm/i915/reset: Handle reset timeouts under unrelated kernel hangs | expand

Commit Message

Dixit, Ashutosh June 28, 2022, 7:17 p.m. UTC
From: Chris Wilson <chris@chris-wilson.co.uk>

When resuming after hibernate sometimes we see hangs in unrelated kernel
subsystems. These hangs often result in the following i915 trace:

i915 0000:00:02.0: [drm] \
	*ERROR* intel_gt_reset_global timed out, cancelling all in-flight rendering.

implying our reset task has been starved by the hanging kernel subsystem,
causing us to inappropiately declare the system as wedged beyond recovery.

The trace would be caused by our synchronize_srcu_expedited() taking more
than the allowed 5s due to the unrelated kernel hang. But we neither need
to perform that synchronisation inside the reset watchdog, nor do we need
such a short timeout before declaring the device as unrecoverable.

Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/3575
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

Comments

Dixit, Ashutosh June 29, 2022, 5:35 a.m. UTC | #1
On Tue, 28 Jun 2022 12:17:41 -0700, Ashutosh Dixit wrote:
>
> From: Chris Wilson <chris@chris-wilson.co.uk>
>
> When resuming after hibernate sometimes we see hangs in unrelated kernel
> subsystems. These hangs often result in the following i915 trace:
>
> i915 0000:00:02.0: [drm] \
>	*ERROR* intel_gt_reset_global timed out, cancelling all in-flight rendering.
>
> implying our reset task has been starved by the hanging kernel subsystem,
> causing us to inappropiately declare the system as wedged beyond recovery.
>
> The trace would be caused by our synchronize_srcu_expedited() taking more
> than the allowed 5s due to the unrelated kernel hang. But we neither need
> to perform that synchronisation inside the reset watchdog, nor do we need
> such a short timeout before declaring the device as unrecoverable.
>
> Bug: https://gitlab.freedesktop.org/drm/intel/-/issues/3575
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index a5338c3fde7a0..e72744f6faedc 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1259,12 +1259,9 @@ static void intel_gt_reset_global(struct intel_gt *gt,
>	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
>
>	/* Use a watchdog to ensure that our reset completes */
> -	intel_wedge_on_timeout(&w, gt, 5 * HZ) {
> +	intel_wedge_on_timeout(&w, gt, 60 * HZ) {

How about we take one step at a time so if we are moving
synchronize_srcu_expedited() out of the reset watchdog, we leave the
timeout to the previous 5s? With the original timeout restored this patch
is:

Reviewed-by: Ashutosh Dixit <ashutosh.dixit@intel.com>

>		intel_display_prepare_reset(gt->i915);
>
> -		/* Flush everyone using a resource about to be clobbered */
> -		synchronize_srcu_expedited(&gt->reset.backoff_srcu);
> -
>		intel_gt_reset(gt, engine_mask, reason);
>
>		intel_display_finish_reset(gt->i915);
> @@ -1373,6 +1370,9 @@ void intel_gt_handle_error(struct intel_gt *gt,
>		}
>	}
>
> +	/* Flush everyone using a resource about to be clobbered */
> +	synchronize_srcu_expedited(&gt->reset.backoff_srcu);
> +
>	intel_gt_reset_global(gt, engine_mask, msg);
>
>	if (!intel_uc_uses_guc_submission(&gt->uc)) {
> --
> 2.36.1
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index a5338c3fde7a0..e72744f6faedc 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1259,12 +1259,9 @@  static void intel_gt_reset_global(struct intel_gt *gt,
 	kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
 
 	/* Use a watchdog to ensure that our reset completes */
-	intel_wedge_on_timeout(&w, gt, 5 * HZ) {
+	intel_wedge_on_timeout(&w, gt, 60 * HZ) {
 		intel_display_prepare_reset(gt->i915);
 
-		/* Flush everyone using a resource about to be clobbered */
-		synchronize_srcu_expedited(&gt->reset.backoff_srcu);
-
 		intel_gt_reset(gt, engine_mask, reason);
 
 		intel_display_finish_reset(gt->i915);
@@ -1373,6 +1370,9 @@  void intel_gt_handle_error(struct intel_gt *gt,
 		}
 	}
 
+	/* Flush everyone using a resource about to be clobbered */
+	synchronize_srcu_expedited(&gt->reset.backoff_srcu);
+
 	intel_gt_reset_global(gt, engine_mask, msg);
 
 	if (!intel_uc_uses_guc_submission(&gt->uc)) {