diff mbox series

[v8,2/2] drm/i915/guc: Close deregister-context race against CT-loss

Message ID 20231212165716.57493-3-alan.previn.teres.alexis@intel.com (mailing list archive)
State New, archived
Headers show
Series Resolve suspend-resume racing with GuC destroy-context-worker | expand

Commit Message

Teres Alexis, Alan Previn Dec. 12, 2023, 4:57 p.m. UTC
If we are at the end of suspend or very early in resume
its possible an async fence signal (via rcu_call) is triggered
to free_engines which could lead us to the execution of
the context destruction worker (after a prior worker flush).

Thus, when suspending, insert rcu_barriers at the start
of i915_gem_suspend (part of driver's suspend prepare) and
again in i915_gem_suspend_late so that all such cases have
completed and context destruction list isn't missing anything.

In destroyed_worker_func, close the race against CT-loss
by checking that CT is enabled before calling into
deregister_destroyed_contexts.

Based on testing, guc_lrc_desc_unpin may still race and fail
as we traverse the GuC's context-destroy list because the
CT could be disabled right before calling GuC's CT send function.

We've witnessed this race condition once every ~6000-8000
suspend-resume cycles while ensuring workloads that render
something onscreen is continuously started just before
we suspend (and the workload is small enough to complete
and trigger the queued engine/context free-up either very
late in suspend or very early in resume).

In such a case, we need to unroll the entire process because
guc-lrc-unpin takes a gt wakeref which only gets released in
the G2H IRQ reply that never comes through in this corner
case. Without the unroll, the taken wakeref is leaked and will
cascade into a kernel hang later at the tail end of suspend in
this function:

   intel_wakeref_wait_for_idle(&gt->wakeref)
   (called by) - intel_gt_pm_wait_for_idle
   (called by) - wait_for_suspend

Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
contexts if guc_lrc_desc_unpin fails due to CT send falure.
When unrolling, keep the context in the GuC's destroy-list so
it can get picked up on the next destroy worker invocation
(if suspend aborted) or get fully purged as part of a GuC
sanitization (end of suspend) or a reset flow.

Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com>
Tested-by: Mousumi Jana <mousumi.jana@intel.com>
Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gem/i915_gem_pm.c        | 10 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 73 +++++++++++++++++--
 2 files changed, 78 insertions(+), 5 deletions(-)

Comments

Rodrigo Vivi Dec. 13, 2023, 9:23 p.m. UTC | #1
On Tue, Dec 12, 2023 at 08:57:16AM -0800, Alan Previn wrote:
> If we are at the end of suspend or very early in resume
> its possible an async fence signal (via rcu_call) is triggered
> to free_engines which could lead us to the execution of
> the context destruction worker (after a prior worker flush).
> 
> Thus, when suspending, insert rcu_barriers at the start
> of i915_gem_suspend (part of driver's suspend prepare) and
> again in i915_gem_suspend_late so that all such cases have
> completed and context destruction list isn't missing anything.
> 
> In destroyed_worker_func, close the race against CT-loss
> by checking that CT is enabled before calling into
> deregister_destroyed_contexts.
> 
> Based on testing, guc_lrc_desc_unpin may still race and fail
> as we traverse the GuC's context-destroy list because the
> CT could be disabled right before calling GuC's CT send function.
> 
> We've witnessed this race condition once every ~6000-8000
> suspend-resume cycles while ensuring workloads that render
> something onscreen is continuously started just before
> we suspend (and the workload is small enough to complete
> and trigger the queued engine/context free-up either very
> late in suspend or very early in resume).
> 
> In such a case, we need to unroll the entire process because
> guc-lrc-unpin takes a gt wakeref which only gets released in
> the G2H IRQ reply that never comes through in this corner
> case. Without the unroll, the taken wakeref is leaked and will
> cascade into a kernel hang later at the tail end of suspend in
> this function:
> 
>    intel_wakeref_wait_for_idle(&gt->wakeref)
>    (called by) - intel_gt_pm_wait_for_idle
>    (called by) - wait_for_suspend
> 
> Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
> contexts if guc_lrc_desc_unpin fails due to CT send falure.
> When unrolling, keep the context in the GuC's destroy-list so
> it can get picked up on the next destroy worker invocation
> (if suspend aborted) or get fully purged as part of a GuC
> sanitization (end of suspend) or a reset flow.
> 
> Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com>
> Tested-by: Mousumi Jana <mousumi.jana@intel.com>
> Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Thanks for all the explanations, patience and great work!

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

> ---
>  drivers/gpu/drm/i915/gem/i915_gem_pm.c        | 10 +++
>  .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 73 +++++++++++++++++--
>  2 files changed, 78 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_pm.c
> index 0d812f4d787d..3b27218aabe2 100644
> --- a/drivers/gpu/drm/i915/gem/i915_gem_pm.c
> +++ b/drivers/gpu/drm/i915/gem/i915_gem_pm.c
> @@ -28,6 +28,13 @@ void i915_gem_suspend(struct drm_i915_private *i915)
>  	GEM_TRACE("%s\n", dev_name(i915->drm.dev));
>  
>  	intel_wakeref_auto(&i915->runtime_pm.userfault_wakeref, 0);
> +	/*
> +	 * On rare occasions, we've observed the fence completion triggers
> +	 * free_engines asynchronously via rcu_call. Ensure those are done.
> +	 * This path is only called on suspend, so it's an acceptable cost.
> +	 */
> +	rcu_barrier();
> +
>  	flush_workqueue(i915->wq);
>  
>  	/*
> @@ -160,6 +167,9 @@ void i915_gem_suspend_late(struct drm_i915_private *i915)
>  	 * machine in an unusable condition.
>  	 */
>  
> +	/* Like i915_gem_suspend, flush tasks staged from fence triggers */
> +	rcu_barrier();
> +
>  	for_each_gt(gt, i915, i)
>  		intel_gt_suspend_late(gt);
>  
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 9c64ae0766cc..cae637fc3ead 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -236,6 +236,13 @@ set_context_destroyed(struct intel_context *ce)
>  	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
>  }
>  
> +static inline void
> +clr_context_destroyed(struct intel_context *ce)
> +{
> +	lockdep_assert_held(&ce->guc_state.lock);
> +	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
> +}
> +
>  static inline bool context_pending_disable(struct intel_context *ce)
>  {
>  	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
> @@ -613,6 +620,8 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
>  					 u32 g2h_len_dw,
>  					 bool loop)
>  {
> +	int ret;
> +
>  	/*
>  	 * We always loop when a send requires a reply (i.e. g2h_len_dw > 0),
>  	 * so we don't handle the case where we don't get a reply because we
> @@ -623,7 +632,11 @@ static int guc_submission_send_busy_loop(struct intel_guc *guc,
>  	if (g2h_len_dw)
>  		atomic_inc(&guc->outstanding_submission_g2h);
>  
> -	return intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> +	ret = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
> +	if (ret)
> +		atomic_dec(&guc->outstanding_submission_g2h);
> +
> +	return ret;
>  }
>  
>  int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
> @@ -3288,12 +3301,13 @@ static void guc_context_close(struct intel_context *ce)
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
>  }
>  
> -static inline void guc_lrc_desc_unpin(struct intel_context *ce)
> +static inline int guc_lrc_desc_unpin(struct intel_context *ce)
>  {
>  	struct intel_guc *guc = ce_to_guc(ce);
>  	struct intel_gt *gt = guc_to_gt(guc);
>  	unsigned long flags;
>  	bool disabled;
> +	int ret;
>  
>  	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
>  	GEM_BUG_ON(!ctx_id_mapped(guc, ce->guc_id.id));
> @@ -3304,18 +3318,41 @@ static inline void guc_lrc_desc_unpin(struct intel_context *ce)
>  	spin_lock_irqsave(&ce->guc_state.lock, flags);
>  	disabled = submission_disabled(guc);
>  	if (likely(!disabled)) {
> +		/*
> +		 * Take a gt-pm ref and change context state to be destroyed.
> +		 * NOTE: a G2H IRQ that comes after will put this gt-pm ref back
> +		 */
>  		__intel_gt_pm_get(gt);
>  		set_context_destroyed(ce);
>  		clr_context_registered(ce);
>  	}
>  	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
> +
>  	if (unlikely(disabled)) {
>  		release_guc_id(guc, ce);
>  		__guc_context_destroy(ce);
> -		return;
> +		return 0;
>  	}
>  
> -	deregister_context(ce, ce->guc_id.id);
> +	/*
> +	 * GuC is active, lets destroy this context, but at this point we can still be racing
> +	 * with suspend, so we undo everything if the H2G fails in deregister_context so
> +	 * that GuC reset will find this context during clean up.
> +	 */
> +	ret = deregister_context(ce, ce->guc_id.id);
> +	if (ret) {
> +		spin_lock(&ce->guc_state.lock);
> +		set_context_registered(ce);
> +		clr_context_destroyed(ce);
> +		spin_unlock(&ce->guc_state.lock);
> +		/*
> +		 * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements
> +		 * the wakeref immediately but per function spec usage call this after unlock.
> +		 */
> +		intel_wakeref_put_async(&gt->wakeref);
> +	}
> +
> +	return ret;
>  }
>  
>  static void __guc_context_destroy(struct intel_context *ce)
> @@ -3383,7 +3420,22 @@ static void deregister_destroyed_contexts(struct intel_guc *guc)
>  		if (!ce)
>  			break;
>  
> -		guc_lrc_desc_unpin(ce);
> +		if (guc_lrc_desc_unpin(ce)) {
> +			/*
> +			 * This means GuC's CT link severed mid-way which could happen
> +			 * in suspend-resume corner cases. In this case, put the
> +			 * context back into the destroyed_contexts list which will
> +			 * get picked up on the next context deregistration event or
> +			 * purged in a GuC sanitization event (reset/unload/wedged/...).
> +			 */
> +			spin_lock_irqsave(&guc->submission_state.lock, flags);
> +			list_add_tail(&ce->destroyed_link,
> +				      &guc->submission_state.destroyed_contexts);
> +			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
> +			/* Bail now since the list might never be emptied if h2gs fail */
> +			break;
> +		}
> +
>  	}
>  }
>  
> @@ -3394,6 +3446,17 @@ static void destroyed_worker_func(struct work_struct *w)
>  	struct intel_gt *gt = guc_to_gt(guc);
>  	intel_wakeref_t wakeref;
>  
> +	/*
> +	 * In rare cases we can get here via async context-free fence-signals that
> +	 * come very late in suspend flow or very early in resume flows. In these
> +	 * cases, GuC won't be ready but just skipping it here is fine as these
> +	 * pending-destroy-contexts get destroyed totally at GuC reset time at the
> +	 * end of suspend.. OR.. this worker can be picked up later on the next
> +	 * context destruction trigger after resume-completes
> +	 */
> +	if (!intel_guc_is_ready(guc))
> +		return;
> +
>  	with_intel_gt_pm(gt, wakeref)
>  		deregister_destroyed_contexts(guc);
>  }
> -- 
> 2.39.0
>
Teres Alexis, Alan Previn Dec. 20, 2023, 11:08 p.m. UTC | #2
On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote:
> On Tue, Dec 12, 2023 at 08:57:16AM -0800, Alan Previn wrote:
> > If we are at the end of suspend or very early in resume
> > its possible an async fence signal (via rcu_call) is triggered
> > to free_engines which could lead us to the execution of
> > the context destruction worker (after a prior worker flush).
alan:snip
> 
> > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
> > contexts if guc_lrc_desc_unpin fails due to CT send falure.
> > When unrolling, keep the context in the GuC's destroy-list so
> > it can get picked up on the next destroy worker invocation
> > (if suspend aborted) or get fully purged as part of a GuC
> > sanitization (end of suspend) or a reset flow.
> > 
> > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> > Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com>
> > Tested-by: Mousumi Jana <mousumi.jana@intel.com>
> > Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> 
> Thanks for all the explanations, patience and great work!
> 
> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>

alan: Thanks Rodrigo for the RB last week, just quick update:

I've cant reproduce the BAT failures that seem to be intermittent
on platform and test - however, a noticable number of failures
do keep occuring on i915_selftest @live @requests where the
last test leaked a wakeref and the failing test hangs waiting
for gt to idle before starting its test.

i have to debug this further although from code inspection
is unrelated to the patches in this series.
Hopefully its a different issue.
Rodrigo Vivi Dec. 26, 2023, 3:11 p.m. UTC | #3
On Wed, Dec 20, 2023 at 11:08:59PM +0000, Teres Alexis, Alan Previn wrote:
> On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote:
> > On Tue, Dec 12, 2023 at 08:57:16AM -0800, Alan Previn wrote:
> > > If we are at the end of suspend or very early in resume
> > > its possible an async fence signal (via rcu_call) is triggered
> > > to free_engines which could lead us to the execution of
> > > the context destruction worker (after a prior worker flush).
> alan:snip
> > 
> > > Thus, do an unroll in guc_lrc_desc_unpin and deregister_destroyed_-
> > > contexts if guc_lrc_desc_unpin fails due to CT send falure.
> > > When unrolling, keep the context in the GuC's destroy-list so
> > > it can get picked up on the next destroy worker invocation
> > > (if suspend aborted) or get fully purged as part of a GuC
> > > sanitization (end of suspend) or a reset flow.
> > > 
> > > Signed-off-by: Alan Previn <alan.previn.teres.alexis@intel.com>
> > > Signed-off-by: Anshuman Gupta <anshuman.gupta@intel.com>
> > > Tested-by: Mousumi Jana <mousumi.jana@intel.com>
> > > Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > 
> > Thanks for all the explanations, patience and great work!
> > 
> > Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> 
> alan: Thanks Rodrigo for the RB last week, just quick update:
> 
> I've cant reproduce the BAT failures that seem to be intermittent
> on platform and test - however, a noticable number of failures
> do keep occuring on i915_selftest @live @requests where the
> last test leaked a wakeref and the failing test hangs waiting
> for gt to idle before starting its test.
> 
> i have to debug this further although from code inspection
> is unrelated to the patches in this series.
> Hopefully its a different issue.

Yeap, likely not related. Anyway, I'm sorry for not merging
this sooner. Could you please send a rebased version? This
on is not applying cleanly anymore.
Teres Alexis, Alan Previn Dec. 28, 2023, 4 a.m. UTC | #4
On Tue, 2023-12-26 at 10:11 -0500, Vivi, Rodrigo wrote:
> On Wed, Dec 20, 2023 at 11:08:59PM +0000, Teres Alexis, Alan Previn wrote:
> > On Wed, 2023-12-13 at 16:23 -0500, Vivi, Rodrigo wrote:
alan:snip

> > 
> > 
> > alan: Thanks Rodrigo for the RB last week, just quick update:
> > 
> > I've cant reproduce the BAT failures that seem to be intermittent
> > on platform and test - however, a noticable number of failures
> > do keep occuring on i915_selftest @live @requests where the
> > last test leaked a wakeref and the failing test hangs waiting
> > for gt to idle before starting its test.
> > 
> > i have to debug this further although from code inspection
> > is unrelated to the patches in this series.
> > Hopefully its a different issue.
> 
> Yeap, likely not related. Anyway, I'm sorry for not merging
> this sooner. Could you please send a rebased version? This
> on is not applying cleanly anymore.

Hi Rodrigo, i will rebase it as soon as i do a bit more testing..
I realize i was using a slighlty older guc and with newer guc am
seeing all kinds of failures but trending as not an issue with the series.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pm.c b/drivers/gpu/drm/i915/gem/i915_gem_pm.c
index 0d812f4d787d..3b27218aabe2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pm.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pm.c
@@ -28,6 +28,13 @@  void i915_gem_suspend(struct drm_i915_private *i915)
 	GEM_TRACE("%s\n", dev_name(i915->drm.dev));
 
 	intel_wakeref_auto(&i915->runtime_pm.userfault_wakeref, 0);
+	/*
+	 * On rare occasions, we've observed the fence completion triggers
+	 * free_engines asynchronously via rcu_call. Ensure those are done.
+	 * This path is only called on suspend, so it's an acceptable cost.
+	 */
+	rcu_barrier();
+
 	flush_workqueue(i915->wq);
 
 	/*
@@ -160,6 +167,9 @@  void i915_gem_suspend_late(struct drm_i915_private *i915)
 	 * machine in an unusable condition.
 	 */
 
+	/* Like i915_gem_suspend, flush tasks staged from fence triggers */
+	rcu_barrier();
+
 	for_each_gt(gt, i915, i)
 		intel_gt_suspend_late(gt);
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9c64ae0766cc..cae637fc3ead 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -236,6 +236,13 @@  set_context_destroyed(struct intel_context *ce)
 	ce->guc_state.sched_state |= SCHED_STATE_DESTROYED;
 }
 
+static inline void
+clr_context_destroyed(struct intel_context *ce)
+{
+	lockdep_assert_held(&ce->guc_state.lock);
+	ce->guc_state.sched_state &= ~SCHED_STATE_DESTROYED;
+}
+
 static inline bool context_pending_disable(struct intel_context *ce)
 {
 	return ce->guc_state.sched_state & SCHED_STATE_PENDING_DISABLE;
@@ -613,6 +620,8 @@  static int guc_submission_send_busy_loop(struct intel_guc *guc,
 					 u32 g2h_len_dw,
 					 bool loop)
 {
+	int ret;
+
 	/*
 	 * We always loop when a send requires a reply (i.e. g2h_len_dw > 0),
 	 * so we don't handle the case where we don't get a reply because we
@@ -623,7 +632,11 @@  static int guc_submission_send_busy_loop(struct intel_guc *guc,
 	if (g2h_len_dw)
 		atomic_inc(&guc->outstanding_submission_g2h);
 
-	return intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+	ret = intel_guc_send_busy_loop(guc, action, len, g2h_len_dw, loop);
+	if (ret)
+		atomic_dec(&guc->outstanding_submission_g2h);
+
+	return ret;
 }
 
 int intel_guc_wait_for_pending_msg(struct intel_guc *guc,
@@ -3288,12 +3301,13 @@  static void guc_context_close(struct intel_context *ce)
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
 }
 
-static inline void guc_lrc_desc_unpin(struct intel_context *ce)
+static inline int guc_lrc_desc_unpin(struct intel_context *ce)
 {
 	struct intel_guc *guc = ce_to_guc(ce);
 	struct intel_gt *gt = guc_to_gt(guc);
 	unsigned long flags;
 	bool disabled;
+	int ret;
 
 	GEM_BUG_ON(!intel_gt_pm_is_awake(gt));
 	GEM_BUG_ON(!ctx_id_mapped(guc, ce->guc_id.id));
@@ -3304,18 +3318,41 @@  static inline void guc_lrc_desc_unpin(struct intel_context *ce)
 	spin_lock_irqsave(&ce->guc_state.lock, flags);
 	disabled = submission_disabled(guc);
 	if (likely(!disabled)) {
+		/*
+		 * Take a gt-pm ref and change context state to be destroyed.
+		 * NOTE: a G2H IRQ that comes after will put this gt-pm ref back
+		 */
 		__intel_gt_pm_get(gt);
 		set_context_destroyed(ce);
 		clr_context_registered(ce);
 	}
 	spin_unlock_irqrestore(&ce->guc_state.lock, flags);
+
 	if (unlikely(disabled)) {
 		release_guc_id(guc, ce);
 		__guc_context_destroy(ce);
-		return;
+		return 0;
 	}
 
-	deregister_context(ce, ce->guc_id.id);
+	/*
+	 * GuC is active, lets destroy this context, but at this point we can still be racing
+	 * with suspend, so we undo everything if the H2G fails in deregister_context so
+	 * that GuC reset will find this context during clean up.
+	 */
+	ret = deregister_context(ce, ce->guc_id.id);
+	if (ret) {
+		spin_lock(&ce->guc_state.lock);
+		set_context_registered(ce);
+		clr_context_destroyed(ce);
+		spin_unlock(&ce->guc_state.lock);
+		/*
+		 * As gt-pm is awake at function entry, intel_wakeref_put_async merely decrements
+		 * the wakeref immediately but per function spec usage call this after unlock.
+		 */
+		intel_wakeref_put_async(&gt->wakeref);
+	}
+
+	return ret;
 }
 
 static void __guc_context_destroy(struct intel_context *ce)
@@ -3383,7 +3420,22 @@  static void deregister_destroyed_contexts(struct intel_guc *guc)
 		if (!ce)
 			break;
 
-		guc_lrc_desc_unpin(ce);
+		if (guc_lrc_desc_unpin(ce)) {
+			/*
+			 * This means GuC's CT link severed mid-way which could happen
+			 * in suspend-resume corner cases. In this case, put the
+			 * context back into the destroyed_contexts list which will
+			 * get picked up on the next context deregistration event or
+			 * purged in a GuC sanitization event (reset/unload/wedged/...).
+			 */
+			spin_lock_irqsave(&guc->submission_state.lock, flags);
+			list_add_tail(&ce->destroyed_link,
+				      &guc->submission_state.destroyed_contexts);
+			spin_unlock_irqrestore(&guc->submission_state.lock, flags);
+			/* Bail now since the list might never be emptied if h2gs fail */
+			break;
+		}
+
 	}
 }
 
@@ -3394,6 +3446,17 @@  static void destroyed_worker_func(struct work_struct *w)
 	struct intel_gt *gt = guc_to_gt(guc);
 	intel_wakeref_t wakeref;
 
+	/*
+	 * In rare cases we can get here via async context-free fence-signals that
+	 * come very late in suspend flow or very early in resume flows. In these
+	 * cases, GuC won't be ready but just skipping it here is fine as these
+	 * pending-destroy-contexts get destroyed totally at GuC reset time at the
+	 * end of suspend.. OR.. this worker can be picked up later on the next
+	 * context destruction trigger after resume-completes
+	 */
+	if (!intel_guc_is_ready(guc))
+		return;
+
 	with_intel_gt_pm(gt, wakeref)
 		deregister_destroyed_contexts(guc);
 }