Message ID | 20230117213630.2897570-5-John.C.Harrison@Intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Allow error capture without a request / on reset failure | expand |
On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote: > From: John Harrison <John.C.Harrison@Intel.com> > > Engine resets are supposed to never fail. But in the case when one > does (due to unknown reasons that normally come down to a missing > w/a), it is useful to get as much information out of the system as > possible. Given that the GuC effectively dies on such a situation, it > is not possible to get a guilty context notification back. So do a > manual search instead. Given that GuC is dead, this is safe because > GuC won't be changing the engine state asynchronously. > > v2: Change comment to be less alarming (Tvrtko) > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com> > --- > .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 +++++++++++++++-- > 1 file changed, 15 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > index 3b34a82d692be..9bc80b807dbcc 100644 > --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c > @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w) > guc->submission_state.reset_fail_mask = 0; > spin_unlock_irqrestore(&guc->submission_state.lock, flags); > > - if (likely(reset_fail_mask)) > + if (likely(reset_fail_mask)) { > + struct intel_engine_cs *engine; > + enum intel_engine_id id; > + > + /* > + * GuC is toast at this point - it dead loops after sending the failed > + * reset notification. So need to manually determine the guilty context. > + * Note that it should be reliable to do this here because the GuC is > + * toast and will not be scheduling behind the KMD's back. > + */ > + for_each_engine_masked(engine, gt, reset_fail_mask, id) > + intel_guc_find_hung_context(engine); > + > intel_gt_handle_error(gt, reset_fail_mask, > I915_ERROR_CAPTURE, > - "GuC failed to reset engine mask=0x%x\n", > + "GuC failed to reset engine mask=0x%x", > reset_fail_mask); > + } > } > > int intel_guc_engine_failure_process_msg(struct intel_guc *guc, Assuming 1/5 gets blessed by GuC experts this would then look safe to: Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Regards, Tvrtko
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 3b34a82d692be..9bc80b807dbcc 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w) guc->submission_state.reset_fail_mask = 0; spin_unlock_irqrestore(&guc->submission_state.lock, flags); - if (likely(reset_fail_mask)) + if (likely(reset_fail_mask)) { + struct intel_engine_cs *engine; + enum intel_engine_id id; + + /* + * GuC is toast at this point - it dead loops after sending the failed + * reset notification. So need to manually determine the guilty context. + * Note that it should be reliable to do this here because the GuC is + * toast and will not be scheduling behind the KMD's back. + */ + for_each_engine_masked(engine, gt, reset_fail_mask, id) + intel_guc_find_hung_context(engine); + intel_gt_handle_error(gt, reset_fail_mask, I915_ERROR_CAPTURE, - "GuC failed to reset engine mask=0x%x\n", + "GuC failed to reset engine mask=0x%x", reset_fail_mask); + } } int intel_guc_engine_failure_process_msg(struct intel_guc *guc,