diff mbox series

[2/2] drm/i915/guc: Look for a guilty context when an engine reset fails

Message ID 20221129211253.3183480-3-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series Allow error capture without a request / on reset failure | expand

Commit Message

John Harrison Nov. 29, 2022, 9:12 p.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never happen. But in the case when one
does (due to unknwon reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Tvrtko Ursulin Nov. 30, 2022, 8:30 a.m. UTC | #1
On 29/11/2022 21:12, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Engine resets are supposed to never happen. But in the case when one

Engine resets or engine reset failures? Hopefully the latter.

> does (due to unknwon reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC effectively dies on such a situation, it
> is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>   1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 0a42f1807f52c..c82730804a1c4 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w)
>   	guc->submission_state.reset_fail_mask = 0;
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
> -	if (likely(reset_fail_mask))
> +	if (likely(reset_fail_mask)) {
> +		struct intel_engine_cs *engine;
> +		enum intel_engine_id id;
> +
> +		/*
> +		 * GuC is toast at this point - it dead loops after sending the failed
> +		 * reset notification. So need to manually determine the guilty context.
> +		 * Note that it should be safe/reliable to do this here because the GuC
> +		 * is toast and will not be scheduling behind the KMD's back.
> +		 */
> +		for_each_engine_masked(engine, gt, reset_fail_mask, id)
> +			intel_guc_find_hung_context(engine);
> +
>   		intel_gt_handle_error(gt, reset_fail_mask,
>   				      I915_ERROR_CAPTURE,
>   				      "GuC failed to reset engine mask=0x%x\n",
>   				      reset_fail_mask);

If GuC is defined by ABI contract to be dead, should the flow be 
attempting to do a full GPU reset here, or maybe it happens somewhere 
else as a consequence anyway? (In which case is the engine reset here 
even needed?)

Regards,

Tvrtko

> +	}
>   }
>   
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
John Harrison Nov. 30, 2022, 9:04 p.m. UTC | #2
On 11/30/2022 00:30, Tvrtko Ursulin wrote:
> On 29/11/2022 21:12, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Engine resets are supposed to never happen. But in the case when one
>
> Engine resets or engine reset failures? Hopefully the latter.
>
Oops. Yes, that was meant to say "engine resets are never supposed to fail."

>> does (due to unknwon reasons that normally come down to a missing
unknwon -> unknown

>> w/a), it is useful to get as much information out of the system as
>> possible. Given that the GuC effectively dies on such a situation, it
>> is not possible to get a guilty context notification back. So do a
>> manual search instead. Given that GuC is dead, this is safe because
>> GuC won't be changing the engine state asynchronously.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 0a42f1807f52c..c82730804a1c4 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct 
>> work_struct *w)
>>       guc->submission_state.reset_fail_mask = 0;
>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>   -    if (likely(reset_fail_mask))
>> +    if (likely(reset_fail_mask)) {
>> +        struct intel_engine_cs *engine;
>> +        enum intel_engine_id id;
>> +
>> +        /*
>> +         * GuC is toast at this point - it dead loops after sending 
>> the failed
>> +         * reset notification. So need to manually determine the 
>> guilty context.
>> +         * Note that it should be safe/reliable to do this here 
>> because the GuC
>> +         * is toast and will not be scheduling behind the KMD's back.
>> +         */
>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>> +            intel_guc_find_hung_context(engine);
>> +
>>           intel_gt_handle_error(gt, reset_fail_mask,
>>                         I915_ERROR_CAPTURE,
>>                         "GuC failed to reset engine mask=0x%x\n",
>>                         reset_fail_mask);
>
> If GuC is defined by ABI contract to be dead, should the flow be 
> attempting to do a full GPU reset here, or maybe it happens somewhere 
> else as a consequence anyway? (In which case is the engine reset here 
> even needed?)
This is a full GT reset. i915 is not allowed to perform an engine reset 
when using GuC submission. Those can only be done by GuC. So any forced 
reset by i915 will be escalated to full GT internally.

John.

>
> Regards,
>
> Tvrtko
>
>> +    }
>>   }
>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
Tvrtko Ursulin Dec. 1, 2022, 10:21 a.m. UTC | #3
On 30/11/2022 21:04, John Harrison wrote:
> On 11/30/2022 00:30, Tvrtko Ursulin wrote:
>> On 29/11/2022 21:12, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> Engine resets are supposed to never happen. But in the case when one
>>
>> Engine resets or engine reset failures? Hopefully the latter.
>>
> Oops. Yes, that was meant to say "engine resets are never supposed to 
> fail."
> 
>>> does (due to unknwon reasons that normally come down to a missing
> unknwon -> unknown
> 
>>> w/a), it is useful to get as much information out of the system as
>>> possible. Given that the GuC effectively dies on such a situation, it
>>> is not possible to get a guilty context notification back. So do a
>>> manual search instead. Given that GuC is dead, this is safe because
>>> GuC won't be changing the engine state asynchronously.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 0a42f1807f52c..c82730804a1c4 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct 
>>> work_struct *w)
>>>       guc->submission_state.reset_fail_mask = 0;
>>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>   -    if (likely(reset_fail_mask))
>>> +    if (likely(reset_fail_mask)) {
>>> +        struct intel_engine_cs *engine;
>>> +        enum intel_engine_id id;
>>> +
>>> +        /*
>>> +         * GuC is toast at this point - it dead loops after sending 
>>> the failed
>>> +         * reset notification. So need to manually determine the 
>>> guilty context.
>>> +         * Note that it should be safe/reliable to do this here 
>>> because the GuC
>>> +         * is toast and will not be scheduling behind the KMD's back.
>>> +         */
>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>> +            intel_guc_find_hung_context(engine);
>>> +
>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>                         I915_ERROR_CAPTURE,
>>>                         "GuC failed to reset engine mask=0x%x\n",
>>>                         reset_fail_mask);
>>
>> If GuC is defined by ABI contract to be dead, should the flow be 
>> attempting to do a full GPU reset here, or maybe it happens somewhere 
>> else as a consequence anyway? (In which case is the engine reset here 
>> even needed?)
> This is a full GT reset. i915 is not allowed to perform an engine reset 
> when using GuC submission. Those can only be done by GuC. So any forced 
> reset by i915 will be escalated to full GT internally.

Okay, I saw passing in of the engine mask and drew the wrong conclusion.

Regards,

Tvrtko
Umesh Nerlige Ramappa Dec. 13, 2022, 2 a.m. UTC | #4
On Tue, Nov 29, 2022 at 01:12:53PM -0800, John.C.Harrison@Intel.com wrote:
>From: John Harrison <John.C.Harrison@Intel.com>
>
>Engine resets are supposed to never happen. But in the case when one
>does (due to unknwon reasons that normally come down to a missing
>w/a), it is useful to get as much information out of the system as
>possible. Given that the GuC effectively dies on such a situation, it
>is not possible to get a guilty context notification back. So do a
>manual search instead. Given that GuC is dead, this is safe because
>GuC won't be changing the engine state asynchronously.
>
>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>---
> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>index 0a42f1807f52c..c82730804a1c4 100644
>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>@@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w)
> 	guc->submission_state.reset_fail_mask = 0;
> 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>
>-	if (likely(reset_fail_mask))
>+	if (likely(reset_fail_mask)) {
>+		struct intel_engine_cs *engine;
>+		enum intel_engine_id id;
>+
>+		/*
>+		 * GuC is toast at this point - it dead loops after sending the failed
>+		 * reset notification. So need to manually determine the guilty context.
>+		 * Note that it should be safe/reliable to do this here because the GuC
>+		 * is toast and will not be scheduling behind the KMD's back.
>+		 */

Is that defined by the kmd-GuC interface that following a failed reset notification, GuC 
will always dead-loop OR not schedule anything (even on other engines) until KMD takes 
some action? What action should KMD take?

Regards,
Umesh

>+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
>+			intel_guc_find_hung_context(engine);
>+
> 		intel_gt_handle_error(gt, reset_fail_mask,
> 				      I915_ERROR_CAPTURE,
> 				      "GuC failed to reset engine mask=0x%x\n",
> 				      reset_fail_mask);
>+	}
> }
>
> int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>-- 
>2.37.3
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0a42f1807f52c..c82730804a1c4 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4751,11 +4751,24 @@  static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be safe/reliable to do this here because the GuC
+		 * is toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
 				      "GuC failed to reset engine mask=0x%x\n",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,