diff mbox series

[2/4] drm/i915: Allow error capture of a pending request

Message ID 20230112025311.2577084-3-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series Allow error capture without a request / on reset failure | expand

Commit Message

John Harrison Jan. 12, 2023, 2:53 a.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

Comments

Tvrtko Ursulin Jan. 12, 2023, 10:06 a.m. UTC | #1
On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> A hang situation has been observed where the only requests on the
> context were either completed or not yet started according to the
> breaadcrumbs. However, the register state claimed a batch was (maybe)
> in progress. So, allow capture of the pending request on the grounds
> that this might be better than nothing.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>   1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index bd2cf7d235df0..2e338a9667a4b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
>   		rq = intel_context_find_active_request(ce);
> -		if (rq && !i915_request_started(rq)) {
> -			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
> -				 engine->name);
> -			rq = NULL;
> -		}
> +		if (rq && !i915_request_started(rq))
> +			drm_info(&engine->gt->i915->drm, "Confused - active request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
> +				 rq->fence.context, rq->fence.seqno, ce->guc_id.id, engine->name);

Ah you change active to started in this patch! :)

I suggest no "ce" in user visible messages and maybe stick with the 
convention grep suggest is already established:

"Hung context with active request %lld:%lld [0x%04X] not started!"

Regards,

Tvrtko

>   	} else {
>   		/*
>   		 * Getting here with GuC enabled means it is a forced error capture
John Harrison Jan. 12, 2023, 8:46 p.m. UTC | #2
On 1/12/2023 02:06, Tvrtko Ursulin wrote:
> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> A hang situation has been observed where the only requests on the
>> context were either completed or not yet started according to the
>> breaadcrumbs. However, the register state claimed a batch was (maybe)
>> in progress. So, allow capture of the pending request on the grounds
>> that this might be better than nothing.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index bd2cf7d235df0..2e338a9667a4b 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>>       if (ce) {
>>           intel_engine_clear_hung_context(engine);
>>           rq = intel_context_find_active_request(ce);
>> -        if (rq && !i915_request_started(rq)) {
>> -            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>> with no active request!\n",
>> -                 engine->name);
>> -            rq = NULL;
>> -        }
>> +        if (rq && !i915_request_started(rq))
>> +            drm_info(&engine->gt->i915->drm, "Confused - active 
>> request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
>> +                 rq->fence.context, rq->fence.seqno, ce->guc_id.id, 
>> engine->name);
>
> Ah you change active to started in this patch! :)
Yeah, I'm wanting to keep these two patches separate. This one is a more 
questionable change in actual behaviour. The previous patch just allows 
capturing the context when the request has been rejected. Whereas this 
one changes the request acceptance criteria. With the potential to start 
blaming innocent requests. It seems plausible to me, especially with the 
warning message. We know the context owning the request is guilty so why 
wouldn't we blame that request just because the tracking is off (maybe 
due to some driver bug). But I could see someone objecting on grounds of 
being super strict about who/what gets blamed for a hang and either 
nacks or maybe wants this change reverted some time later.

>
> I suggest no "ce" in user visible messages and maybe stick with the 
> convention grep suggest is already established:
>
> "Hung context with active request %lld:%lld [0x%04X] not started!"
>
Are you also meaning to drop the engine name? I think it is important to 
keep the '%s' in there somewhere.

John.


> Regards,
>
> Tvrtko
>
>>       } else {
>>           /*
>>            * Getting here with GuC enabled means it is a forced error 
>> capture
Tvrtko Ursulin Jan. 13, 2023, 9:10 a.m. UTC | #3
On 12/01/2023 20:46, John Harrison wrote:
> On 1/12/2023 02:06, Tvrtko Ursulin wrote:
>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> A hang situation has been observed where the only requests on the
>>> context were either completed or not yet started according to the
>>> breaadcrumbs. However, the register state claimed a batch was (maybe)
>>> in progress. So, allow capture of the pending request on the grounds
>>> that this might be better than nothing.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index bd2cf7d235df0..2e338a9667a4b 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>>>       if (ce) {
>>>           intel_engine_clear_hung_context(engine);
>>>           rq = intel_context_find_active_request(ce);
>>> -        if (rq && !i915_request_started(rq)) {
>>> -            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>>> with no active request!\n",
>>> -                 engine->name);
>>> -            rq = NULL;
>>> -        }
>>> +        if (rq && !i915_request_started(rq))
>>> +            drm_info(&engine->gt->i915->drm, "Confused - active 
>>> request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
>>> +                 rq->fence.context, rq->fence.seqno, ce->guc_id.id, 
>>> engine->name);
>>
>> Ah you change active to started in this patch! :)
> Yeah, I'm wanting to keep these two patches separate. This one is a more 
> questionable change in actual behaviour. The previous patch just allows 
> capturing the context when the request has been rejected. Whereas this 
> one changes the request acceptance criteria. With the potential to start 
> blaming innocent requests. It seems plausible to me, especially with the 
> warning message. We know the context owning the request is guilty so why 
> wouldn't we blame that request just because the tracking is off (maybe 
> due to some driver bug). But I could see someone objecting on grounds of 
> being super strict about who/what gets blamed for a hang and either 
> nacks or maybe wants this change reverted some time later.
> 
>>
>> I suggest no "ce" in user visible messages and maybe stick with the 
>> convention grep suggest is already established:
>>
>> "Hung context with active request %lld:%lld [0x%04X] not started!"
>>
> Are you also meaning to drop the engine name? I think it is important to 
> keep the '%s' in there somewhere.

No sorry, just an oversight.

"Hung context on %s with active request %lld:%lld [0x%04X] not started!"

Doesn't have to be exactly that, only trying to illustrate what style 
looks better to me when user facing - not mentioning confusing and fewer 
special characters.

Regards,

Tvrtko

> 
> John.
> 
> 
>> Regards,
>>
>> Tvrtko
>>
>>>       } else {
>>>           /*
>>>            * Getting here with GuC enabled means it is a forced error 
>>> capture
>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index bd2cf7d235df0..2e338a9667a4b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1628,11 +1628,9 @@  capture_engine(struct intel_engine_cs *engine,
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
 		rq = intel_context_find_active_request(ce);
-		if (rq && !i915_request_started(rq)) {
-			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
-				 engine->name);
-			rq = NULL;
-		}
+		if (rq && !i915_request_started(rq))
+			drm_info(&engine->gt->i915->drm, "Confused - active request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
+				 rq->fence.context, rq->fence.seqno, ce->guc_id.id, engine->name);
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture