diff mbox series

[3/4] drm/i915/guc: Look for a guilty context when an engine reset fails

Message ID 20230112025311.2577084-4-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series Allow error capture without a request / on reset failure | expand

Commit Message

John Harrison Jan. 12, 2023, 2:53 a.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

Comments

Tvrtko Ursulin Jan. 12, 2023, 10:15 a.m. UTC | #1
On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Engine resets are supposed to never fail. But in the case when one
> does (due to unknown reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC effectively dies on such a situation, it
> is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>   1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..99d09e3394597 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
>   	guc->submission_state.reset_fail_mask = 0;
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
> -	if (likely(reset_fail_mask))
> +	if (likely(reset_fail_mask)) {
> +		struct intel_engine_cs *engine;
> +		enum intel_engine_id id;
> +
> +		/*
> +		 * GuC is toast at this point - it dead loops after sending the failed
> +		 * reset notification. So need to manually determine the guilty context.
> +		 * Note that it should be safe/reliable to do this here because the GuC
> +		 * is toast and will not be scheduling behind the KMD's back.
> +		 */
> +		for_each_engine_masked(engine, gt, reset_fail_mask, id)
> +			intel_guc_find_hung_context(engine);
> +
>   		intel_gt_handle_error(gt, reset_fail_mask,
>   				      I915_ERROR_CAPTURE,
> -				      "GuC failed to reset engine mask=0x%x\n",
> +				      "GuC failed to reset engine mask=0x%x",
>   				      reset_fail_mask);
> +	}
>   }
>   
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,

This one I don't feel "at home" enough to r-b. Just a question - can we 
be sure at this point that GuC is 100% stuck and there isn't a chance it 
somehow comes alive and starts running in parallel (being driven in 
parallel by a different "thread" in i915), interfering with the 
assumption made in the comment?

Regards,

Tvrtko
John Harrison Jan. 12, 2023, 8:59 p.m. UTC | #2
On 1/12/2023 02:15, Tvrtko Ursulin wrote:
> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Engine resets are supposed to never fail. But in the case when one
>> does (due to unknown reasons that normally come down to a missing
>> w/a), it is useful to get as much information out of the system as
>> possible. Given that the GuC effectively dies on such a situation, it
>> is not possible to get a guilty context notification back. So do a
>> manual search instead. Given that GuC is dead, this is safe because
>> GuC won't be changing the engine state asynchronously.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index b436dd7f12e42..99d09e3394597 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>> work_struct *w)
>>       guc->submission_state.reset_fail_mask = 0;
>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>   -    if (likely(reset_fail_mask))
>> +    if (likely(reset_fail_mask)) {
>> +        struct intel_engine_cs *engine;
>> +        enum intel_engine_id id;
>> +
>> +        /*
>> +         * GuC is toast at this point - it dead loops after sending 
>> the failed
>> +         * reset notification. So need to manually determine the 
>> guilty context.
>> +         * Note that it should be safe/reliable to do this here 
>> because the GuC
>> +         * is toast and will not be scheduling behind the KMD's back.
>> +         */
>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>> +            intel_guc_find_hung_context(engine);
>> +
>>           intel_gt_handle_error(gt, reset_fail_mask,
>>                         I915_ERROR_CAPTURE,
>> -                      "GuC failed to reset engine mask=0x%x\n",
>> +                      "GuC failed to reset engine mask=0x%x",
>>                         reset_fail_mask);
>> +    }
>>   }
>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>
> This one I don't feel "at home" enough to r-b. Just a question - can 
> we be sure at this point that GuC is 100% stuck and there isn't a 
> chance it somehow comes alive and starts running in parallel (being 
> driven in parallel by a different "thread" in i915), interfering with 
> the assumption made in the comment?
The GuC API definition for the engine reset failure notification is that 
GuC will dead loop itself after sending - to quote "This is a 
catastrophic failure that requires a full GT reset, or FLR to recover.". 
So yes, GuC is 100% stuck and is not going to self recover. Guaranteed. 
If that changes in the future then that would be a backwards breaking 
API change and would require a corresponding driver update to go with 
supporting the new GuC firmware version.

There is the potential for a GT reset to maybe occur in parallel and 
resurrect the GuC that way. Not sure how that could happen though. The 
heartbeat timeout is significantly longer than the GuC's pre-emption 
timeout + engine reset timeout. That just leaves manual resets from the 
user or maybe from a selftest. If the user is manually poking reset 
debugfs files then it is already known that all bets are off in terms of 
getting an accurate error capture. And if a selftest is triggering GT 
resets in parallel with engine resets then either it is a broken test or 
it is attempting to test an evil corner case in which it is expected 
that error capture results will be unreliable. Having said all that, 
given that the submission_state lock is held here, such a GT reset would 
not get very far in bring the GuC back up anyway. Certainly, it would 
not be able to get as far as submitting new work and thus potentially 
changing the engine state.

So yes, if multiple impossible events occur back to back then the error 
capture may be wonky. Where wonky means a potentially innocent 
context/request gets blamed for breaking the hardware. Oh dear. I can 
live with that.

John.


>
> Regards,
>
> Tvrtko
Tvrtko Ursulin Jan. 13, 2023, 9:22 a.m. UTC | #3
On 12/01/2023 20:59, John Harrison wrote:
> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> Engine resets are supposed to never fail. But in the case when one
>>> does (due to unknown reasons that normally come down to a missing
>>> w/a), it is useful to get as much information out of the system as
>>> possible. Given that the GuC effectively dies on such a situation, it
>>> is not possible to get a guilty context notification back. So do a
>>> manual search instead. Given that GuC is dead, this is safe because
>>> GuC won't be changing the engine state asynchronously.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index b436dd7f12e42..99d09e3394597 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>> work_struct *w)
>>>       guc->submission_state.reset_fail_mask = 0;
>>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>   -    if (likely(reset_fail_mask))
>>> +    if (likely(reset_fail_mask)) {
>>> +        struct intel_engine_cs *engine;
>>> +        enum intel_engine_id id;
>>> +
>>> +        /*
>>> +         * GuC is toast at this point - it dead loops after sending 
>>> the failed
>>> +         * reset notification. So need to manually determine the 
>>> guilty context.
>>> +         * Note that it should be safe/reliable to do this here 
>>> because the GuC
>>> +         * is toast and will not be scheduling behind the KMD's back.
>>> +         */
>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>> +            intel_guc_find_hung_context(engine);
>>> +
>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>                         I915_ERROR_CAPTURE,
>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>> +                      "GuC failed to reset engine mask=0x%x",
>>>                         reset_fail_mask);
>>> +    }
>>>   }
>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>
>> This one I don't feel "at home" enough to r-b. Just a question - can 
>> we be sure at this point that GuC is 100% stuck and there isn't a 
>> chance it somehow comes alive and starts running in parallel (being 
>> driven in parallel by a different "thread" in i915), interfering with 
>> the assumption made in the comment?
> The GuC API definition for the engine reset failure notification is that 
> GuC will dead loop itself after sending - to quote "This is a 
> catastrophic failure that requires a full GT reset, or FLR to recover.". 
> So yes, GuC is 100% stuck and is not going to self recover. Guaranteed. 
> If that changes in the future then that would be a backwards breaking 
> API change and would require a corresponding driver update to go with 
> supporting the new GuC firmware version.
> 
> There is the potential for a GT reset to maybe occur in parallel and 
> resurrect the GuC that way. Not sure how that could happen though. The 
> heartbeat timeout is significantly longer than the GuC's pre-emption 
> timeout + engine reset timeout. That just leaves manual resets from the 
> user or maybe from a selftest. If the user is manually poking reset 
> debugfs files then it is already known that all bets are off in terms of 
> getting an accurate error capture. And if a selftest is triggering GT 
> resets in parallel with engine resets then either it is a broken test or 
> it is attempting to test an evil corner case in which it is expected 
> that error capture results will be unreliable. Having said all that, 
> given that the submission_state lock is held here, such a GT reset would 
> not get very far in bring the GuC back up anyway. Certainly, it would 
> not be able to get as far as submitting new work and thus potentially 
> changing the engine state.
> 
> So yes, if multiple impossible events occur back to back then the error 
> capture may be wonky. Where wonky means a potentially innocent 
> context/request gets blamed for breaking the hardware. Oh dear. I can 
> live with that.

Okay, so I was triggered by the "safe/reliable" qualification from the 
comment. I agree "reliable" does not have to be and was mostly worried 
about the "safe" part.

 From what you explain if short heartbeat, or manual reset invocation, 
could actually mess up any of the data structures which added 
intel_guc_find_hung_context walks and so crash the kernel.

Looking inside, there is some lock dropping going on (and undocumented 
irqsave games), and walking the list while unlocked. So whether or not 
that can go bang if a full reset happens in parallel and re-activates 
the normal driver flows.

Regards,

Tvrtko
John Harrison Jan. 14, 2023, 1:27 a.m. UTC | #4
On 1/13/2023 01:22, Tvrtko Ursulin wrote:
> On 12/01/2023 20:59, John Harrison wrote:
>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> Engine resets are supposed to never fail. But in the case when one
>>>> does (due to unknown reasons that normally come down to a missing
>>>> w/a), it is useful to get as much information out of the system as
>>>> possible. Given that the GuC effectively dies on such a situation, it
>>>> is not possible to get a guilty context notification back. So do a
>>>> manual search instead. Given that GuC is dead, this is safe because
>>>> GuC won't be changing the engine state asynchronously.
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 
>>>> +++++++++++++++--
>>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> index b436dd7f12e42..99d09e3394597 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>>> work_struct *w)
>>>>       guc->submission_state.reset_fail_mask = 0;
>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>   -    if (likely(reset_fail_mask))
>>>> +    if (likely(reset_fail_mask)) {
>>>> +        struct intel_engine_cs *engine;
>>>> +        enum intel_engine_id id;
>>>> +
>>>> +        /*
>>>> +         * GuC is toast at this point - it dead loops after 
>>>> sending the failed
>>>> +         * reset notification. So need to manually determine the 
>>>> guilty context.
>>>> +         * Note that it should be safe/reliable to do this here 
>>>> because the GuC
>>>> +         * is toast and will not be scheduling behind the KMD's back.
>>>> +         */
>>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>> +            intel_guc_find_hung_context(engine);
>>>> +
>>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>>                         I915_ERROR_CAPTURE,
>>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>>> +                      "GuC failed to reset engine mask=0x%x",
>>>>                         reset_fail_mask);
>>>> +    }
>>>>   }
>>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>
>>> This one I don't feel "at home" enough to r-b. Just a question - can 
>>> we be sure at this point that GuC is 100% stuck and there isn't a 
>>> chance it somehow comes alive and starts running in parallel (being 
>>> driven in parallel by a different "thread" in i915), interfering 
>>> with the assumption made in the comment?
>> The GuC API definition for the engine reset failure notification is 
>> that GuC will dead loop itself after sending - to quote "This is a 
>> catastrophic failure that requires a full GT reset, or FLR to 
>> recover.". So yes, GuC is 100% stuck and is not going to self 
>> recover. Guaranteed. If that changes in the future then that would be 
>> a backwards breaking API change and would require a corresponding 
>> driver update to go with supporting the new GuC firmware version.
>>
>> There is the potential for a GT reset to maybe occur in parallel and 
>> resurrect the GuC that way. Not sure how that could happen though. 
>> The heartbeat timeout is significantly longer than the GuC's 
>> pre-emption timeout + engine reset timeout. That just leaves manual 
>> resets from the user or maybe from a selftest. If the user is 
>> manually poking reset debugfs files then it is already known that all 
>> bets are off in terms of getting an accurate error capture. And if a 
>> selftest is triggering GT resets in parallel with engine resets then 
>> either it is a broken test or it is attempting to test an evil corner 
>> case in which it is expected that error capture results will be 
>> unreliable. Having said all that, given that the submission_state 
>> lock is held here, such a GT reset would not get very far in bring 
>> the GuC back up anyway. Certainly, it would not be able to get as far 
>> as submitting new work and thus potentially changing the engine state.
>>
>> So yes, if multiple impossible events occur back to back then the 
>> error capture may be wonky. Where wonky means a potentially innocent 
>> context/request gets blamed for breaking the hardware. Oh dear. I can 
>> live with that.
>
> Okay, so I was triggered by the "safe/reliable" qualification from the 
> comment. I agree "reliable" does not have to be and was mostly worried 
> about the "safe" part.
>
> From what you explain if short heartbeat, or manual reset invocation, 
> could actually mess up any of the data structures which added 
> intel_guc_find_hung_context walks and so crash the kernel.
>
> Looking inside, there is some lock dropping going on (and undocumented 
> irqsave games), and walking the list while unlocked. So whether or not 
> that can go bang if a full reset happens in parallel and re-activates 
> the normal driver flows.
There is no walking of unlocked lists. The xa_lock is held whenever it 
looks at the xa structure itself. The release is only while analysing 
the context that was retrieved. And the context retrieval itself starts 
with a kref_get_unless_zero. So everything is only ever accessed while 
locked or reference counted. The unlock of the xa while analysing a 
context is because the xa object can be accessed from interrupt code and 
so we don't want to hold it locked unnecessarily while scanning through 
requests within a context (all code which has no connection to the GuC 
backend at all).

I can drop the word 'safe' if it makes you nervous. That was only meant 
to refer to the possibility of such a scan returning bogus results due 
to contexts switching in/out of the hardware before/during/after the 
scan. There is no way for it to go bang.

John.


>
> Regards,
>
> Tvrtko
Tvrtko Ursulin Jan. 16, 2023, 12:43 p.m. UTC | #5
On 14/01/2023 01:27, John Harrison wrote:
> On 1/13/2023 01:22, Tvrtko Ursulin wrote:
>> On 12/01/2023 20:59, John Harrison wrote:
>>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> Engine resets are supposed to never fail. But in the case when one
>>>>> does (due to unknown reasons that normally come down to a missing
>>>>> w/a), it is useful to get as much information out of the system as
>>>>> possible. Given that the GuC effectively dies on such a situation, it
>>>>> is not possible to get a guilty context notification back. So do a
>>>>> manual search instead. Given that GuC is dead, this is safe because
>>>>> GuC won't be changing the engine state asynchronously.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 
>>>>> +++++++++++++++--
>>>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index b436dd7f12e42..99d09e3394597 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>>>> work_struct *w)
>>>>>       guc->submission_state.reset_fail_mask = 0;
>>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>>   -    if (likely(reset_fail_mask))
>>>>> +    if (likely(reset_fail_mask)) {
>>>>> +        struct intel_engine_cs *engine;
>>>>> +        enum intel_engine_id id;
>>>>> +
>>>>> +        /*
>>>>> +         * GuC is toast at this point - it dead loops after 
>>>>> sending the failed
>>>>> +         * reset notification. So need to manually determine the 
>>>>> guilty context.
>>>>> +         * Note that it should be safe/reliable to do this here 
>>>>> because the GuC
>>>>> +         * is toast and will not be scheduling behind the KMD's back.
>>>>> +         */
>>>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>>> +            intel_guc_find_hung_context(engine);
>>>>> +
>>>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>>>                         I915_ERROR_CAPTURE,
>>>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>>>> +                      "GuC failed to reset engine mask=0x%x",
>>>>>                         reset_fail_mask);
>>>>> +    }
>>>>>   }
>>>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>>
>>>> This one I don't feel "at home" enough to r-b. Just a question - can 
>>>> we be sure at this point that GuC is 100% stuck and there isn't a 
>>>> chance it somehow comes alive and starts running in parallel (being 
>>>> driven in parallel by a different "thread" in i915), interfering 
>>>> with the assumption made in the comment?
>>> The GuC API definition for the engine reset failure notification is 
>>> that GuC will dead loop itself after sending - to quote "This is a 
>>> catastrophic failure that requires a full GT reset, or FLR to 
>>> recover.". So yes, GuC is 100% stuck and is not going to self 
>>> recover. Guaranteed. If that changes in the future then that would be 
>>> a backwards breaking API change and would require a corresponding 
>>> driver update to go with supporting the new GuC firmware version.
>>>
>>> There is the potential for a GT reset to maybe occur in parallel and 
>>> resurrect the GuC that way. Not sure how that could happen though. 
>>> The heartbeat timeout is significantly longer than the GuC's 
>>> pre-emption timeout + engine reset timeout. That just leaves manual 
>>> resets from the user or maybe from a selftest. If the user is 
>>> manually poking reset debugfs files then it is already known that all 
>>> bets are off in terms of getting an accurate error capture. And if a 
>>> selftest is triggering GT resets in parallel with engine resets then 
>>> either it is a broken test or it is attempting to test an evil corner 
>>> case in which it is expected that error capture results will be 
>>> unreliable. Having said all that, given that the submission_state 
>>> lock is held here, such a GT reset would not get very far in bring 
>>> the GuC back up anyway. Certainly, it would not be able to get as far 
>>> as submitting new work and thus potentially changing the engine state.
>>>
>>> So yes, if multiple impossible events occur back to back then the 
>>> error capture may be wonky. Where wonky means a potentially innocent 
>>> context/request gets blamed for breaking the hardware. Oh dear. I can 
>>> live with that.
>>
>> Okay, so I was triggered by the "safe/reliable" qualification from the 
>> comment. I agree "reliable" does not have to be and was mostly worried 
>> about the "safe" part.
>>
>> From what you explain if short heartbeat, or manual reset invocation, 
>> could actually mess up any of the data structures which added 
>> intel_guc_find_hung_context walks and so crash the kernel.
>>
>> Looking inside, there is some lock dropping going on (and undocumented 
>> irqsave games), and walking the list while unlocked. So whether or not 
>> that can go bang if a full reset happens in parallel and re-activates 
>> the normal driver flows.
> There is no walking of unlocked lists. The xa_lock is held whenever it 
> looks at the xa structure itself. The release is only while analysing 
> the context that was retrieved. And the context retrieval itself starts 
> with a kref_get_unless_zero. So everything is only ever accessed while 
> locked or reference counted. The unlock of the xa while analysing a 
> context is because the xa object can be accessed from interrupt code and 
> so we don't want to hold it locked unnecessarily while scanning through 
> requests within a context (all code which has no connection to the GuC 
> backend at all).

AFAICS intel_guc_find_hung_context walks &ce->guc_state.requests with no 
locks held. Other places in the code appear to use &ce->guc_state.lock, 
or maybe &sched_engine->lock, not sure. Then we have request submission, 
retirement and a few other places modify that list. So *if* indeed hung 
GuC can get resurrected by a parallel full reset while 
reset_fail_worker_func is running, why couldn't that list walk explode?

Regards,

Tvrtko

> I can drop the word 'safe' if it makes you nervous. That was only meant 
> to refer to the possibility of such a scan returning bogus results due 
> to contexts switching in/out of the hardware before/during/after the 
> scan. There is no way for it to go bang.
> 
> John.
> 
> 
>>
>> Regards,
>>
>> Tvrtko
>
John Harrison Jan. 17, 2023, 9:14 p.m. UTC | #6
On 1/16/2023 04:43, Tvrtko Ursulin wrote:
> On 14/01/2023 01:27, John Harrison wrote:
>> On 1/13/2023 01:22, Tvrtko Ursulin wrote:
>>> On 12/01/2023 20:59, John Harrison wrote:
>>>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>
>>>>>> Engine resets are supposed to never fail. But in the case when one
>>>>>> does (due to unknown reasons that normally come down to a missing
>>>>>> w/a), it is useful to get as much information out of the system as
>>>>>> possible. Given that the GuC effectively dies on such a 
>>>>>> situation, it
>>>>>> is not possible to get a guilty context notification back. So do a
>>>>>> manual search instead. Given that GuC is dead, this is safe because
>>>>>> GuC won't be changing the engine state asynchronously.
>>>>>>
>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>> ---
>>>>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 
>>>>>> +++++++++++++++--
>>>>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> index b436dd7f12e42..99d09e3394597 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>>>>> work_struct *w)
>>>>>>       guc->submission_state.reset_fail_mask = 0;
>>>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>>>   -    if (likely(reset_fail_mask))
>>>>>> +    if (likely(reset_fail_mask)) {
>>>>>> +        struct intel_engine_cs *engine;
>>>>>> +        enum intel_engine_id id;
>>>>>> +
>>>>>> +        /*
>>>>>> +         * GuC is toast at this point - it dead loops after 
>>>>>> sending the failed
>>>>>> +         * reset notification. So need to manually determine the 
>>>>>> guilty context.
>>>>>> +         * Note that it should be safe/reliable to do this here 
>>>>>> because the GuC
>>>>>> +         * is toast and will not be scheduling behind the KMD's 
>>>>>> back.
>>>>>> +         */
>>>>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>>>> +            intel_guc_find_hung_context(engine);
>>>>>> +
>>>>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>>>>                         I915_ERROR_CAPTURE,
>>>>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>>>>> +                      "GuC failed to reset engine mask=0x%x",
>>>>>>                         reset_fail_mask);
>>>>>> +    }
>>>>>>   }
>>>>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>>>
>>>>> This one I don't feel "at home" enough to r-b. Just a question - 
>>>>> can we be sure at this point that GuC is 100% stuck and there 
>>>>> isn't a chance it somehow comes alive and starts running in 
>>>>> parallel (being driven in parallel by a different "thread" in 
>>>>> i915), interfering with the assumption made in the comment?
>>>> The GuC API definition for the engine reset failure notification is 
>>>> that GuC will dead loop itself after sending - to quote "This is a 
>>>> catastrophic failure that requires a full GT reset, or FLR to 
>>>> recover.". So yes, GuC is 100% stuck and is not going to self 
>>>> recover. Guaranteed. If that changes in the future then that would 
>>>> be a backwards breaking API change and would require a 
>>>> corresponding driver update to go with supporting the new GuC 
>>>> firmware version.
>>>>
>>>> There is the potential for a GT reset to maybe occur in parallel 
>>>> and resurrect the GuC that way. Not sure how that could happen 
>>>> though. The heartbeat timeout is significantly longer than the 
>>>> GuC's pre-emption timeout + engine reset timeout. That just leaves 
>>>> manual resets from the user or maybe from a selftest. If the user 
>>>> is manually poking reset debugfs files then it is already known 
>>>> that all bets are off in terms of getting an accurate error 
>>>> capture. And if a selftest is triggering GT resets in parallel with 
>>>> engine resets then either it is a broken test or it is attempting 
>>>> to test an evil corner case in which it is expected that error 
>>>> capture results will be unreliable. Having said all that, given 
>>>> that the submission_state lock is held here, such a GT reset would 
>>>> not get very far in bring the GuC back up anyway. Certainly, it 
>>>> would not be able to get as far as submitting new work and thus 
>>>> potentially changing the engine state.
>>>>
>>>> So yes, if multiple impossible events occur back to back then the 
>>>> error capture may be wonky. Where wonky means a potentially 
>>>> innocent context/request gets blamed for breaking the hardware. Oh 
>>>> dear. I can live with that.
>>>
>>> Okay, so I was triggered by the "safe/reliable" qualification from 
>>> the comment. I agree "reliable" does not have to be and was mostly 
>>> worried about the "safe" part.
>>>
>>> From what you explain if short heartbeat, or manual reset 
>>> invocation, could actually mess up any of the data structures which 
>>> added intel_guc_find_hung_context walks and so crash the kernel.
>>>
>>> Looking inside, there is some lock dropping going on (and 
>>> undocumented irqsave games), and walking the list while unlocked. So 
>>> whether or not that can go bang if a full reset happens in parallel 
>>> and re-activates the normal driver flows.
>> There is no walking of unlocked lists. The xa_lock is held whenever 
>> it looks at the xa structure itself. The release is only while 
>> analysing the context that was retrieved. And the context retrieval 
>> itself starts with a kref_get_unless_zero. So everything is only ever 
>> accessed while locked or reference counted. The unlock of the xa 
>> while analysing a context is because the xa object can be accessed 
>> from interrupt code and so we don't want to hold it locked 
>> unnecessarily while scanning through requests within a context (all 
>> code which has no connection to the GuC backend at all).
>
> AFAICS intel_guc_find_hung_context walks &ce->guc_state.requests with 
> no locks held. Other places in the code appear to use 
> &ce->guc_state.lock, or maybe &sched_engine->lock, not sure. Then we 
> have request submission, retirement and a few other places modify that 
> list. So *if* indeed hung GuC can get resurrected by a parallel full 
> reset while reset_fail_worker_func is running, why couldn't that list 
> walk explode?
Blurgh. Didn't even notice that loop somehow. Or rather, was assuming 
the context reference count covered everything for some dumb reason and 
didn't look closely at the fact it was scanning something else entirely. 
Yeah, there should be a guc_state lock around that loop.

John.

>
> Regards,
>
> Tvrtko
>
>> I can drop the word 'safe' if it makes you nervous. That was only 
>> meant to refer to the possibility of such a scan returning bogus 
>> results due to contexts switching in/out of the hardware 
>> before/during/after the scan. There is no way for it to go bang.
>>
>> John.
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..99d09e3394597 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4754,11 +4754,24 @@  static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be safe/reliable to do this here because the GuC
+		 * is toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,