[0/3] Improve anti-pre-emption w/a for compute workloads

Message ID	20220218213307.1338478-1-John.C.Harrison@Intel.com (mailing list archive)
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH 0/3] Improve anti-pre-emption w/a for compute workloads Date: Fri, 18 Feb 2022 13:33:04 -0800 Message-Id: <20220218213307.1338478-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Content-Transfer-Encoding: 8bit Precedence: list Cc: John Harrison <John.C.Harrison@Intel.com>, DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Improve anti-pre-emption w/a for compute workloads \| expand [0/3] Improve anti-pre-emption w/a for compute workloads [1/3] drm/i915/guc: Limit scheduling properties to avoid overflow [2/3] drm/i915/gt: Make the heartbeat play nice with long pre-emption timeouts [3/3] drm/i915: Improve long running OCL w/a for GuC submission

John Harrison Feb. 18, 2022, 9:33 p.m. UTC

From: John Harrison <John.C.Harrison@Intel.com>

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout
to a big value. Also, update the heartbeat to allow such a long
pre-emption delay in the final heartbeat period.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (3):
  drm/i915/guc: Limit scheduling properties to avoid overflow
  drm/i915/gt: Make the heartbeat play nice with long pre-emption
    timeouts
  drm/i915: Improve long running OCL w/a for GuC submission

 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 37 +++++++++++++++++--
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 16 ++++++++
 drivers/gpu/drm/i915/gt/sysfs_engines.c       | 14 +++++++
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  9 +++++
 4 files changed, 73 insertions(+), 3 deletions(-)

Tvrtko Ursulin Feb. 22, 2022, 9:53 a.m. UTC | #1

On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Compute workloads are inherently not pre-emptible on current hardware.
> Thus the pre-emption timeout was disabled as a workaround to prevent
> unwanted resets. Instead, the hang detection was left to the heartbeat
> and its (longer) timeout. This is undesirable with GuC submission as
> the heartbeat is a full GT reset rather than a per engine reset and so
> is much more destructive. Instead, just bump the pre-emption timeout

Can we have a feature request to allow asking GuC for an engine reset?

Regards,

Tvrtko

> to a big value. Also, update the heartbeat to allow such a long
> pre-emption delay in the final heartbeat period.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> 
> 
> John Harrison (3):
>    drm/i915/guc: Limit scheduling properties to avoid overflow
>    drm/i915/gt: Make the heartbeat play nice with long pre-emption
>      timeouts
>    drm/i915: Improve long running OCL w/a for GuC submission
> 
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 37 +++++++++++++++++--
>   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 16 ++++++++
>   drivers/gpu/drm/i915/gt/sysfs_engines.c       | 14 +++++++
>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  9 +++++
>   4 files changed, 73 insertions(+), 3 deletions(-)
>

John Harrison Feb. 23, 2022, 2:22 a.m. UTC | #2

On 2/22/2022 01:53, Tvrtko Ursulin wrote:
> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Compute workloads are inherently not pre-emptible on current hardware.
>> Thus the pre-emption timeout was disabled as a workaround to prevent
>> unwanted resets. Instead, the hang detection was left to the heartbeat
>> and its (longer) timeout. This is undesirable with GuC submission as
>> the heartbeat is a full GT reset rather than a per engine reset and so
>> is much more destructive. Instead, just bump the pre-emption timeout
>
> Can we have a feature request to allow asking GuC for an engine reset?
For what purpose?

GuC manages the scheduling of contexts across engines. With virtual 
engines, the KMD has no knowledge of which engine a context might be 
executing on. Even without virtual engines, the KMD still has no 
knowledge of which context is currently executing on any given engine at 
any given time.

There is a reason why hang detection should be left to the entity that 
is doing the scheduling. Any other entity is second guessing at best.

The reason for keeping the heartbeat around even when GuC submission is 
enabled is for the case where the KMD/GuC have got out of sync with 
either other somehow or GuC itself has just crashed. I.e. when no 
submission at all is working and we need to reset the GuC itself and 
start over.

John.

>
> Regards,
>
> Tvrtko
>
>> to a big value. Also, update the heartbeat to allow such a long
>> pre-emption delay in the final heartbeat period.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>
>>
>> John Harrison (3):
>>    drm/i915/guc: Limit scheduling properties to avoid overflow
>>    drm/i915/gt: Make the heartbeat play nice with long pre-emption
>>      timeouts
>>    drm/i915: Improve long running OCL w/a for GuC submission
>>
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 37 +++++++++++++++++--
>>   .../gpu/drm/i915/gt/intel_engine_heartbeat.c  | 16 ++++++++
>>   drivers/gpu/drm/i915/gt/sysfs_engines.c       | 14 +++++++
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  9 +++++
>>   4 files changed, 73 insertions(+), 3 deletions(-)
>>

Tvrtko Ursulin Feb. 23, 2022, noon UTC | #3

On 23/02/2022 02:22, John Harrison wrote:
> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> Compute workloads are inherently not pre-emptible on current hardware.
>>> Thus the pre-emption timeout was disabled as a workaround to prevent
>>> unwanted resets. Instead, the hang detection was left to the heartbeat
>>> and its (longer) timeout. This is undesirable with GuC submission as
>>> the heartbeat is a full GT reset rather than a per engine reset and so
>>> is much more destructive. Instead, just bump the pre-emption timeout
>>
>> Can we have a feature request to allow asking GuC for an engine reset?
> For what purpose?

To allow "stopped heartbeat" to reset the engine, however..

> GuC manages the scheduling of contexts across engines. With virtual 
> engines, the KMD has no knowledge of which engine a context might be 
> executing on. Even without virtual engines, the KMD still has no 
> knowledge of which context is currently executing on any given engine at 
> any given time.
> 
> There is a reason why hang detection should be left to the entity that 
> is doing the scheduling. Any other entity is second guessing at best.
> 
> The reason for keeping the heartbeat around even when GuC submission is 
> enabled is for the case where the KMD/GuC have got out of sync with 
> either other somehow or GuC itself has just crashed. I.e. when no 
> submission at all is working and we need to reset the GuC itself and 
> start over.

.. I wasn't really up to speed to know/remember heartbeats are nerfed 
already in GuC mode.

I am not sure it was the best way since full reset penalizes everyone 
for one hanging engine. Better question would be why leave heartbeats 
around to start with with GuC? If you want to use it to health check 
GuC, as you say, maybe just send a CT message and expect replies? Then 
full reset would make sense. It also achieves the goal of not seconding 
guessing the submission backend you raise.

Like it is now, and the need for this series demonstrates it, the whole 
thing has a pretty poor "impedance" match. Not even sure what 
intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why is 
that not in intel_gt_handle_error at least? Why is hearbeat code special 
and other callers of intel_gt_handle_error don't need it?

Regards,

Tvrtko

John Harrison Feb. 24, 2022, 8:02 p.m. UTC | #4

On 2/23/2022 04:00, Tvrtko Ursulin wrote:
> On 23/02/2022 02:22, John Harrison wrote:
>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> Compute workloads are inherently not pre-emptible on current hardware.
>>>> Thus the pre-emption timeout was disabled as a workaround to prevent
>>>> unwanted resets. Instead, the hang detection was left to the heartbeat
>>>> and its (longer) timeout. This is undesirable with GuC submission as
>>>> the heartbeat is a full GT reset rather than a per engine reset and so
>>>> is much more destructive. Instead, just bump the pre-emption timeout
>>>
>>> Can we have a feature request to allow asking GuC for an engine reset?
>> For what purpose?
>
> To allow "stopped heartbeat" to reset the engine, however..
>
>> GuC manages the scheduling of contexts across engines. With virtual 
>> engines, the KMD has no knowledge of which engine a context might be 
>> executing on. Even without virtual engines, the KMD still has no 
>> knowledge of which context is currently executing on any given engine 
>> at any given time.
>>
>> There is a reason why hang detection should be left to the entity 
>> that is doing the scheduling. Any other entity is second guessing at 
>> best.
>>
>> The reason for keeping the heartbeat around even when GuC submission 
>> is enabled is for the case where the KMD/GuC have got out of sync 
>> with either other somehow or GuC itself has just crashed. I.e. when 
>> no submission at all is working and we need to reset the GuC itself 
>> and start over.
>
> .. I wasn't really up to speed to know/remember heartbeats are nerfed 
> already in GuC mode.
Not sure what you mean by that claim. Engine resets are handled by GuC 
because GuC handles the scheduling. You can't do the former if you 
aren't doing the latter. However, the heartbeat is still present and is 
still the watchdog by which engine resets are triggered. As per the rest 
of the submission process, the hang detection and recovery is split 
between i915 and GuC.


>
> I am not sure it was the best way since full reset penalizes everyone 
> for one hanging engine. Better question would be why leave heartbeats 
> around to start with with GuC? If you want to use it to health check 
> GuC, as you say, maybe just send a CT message and expect replies? Then 
> full reset would make sense. It also achieves the goal of not 
> seconding guessing the submission backend you raise.
Adding yet another hang detection mechanism is yet more complication and 
is unnecessary when we already have one that works well enough. As 
above, the heartbeat is still required for sending the pulses that cause 
pre-emptions and so let GuC detect hangs. It also provides a fallback 
against a dead GuC by default. So why invent yet another wheel?

>
> Like it is now, and the need for this series demonstrates it, the 
> whole thing has a pretty poor "impedance" match. Not even sure what 
> intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why 
> is that not in intel_gt_handle_error at least? Why is hearbeat code 
> special and other callers of intel_gt_handle_error don't need it?
There is no guilty context if the reset was triggered via debugfs or 
similar. And as stated ad nauseam, i915 is no longer handling the 
scheduling and so cannot make assumptions about what is or is not 
running on the hardware at any given time. And obviously, if the reset 
initiated by GuC itself then i915 should not be second guessing the 
guilty context as the GuC notification has already told us who was 
responsible.

And to be clear, the 'poor impedance match' is purely because we don't 
have mid-thread pre-emption and so need a stupidly huge timeout on 
compute capable engines. Whereas, we don't want a total heatbeat timeout 
of a minute or more. That is the impedance mis-match. If the 640ms was 
acceptable for RCS then none of this hacky timeout algorithm mush would 
be needed.

John.


>
> Regards,
>
> Tvrtko

Tvrtko Ursulin Feb. 25, 2022, 4:36 p.m. UTC | #5

On 24/02/2022 20:02, John Harrison wrote:
> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>> On 23/02/2022 02:22, John Harrison wrote:
>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> Compute workloads are inherently not pre-emptible on current hardware.
>>>>> Thus the pre-emption timeout was disabled as a workaround to prevent
>>>>> unwanted resets. Instead, the hang detection was left to the heartbeat
>>>>> and its (longer) timeout. This is undesirable with GuC submission as
>>>>> the heartbeat is a full GT reset rather than a per engine reset and so
>>>>> is much more destructive. Instead, just bump the pre-emption timeout
>>>>
>>>> Can we have a feature request to allow asking GuC for an engine reset?
>>> For what purpose?
>>
>> To allow "stopped heartbeat" to reset the engine, however..
>>
>>> GuC manages the scheduling of contexts across engines. With virtual 
>>> engines, the KMD has no knowledge of which engine a context might be 
>>> executing on. Even without virtual engines, the KMD still has no 
>>> knowledge of which context is currently executing on any given engine 
>>> at any given time.
>>>
>>> There is a reason why hang detection should be left to the entity 
>>> that is doing the scheduling. Any other entity is second guessing at 
>>> best.
>>>
>>> The reason for keeping the heartbeat around even when GuC submission 
>>> is enabled is for the case where the KMD/GuC have got out of sync 
>>> with either other somehow or GuC itself has just crashed. I.e. when 
>>> no submission at all is working and we need to reset the GuC itself 
>>> and start over.
>>
>> .. I wasn't really up to speed to know/remember heartbeats are nerfed 
>> already in GuC mode.
> Not sure what you mean by that claim. Engine resets are handled by GuC 
> because GuC handles the scheduling. You can't do the former if you 
> aren't doing the latter. However, the heartbeat is still present and is 
> still the watchdog by which engine resets are triggered. As per the rest 
> of the submission process, the hang detection and recovery is split 
> between i915 and GuC.

I meant that "stopped heartbeat on engine XXX" can only do a full GPU reset on GuC.

	intel_gt_handle_error(engine->gt, engine->mask,
			      I915_ERROR_CAPTURE,
			      "stopped heartbeat on %s",
			      engine->name);

intel_gt_handle_error:

	/*
	 * Try engine reset when available. We fall back to full reset if
	 * single reset fails.
	 */
	if (!intel_uc_uses_guc_submission(&gt->uc) &&
	    intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
		local_bh_disable();
		for_each_engine_masked(engine, gt, engine_mask, tmp) {

You said "However, the heartbeat is still present and is still the watchdog by which engine resets are triggered", now I don't know what you meant by this. It actually triggers a single engine reset in GuC mode? Where in code does that happen if this block above shows it not taking the engine reset path?

>> I am not sure it was the best way since full reset penalizes everyone 
>> for one hanging engine. Better question would be why leave heartbeats 
>> around to start with with GuC? If you want to use it to health check 
>> GuC, as you say, maybe just send a CT message and expect replies? Then 
>> full reset would make sense. It also achieves the goal of not 
>> seconding guessing the submission backend you raise.
> Adding yet another hang detection mechanism is yet more complication and 
> is unnecessary when we already have one that works well enough. As 
> above, the heartbeat is still required for sending the pulses that cause 
> pre-emptions and so let GuC detect hangs. It also provides a fallback 
> against a dead GuC by default. So why invent yet another wheel?

Lets first clarify the previous block to make sure there aren't any misunderstandings there.

Regards,

Tvrtko

>> Like it is now, and the need for this series demonstrates it, the 
>> whole thing has a pretty poor "impedance" match. Not even sure what 
>> intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - why 
>> is that not in intel_gt_handle_error at least? Why is hearbeat code 
>> special and other callers of intel_gt_handle_error don't need it?
> There is no guilty context if the reset was triggered via debugfs or 
> similar. And as stated ad nauseam, i915 is no longer handling the 
> scheduling and so cannot make assumptions about what is or is not 
> running on the hardware at any given time. And obviously, if the reset 
> initiated by GuC itself then i915 should not be second guessing the 
> guilty context as the GuC notification has already told us who was 
> responsible.
> 
> And to be clear, the 'poor impedance match' is purely because we don't 
> have mid-thread pre-emption and so need a stupidly huge timeout on 
> compute capable engines. Whereas, we don't want a total heatbeat timeout 
> of a minute or more. That is the impedance mis-match. If the 640ms was 
> acceptable for RCS then none of this hacky timeout algorithm mush would 
> be needed.
> 
> John.
> 
> 
>>
>> Regards,
>>
>> Tvrtko
>

John Harrison Feb. 25, 2022, 5:11 p.m. UTC | #6

On 2/25/2022 08:36, Tvrtko Ursulin wrote:
> On 24/02/2022 20:02, John Harrison wrote:
>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>> On 23/02/2022 02:22, John Harrison wrote:
>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>
>>>>>> Compute workloads are inherently not pre-emptible on current 
>>>>>> hardware.
>>>>>> Thus the pre-emption timeout was disabled as a workaround to prevent
>>>>>> unwanted resets. Instead, the hang detection was left to the 
>>>>>> heartbeat
>>>>>> and its (longer) timeout. This is undesirable with GuC submission as
>>>>>> the heartbeat is a full GT reset rather than a per engine reset 
>>>>>> and so
>>>>>> is much more destructive. Instead, just bump the pre-emption timeout
>>>>>
>>>>> Can we have a feature request to allow asking GuC for an engine 
>>>>> reset?
>>>> For what purpose?
>>>
>>> To allow "stopped heartbeat" to reset the engine, however..
>>>
>>>> GuC manages the scheduling of contexts across engines. With virtual 
>>>> engines, the KMD has no knowledge of which engine a context might 
>>>> be executing on. Even without virtual engines, the KMD still has no 
>>>> knowledge of which context is currently executing on any given 
>>>> engine at any given time.
>>>>
>>>> There is a reason why hang detection should be left to the entity 
>>>> that is doing the scheduling. Any other entity is second guessing 
>>>> at best.
>>>>
>>>> The reason for keeping the heartbeat around even when GuC 
>>>> submission is enabled is for the case where the KMD/GuC have got 
>>>> out of sync with either other somehow or GuC itself has just 
>>>> crashed. I.e. when no submission at all is working and we need to 
>>>> reset the GuC itself and start over.
>>>
>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>> nerfed already in GuC mode.
>> Not sure what you mean by that claim. Engine resets are handled by 
>> GuC because GuC handles the scheduling. You can't do the former if 
>> you aren't doing the latter. However, the heartbeat is still present 
>> and is still the watchdog by which engine resets are triggered. As 
>> per the rest of the submission process, the hang detection and 
>> recovery is split between i915 and GuC.
>
> I meant that "stopped heartbeat on engine XXX" can only do a full GPU 
> reset on GuC.
I mean that there is no 'stopped heartbeat on engine XXX' when i915 is 
not handling the recovery part of the process.


>
>     intel_gt_handle_error(engine->gt, engine->mask,
>                   I915_ERROR_CAPTURE,
>                   "stopped heartbeat on %s",
>                   engine->name);
>
> intel_gt_handle_error:
>
>     /*
>      * Try engine reset when available. We fall back to full reset if
>      * single reset fails.
>      */
>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>         local_bh_disable();
>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>
> You said "However, the heartbeat is still present and is still the 
> watchdog by which engine resets are triggered", now I don't know what 
> you meant by this. It actually triggers a single engine reset in GuC 
> mode? Where in code does that happen if this block above shows it not 
> taking the engine reset path?
i915 sends down the per engine pulse.
GuC schedules the pulse
GuC attempts to pre-empt the currently active context
GuC detects the pre-emption timeout
GuC resets the engine

The fundamental process is exactly the same as in execlist mode. It's 
just that the above blocks of code (calls to intel_gt_handle_error and 
such) are now inside the GuC not i915.

Without the heartbeat going ping, there would be no context switching 
and thus no pre-emption, no pre-emption timeout and so no hang and reset 
recovery. And GuC cannot sent pulses by itself - it has no ability to 
generate context workloads. So we need i915 to send the pings and to 
gradually raise their priority. But the back half of the heartbeat code 
is now inside the GuC. It will simply never reach the i915 side timeout 
if GuC is doing the recovery (unless the heartbeat's final period is too 
short). We should only reach the i915 side timeout if GuC itself is 
toast. At which point we need the full GT reset to recover the GuC.

John.


>
>>> I am not sure it was the best way since full reset penalizes 
>>> everyone for one hanging engine. Better question would be why leave 
>>> heartbeats around to start with with GuC? If you want to use it to 
>>> health check GuC, as you say, maybe just send a CT message and 
>>> expect replies? Then full reset would make sense. It also achieves 
>>> the goal of not seconding guessing the submission backend you raise.
>> Adding yet another hang detection mechanism is yet more complication 
>> and is unnecessary when we already have one that works well enough. 
>> As above, the heartbeat is still required for sending the pulses that 
>> cause pre-emptions and so let GuC detect hangs. It also provides a 
>> fallback against a dead GuC by default. So why invent yet another wheel?
>
> Lets first clarify the previous block to make sure there aren't any 
> misunderstandings there.
>
> Regards,
>
> Tvrtko
>
>>> Like it is now, and the need for this series demonstrates it, the 
>>> whole thing has a pretty poor "impedance" match. Not even sure what 
>>> intel_guc_find_hung_context is doing in intel_engine_hearbeat.c - 
>>> why is that not in intel_gt_handle_error at least? Why is hearbeat 
>>> code special and other callers of intel_gt_handle_error don't need it?
>> There is no guilty context if the reset was triggered via debugfs or 
>> similar. And as stated ad nauseam, i915 is no longer handling the 
>> scheduling and so cannot make assumptions about what is or is not 
>> running on the hardware at any given time. And obviously, if the 
>> reset initiated by GuC itself then i915 should not be second guessing 
>> the guilty context as the GuC notification has already told us who 
>> was responsible.
>>
>> And to be clear, the 'poor impedance match' is purely because we 
>> don't have mid-thread pre-emption and so need a stupidly huge timeout 
>> on compute capable engines. Whereas, we don't want a total heatbeat 
>> timeout of a minute or more. That is the impedance mis-match. If the 
>> 640ms was acceptable for RCS then none of this hacky timeout 
>> algorithm mush would be needed.
>>
>> John.
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>

Tvrtko Ursulin Feb. 25, 2022, 5:39 p.m. UTC | #7

On 25/02/2022 17:11, John Harrison wrote:
> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>> On 24/02/2022 20:02, John Harrison wrote:
>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>
>>>>>>> Compute workloads are inherently not pre-emptible on current 
>>>>>>> hardware.
>>>>>>> Thus the pre-emption timeout was disabled as a workaround to prevent
>>>>>>> unwanted resets. Instead, the hang detection was left to the 
>>>>>>> heartbeat
>>>>>>> and its (longer) timeout. This is undesirable with GuC submission as
>>>>>>> the heartbeat is a full GT reset rather than a per engine reset 
>>>>>>> and so
>>>>>>> is much more destructive. Instead, just bump the pre-emption timeout
>>>>>>
>>>>>> Can we have a feature request to allow asking GuC for an engine 
>>>>>> reset?
>>>>> For what purpose?
>>>>
>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>
>>>>> GuC manages the scheduling of contexts across engines. With virtual 
>>>>> engines, the KMD has no knowledge of which engine a context might 
>>>>> be executing on. Even without virtual engines, the KMD still has no 
>>>>> knowledge of which context is currently executing on any given 
>>>>> engine at any given time.
>>>>>
>>>>> There is a reason why hang detection should be left to the entity 
>>>>> that is doing the scheduling. Any other entity is second guessing 
>>>>> at best.
>>>>>
>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>> submission is enabled is for the case where the KMD/GuC have got 
>>>>> out of sync with either other somehow or GuC itself has just 
>>>>> crashed. I.e. when no submission at all is working and we need to 
>>>>> reset the GuC itself and start over.
>>>>
>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>> nerfed already in GuC mode.
>>> Not sure what you mean by that claim. Engine resets are handled by 
>>> GuC because GuC handles the scheduling. You can't do the former if 
>>> you aren't doing the latter. However, the heartbeat is still present 
>>> and is still the watchdog by which engine resets are triggered. As 
>>> per the rest of the submission process, the hang detection and 
>>> recovery is split between i915 and GuC.
>>
>> I meant that "stopped heartbeat on engine XXX" can only do a full GPU 
>> reset on GuC.
> I mean that there is no 'stopped heartbeat on engine XXX' when i915 is 
> not handling the recovery part of the process.

Hmmmm?

static void
reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
{
	if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
		show_heartbeat(rq, engine);

	if (intel_engine_uses_guc(engine))
		/*
		 * GuC itself is toast or GuC's hang detection
		 * is disabled. Either way, need to find the
		 * hang culprit manually.
		 */
		intel_guc_find_hung_context(engine);

	intel_gt_handle_error(engine->gt, engine->mask,
			      I915_ERROR_CAPTURE,
			      "stopped heartbeat on %s",
			      engine->name);
}

How there is no "stopped hearbeat" in guc mode? From this code it certainly looks there is.

You say below heartbeats are going in GuC mode. Now I totally don't understand how they are going but there is allegedly no "stopped hearbeat".

>>
>>     intel_gt_handle_error(engine->gt, engine->mask,
>>                   I915_ERROR_CAPTURE,
>>                   "stopped heartbeat on %s",
>>                   engine->name);
>>
>> intel_gt_handle_error:
>>
>>     /*
>>      * Try engine reset when available. We fall back to full reset if
>>      * single reset fails.
>>      */
>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>         local_bh_disable();
>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>
>> You said "However, the heartbeat is still present and is still the 
>> watchdog by which engine resets are triggered", now I don't know what 
>> you meant by this. It actually triggers a single engine reset in GuC 
>> mode? Where in code does that happen if this block above shows it not 
>> taking the engine reset path?
> i915 sends down the per engine pulse.
> GuC schedules the pulse
> GuC attempts to pre-empt the currently active context
> GuC detects the pre-emption timeout
> GuC resets the engine
> 
> The fundamental process is exactly the same as in execlist mode. It's 
> just that the above blocks of code (calls to intel_gt_handle_error and 
> such) are now inside the GuC not i915.
> 
> Without the heartbeat going ping, there would be no context switching 
> and thus no pre-emption, no pre-emption timeout and so no hang and reset 
> recovery. And GuC cannot sent pulses by itself - it has no ability to 
> generate context workloads. So we need i915 to send the pings and to 
> gradually raise their priority. But the back half of the heartbeat code 
> is now inside the GuC. It will simply never reach the i915 side timeout 
> if GuC is doing the recovery (unless the heartbeat's final period is too 
> short). We should only reach the i915 side timeout if GuC itself is 
> toast. At which point we need the full GT reset to recover the GuC.

If workload is not preempting and reset does not work, like engine is truly stuck, does the current code hit "stopped heartbeat" or not in GuC mode?

Regards,

Tvrtko

John Harrison Feb. 25, 2022, 6:01 p.m. UTC | #8

On 2/25/2022 09:39, Tvrtko Ursulin wrote:
> On 25/02/2022 17:11, John Harrison wrote:
>> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>>> On 24/02/2022 20:02, John Harrison wrote:
>>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>
>>>>>>>> Compute workloads are inherently not pre-emptible on current 
>>>>>>>> hardware.
>>>>>>>> Thus the pre-emption timeout was disabled as a workaround to 
>>>>>>>> prevent
>>>>>>>> unwanted resets. Instead, the hang detection was left to the 
>>>>>>>> heartbeat
>>>>>>>> and its (longer) timeout. This is undesirable with GuC 
>>>>>>>> submission as
>>>>>>>> the heartbeat is a full GT reset rather than a per engine reset 
>>>>>>>> and so
>>>>>>>> is much more destructive. Instead, just bump the pre-emption 
>>>>>>>> timeout
>>>>>>>
>>>>>>> Can we have a feature request to allow asking GuC for an engine 
>>>>>>> reset?
>>>>>> For what purpose?
>>>>>
>>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>>
>>>>>> GuC manages the scheduling of contexts across engines. With 
>>>>>> virtual engines, the KMD has no knowledge of which engine a 
>>>>>> context might be executing on. Even without virtual engines, the 
>>>>>> KMD still has no knowledge of which context is currently 
>>>>>> executing on any given engine at any given time.
>>>>>>
>>>>>> There is a reason why hang detection should be left to the entity 
>>>>>> that is doing the scheduling. Any other entity is second guessing 
>>>>>> at best.
>>>>>>
>>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>>> submission is enabled is for the case where the KMD/GuC have got 
>>>>>> out of sync with either other somehow or GuC itself has just 
>>>>>> crashed. I.e. when no submission at all is working and we need to 
>>>>>> reset the GuC itself and start over.
>>>>>
>>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>>> nerfed already in GuC mode.
>>>> Not sure what you mean by that claim. Engine resets are handled by 
>>>> GuC because GuC handles the scheduling. You can't do the former if 
>>>> you aren't doing the latter. However, the heartbeat is still 
>>>> present and is still the watchdog by which engine resets are 
>>>> triggered. As per the rest of the submission process, the hang 
>>>> detection and recovery is split between i915 and GuC.
>>>
>>> I meant that "stopped heartbeat on engine XXX" can only do a full 
>>> GPU reset on GuC.
>> I mean that there is no 'stopped heartbeat on engine XXX' when i915 
>> is not handling the recovery part of the process.
>
> Hmmmm?
>
> static void
> reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
> {
>     if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
>         show_heartbeat(rq, engine);
>
>     if (intel_engine_uses_guc(engine))
>         /*
>          * GuC itself is toast or GuC's hang detection
>          * is disabled. Either way, need to find the
>          * hang culprit manually.
>          */
>         intel_guc_find_hung_context(engine);
>
>     intel_gt_handle_error(engine->gt, engine->mask,
>                   I915_ERROR_CAPTURE,
>                   "stopped heartbeat on %s",
>                   engine->name);
> }
>
> How there is no "stopped hearbeat" in guc mode? From this code it 
> certainly looks there is.
Only when the GuC is toast and it is no longer an engine reset but a 
full GT reset that is required. So technically, it is not a 'stopped 
heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.

>
> You say below heartbeats are going in GuC mode. Now I totally don't 
> understand how they are going but there is allegedly no "stopped 
> hearbeat".
Because if GuC is handling the detection and recovery then i915 will not 
reach that point. GuC will do the engine reset and start scheduling the 
next context before the heartbeat period expires. So the notification 
will be a G2H about a specific context being reset rather than the i915 
notification about a stopped heartbeat.

>
>>>
>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>                   I915_ERROR_CAPTURE,
>>>                   "stopped heartbeat on %s",
>>>                   engine->name);
>>>
>>> intel_gt_handle_error:
>>>
>>>     /*
>>>      * Try engine reset when available. We fall back to full reset if
>>>      * single reset fails.
>>>      */
>>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>>         local_bh_disable();
>>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>>
>>> You said "However, the heartbeat is still present and is still the 
>>> watchdog by which engine resets are triggered", now I don't know 
>>> what you meant by this. It actually triggers a single engine reset 
>>> in GuC mode? Where in code does that happen if this block above 
>>> shows it not taking the engine reset path?
>> i915 sends down the per engine pulse.
>> GuC schedules the pulse
>> GuC attempts to pre-empt the currently active context
>> GuC detects the pre-emption timeout
>> GuC resets the engine
>>
>> The fundamental process is exactly the same as in execlist mode. It's 
>> just that the above blocks of code (calls to intel_gt_handle_error 
>> and such) are now inside the GuC not i915.
>>
>> Without the heartbeat going ping, there would be no context switching 
>> and thus no pre-emption, no pre-emption timeout and so no hang and 
>> reset recovery. And GuC cannot sent pulses by itself - it has no 
>> ability to generate context workloads. So we need i915 to send the 
>> pings and to gradually raise their priority. But the back half of the 
>> heartbeat code is now inside the GuC. It will simply never reach the 
>> i915 side timeout if GuC is doing the recovery (unless the 
>> heartbeat's final period is too short). We should only reach the i915 
>> side timeout if GuC itself is toast. At which point we need the full 
>> GT reset to recover the GuC.
>
> If workload is not preempting and reset does not work, like engine is 
> truly stuck, does the current code hit "stopped heartbeat" or not in 
> GuC mode?
Hang on, where did 'reset does not work' come into this?

If GuC is alive and the hardware is not broken then no, it won't. That's 
the whole point. GuC does the detection and recovery. The KMD will never 
reach 'stopped heartbeat'.

If the hardware is broken and the reset does not work then GuC will send 
a 'failed reset' notification to the KMD. The KMD treats that as a major 
error and immediately does a full GT reset. So there is still no 
'stopped heartbeat'.

If GuC has died (or a KMD bug has caused sufficient confusion to make it 
think the GuC has died) then yes, you will reach that code. But in that 
case it is not an engine reset that is required, it is a full GT reset 
including a reset of the GuC.

>
> Regards,
>
> Tvrtko

Tvrtko Ursulin Feb. 25, 2022, 6:29 p.m. UTC | #9

On 25/02/2022 18:01, John Harrison wrote:
> On 2/25/2022 09:39, Tvrtko Ursulin wrote:
>> On 25/02/2022 17:11, John Harrison wrote:
>>> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>>>> On 24/02/2022 20:02, John Harrison wrote:
>>>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>
>>>>>>>>> Compute workloads are inherently not pre-emptible on current 
>>>>>>>>> hardware.
>>>>>>>>> Thus the pre-emption timeout was disabled as a workaround to 
>>>>>>>>> prevent
>>>>>>>>> unwanted resets. Instead, the hang detection was left to the 
>>>>>>>>> heartbeat
>>>>>>>>> and its (longer) timeout. This is undesirable with GuC 
>>>>>>>>> submission as
>>>>>>>>> the heartbeat is a full GT reset rather than a per engine reset 
>>>>>>>>> and so
>>>>>>>>> is much more destructive. Instead, just bump the pre-emption 
>>>>>>>>> timeout
>>>>>>>>
>>>>>>>> Can we have a feature request to allow asking GuC for an engine 
>>>>>>>> reset?
>>>>>>> For what purpose?
>>>>>>
>>>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>>>
>>>>>>> GuC manages the scheduling of contexts across engines. With 
>>>>>>> virtual engines, the KMD has no knowledge of which engine a 
>>>>>>> context might be executing on. Even without virtual engines, the 
>>>>>>> KMD still has no knowledge of which context is currently 
>>>>>>> executing on any given engine at any given time.
>>>>>>>
>>>>>>> There is a reason why hang detection should be left to the entity 
>>>>>>> that is doing the scheduling. Any other entity is second guessing 
>>>>>>> at best.
>>>>>>>
>>>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>>>> submission is enabled is for the case where the KMD/GuC have got 
>>>>>>> out of sync with either other somehow or GuC itself has just 
>>>>>>> crashed. I.e. when no submission at all is working and we need to 
>>>>>>> reset the GuC itself and start over.
>>>>>>
>>>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>>>> nerfed already in GuC mode.
>>>>> Not sure what you mean by that claim. Engine resets are handled by 
>>>>> GuC because GuC handles the scheduling. You can't do the former if 
>>>>> you aren't doing the latter. However, the heartbeat is still 
>>>>> present and is still the watchdog by which engine resets are 
>>>>> triggered. As per the rest of the submission process, the hang 
>>>>> detection and recovery is split between i915 and GuC.
>>>>
>>>> I meant that "stopped heartbeat on engine XXX" can only do a full 
>>>> GPU reset on GuC.
>>> I mean that there is no 'stopped heartbeat on engine XXX' when i915 
>>> is not handling the recovery part of the process.
>>
>> Hmmmm?
>>
>> static void
>> reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
>> {
>>     if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
>>         show_heartbeat(rq, engine);
>>
>>     if (intel_engine_uses_guc(engine))
>>         /*
>>          * GuC itself is toast or GuC's hang detection
>>          * is disabled. Either way, need to find the
>>          * hang culprit manually.
>>          */
>>         intel_guc_find_hung_context(engine);
>>
>>     intel_gt_handle_error(engine->gt, engine->mask,
>>                   I915_ERROR_CAPTURE,
>>                   "stopped heartbeat on %s",
>>                   engine->name);
>> }
>>
>> How there is no "stopped hearbeat" in guc mode? From this code it 
>> certainly looks there is.
> Only when the GuC is toast and it is no longer an engine reset but a 
> full GT reset that is required. So technically, it is not a 'stopped 
> heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.
> 
>>
>> You say below heartbeats are going in GuC mode. Now I totally don't 
>> understand how they are going but there is allegedly no "stopped 
>> hearbeat".
> Because if GuC is handling the detection and recovery then i915 will not 
> reach that point. GuC will do the engine reset and start scheduling the 
> next context before the heartbeat period expires. So the notification 
> will be a G2H about a specific context being reset rather than the i915 
> notification about a stopped heartbeat.
> 
>>
>>>>
>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>                   I915_ERROR_CAPTURE,
>>>>                   "stopped heartbeat on %s",
>>>>                   engine->name);
>>>>
>>>> intel_gt_handle_error:
>>>>
>>>>     /*
>>>>      * Try engine reset when available. We fall back to full reset if
>>>>      * single reset fails.
>>>>      */
>>>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>>>         local_bh_disable();
>>>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>>>
>>>> You said "However, the heartbeat is still present and is still the 
>>>> watchdog by which engine resets are triggered", now I don't know 
>>>> what you meant by this. It actually triggers a single engine reset 
>>>> in GuC mode? Where in code does that happen if this block above 
>>>> shows it not taking the engine reset path?
>>> i915 sends down the per engine pulse.
>>> GuC schedules the pulse
>>> GuC attempts to pre-empt the currently active context
>>> GuC detects the pre-emption timeout
>>> GuC resets the engine
>>>
>>> The fundamental process is exactly the same as in execlist mode. It's 
>>> just that the above blocks of code (calls to intel_gt_handle_error 
>>> and such) are now inside the GuC not i915.
>>>
>>> Without the heartbeat going ping, there would be no context switching 
>>> and thus no pre-emption, no pre-emption timeout and so no hang and 
>>> reset recovery. And GuC cannot sent pulses by itself - it has no 
>>> ability to generate context workloads. So we need i915 to send the 
>>> pings and to gradually raise their priority. But the back half of the 
>>> heartbeat code is now inside the GuC. It will simply never reach the 
>>> i915 side timeout if GuC is doing the recovery (unless the 
>>> heartbeat's final period is too short). We should only reach the i915 
>>> side timeout if GuC itself is toast. At which point we need the full 
>>> GT reset to recover the GuC.
>>
>> If workload is not preempting and reset does not work, like engine is 
>> truly stuck, does the current code hit "stopped heartbeat" or not in 
>> GuC mode?
> Hang on, where did 'reset does not work' come into this?
> 
> If GuC is alive and the hardware is not broken then no, it won't. That's 
> the whole point. GuC does the detection and recovery. The KMD will never 
> reach 'stopped heartbeat'.
> 
> If the hardware is broken and the reset does not work then GuC will send 
> a 'failed reset' notification to the KMD. The KMD treats that as a major 
> error and immediately does a full GT reset. So there is still no 
> 'stopped heartbeat'.
> 
> If GuC has died (or a KMD bug has caused sufficient confusion to make it 
> think the GuC has died) then yes, you will reach that code. But in that 
> case it is not an engine reset that is required, it is a full GT reset 
> including a reset of the GuC.

Got it, so what is actually wrong is calling intel_gt_handle_error with 
an engine->mask in GuC mode. intel_engine_hearbeat.c/reset_engine should 
fork into two (in some way), depending on backend, so in the case of GuC 
it can call a variant of intel_gt_handle_error which would be explicitly 
about a full GPU reset only, instead of a sprinkle of 
intel_uc_uses_guc_submission in that function. Possibly even off load 
the reset to a single per gt worker, so that if multiple active engines 
trigger the reset roughly simultaneously, there is only one full GPU 
reset. And it gets correctly labeled as "dead GuC" or something.

Regards,

Tvrtko

John Harrison Feb. 25, 2022, 7:03 p.m. UTC | #10

On 2/25/2022 10:29, Tvrtko Ursulin wrote:
> On 25/02/2022 18:01, John Harrison wrote:
>> On 2/25/2022 09:39, Tvrtko Ursulin wrote:
>>> On 25/02/2022 17:11, John Harrison wrote:
>>>> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>>>>> On 24/02/2022 20:02, John Harrison wrote:
>>>>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>
>>>>>>>>>> Compute workloads are inherently not pre-emptible on current 
>>>>>>>>>> hardware.
>>>>>>>>>> Thus the pre-emption timeout was disabled as a workaround to 
>>>>>>>>>> prevent
>>>>>>>>>> unwanted resets. Instead, the hang detection was left to the 
>>>>>>>>>> heartbeat
>>>>>>>>>> and its (longer) timeout. This is undesirable with GuC 
>>>>>>>>>> submission as
>>>>>>>>>> the heartbeat is a full GT reset rather than a per engine 
>>>>>>>>>> reset and so
>>>>>>>>>> is much more destructive. Instead, just bump the pre-emption 
>>>>>>>>>> timeout
>>>>>>>>>
>>>>>>>>> Can we have a feature request to allow asking GuC for an 
>>>>>>>>> engine reset?
>>>>>>>> For what purpose?
>>>>>>>
>>>>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>>>>
>>>>>>>> GuC manages the scheduling of contexts across engines. With 
>>>>>>>> virtual engines, the KMD has no knowledge of which engine a 
>>>>>>>> context might be executing on. Even without virtual engines, 
>>>>>>>> the KMD still has no knowledge of which context is currently 
>>>>>>>> executing on any given engine at any given time.
>>>>>>>>
>>>>>>>> There is a reason why hang detection should be left to the 
>>>>>>>> entity that is doing the scheduling. Any other entity is second 
>>>>>>>> guessing at best.
>>>>>>>>
>>>>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>>>>> submission is enabled is for the case where the KMD/GuC have 
>>>>>>>> got out of sync with either other somehow or GuC itself has 
>>>>>>>> just crashed. I.e. when no submission at all is working and we 
>>>>>>>> need to reset the GuC itself and start over.
>>>>>>>
>>>>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>>>>> nerfed already in GuC mode.
>>>>>> Not sure what you mean by that claim. Engine resets are handled 
>>>>>> by GuC because GuC handles the scheduling. You can't do the 
>>>>>> former if you aren't doing the latter. However, the heartbeat is 
>>>>>> still present and is still the watchdog by which engine resets 
>>>>>> are triggered. As per the rest of the submission process, the 
>>>>>> hang detection and recovery is split between i915 and GuC.
>>>>>
>>>>> I meant that "stopped heartbeat on engine XXX" can only do a full 
>>>>> GPU reset on GuC.
>>>> I mean that there is no 'stopped heartbeat on engine XXX' when i915 
>>>> is not handling the recovery part of the process.
>>>
>>> Hmmmm?
>>>
>>> static void
>>> reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
>>> {
>>>     if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
>>>         show_heartbeat(rq, engine);
>>>
>>>     if (intel_engine_uses_guc(engine))
>>>         /*
>>>          * GuC itself is toast or GuC's hang detection
>>>          * is disabled. Either way, need to find the
>>>          * hang culprit manually.
>>>          */
>>>         intel_guc_find_hung_context(engine);
>>>
>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>                   I915_ERROR_CAPTURE,
>>>                   "stopped heartbeat on %s",
>>>                   engine->name);
>>> }
>>>
>>> How there is no "stopped hearbeat" in guc mode? From this code it 
>>> certainly looks there is.
>> Only when the GuC is toast and it is no longer an engine reset but a 
>> full GT reset that is required. So technically, it is not a 'stopped 
>> heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.
>>
>>>
>>> You say below heartbeats are going in GuC mode. Now I totally don't 
>>> understand how they are going but there is allegedly no "stopped 
>>> hearbeat".
>> Because if GuC is handling the detection and recovery then i915 will 
>> not reach that point. GuC will do the engine reset and start 
>> scheduling the next context before the heartbeat period expires. So 
>> the notification will be a G2H about a specific context being reset 
>> rather than the i915 notification about a stopped heartbeat.
>>
>>>
>>>>>
>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>                   I915_ERROR_CAPTURE,
>>>>>                   "stopped heartbeat on %s",
>>>>>                   engine->name);
>>>>>
>>>>> intel_gt_handle_error:
>>>>>
>>>>>     /*
>>>>>      * Try engine reset when available. We fall back to full reset if
>>>>>      * single reset fails.
>>>>>      */
>>>>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>>>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>>>>         local_bh_disable();
>>>>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>>>>
>>>>> You said "However, the heartbeat is still present and is still the 
>>>>> watchdog by which engine resets are triggered", now I don't know 
>>>>> what you meant by this. It actually triggers a single engine reset 
>>>>> in GuC mode? Where in code does that happen if this block above 
>>>>> shows it not taking the engine reset path?
>>>> i915 sends down the per engine pulse.
>>>> GuC schedules the pulse
>>>> GuC attempts to pre-empt the currently active context
>>>> GuC detects the pre-emption timeout
>>>> GuC resets the engine
>>>>
>>>> The fundamental process is exactly the same as in execlist mode. 
>>>> It's just that the above blocks of code (calls to 
>>>> intel_gt_handle_error and such) are now inside the GuC not i915.
>>>>
>>>> Without the heartbeat going ping, there would be no context 
>>>> switching and thus no pre-emption, no pre-emption timeout and so no 
>>>> hang and reset recovery. And GuC cannot sent pulses by itself - it 
>>>> has no ability to generate context workloads. So we need i915 to 
>>>> send the pings and to gradually raise their priority. But the back 
>>>> half of the heartbeat code is now inside the GuC. It will simply 
>>>> never reach the i915 side timeout if GuC is doing the recovery 
>>>> (unless the heartbeat's final period is too short). We should only 
>>>> reach the i915 side timeout if GuC itself is toast. At which point 
>>>> we need the full GT reset to recover the GuC.
>>>
>>> If workload is not preempting and reset does not work, like engine 
>>> is truly stuck, does the current code hit "stopped heartbeat" or not 
>>> in GuC mode?
>> Hang on, where did 'reset does not work' come into this?
>>
>> If GuC is alive and the hardware is not broken then no, it won't. 
>> That's the whole point. GuC does the detection and recovery. The KMD 
>> will never reach 'stopped heartbeat'.
>>
>> If the hardware is broken and the reset does not work then GuC will 
>> send a 'failed reset' notification to the KMD. The KMD treats that as 
>> a major error and immediately does a full GT reset. So there is still 
>> no 'stopped heartbeat'.
>>
>> If GuC has died (or a KMD bug has caused sufficient confusion to make 
>> it think the GuC has died) then yes, you will reach that code. But in 
>> that case it is not an engine reset that is required, it is a full GT 
>> reset including a reset of the GuC.
>
> Got it, so what is actually wrong is calling intel_gt_handle_error 
> with an engine->mask in GuC mode. intel_engine_hearbeat.c/reset_engine 
> should fork into two (in some way), depending on backend, so in the 
> case of GuC it can call a variant of intel_gt_handle_error which would 
> be explicitly about a full GPU reset only, instead of a sprinkle of 
> intel_uc_uses_guc_submission in that function. Possibly even off load 
> the reset to a single per gt worker, so that if multiple active 
> engines trigger the reset roughly simultaneously, there is only one 
> full GPU reset. And it gets correctly labeled as "dead GuC" or something.
>
Sure. Feel free to re-write the reset code yet again. That's another 
exceedingly fragile area of the driver.

However, that is unrelated to this patch set.

John.

> Regards,
>
> Tvrtko

Tvrtko Ursulin Feb. 28, 2022, 3:32 p.m. UTC | #11

On 25/02/2022 19:03, John Harrison wrote:
> On 2/25/2022 10:29, Tvrtko Ursulin wrote:
>> On 25/02/2022 18:01, John Harrison wrote:
>>> On 2/25/2022 09:39, Tvrtko Ursulin wrote:
>>>> On 25/02/2022 17:11, John Harrison wrote:
>>>>> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>>>>>> On 24/02/2022 20:02, John Harrison wrote:
>>>>>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>>>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>
>>>>>>>>>>> Compute workloads are inherently not pre-emptible on current 
>>>>>>>>>>> hardware.
>>>>>>>>>>> Thus the pre-emption timeout was disabled as a workaround to 
>>>>>>>>>>> prevent
>>>>>>>>>>> unwanted resets. Instead, the hang detection was left to the 
>>>>>>>>>>> heartbeat
>>>>>>>>>>> and its (longer) timeout. This is undesirable with GuC 
>>>>>>>>>>> submission as
>>>>>>>>>>> the heartbeat is a full GT reset rather than a per engine 
>>>>>>>>>>> reset and so
>>>>>>>>>>> is much more destructive. Instead, just bump the pre-emption 
>>>>>>>>>>> timeout
>>>>>>>>>>
>>>>>>>>>> Can we have a feature request to allow asking GuC for an 
>>>>>>>>>> engine reset?
>>>>>>>>> For what purpose?
>>>>>>>>
>>>>>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>>>>>
>>>>>>>>> GuC manages the scheduling of contexts across engines. With 
>>>>>>>>> virtual engines, the KMD has no knowledge of which engine a 
>>>>>>>>> context might be executing on. Even without virtual engines, 
>>>>>>>>> the KMD still has no knowledge of which context is currently 
>>>>>>>>> executing on any given engine at any given time.
>>>>>>>>>
>>>>>>>>> There is a reason why hang detection should be left to the 
>>>>>>>>> entity that is doing the scheduling. Any other entity is second 
>>>>>>>>> guessing at best.
>>>>>>>>>
>>>>>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>>>>>> submission is enabled is for the case where the KMD/GuC have 
>>>>>>>>> got out of sync with either other somehow or GuC itself has 
>>>>>>>>> just crashed. I.e. when no submission at all is working and we 
>>>>>>>>> need to reset the GuC itself and start over.
>>>>>>>>
>>>>>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>>>>>> nerfed already in GuC mode.
>>>>>>> Not sure what you mean by that claim. Engine resets are handled 
>>>>>>> by GuC because GuC handles the scheduling. You can't do the 
>>>>>>> former if you aren't doing the latter. However, the heartbeat is 
>>>>>>> still present and is still the watchdog by which engine resets 
>>>>>>> are triggered. As per the rest of the submission process, the 
>>>>>>> hang detection and recovery is split between i915 and GuC.
>>>>>>
>>>>>> I meant that "stopped heartbeat on engine XXX" can only do a full 
>>>>>> GPU reset on GuC.
>>>>> I mean that there is no 'stopped heartbeat on engine XXX' when i915 
>>>>> is not handling the recovery part of the process.
>>>>
>>>> Hmmmm?
>>>>
>>>> static void
>>>> reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
>>>> {
>>>>     if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
>>>>         show_heartbeat(rq, engine);
>>>>
>>>>     if (intel_engine_uses_guc(engine))
>>>>         /*
>>>>          * GuC itself is toast or GuC's hang detection
>>>>          * is disabled. Either way, need to find the
>>>>          * hang culprit manually.
>>>>          */
>>>>         intel_guc_find_hung_context(engine);
>>>>
>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>                   I915_ERROR_CAPTURE,
>>>>                   "stopped heartbeat on %s",
>>>>                   engine->name);
>>>> }
>>>>
>>>> How there is no "stopped hearbeat" in guc mode? From this code it 
>>>> certainly looks there is.
>>> Only when the GuC is toast and it is no longer an engine reset but a 
>>> full GT reset that is required. So technically, it is not a 'stopped 
>>> heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.
>>>
>>>>
>>>> You say below heartbeats are going in GuC mode. Now I totally don't 
>>>> understand how they are going but there is allegedly no "stopped 
>>>> hearbeat".
>>> Because if GuC is handling the detection and recovery then i915 will 
>>> not reach that point. GuC will do the engine reset and start 
>>> scheduling the next context before the heartbeat period expires. So 
>>> the notification will be a G2H about a specific context being reset 
>>> rather than the i915 notification about a stopped heartbeat.
>>>
>>>>
>>>>>>
>>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>>                   I915_ERROR_CAPTURE,
>>>>>>                   "stopped heartbeat on %s",
>>>>>>                   engine->name);
>>>>>>
>>>>>> intel_gt_handle_error:
>>>>>>
>>>>>>     /*
>>>>>>      * Try engine reset when available. We fall back to full reset if
>>>>>>      * single reset fails.
>>>>>>      */
>>>>>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>>>>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>>>>>         local_bh_disable();
>>>>>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>>>>>
>>>>>> You said "However, the heartbeat is still present and is still the 
>>>>>> watchdog by which engine resets are triggered", now I don't know 
>>>>>> what you meant by this. It actually triggers a single engine reset 
>>>>>> in GuC mode? Where in code does that happen if this block above 
>>>>>> shows it not taking the engine reset path?
>>>>> i915 sends down the per engine pulse.
>>>>> GuC schedules the pulse
>>>>> GuC attempts to pre-empt the currently active context
>>>>> GuC detects the pre-emption timeout
>>>>> GuC resets the engine
>>>>>
>>>>> The fundamental process is exactly the same as in execlist mode. 
>>>>> It's just that the above blocks of code (calls to 
>>>>> intel_gt_handle_error and such) are now inside the GuC not i915.
>>>>>
>>>>> Without the heartbeat going ping, there would be no context 
>>>>> switching and thus no pre-emption, no pre-emption timeout and so no 
>>>>> hang and reset recovery. And GuC cannot sent pulses by itself - it 
>>>>> has no ability to generate context workloads. So we need i915 to 
>>>>> send the pings and to gradually raise their priority. But the back 
>>>>> half of the heartbeat code is now inside the GuC. It will simply 
>>>>> never reach the i915 side timeout if GuC is doing the recovery 
>>>>> (unless the heartbeat's final period is too short). We should only 
>>>>> reach the i915 side timeout if GuC itself is toast. At which point 
>>>>> we need the full GT reset to recover the GuC.
>>>>
>>>> If workload is not preempting and reset does not work, like engine 
>>>> is truly stuck, does the current code hit "stopped heartbeat" or not 
>>>> in GuC mode?
>>> Hang on, where did 'reset does not work' come into this?
>>>
>>> If GuC is alive and the hardware is not broken then no, it won't. 
>>> That's the whole point. GuC does the detection and recovery. The KMD 
>>> will never reach 'stopped heartbeat'.
>>>
>>> If the hardware is broken and the reset does not work then GuC will 
>>> send a 'failed reset' notification to the KMD. The KMD treats that as 
>>> a major error and immediately does a full GT reset. So there is still 
>>> no 'stopped heartbeat'.
>>>
>>> If GuC has died (or a KMD bug has caused sufficient confusion to make 
>>> it think the GuC has died) then yes, you will reach that code. But in 
>>> that case it is not an engine reset that is required, it is a full GT 
>>> reset including a reset of the GuC.
>>
>> Got it, so what is actually wrong is calling intel_gt_handle_error 
>> with an engine->mask in GuC mode. intel_engine_hearbeat.c/reset_engine 
>> should fork into two (in some way), depending on backend, so in the 
>> case of GuC it can call a variant of intel_gt_handle_error which would 
>> be explicitly about a full GPU reset only, instead of a sprinkle of 
>> intel_uc_uses_guc_submission in that function. Possibly even off load 
>> the reset to a single per gt worker, so that if multiple active 
>> engines trigger the reset roughly simultaneously, there is only one 
>> full GPU reset. And it gets correctly labeled as "dead GuC" or something.
>>
> Sure. Feel free to re-write the reset code yet again. That's another 
> exceedingly fragile area of the driver.

> However, that is unrelated to this patch set.

It's still okay to talk about improving things in my opinion. Especially 
since I think it is the same issue I raised late 2019 during internal 
review.

And I don't think it is fair to say that I should go and rewrite it, 
since I do not own the GuC backend area. Especially given the above.

If there is no motivation to improve it now then please at least track 
this, if it isn't already, in that internal Jira which was tracking all 
the areas of the GuC backend which could be improved.

I am also pretty sure if the design was cleaner it would have been more 
obvious to me, or anyone who happens to stumble there, what the purpose 
of intel_engine_heartbeat.c/reset_engine() is in GuC mode. So we 
wouldn't have to spend this long explaining things.

Regards,

Tvrtko

John Harrison Feb. 28, 2022, 7:17 p.m. UTC | #12

On 2/28/2022 07:32, Tvrtko Ursulin wrote:
> On 25/02/2022 19:03, John Harrison wrote:
>> On 2/25/2022 10:29, Tvrtko Ursulin wrote:
>>> On 25/02/2022 18:01, John Harrison wrote:
>>>> On 2/25/2022 09:39, Tvrtko Ursulin wrote:
>>>>> On 25/02/2022 17:11, John Harrison wrote:
>>>>>> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>>>>>>> On 24/02/2022 20:02, John Harrison wrote:
>>>>>>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>>>>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>>>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Compute workloads are inherently not pre-emptible on 
>>>>>>>>>>>> current hardware.
>>>>>>>>>>>> Thus the pre-emption timeout was disabled as a workaround 
>>>>>>>>>>>> to prevent
>>>>>>>>>>>> unwanted resets. Instead, the hang detection was left to 
>>>>>>>>>>>> the heartbeat
>>>>>>>>>>>> and its (longer) timeout. This is undesirable with GuC 
>>>>>>>>>>>> submission as
>>>>>>>>>>>> the heartbeat is a full GT reset rather than a per engine 
>>>>>>>>>>>> reset and so
>>>>>>>>>>>> is much more destructive. Instead, just bump the 
>>>>>>>>>>>> pre-emption timeout
>>>>>>>>>>>
>>>>>>>>>>> Can we have a feature request to allow asking GuC for an 
>>>>>>>>>>> engine reset?
>>>>>>>>>> For what purpose?
>>>>>>>>>
>>>>>>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>>>>>>
>>>>>>>>>> GuC manages the scheduling of contexts across engines. With 
>>>>>>>>>> virtual engines, the KMD has no knowledge of which engine a 
>>>>>>>>>> context might be executing on. Even without virtual engines, 
>>>>>>>>>> the KMD still has no knowledge of which context is currently 
>>>>>>>>>> executing on any given engine at any given time.
>>>>>>>>>>
>>>>>>>>>> There is a reason why hang detection should be left to the 
>>>>>>>>>> entity that is doing the scheduling. Any other entity is 
>>>>>>>>>> second guessing at best.
>>>>>>>>>>
>>>>>>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>>>>>>> submission is enabled is for the case where the KMD/GuC have 
>>>>>>>>>> got out of sync with either other somehow or GuC itself has 
>>>>>>>>>> just crashed. I.e. when no submission at all is working and 
>>>>>>>>>> we need to reset the GuC itself and start over.
>>>>>>>>>
>>>>>>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>>>>>>> nerfed already in GuC mode.
>>>>>>>> Not sure what you mean by that claim. Engine resets are handled 
>>>>>>>> by GuC because GuC handles the scheduling. You can't do the 
>>>>>>>> former if you aren't doing the latter. However, the heartbeat 
>>>>>>>> is still present and is still the watchdog by which engine 
>>>>>>>> resets are triggered. As per the rest of the submission 
>>>>>>>> process, the hang detection and recovery is split between i915 
>>>>>>>> and GuC.
>>>>>>>
>>>>>>> I meant that "stopped heartbeat on engine XXX" can only do a 
>>>>>>> full GPU reset on GuC.
>>>>>> I mean that there is no 'stopped heartbeat on engine XXX' when 
>>>>>> i915 is not handling the recovery part of the process.
>>>>>
>>>>> Hmmmm?
>>>>>
>>>>> static void
>>>>> reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
>>>>> {
>>>>>     if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
>>>>>         show_heartbeat(rq, engine);
>>>>>
>>>>>     if (intel_engine_uses_guc(engine))
>>>>>         /*
>>>>>          * GuC itself is toast or GuC's hang detection
>>>>>          * is disabled. Either way, need to find the
>>>>>          * hang culprit manually.
>>>>>          */
>>>>>         intel_guc_find_hung_context(engine);
>>>>>
>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>                   I915_ERROR_CAPTURE,
>>>>>                   "stopped heartbeat on %s",
>>>>>                   engine->name);
>>>>> }
>>>>>
>>>>> How there is no "stopped hearbeat" in guc mode? From this code it 
>>>>> certainly looks there is.
>>>> Only when the GuC is toast and it is no longer an engine reset but 
>>>> a full GT reset that is required. So technically, it is not a 
>>>> 'stopped heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.
>>>>
>>>>>
>>>>> You say below heartbeats are going in GuC mode. Now I totally 
>>>>> don't understand how they are going but there is allegedly no 
>>>>> "stopped hearbeat".
>>>> Because if GuC is handling the detection and recovery then i915 
>>>> will not reach that point. GuC will do the engine reset and start 
>>>> scheduling the next context before the heartbeat period expires. So 
>>>> the notification will be a G2H about a specific context being reset 
>>>> rather than the i915 notification about a stopped heartbeat.
>>>>
>>>>>
>>>>>>>
>>>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>>>                   I915_ERROR_CAPTURE,
>>>>>>>                   "stopped heartbeat on %s",
>>>>>>>                   engine->name);
>>>>>>>
>>>>>>> intel_gt_handle_error:
>>>>>>>
>>>>>>>     /*
>>>>>>>      * Try engine reset when available. We fall back to full 
>>>>>>> reset if
>>>>>>>      * single reset fails.
>>>>>>>      */
>>>>>>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>>>>>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>>>>>>         local_bh_disable();
>>>>>>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>>>>>>
>>>>>>> You said "However, the heartbeat is still present and is still 
>>>>>>> the watchdog by which engine resets are triggered", now I don't 
>>>>>>> know what you meant by this. It actually triggers a single 
>>>>>>> engine reset in GuC mode? Where in code does that happen if this 
>>>>>>> block above shows it not taking the engine reset path?
>>>>>> i915 sends down the per engine pulse.
>>>>>> GuC schedules the pulse
>>>>>> GuC attempts to pre-empt the currently active context
>>>>>> GuC detects the pre-emption timeout
>>>>>> GuC resets the engine
>>>>>>
>>>>>> The fundamental process is exactly the same as in execlist mode. 
>>>>>> It's just that the above blocks of code (calls to 
>>>>>> intel_gt_handle_error and such) are now inside the GuC not i915.
>>>>>>
>>>>>> Without the heartbeat going ping, there would be no context 
>>>>>> switching and thus no pre-emption, no pre-emption timeout and so 
>>>>>> no hang and reset recovery. And GuC cannot sent pulses by itself 
>>>>>> - it has no ability to generate context workloads. So we need 
>>>>>> i915 to send the pings and to gradually raise their priority. But 
>>>>>> the back half of the heartbeat code is now inside the GuC. It 
>>>>>> will simply never reach the i915 side timeout if GuC is doing the 
>>>>>> recovery (unless the heartbeat's final period is too short). We 
>>>>>> should only reach the i915 side timeout if GuC itself is toast. 
>>>>>> At which point we need the full GT reset to recover the GuC.
>>>>>
>>>>> If workload is not preempting and reset does not work, like engine 
>>>>> is truly stuck, does the current code hit "stopped heartbeat" or 
>>>>> not in GuC mode?
>>>> Hang on, where did 'reset does not work' come into this?
>>>>
>>>> If GuC is alive and the hardware is not broken then no, it won't. 
>>>> That's the whole point. GuC does the detection and recovery. The 
>>>> KMD will never reach 'stopped heartbeat'.
>>>>
>>>> If the hardware is broken and the reset does not work then GuC will 
>>>> send a 'failed reset' notification to the KMD. The KMD treats that 
>>>> as a major error and immediately does a full GT reset. So there is 
>>>> still no 'stopped heartbeat'.
>>>>
>>>> If GuC has died (or a KMD bug has caused sufficient confusion to 
>>>> make it think the GuC has died) then yes, you will reach that code. 
>>>> But in that case it is not an engine reset that is required, it is 
>>>> a full GT reset including a reset of the GuC.
>>>
>>> Got it, so what is actually wrong is calling intel_gt_handle_error 
>>> with an engine->mask in GuC mode. 
>>> intel_engine_hearbeat.c/reset_engine should fork into two (in some 
>>> way), depending on backend, so in the case of GuC it can call a 
>>> variant of intel_gt_handle_error which would be explicitly about a 
>>> full GPU reset only, instead of a sprinkle of 
>>> intel_uc_uses_guc_submission in that function. Possibly even off 
>>> load the reset to a single per gt worker, so that if multiple active 
>>> engines trigger the reset roughly simultaneously, there is only one 
>>> full GPU reset. And it gets correctly labeled as "dead GuC" or 
>>> something.
>>>
>> Sure. Feel free to re-write the reset code yet again. That's another 
>> exceedingly fragile area of the driver.
>
>> However, that is unrelated to this patch set.
>
> It's still okay to talk about improving things in my opinion. 
> Especially since I think it is the same issue I raised late 2019 
> during internal review.
>
> And I don't think it is fair to say that I should go and rewrite it, 
> since I do not own the GuC backend area. Especially given the above.
>
> If there is no motivation to improve it now then please at least track 
> this, if it isn't already, in that internal Jira which was tracking 
> all the areas of the GuC backend which could be improved.
>
> I am also pretty sure if the design was cleaner it would have been 
> more obvious to me, or anyone who happens to stumble there, what the 
> purpose of intel_engine_heartbeat.c/reset_engine() is in GuC mode. So 
> we wouldn't have to spend this long explaining things.
My point is that redesigning it to be cleaner is not just a GuC task. It 
means redesigning the entire reset sequence to more compatible with 
externally handled resets. That's a big task. Sure it can be added to 
the todo list but it is totally not in the scope of this patch set.

This is purely about enabling per engine resets again so that end users 
have a better experience when GPU hangs occur. Or at least, don't have a 
greatly worse experience on our newest platforms than they did on all 
the previous ones because we have totally hacked that feature out. And 
actually getting that feature enabled before we ship too many products 
and the maintainers/architects decide we are no longer allowed to turn 
it on because it is now a behaviour change that end users are not 
expecting. So forever more ADL-P/DG2 are stuck on full GT only resets.

John.


>
> Regards,
>
> Tvrtko

Tvrtko Ursulin March 2, 2022, 11:21 a.m. UTC | #13

On 28/02/2022 19:17, John Harrison wrote:
> On 2/28/2022 07:32, Tvrtko Ursulin wrote:
>> On 25/02/2022 19:03, John Harrison wrote:
>>> On 2/25/2022 10:29, Tvrtko Ursulin wrote:
>>>> On 25/02/2022 18:01, John Harrison wrote:
>>>>> On 2/25/2022 09:39, Tvrtko Ursulin wrote:
>>>>>> On 25/02/2022 17:11, John Harrison wrote:
>>>>>>> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>>>>>>>> On 24/02/2022 20:02, John Harrison wrote:
>>>>>>>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>>>>>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>>>>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>>>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Compute workloads are inherently not pre-emptible on 
>>>>>>>>>>>>> current hardware.
>>>>>>>>>>>>> Thus the pre-emption timeout was disabled as a workaround 
>>>>>>>>>>>>> to prevent
>>>>>>>>>>>>> unwanted resets. Instead, the hang detection was left to 
>>>>>>>>>>>>> the heartbeat
>>>>>>>>>>>>> and its (longer) timeout. This is undesirable with GuC 
>>>>>>>>>>>>> submission as
>>>>>>>>>>>>> the heartbeat is a full GT reset rather than a per engine 
>>>>>>>>>>>>> reset and so
>>>>>>>>>>>>> is much more destructive. Instead, just bump the 
>>>>>>>>>>>>> pre-emption timeout
>>>>>>>>>>>>
>>>>>>>>>>>> Can we have a feature request to allow asking GuC for an 
>>>>>>>>>>>> engine reset?
>>>>>>>>>>> For what purpose?
>>>>>>>>>>
>>>>>>>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>>>>>>>
>>>>>>>>>>> GuC manages the scheduling of contexts across engines. With 
>>>>>>>>>>> virtual engines, the KMD has no knowledge of which engine a 
>>>>>>>>>>> context might be executing on. Even without virtual engines, 
>>>>>>>>>>> the KMD still has no knowledge of which context is currently 
>>>>>>>>>>> executing on any given engine at any given time.
>>>>>>>>>>>
>>>>>>>>>>> There is a reason why hang detection should be left to the 
>>>>>>>>>>> entity that is doing the scheduling. Any other entity is 
>>>>>>>>>>> second guessing at best.
>>>>>>>>>>>
>>>>>>>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>>>>>>>> submission is enabled is for the case where the KMD/GuC have 
>>>>>>>>>>> got out of sync with either other somehow or GuC itself has 
>>>>>>>>>>> just crashed. I.e. when no submission at all is working and 
>>>>>>>>>>> we need to reset the GuC itself and start over.
>>>>>>>>>>
>>>>>>>>>> .. I wasn't really up to speed to know/remember heartbeats are 
>>>>>>>>>> nerfed already in GuC mode.
>>>>>>>>> Not sure what you mean by that claim. Engine resets are handled 
>>>>>>>>> by GuC because GuC handles the scheduling. You can't do the 
>>>>>>>>> former if you aren't doing the latter. However, the heartbeat 
>>>>>>>>> is still present and is still the watchdog by which engine 
>>>>>>>>> resets are triggered. As per the rest of the submission 
>>>>>>>>> process, the hang detection and recovery is split between i915 
>>>>>>>>> and GuC.
>>>>>>>>
>>>>>>>> I meant that "stopped heartbeat on engine XXX" can only do a 
>>>>>>>> full GPU reset on GuC.
>>>>>>> I mean that there is no 'stopped heartbeat on engine XXX' when 
>>>>>>> i915 is not handling the recovery part of the process.
>>>>>>
>>>>>> Hmmmm?
>>>>>>
>>>>>> static void
>>>>>> reset_engine(struct intel_engine_cs *engine, struct i915_request *rq)
>>>>>> {
>>>>>>     if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
>>>>>>         show_heartbeat(rq, engine);
>>>>>>
>>>>>>     if (intel_engine_uses_guc(engine))
>>>>>>         /*
>>>>>>          * GuC itself is toast or GuC's hang detection
>>>>>>          * is disabled. Either way, need to find the
>>>>>>          * hang culprit manually.
>>>>>>          */
>>>>>>         intel_guc_find_hung_context(engine);
>>>>>>
>>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>>                   I915_ERROR_CAPTURE,
>>>>>>                   "stopped heartbeat on %s",
>>>>>>                   engine->name);
>>>>>> }
>>>>>>
>>>>>> How there is no "stopped hearbeat" in guc mode? From this code it 
>>>>>> certainly looks there is.
>>>>> Only when the GuC is toast and it is no longer an engine reset but 
>>>>> a full GT reset that is required. So technically, it is not a 
>>>>> 'stopped heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.
>>>>>
>>>>>>
>>>>>> You say below heartbeats are going in GuC mode. Now I totally 
>>>>>> don't understand how they are going but there is allegedly no 
>>>>>> "stopped hearbeat".
>>>>> Because if GuC is handling the detection and recovery then i915 
>>>>> will not reach that point. GuC will do the engine reset and start 
>>>>> scheduling the next context before the heartbeat period expires. So 
>>>>> the notification will be a G2H about a specific context being reset 
>>>>> rather than the i915 notification about a stopped heartbeat.
>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>>>>                   I915_ERROR_CAPTURE,
>>>>>>>>                   "stopped heartbeat on %s",
>>>>>>>>                   engine->name);
>>>>>>>>
>>>>>>>> intel_gt_handle_error:
>>>>>>>>
>>>>>>>>     /*
>>>>>>>>      * Try engine reset when available. We fall back to full 
>>>>>>>> reset if
>>>>>>>>      * single reset fails.
>>>>>>>>      */
>>>>>>>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>>>>>>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>>>>>>>         local_bh_disable();
>>>>>>>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>>>>>>>
>>>>>>>> You said "However, the heartbeat is still present and is still 
>>>>>>>> the watchdog by which engine resets are triggered", now I don't 
>>>>>>>> know what you meant by this. It actually triggers a single 
>>>>>>>> engine reset in GuC mode? Where in code does that happen if this 
>>>>>>>> block above shows it not taking the engine reset path?
>>>>>>> i915 sends down the per engine pulse.
>>>>>>> GuC schedules the pulse
>>>>>>> GuC attempts to pre-empt the currently active context
>>>>>>> GuC detects the pre-emption timeout
>>>>>>> GuC resets the engine
>>>>>>>
>>>>>>> The fundamental process is exactly the same as in execlist mode. 
>>>>>>> It's just that the above blocks of code (calls to 
>>>>>>> intel_gt_handle_error and such) are now inside the GuC not i915.
>>>>>>>
>>>>>>> Without the heartbeat going ping, there would be no context 
>>>>>>> switching and thus no pre-emption, no pre-emption timeout and so 
>>>>>>> no hang and reset recovery. And GuC cannot sent pulses by itself 
>>>>>>> - it has no ability to generate context workloads. So we need 
>>>>>>> i915 to send the pings and to gradually raise their priority. But 
>>>>>>> the back half of the heartbeat code is now inside the GuC. It 
>>>>>>> will simply never reach the i915 side timeout if GuC is doing the 
>>>>>>> recovery (unless the heartbeat's final period is too short). We 
>>>>>>> should only reach the i915 side timeout if GuC itself is toast. 
>>>>>>> At which point we need the full GT reset to recover the GuC.
>>>>>>
>>>>>> If workload is not preempting and reset does not work, like engine 
>>>>>> is truly stuck, does the current code hit "stopped heartbeat" or 
>>>>>> not in GuC mode?
>>>>> Hang on, where did 'reset does not work' come into this?
>>>>>
>>>>> If GuC is alive and the hardware is not broken then no, it won't. 
>>>>> That's the whole point. GuC does the detection and recovery. The 
>>>>> KMD will never reach 'stopped heartbeat'.
>>>>>
>>>>> If the hardware is broken and the reset does not work then GuC will 
>>>>> send a 'failed reset' notification to the KMD. The KMD treats that 
>>>>> as a major error and immediately does a full GT reset. So there is 
>>>>> still no 'stopped heartbeat'.
>>>>>
>>>>> If GuC has died (or a KMD bug has caused sufficient confusion to 
>>>>> make it think the GuC has died) then yes, you will reach that code. 
>>>>> But in that case it is not an engine reset that is required, it is 
>>>>> a full GT reset including a reset of the GuC.
>>>>
>>>> Got it, so what is actually wrong is calling intel_gt_handle_error 
>>>> with an engine->mask in GuC mode. 
>>>> intel_engine_hearbeat.c/reset_engine should fork into two (in some 
>>>> way), depending on backend, so in the case of GuC it can call a 
>>>> variant of intel_gt_handle_error which would be explicitly about a 
>>>> full GPU reset only, instead of a sprinkle of 
>>>> intel_uc_uses_guc_submission in that function. Possibly even off 
>>>> load the reset to a single per gt worker, so that if multiple active 
>>>> engines trigger the reset roughly simultaneously, there is only one 
>>>> full GPU reset. And it gets correctly labeled as "dead GuC" or 
>>>> something.
>>>>
>>> Sure. Feel free to re-write the reset code yet again. That's another 
>>> exceedingly fragile area of the driver.
>>
>>> However, that is unrelated to this patch set.
>>
>> It's still okay to talk about improving things in my opinion. 
>> Especially since I think it is the same issue I raised late 2019 
>> during internal review.
>>
>> And I don't think it is fair to say that I should go and rewrite it, 
>> since I do not own the GuC backend area. Especially given the above.
>>
>> If there is no motivation to improve it now then please at least track 
>> this, if it isn't already, in that internal Jira which was tracking 
>> all the areas of the GuC backend which could be improved.
>>
>> I am also pretty sure if the design was cleaner it would have been 
>> more obvious to me, or anyone who happens to stumble there, what the 
>> purpose of intel_engine_heartbeat.c/reset_engine() is in GuC mode. So 
>> we wouldn't have to spend this long explaining things.
> My point is that redesigning it to be cleaner is not just a GuC task. It 
> means redesigning the entire reset sequence to more compatible with 
> externally handled resets. That's a big task. Sure it can be added to 
> the todo list but it is totally not in the scope of this patch set.

My point was that was something which was raised years ago ("don't just 
shoe-horn, redesign, refactor"). Anyway, stopping flogging of this dead 
horse.. onto below..
> This is purely about enabling per engine resets again so that end users 
> have a better experience when GPU hangs occur. Or at least, don't have a 
> greatly worse experience on our newest platforms than they did on all 
> the previous ones because we have totally hacked that feature out. And 
> actually getting that feature enabled before we ship too many products 
> and the maintainers/architects decide we are no longer allowed to turn 
> it on because it is now a behaviour change that end users are not 
> expecting. So forever more ADL-P/DG2 are stuck on full GT only resets.

Is any platform with GuC outside force probe already? Either way 
blocking re-addition of engine resets will not be a thing from 
maintainers point of view. Whether or not the fail of not having them is 
a conscious or accidental miss, we certainly want it back ASAP.

Regards,

Tvrtko

John Harrison March 2, 2022, 5:40 p.m. UTC | #14

On 3/2/2022 03:21, Tvrtko Ursulin wrote:
> On 28/02/2022 19:17, John Harrison wrote:
>> On 2/28/2022 07:32, Tvrtko Ursulin wrote:
>>> On 25/02/2022 19:03, John Harrison wrote:
>>>> On 2/25/2022 10:29, Tvrtko Ursulin wrote:
>>>>> On 25/02/2022 18:01, John Harrison wrote:
>>>>>> On 2/25/2022 09:39, Tvrtko Ursulin wrote:
>>>>>>> On 25/02/2022 17:11, John Harrison wrote:
>>>>>>>> On 2/25/2022 08:36, Tvrtko Ursulin wrote:
>>>>>>>>> On 24/02/2022 20:02, John Harrison wrote:
>>>>>>>>>> On 2/23/2022 04:00, Tvrtko Ursulin wrote:
>>>>>>>>>>> On 23/02/2022 02:22, John Harrison wrote:
>>>>>>>>>>>> On 2/22/2022 01:53, Tvrtko Ursulin wrote:
>>>>>>>>>>>>> On 18/02/2022 21:33, John.C.Harrison@Intel.com wrote:
>>>>>>>>>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Compute workloads are inherently not pre-emptible on 
>>>>>>>>>>>>>> current hardware.
>>>>>>>>>>>>>> Thus the pre-emption timeout was disabled as a workaround 
>>>>>>>>>>>>>> to prevent
>>>>>>>>>>>>>> unwanted resets. Instead, the hang detection was left to 
>>>>>>>>>>>>>> the heartbeat
>>>>>>>>>>>>>> and its (longer) timeout. This is undesirable with GuC 
>>>>>>>>>>>>>> submission as
>>>>>>>>>>>>>> the heartbeat is a full GT reset rather than a per engine 
>>>>>>>>>>>>>> reset and so
>>>>>>>>>>>>>> is much more destructive. Instead, just bump the 
>>>>>>>>>>>>>> pre-emption timeout
>>>>>>>>>>>>>
>>>>>>>>>>>>> Can we have a feature request to allow asking GuC for an 
>>>>>>>>>>>>> engine reset?
>>>>>>>>>>>> For what purpose?
>>>>>>>>>>>
>>>>>>>>>>> To allow "stopped heartbeat" to reset the engine, however..
>>>>>>>>>>>
>>>>>>>>>>>> GuC manages the scheduling of contexts across engines. With 
>>>>>>>>>>>> virtual engines, the KMD has no knowledge of which engine a 
>>>>>>>>>>>> context might be executing on. Even without virtual 
>>>>>>>>>>>> engines, the KMD still has no knowledge of which context is 
>>>>>>>>>>>> currently executing on any given engine at any given time.
>>>>>>>>>>>>
>>>>>>>>>>>> There is a reason why hang detection should be left to the 
>>>>>>>>>>>> entity that is doing the scheduling. Any other entity is 
>>>>>>>>>>>> second guessing at best.
>>>>>>>>>>>>
>>>>>>>>>>>> The reason for keeping the heartbeat around even when GuC 
>>>>>>>>>>>> submission is enabled is for the case where the KMD/GuC 
>>>>>>>>>>>> have got out of sync with either other somehow or GuC 
>>>>>>>>>>>> itself has just crashed. I.e. when no submission at all is 
>>>>>>>>>>>> working and we need to reset the GuC itself and start over.
>>>>>>>>>>>
>>>>>>>>>>> .. I wasn't really up to speed to know/remember heartbeats 
>>>>>>>>>>> are nerfed already in GuC mode.
>>>>>>>>>> Not sure what you mean by that claim. Engine resets are 
>>>>>>>>>> handled by GuC because GuC handles the scheduling. You can't 
>>>>>>>>>> do the former if you aren't doing the latter. However, the 
>>>>>>>>>> heartbeat is still present and is still the watchdog by which 
>>>>>>>>>> engine resets are triggered. As per the rest of the 
>>>>>>>>>> submission process, the hang detection and recovery is split 
>>>>>>>>>> between i915 and GuC.
>>>>>>>>>
>>>>>>>>> I meant that "stopped heartbeat on engine XXX" can only do a 
>>>>>>>>> full GPU reset on GuC.
>>>>>>>> I mean that there is no 'stopped heartbeat on engine XXX' when 
>>>>>>>> i915 is not handling the recovery part of the process.
>>>>>>>
>>>>>>> Hmmmm?
>>>>>>>
>>>>>>> static void
>>>>>>> reset_engine(struct intel_engine_cs *engine, struct i915_request 
>>>>>>> *rq)
>>>>>>> {
>>>>>>>     if (IS_ENABLED(CONFIG_DRM_I915_DEBUG_GEM))
>>>>>>>         show_heartbeat(rq, engine);
>>>>>>>
>>>>>>>     if (intel_engine_uses_guc(engine))
>>>>>>>         /*
>>>>>>>          * GuC itself is toast or GuC's hang detection
>>>>>>>          * is disabled. Either way, need to find the
>>>>>>>          * hang culprit manually.
>>>>>>>          */
>>>>>>>         intel_guc_find_hung_context(engine);
>>>>>>>
>>>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>>>                   I915_ERROR_CAPTURE,
>>>>>>>                   "stopped heartbeat on %s",
>>>>>>>                   engine->name);
>>>>>>> }
>>>>>>>
>>>>>>> How there is no "stopped hearbeat" in guc mode? From this code 
>>>>>>> it certainly looks there is.
>>>>>> Only when the GuC is toast and it is no longer an engine reset 
>>>>>> but a full GT reset that is required. So technically, it is not a 
>>>>>> 'stopped heartbeat on engine XXX' it is 'stopped heartbeat on GT#'.
>>>>>>
>>>>>>>
>>>>>>> You say below heartbeats are going in GuC mode. Now I totally 
>>>>>>> don't understand how they are going but there is allegedly no 
>>>>>>> "stopped hearbeat".
>>>>>> Because if GuC is handling the detection and recovery then i915 
>>>>>> will not reach that point. GuC will do the engine reset and start 
>>>>>> scheduling the next context before the heartbeat period expires. 
>>>>>> So the notification will be a G2H about a specific context being 
>>>>>> reset rather than the i915 notification about a stopped heartbeat.
>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>     intel_gt_handle_error(engine->gt, engine->mask,
>>>>>>>>>                   I915_ERROR_CAPTURE,
>>>>>>>>>                   "stopped heartbeat on %s",
>>>>>>>>>                   engine->name);
>>>>>>>>>
>>>>>>>>> intel_gt_handle_error:
>>>>>>>>>
>>>>>>>>>     /*
>>>>>>>>>      * Try engine reset when available. We fall back to full 
>>>>>>>>> reset if
>>>>>>>>>      * single reset fails.
>>>>>>>>>      */
>>>>>>>>>     if (!intel_uc_uses_guc_submission(&gt->uc) &&
>>>>>>>>>         intel_has_reset_engine(gt) && !intel_gt_is_wedged(gt)) {
>>>>>>>>>         local_bh_disable();
>>>>>>>>>         for_each_engine_masked(engine, gt, engine_mask, tmp) {
>>>>>>>>>
>>>>>>>>> You said "However, the heartbeat is still present and is still 
>>>>>>>>> the watchdog by which engine resets are triggered", now I 
>>>>>>>>> don't know what you meant by this. It actually triggers a 
>>>>>>>>> single engine reset in GuC mode? Where in code does that 
>>>>>>>>> happen if this block above shows it not taking the engine 
>>>>>>>>> reset path?
>>>>>>>> i915 sends down the per engine pulse.
>>>>>>>> GuC schedules the pulse
>>>>>>>> GuC attempts to pre-empt the currently active context
>>>>>>>> GuC detects the pre-emption timeout
>>>>>>>> GuC resets the engine
>>>>>>>>
>>>>>>>> The fundamental process is exactly the same as in execlist 
>>>>>>>> mode. It's just that the above blocks of code (calls to 
>>>>>>>> intel_gt_handle_error and such) are now inside the GuC not i915.
>>>>>>>>
>>>>>>>> Without the heartbeat going ping, there would be no context 
>>>>>>>> switching and thus no pre-emption, no pre-emption timeout and 
>>>>>>>> so no hang and reset recovery. And GuC cannot sent pulses by 
>>>>>>>> itself - it has no ability to generate context workloads. So we 
>>>>>>>> need i915 to send the pings and to gradually raise their 
>>>>>>>> priority. But the back half of the heartbeat code is now inside 
>>>>>>>> the GuC. It will simply never reach the i915 side timeout if 
>>>>>>>> GuC is doing the recovery (unless the heartbeat's final period 
>>>>>>>> is too short). We should only reach the i915 side timeout if 
>>>>>>>> GuC itself is toast. At which point we need the full GT reset 
>>>>>>>> to recover the GuC.
>>>>>>>
>>>>>>> If workload is not preempting and reset does not work, like 
>>>>>>> engine is truly stuck, does the current code hit "stopped 
>>>>>>> heartbeat" or not in GuC mode?
>>>>>> Hang on, where did 'reset does not work' come into this?
>>>>>>
>>>>>> If GuC is alive and the hardware is not broken then no, it won't. 
>>>>>> That's the whole point. GuC does the detection and recovery. The 
>>>>>> KMD will never reach 'stopped heartbeat'.
>>>>>>
>>>>>> If the hardware is broken and the reset does not work then GuC 
>>>>>> will send a 'failed reset' notification to the KMD. The KMD 
>>>>>> treats that as a major error and immediately does a full GT 
>>>>>> reset. So there is still no 'stopped heartbeat'.
>>>>>>
>>>>>> If GuC has died (or a KMD bug has caused sufficient confusion to 
>>>>>> make it think the GuC has died) then yes, you will reach that 
>>>>>> code. But in that case it is not an engine reset that is 
>>>>>> required, it is a full GT reset including a reset of the GuC.
>>>>>
>>>>> Got it, so what is actually wrong is calling intel_gt_handle_error 
>>>>> with an engine->mask in GuC mode. 
>>>>> intel_engine_hearbeat.c/reset_engine should fork into two (in some 
>>>>> way), depending on backend, so in the case of GuC it can call a 
>>>>> variant of intel_gt_handle_error which would be explicitly about a 
>>>>> full GPU reset only, instead of a sprinkle of 
>>>>> intel_uc_uses_guc_submission in that function. Possibly even off 
>>>>> load the reset to a single per gt worker, so that if multiple 
>>>>> active engines trigger the reset roughly simultaneously, there is 
>>>>> only one full GPU reset. And it gets correctly labeled as "dead 
>>>>> GuC" or something.
>>>>>
>>>> Sure. Feel free to re-write the reset code yet again. That's 
>>>> another exceedingly fragile area of the driver.
>>>
>>>> However, that is unrelated to this patch set.
>>>
>>> It's still okay to talk about improving things in my opinion. 
>>> Especially since I think it is the same issue I raised late 2019 
>>> during internal review.
>>>
>>> And I don't think it is fair to say that I should go and rewrite it, 
>>> since I do not own the GuC backend area. Especially given the above.
>>>
>>> If there is no motivation to improve it now then please at least 
>>> track this, if it isn't already, in that internal Jira which was 
>>> tracking all the areas of the GuC backend which could be improved.
>>>
>>> I am also pretty sure if the design was cleaner it would have been 
>>> more obvious to me, or anyone who happens to stumble there, what the 
>>> purpose of intel_engine_heartbeat.c/reset_engine() is in GuC mode. 
>>> So we wouldn't have to spend this long explaining things.
>> My point is that redesigning it to be cleaner is not just a GuC task. 
>> It means redesigning the entire reset sequence to more compatible 
>> with externally handled resets. That's a big task. Sure it can be 
>> added to the todo list but it is totally not in the scope of this 
>> patch set.
>
> My point was that was something which was raised years ago ("don't 
> just shoe-horn, redesign, refactor"). Anyway, stopping flogging of 
> this dead horse.. onto below..
>> This is purely about enabling per engine resets again so that end 
>> users have a better experience when GPU hangs occur. Or at least, 
>> don't have a greatly worse experience on our newest platforms than 
>> they did on all the previous ones because we have totally hacked that 
>> feature out. And actually getting that feature enabled before we ship 
>> too many products and the maintainers/architects decide we are no 
>> longer allowed to turn it on because it is now a behaviour change 
>> that end users are not expecting. So forever more ADL-P/DG2 are stuck 
>> on full GT only resets.
>
> Is any platform with GuC outside force probe already? Either way 
> blocking re-addition of engine resets will not be a thing from 
> maintainers point of view. Whether or not the fail of not having them 
> is a conscious or accidental miss, we certainly want it back ASAP.
>
ADL-P is released and running GuC submission.

John.

> Regards,
>
> Tvrtko

[0/3] Improve anti-pre-emption w/a for compute workloads

Message

Comments