mbox series

[v5,0/4] Improve anti-pre-emption w/a for compute workloads

Message ID 20221006213813.1563435-1-John.C.Harrison@Intel.com (mailing list archive)
Headers show
Series Improve anti-pre-emption w/a for compute workloads | expand

Message

John Harrison Oct. 6, 2022, 9:38 p.m. UTC
From: John Harrison <John.C.Harrison@Intel.com>

Compute workloads are inherently not pre-emptible on current hardware.
Thus the pre-emption timeout was disabled as a workaround to prevent
unwanted resets. Instead, the hang detection was left to the heartbeat
and its (longer) timeout. This is undesirable with GuC submission as
the heartbeat is a full GT reset rather than a per engine reset and so
is much more destructive. Instead, just bump the pre-emption timeout
to a big value. Also, update the heartbeat to allow such a long
pre-emption delay in the final heartbeat period.

v2: Add clamping helpers.
v3: Remove long timeout algorithm and replace with hard coded value
(review feedback from Tvrtko). Also, fix execlist selftest failure and
fix bug in compute enabling patch related to pre-emption timeouts.
v4: Add multiple BUG_ONs to re-check already range checked values (Tvrtko)
v5: Add FIXMEs and drm_notices about setting non-default heartbeat
periods (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (4):
  drm/i915/guc: Limit scheduling properties to avoid overflow
  drm/i915: Fix compute pre-emption w/a to apply to compute engines
  drm/i915: Make the heartbeat play nice with long pre-emption timeouts
  drm/i915: Improve long running compute w/a for GuC submission

 drivers/gpu/drm/i915/Kconfig.profile          |  26 ++++-
 drivers/gpu/drm/i915/gt/intel_engine.h        |   6 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 102 +++++++++++++++---
 .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |  39 +++++++
 drivers/gpu/drm/i915/gt/sysfs_engines.c       |  25 +++--
 drivers/gpu/drm/i915/gt/uc/intel_guc_fwif.h   |  21 ++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   8 ++
 7 files changed, 199 insertions(+), 28 deletions(-)

Comments

John Harrison Oct. 10, 2022, 7:44 p.m. UTC | #1
On 10/6/2022 15:20, Patchwork wrote:
> Project List - Patchwork *Patch Details*
> *Series:* 	Improve anti-pre-emption w/a for compute workloads (rev8)
> *URL:* 	https://patchwork.freedesktop.org/series/100428/
> *State:* 	failure
> *Details:* 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/index.html
>
>
>   CI Bug Log - changes from CI_DRM_12223 -> Patchwork_100428v8
>
>
>     Summary
>
> *FAILURE*
>
> Serious unknown changes coming with Patchwork_100428v8 absolutely need 
> to be
> verified manually.
>
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_100428v8, please notify your bug team to allow 
> them
> to document this new failure mode, which will reduce false positives 
> in CI.
>
> External URL: 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/index.html
>
>
>     Participating hosts (42 -> 40)
>
> Missing (2): fi-ctg-p8600 fi-hsw-4200u
>
>
>     Possible new issues
>
> Here are the unknown changes that may have been introduced in 
> Patchwork_100428v8:
>
>
>       IGT changes
>
>
>         Possible regressions
>
>   * igt@i915_selftest@live@migrate:
>       o fi-apl-guc: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/fi-apl-guc/igt@i915_selftest@live@migrate.html>
>         -> INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-apl-guc/igt@i915_selftest@live@migrate.html>
>
The logs seems to suggest the test just stopped (with the actual dmesg0 
link being corrupted at the end). Seems likely there was a kernel panic 
followed by reboot. Given that this patch set is only affecting hang 
detection and recovery and the migrate test is not supposed to hit any 
hangs, it seems very unlikely this failure is related. Certainly, all 
previous revisions of this patch set did not any problems with the 
live@migrate test.

John.

>  *
>
>
>     Known issues
>
> Here are the changes found in Patchwork_100428v8 that come from known 
> issues:
>
>
>       IGT changes
>
>
>         Issues hit
>
>  *
>
>     igt@gem_exec_suspend@basic-s3@smem:
>
>       o fi-rkl-11600: NOTRUN -> FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-rkl-11600/igt@gem_exec_suspend@basic-s3@smem.html>
>         (fdo#103375 <https://bugs.freedesktop.org/show_bug.cgi?id=103375>)
>  *
>
>     igt@i915_selftest@live@gt_heartbeat:
>
>       o fi-apl-guc: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/fi-apl-guc/igt@i915_selftest@live@gt_heartbeat.html>
>         -> DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-apl-guc/igt@i915_selftest@live@gt_heartbeat.html>
>         (i915#5334 <https://gitlab.freedesktop.org/drm/intel/issues/5334>)
>  *
>
>     igt@i915_selftest@live@hangcheck:
>
>       o fi-hsw-g3258: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/fi-hsw-g3258/igt@i915_selftest@live@hangcheck.html>
>         -> INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-hsw-g3258/igt@i915_selftest@live@hangcheck.html>
>         (i915#3303
>         <https://gitlab.freedesktop.org/drm/intel/issues/3303> /
>         i915#4785 <https://gitlab.freedesktop.org/drm/intel/issues/4785>)
>  *
>
>     igt@kms_chamelium@common-hpd-after-suspend:
>
>      o
>
>         fi-rkl-11600: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-rkl-11600/igt@kms_chamelium@common-hpd-after-suspend.html>
>         (fdo#111827 <https://bugs.freedesktop.org/show_bug.cgi?id=111827>)
>
>      o
>
>         bat-dg1-5: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/bat-dg1-5/igt@kms_chamelium@common-hpd-after-suspend.html>
>         (fdo#111827 <https://bugs.freedesktop.org/show_bug.cgi?id=111827>)
>
>  *
>
>     igt@kms_pipe_crc_basic@suspend-read-crc:
>
>       o bat-dg1-5: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/bat-dg1-5/igt@kms_pipe_crc_basic@suspend-read-crc.html>
>         (i915#4078 <https://gitlab.freedesktop.org/drm/intel/issues/4078>)
>  *
>
>     igt@runner@aborted:
>
>       o fi-hsw-g3258: NOTRUN -> FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-hsw-g3258/igt@runner@aborted.html>
>         (fdo#109271
>         <https://bugs.freedesktop.org/show_bug.cgi?id=109271> /
>         i915#4312 <https://gitlab.freedesktop.org/drm/intel/issues/4312>)
>
>
>         Possible fixes
>
>  *
>
>     igt@i915_pm_rpm@module-reload:
>
>       o {bat-rpls-2}: DMESG-WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/bat-rpls-2/igt@i915_pm_rpm@module-reload.html>
>         (i915#5537
>         <https://gitlab.freedesktop.org/drm/intel/issues/5537>) ->
>         PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/bat-rpls-2/igt@i915_pm_rpm@module-reload.html>
>  *
>
>     igt@i915_selftest@live@gt_engines:
>
>       o bat-dg1-5: INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/bat-dg1-5/igt@i915_selftest@live@gt_engines.html>
>         (i915#4418
>         <https://gitlab.freedesktop.org/drm/intel/issues/4418>) ->
>         PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/bat-dg1-5/igt@i915_selftest@live@gt_engines.html>
>  *
>
>     igt@i915_selftest@live@gt_pm:
>
>       o {bat-adln-1}: DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/bat-adln-1/igt@i915_selftest@live@gt_pm.html>
>         (i915#4258
>         <https://gitlab.freedesktop.org/drm/intel/issues/4258>) ->
>         PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/bat-adln-1/igt@i915_selftest@live@gt_pm.html>
>  *
>
>     igt@i915_selftest@live@requests:
>
>       o {bat-rpls-1}: INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/bat-rpls-1/igt@i915_selftest@live@requests.html>
>         (i915#4983
>         <https://gitlab.freedesktop.org/drm/intel/issues/4983> /
>         i915#6257
>         <https://gitlab.freedesktop.org/drm/intel/issues/6257>) ->
>         PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/bat-rpls-1/igt@i915_selftest@live@requests.html>
>  *
>
>     igt@i915_suspend@basic-s3-without-i915:
>
>      o
>
>         fi-rkl-11600: INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/fi-rkl-11600/igt@i915_suspend@basic-s3-without-i915.html>
>         (i915#5982
>         <https://gitlab.freedesktop.org/drm/intel/issues/5982>) ->
>         PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-rkl-11600/igt@i915_suspend@basic-s3-without-i915.html>
>
>      o
>
>         {bat-rpls-2}: FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/bat-rpls-2/igt@i915_suspend@basic-s3-without-i915.html>
>         (i915#6559
>         <https://gitlab.freedesktop.org/drm/intel/issues/6559>) ->
>         PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/bat-rpls-2/igt@i915_suspend@basic-s3-without-i915.html>
>
> {name}: This element is suppressed. This means it is ignored when 
> computing
> the status of the difference (SUCCESS, WARNING, or FAILURE).
>
>
>     Build changes
>
>   * Linux: CI_DRM_12223 -> Patchwork_100428v8
>
> CI-20190529: 20190529
> CI_DRM_12223: c53a5e48e0405a63cda64682304cd8b391025be3 @ 
> git://anongit.freedesktop.org/gfx-ci/linux
> IGT_7002: 523844c74e7da6b39d856596c28a92f04172035f @ 
> https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
> Patchwork_100428v8: c53a5e48e0405a63cda64682304cd8b391025be3 @ 
> git://anongit.freedesktop.org/gfx-ci/linux
>
>
>       Linux commits
>
> e774225914cc drm/i915: Improve long running compute w/a for GuC submission
> bfcef6af116f drm/i915: Make the heartbeat play nice with long 
> pre-emption timeouts
> 4141fa7b0427 drm/i915: Fix compute pre-emption w/a to apply to compute 
> engines
> eff9e2347441 drm/i915/guc: Limit scheduling properties to avoid overflow
>
Tvrtko Ursulin Oct. 12, 2022, 1:54 p.m. UTC | #2
On 10/10/2022 20:44, John Harrison wrote:
> On 10/6/2022 15:20, Patchwork wrote:
>> Project List - Patchwork *Patch Details*
>> *Series:* 	Improve anti-pre-emption w/a for compute workloads (rev8)
>> *URL:* 	https://patchwork.freedesktop.org/series/100428/
>> *State:* 	failure
>> *Details:* 
>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/index.html
>>
>>
>>   CI Bug Log - changes from CI_DRM_12223 -> Patchwork_100428v8
>>
>>
>>     Summary
>>
>> *FAILURE*
>>
>> Serious unknown changes coming with Patchwork_100428v8 absolutely need 
>> to be
>> verified manually.
>>
>> If you think the reported changes have nothing to do with the changes
>> introduced in Patchwork_100428v8, please notify your bug team to allow 
>> them
>> to document this new failure mode, which will reduce false positives 
>> in CI.
>>
>> External URL: 
>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/index.html
>>
>>
>>     Participating hosts (42 -> 40)
>>
>> Missing (2): fi-ctg-p8600 fi-hsw-4200u
>>
>>
>>     Possible new issues
>>
>> Here are the unknown changes that may have been introduced in 
>> Patchwork_100428v8:
>>
>>
>>       IGT changes
>>
>>
>>         Possible regressions
>>
>>   * igt@i915_selftest@live@migrate:
>>       o fi-apl-guc: PASS
>>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12223/fi-apl-guc/igt@i915_selftest@live@migrate.html> -> INCOMPLETE <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_100428v8/fi-apl-guc/igt@i915_selftest@live@migrate.html>
>>
> The logs seems to suggest the test just stopped (with the actual dmesg0 
> link being corrupted at the end). Seems likely there was a kernel panic 
> followed by reboot. Given that this patch set is only affecting hang 
> detection and recovery and the migrate test is not supposed to hit any 
> hangs, it seems very unlikely this failure is related. Certainly, all 
> previous revisions of this patch set did not any problems with the 
> live@migrate test.

I happened in the past: 
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12221/fi-apl-guc/igt@i915_selftest@live@migrate.html

So I think you can ignore it.

Regards,

Tvrtko