mbox series

[v5,0/1] Delay disabling GuC scheduling of an idle context

Message ID 20221006225121.826257-1-alan.previn.teres.alexis@intel.com (mailing list archive)
Headers show
Series Delay disabling GuC scheduling of an idle context | expand

Message

Alan Previn Oct. 6, 2022, 10:51 p.m. UTC
This series adds a delay before disabling scheduling of the guc-context
when a context has become idle to avoid costly re-registration that may
occur immediately after. The 2nd patch should explain it quite well.

The origin of this series was posted by Matthew Brost back in Oct 2021
(https://patchwork.freedesktop.org/series/96167/). However no real
world workload performance impact was available until recently proving
it's intended results.

This series is a redo of a prior patch that was reverted:
2ccddb758079d0c62ce03e69ee8929bb212f7799 drm/i915/guc: Add delay to
disable scheduling after pin count goes to zero

The cause for the reversion is now fixed here (was not caught due to
issues with CI reporting at that time). Two additional changes included
in this redo and restarting as new series / revs:
 - Resolve race between guc_request_alloc and guc_context_close in
completing the delayed disable-guc-scheduling worker.
 - GT Reset flow properly cancelling delayed disable-sched worker and
   closing contexts that were were still awaiting that delayed task.

Changes from prior revs:
   v4: - Remove intel_context.guc_state.number_committed_requests
         completely (John Harrison).
   v3: - Removed the now unused context_has_committed_requests (John
         Harrison)
       - Renamed the worker from sched_disable_delay to
         sched_disable_delay_work
   v2: - Removed unused function prototype.
   v1: - Changed the added guc's sched_disable_foo debugfs tunable knobs
         to unsigned int type (Tvrtko Ursulin)
       - Added more comments in the race-resolution code change
         between guc_request_alloc and context-close (Tvrtko Ursulin)
       - Increased the timeout on the race-resolution code change
         between guc_request_alloc and context-close (Daniele Ceraolo Spurio)
       - As part of guc reset preparation flow, instead of creating a new
         function (taking a whole round of locks) to deal with the contexts 
         that are in the midst of awaiting the delayed-disable-sched worker
         move that code inside scrub_guc_desc_for_outstanding_g2h before
         we check for 'pending_disable' contexts.

Alan Previn (1):
  drm/i915/guc: Remove intel_context:number_committed_requests counter

Matthew Brost (1):
  drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

 drivers/gpu/drm/i915/gem/i915_gem_context.c   |   2 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |   8 +
 drivers/gpu/drm/i915/gt/intel_context_types.h |   9 +-
 drivers/gpu/drm/i915/gt/uc/intel_guc.h        |  16 ++
 .../gpu/drm/i915/gt/uc/intel_guc_debugfs.c    |  61 +++++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 228 ++++++++++++++----
 drivers/gpu/drm/i915/i915_selftest.h          |   2 +
 7 files changed, 274 insertions(+), 52 deletions(-)


base-commit: c53a5e48e0405a63cda64682304cd8b391025be3

Comments

Alan Previn Oct. 12, 2022, 4:05 a.m. UTC | #1
On Thu, 2022-10-06 at 23:56 +0000, Patchwork wrote:
> Patch Details
> Series: Delay disabling GuC scheduling of an idle context URL: https://patchwork.freedesktop.org/series/109466/ State:
> failure Details: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109466v1/index.html 
> CI Bug Log - changes from CI_DRM_12223 -> Patchwork_109466v1SummaryFAILURE
> Serious unknown changes coming with Patchwork_109466v1 absolutely need to be
>  verified manually.
> If you think the reported changes have nothing to do with the changes
>  introduced in Patchwork_109466v1, please notify your bug team to allow them
>  to document this new failure mode, which will reduce false positives in CI.
> External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109466v1/index.html
> Participating hosts (42 -> 40)Missing (2): fi-ctg-p8600 fi-hsw-4200u 
> Possible new issuesHere are the unknown changes that may have been introduced in Patchwork_109466v1:
> IGT changesPossible regressions * igt@i915_selftest@live@migrate:fi-bdw-gvtdvm: PASS -> DMESG-WARN

I believe this is unrelated failure since the series was only changing guc code paths and this is a BDW test.
Additionally, the warning reported was about a memory-free while a lock was held but the mutex being reported
was nothing to do with GuC while the series being tested is only every adding code that use or depend on
guc->context_lookup or ce->guc_state.lock .
Alan Previn Oct. 26, 2022, 7:02 p.m. UTC | #2
I verified that 2 of the 3 the errors reported below are on platforms that don't support GuC and the dmesgs confirm guc was disabled.
The remaining ICL one, we know ICL doesnt support GuC .. also, additionally, the error was on a display IGT where CRCs were failing.
That said these errors I believe are unrelated.
...alan


On Wed, 2022-10-26 at 18:02 +0000, Patchwork wrote:
Patch Details
Series: Delay disabling GuC scheduling of an idle context (rev2)
URL:    https://patchwork.freedesktop.org/series/109466/
State:  failure
Details:        https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109466v2/index.html
CI Bug Log - changes from CI_DRM_12295_full -> Patchwork_109466v2_full
Summary

FAILURE

Serious unknown changes coming with Patchwork_109466v2_full absolutely need to be
verified manually.

If you think the reported changes have nothing to do with the changes
introduced in Patchwork_109466v2_full, please notify your bug team to allow them
to document this new failure mode, which will reduce false positives in CI.

Participating hosts (9 -> 11)

Additional (2): shard-rkl shard-dg1

Possible new issues

Here are the unknown changes that may have been introduced in Patchwork_109466v2_full:

IGT changes
Possible regressions

  *   igt@i915_selftest@live@hangcheck:

     *   shard-skl: NOTRUN -> INCOMPLETE<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109466v2/shard-skl7/igt@i915_selftest@live@hangcheck.html> +1 similar issue
  *   igt@kms_flip@flip-vs-suspend-interruptible@c-edp1:

     *   shard-skl: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12295/shard-skl1/igt@kms_flip@flip-vs-suspend-interruptible@c-edp1.html> -> INCOMPLETE<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109466v2/shard-skl3/igt@kms_flip@flip-vs-suspend-interruptible@c-edp1.html>
  *   igt@kms_frontbuffer_tracking@psr-1p-primscrn-cur-indfb-draw-render:

     *   shard-iclb: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12295/shard-iclb6/igt@kms_frontbuffer_tracking@psr-1p-primscrn-cur-indfb-draw-render.html> -> FAIL<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109466v2/shard-iclb1/igt@kms_frontbuffer_tracking@psr-1p-primscrn-cur-indfb-draw-render.html>

Suppressed

The following results come from untrusted machines, tests, or statuses.
They do not affect the overall result.

  *   igt@sysfs_preempt_timeout@idempotent@rcs0:
     *   {shard-dg1}: NOTRUN -> FAIL<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109466v2/shard-dg1-18/igt@sysfs_preempt_timeout@idempotent@rcs0.html> +4 similar issues

Known issues