mbox series

[v9,0/2] Resolve suspend-resume racing with GuC destroy-context-worker

Message ID 20231228045558.536585-1-alan.previn.teres.alexis@intel.com (mailing list archive)
Headers show
Series Resolve suspend-resume racing with GuC destroy-context-worker | expand

Message

Teres Alexis, Alan Previn Dec. 28, 2023, 4:55 a.m. UTC
This series is the result of debugging issues root caused to
races between the GuC's destroyed_worker_func being triggered
vs repeating suspend-resume cycles with concurrent delayed
fence signals for engine-freeing.

The reproduction steps require that an app is launched right
before the start of the suspend cycle where it creates a
new gem context and submits a tiny workload that would
complete in the middle of the suspend cycle. However this
app uses dma-buffer sharing or dma-fence with non-GPU
objects or signals that eventually triggers a FENCE_FREE
via__i915_sw_fence_notify that connects to engines_notify ->
free_engines_rcu -> intel_context_put ->
kref_put(&ce->ref..) that queues the worker after the GuCs
CTB has been disabled (i.e. after i915-gem's suspend-late).

This sequence is a corner-case and required repeating this
app->suspend->resume cycle ~1500 times across 4 identical
systems to see it once. That said, based on above callstack,
it is clear that merely flushing the context destruction worker,
which is obviously missing and needed, isn't sufficient.

Because of that, this series adds additional patches besides
the obvious (Patch #1) flushing of the worker during the
suspend flows. It also includes (Patch #2) closing a race
between sending the context-deregistration H2G vs the CTB
getting disabled in the midst of it (by detecing the failure
and unrolling the guc-lrc-unpin flow) and adding an additional
rcu_barrier in the gem-suspend flow to purge outstanding
rcu defered tasks that may include context destruction.

This patch was tested and confirmed to be reliably working
after running ~1500 suspend resume cycles on 4 concurrent
machines.

Changes from prior revs:
   v8: - Rebase again to resolve conflicts (Rodrigo).
   v7: - Rebase on latest drm-tip.
   v6: - Dont hold the spinlock while calling deregister_context
         which can take a longer time. (Rodrigo)
       - Fix / improve of comments. (Rodrigo)
       - Release the ce->guc_state.lock before calling
         deregister_context and retake it if we fail. (John/Daniele).
   v5: - Remove Patch #3 which doesnt solve this exact bug
         but can be a separate patch(Tvrtko).
   v4: - In Patch #2, change the position of the calls into
         rcu_barrier based on latest testing data. (Alan/Anshuman).
       - In Patch #3, fix the timeout value selection for the
         final gt-pm idle-wait that was incorrectly using a 'ns'
         #define as a milisec timeout.
   v3: - In Patch #3, when deregister_context fails, instead
         of calling intel_gt_pm_put(that might sleep), call
         __intel_wakeref_put (without ASYNC flag) (Rodrigo/Anshuman).
       - In wait_for_suspend add an rcu_barrier before we
         proceed to wait for idle. (Anshuman)
   v2: - Patch #2 Restructure code in guc_lrc_desc_unpin so
         it's more readible to differentiate (1)direct guc-id
         cleanup ..vs (2) sending the H2G ctx-destroy action ..
         vs (3) the unrolling steps if the H2G fails.
       - Patch #2 Add a check to close the race sooner by checking
         for intel_guc_is_ready from destroyed_worker_func.
       - Patch #2 When guc_submission_send_busy_loop gets a
         failure from intel_guc_send_busy_loop, we need to undo
         i.e. decrement the outstanding_submission_g2h.
       - Patch #3 In wait_for_suspend, fix checking of return from
         intel_gt_pm_wait_timeout_for_idle to now use -ETIMEDOUT
         and add documentation for intel_wakeref_wait_for_idle.
         (Rodrigo).

Alan Previn (2):
  drm/i915/guc: Flush context destruction worker at suspend
  drm/i915/guc: Close deregister-context race against CT-loss

 drivers/gpu/drm/i915/gem/i915_gem_pm.c        | 10 +++
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 78 +++++++++++++++++--
 .../gpu/drm/i915/gt/uc/intel_guc_submission.h |  2 +
 drivers/gpu/drm/i915/gt/uc/intel_uc.c         |  2 +
 4 files changed, 87 insertions(+), 5 deletions(-)


base-commit: 56c351e9f8c3ca9389a924ff2a46b34b50fb37dd

Comments

Teres Alexis, Alan Previn Jan. 2, 2024, 5:40 p.m. UTC | #1
On Wed, 2023-12-27 at 20:55 -0800, Teres Alexis, Alan Previn wrote:
> This series is the result of debugging issues root caused to
> races between the GuC's destroyed_worker_func being triggered
> vs repeating suspend-resume cycles with concurrent delayed
> fence signals for engine-freeing.
> 
alan:snip.

alan: I did not receive the CI-premerge email where the following was reported:
      IGT changes
      Possible regressions
      igt@i915_selftest@live@gt_pm:
      shard-rkl: PASS -> DMESG-FAIL
After going thru the error in dmesg and codes, i am confident this failure not
related to the series. This selftest calls rdmsrl functions (that doen't do
any requests / guc submissions) but gets a reply power of zero (the bug reported).
So this is unrelated.


Hi @"Vivi, Rodrigo" <rodrigo.vivi@intel.com>, just an FYI note that after the last
requested rebase, BAT passed twice in a row now so i am confident failures on rev7
and prior was unrelated and that this series is ready for merging. Thanks again
for all your help and patiences - this was a long one  :)
Teres Alexis, Alan Previn Jan. 4, 2024, 5:39 p.m. UTC | #2
On Thu, 2024-01-04 at 10:57 +0000, Patchwork wrote:
> Patch Details
> Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev13)
> URL:    https://patchwork.freedesktop.org/series/121916/
> State:  failure
> Details:        https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/index.html
> CI Bug Log - changes from CI_DRM_14076_full -> Patchwork_121916v13_full
> Summary
> 
> FAILURE
alan:snip


> Here are the unknown changes that may have been introduced in Patchwork_121916v13_full:
> 
> IGT changes
> Possible regressions
> 
>   *   igt@gem_eio@wait-wedge-immediate:
>      *   shard-mtlp: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14076/shard-mtlp-3/igt@gem_eio@wait-wedge-immediate.html> -> ABORT<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html>
> 
alan: from the code and dmesg, this is unrelated to guc context destruction flows.
Its reading an MCR register that times out. Additionally, i believe this error is occuring during post-reset-init flows.
So its definitely not doing any context destruction at this point (as reset would have happenned sooner).
> Known issues
>
Matt Roper Jan. 9, 2024, 5:41 p.m. UTC | #3
On Thu, Jan 04, 2024 at 05:39:16PM +0000, Teres Alexis, Alan Previn wrote:
> On Thu, 2024-01-04 at 10:57 +0000, Patchwork wrote:
> > Patch Details
> > Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev13)
> > URL:    https://patchwork.freedesktop.org/series/121916/
> > State:  failure
> > Details:        https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/index.html
> > CI Bug Log - changes from CI_DRM_14076_full -> Patchwork_121916v13_full
> > Summary
> > 
> > FAILURE
> alan:snip
> 
> 
> > Here are the unknown changes that may have been introduced in Patchwork_121916v13_full:
> > 
> > IGT changes
> > Possible regressions
> > 
> >   *   igt@gem_eio@wait-wedge-immediate:
> >      *   shard-mtlp: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14076/shard-mtlp-3/igt@gem_eio@wait-wedge-immediate.html> -> ABORT<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html>
> > 
> alan: from the code and dmesg, this is unrelated to guc context destruction flows.
> Its reading an MCR register that times out. Additionally, i believe this error is occuring during post-reset-init flows.
> So its definitely not doing any context destruction at this point (as reset would have happenned sooner).

Yeah, the MCR timeouts are due to these CI machines running an outdated
IFWI, so they're missing an important workaround in the firmware.

Series applies to drm-intel-gt-next.  Thanks for the patches and
reviews.


Matt

> > Known issues
> > 
>
Rodrigo Vivi Jan. 9, 2024, 10:42 p.m. UTC | #4
On Thu, Jan 04, 2024 at 05:39:16PM +0000, Teres Alexis, Alan Previn wrote:
> On Thu, 2024-01-04 at 10:57 +0000, Patchwork wrote:
> > Patch Details
> > Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev13)
> > URL:    https://patchwork.freedesktop.org/series/121916/
> > State:  failure
> > Details:        https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/index.html
> > CI Bug Log - changes from CI_DRM_14076_full -> Patchwork_121916v13_full
> > Summary
> > 
> > FAILURE
> alan:snip
> 
> 
> > Here are the unknown changes that may have been introduced in Patchwork_121916v13_full:
> > 
> > IGT changes
> > Possible regressions
> > 
> >   *   igt@gem_eio@wait-wedge-immediate:
> >      *   shard-mtlp: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14076/shard-mtlp-3/igt@gem_eio@wait-wedge-immediate.html> -> ABORT<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html>
> > 
> alan: from the code and dmesg, this is unrelated to guc context destruction flows.
> Its reading an MCR register that times out. Additionally, i believe this error is occuring during post-reset-init flows.
> So its definitely not doing any context destruction at this point (as reset would have happenned sooner).

yeap, it is indeed happening once in a while:

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_7659/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html

I was going to merge the series now, but then I noticed that Matt had taken care of that.
Thank you all.

> > Known issues
> > 
>