Message ID | 20231228045558.536585-1-alan.previn.teres.alexis@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Resolve suspend-resume racing with GuC destroy-context-worker | expand |
On Wed, 2023-12-27 at 20:55 -0800, Teres Alexis, Alan Previn wrote: > This series is the result of debugging issues root caused to > races between the GuC's destroyed_worker_func being triggered > vs repeating suspend-resume cycles with concurrent delayed > fence signals for engine-freeing. > alan:snip. alan: I did not receive the CI-premerge email where the following was reported: IGT changes Possible regressions igt@i915_selftest@live@gt_pm: shard-rkl: PASS -> DMESG-FAIL After going thru the error in dmesg and codes, i am confident this failure not related to the series. This selftest calls rdmsrl functions (that doen't do any requests / guc submissions) but gets a reply power of zero (the bug reported). So this is unrelated. Hi @"Vivi, Rodrigo" <rodrigo.vivi@intel.com>, just an FYI note that after the last requested rebase, BAT passed twice in a row now so i am confident failures on rev7 and prior was unrelated and that this series is ready for merging. Thanks again for all your help and patiences - this was a long one :)
On Thu, 2024-01-04 at 10:57 +0000, Patchwork wrote: > Patch Details > Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev13) > URL: https://patchwork.freedesktop.org/series/121916/ > State: failure > Details: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/index.html > CI Bug Log - changes from CI_DRM_14076_full -> Patchwork_121916v13_full > Summary > > FAILURE alan:snip > Here are the unknown changes that may have been introduced in Patchwork_121916v13_full: > > IGT changes > Possible regressions > > * igt@gem_eio@wait-wedge-immediate: > * shard-mtlp: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14076/shard-mtlp-3/igt@gem_eio@wait-wedge-immediate.html> -> ABORT<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html> > alan: from the code and dmesg, this is unrelated to guc context destruction flows. Its reading an MCR register that times out. Additionally, i believe this error is occuring during post-reset-init flows. So its definitely not doing any context destruction at this point (as reset would have happenned sooner). > Known issues >
On Thu, Jan 04, 2024 at 05:39:16PM +0000, Teres Alexis, Alan Previn wrote: > On Thu, 2024-01-04 at 10:57 +0000, Patchwork wrote: > > Patch Details > > Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev13) > > URL: https://patchwork.freedesktop.org/series/121916/ > > State: failure > > Details: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/index.html > > CI Bug Log - changes from CI_DRM_14076_full -> Patchwork_121916v13_full > > Summary > > > > FAILURE > alan:snip > > > > Here are the unknown changes that may have been introduced in Patchwork_121916v13_full: > > > > IGT changes > > Possible regressions > > > > * igt@gem_eio@wait-wedge-immediate: > > * shard-mtlp: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14076/shard-mtlp-3/igt@gem_eio@wait-wedge-immediate.html> -> ABORT<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html> > > > alan: from the code and dmesg, this is unrelated to guc context destruction flows. > Its reading an MCR register that times out. Additionally, i believe this error is occuring during post-reset-init flows. > So its definitely not doing any context destruction at this point (as reset would have happenned sooner). Yeah, the MCR timeouts are due to these CI machines running an outdated IFWI, so they're missing an important workaround in the firmware. Series applies to drm-intel-gt-next. Thanks for the patches and reviews. Matt > > Known issues > > >
On Thu, Jan 04, 2024 at 05:39:16PM +0000, Teres Alexis, Alan Previn wrote: > On Thu, 2024-01-04 at 10:57 +0000, Patchwork wrote: > > Patch Details > > Series: Resolve suspend-resume racing with GuC destroy-context-worker (rev13) > > URL: https://patchwork.freedesktop.org/series/121916/ > > State: failure > > Details: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/index.html > > CI Bug Log - changes from CI_DRM_14076_full -> Patchwork_121916v13_full > > Summary > > > > FAILURE > alan:snip > > > > Here are the unknown changes that may have been introduced in Patchwork_121916v13_full: > > > > IGT changes > > Possible regressions > > > > * igt@gem_eio@wait-wedge-immediate: > > * shard-mtlp: PASS<https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_14076/shard-mtlp-3/igt@gem_eio@wait-wedge-immediate.html> -> ABORT<https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_121916v13/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html> > > > alan: from the code and dmesg, this is unrelated to guc context destruction flows. > Its reading an MCR register that times out. Additionally, i believe this error is occuring during post-reset-init flows. > So its definitely not doing any context destruction at this point (as reset would have happenned sooner). yeap, it is indeed happening once in a while:  https://intel-gfx-ci.01.org/tree/drm-tip/IGT_7659/shard-mtlp-4/igt@gem_eio@wait-wedge-immediate.html I was going to merge the series now, but then I noticed that Matt had taken care of that. Thank you all. > > Known issues > > >