Message ID | 20241015145710.2478599-1-nitin.r.gote@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v4] drm/i915/gt: Retry RING_HEAD reset until it get sticks | expand |
On Tue, Oct 15, 2024 at 08:27:10PM +0530, Nitin Gote wrote: > we see an issue where resets fails because the engine resumes > from an incorrect RING_HEAD. Since the RING_HEAD doesn't point > to the remaining requests to re-run, but may instead point into > the uninitialised portion of the ring, the GPU may be then fed > invalid instructions from a privileged context, oft pushing the > GPU into an unrecoverable hang. > > If at first the write doesn't succeed, try, try again. > > v2: Avoid unnecessary timeout macro (Andi) > > v3: Correct comment format (Andi) > > v4: Make it generic for all platform as it won't impact (Chris) > > Link: https://gitlab.freedesktop.org/drm/intel/-/issues/5432 > Testcase: igt/i915_selftest/hangcheck The referenced HSW-specific gitlab issue was closed in 2022 and hadn't been active for a while before that. This patch from Chris was originally posted as an attachment on that gitlab issue asking if it helped, but nobody responded that it did/didn't improve the situation so it may or may not have been relevant to what was originally reported in that ticket. Looking in cibuglog, the most similar failures I see today are the ones getting associated with issue #12310. I.e., <3> [220.415493] i915 0000:00:02.0: [drm] *ERROR* failed to set rcs0 head to zero ctl 00000000 head 00001db8 tail 00000000 start 7fffa000 Are you trying to solve that CI issue or is there a different user-submitted report somewhere that this patch is trying to address? Matt > Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com> > Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> > --- > .../gpu/drm/i915/gt/intel_ring_submission.c | 31 ++++++++++++++++--- > 1 file changed, 27 insertions(+), 4 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c > index 72277bc8322e..b6b25fe22cb8 100644 > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c > @@ -192,6 +192,7 @@ static bool stop_ring(struct intel_engine_cs *engine) > static int xcs_resume(struct intel_engine_cs *engine) > { > struct intel_ring *ring = engine->legacy.ring; > + ktime_t kt; > > ENGINE_TRACE(engine, "ring:{HEAD:%04x, TAIL:%04x}\n", > ring->head, ring->tail); > @@ -230,9 +231,27 @@ static int xcs_resume(struct intel_engine_cs *engine) > set_pp_dir(engine); > > /* First wake the ring up to an empty/idle ring */ > - ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); > + for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC); > + ktime_before(ktime_get(), (kt)); cpu_relax()) { > + /* > + * In case of resets fails because engine resumes from > + * incorrect RING_HEAD and then GPU may be then fed > + * to invalid instrcutions, which may lead to unrecoverable > + * hang. So at first write doesn't succeed then try again. > + */ > + ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); > + if (ENGINE_READ_FW(engine, RING_HEAD) == ring->head) > + break; > + } > + > ENGINE_WRITE_FW(engine, RING_TAIL, ring->head); > - ENGINE_POSTING_READ(engine, RING_TAIL); > + if (ENGINE_READ_FW(engine, RING_HEAD) != ENGINE_READ_FW(engine, RING_TAIL)) { > + ENGINE_TRACE(engine, "failed to reset empty ring: [%x, %x]: %x\n", > + ENGINE_READ_FW(engine, RING_HEAD), > + ENGINE_READ_FW(engine, RING_TAIL), > + ring->head); > + goto err; > + } > > ENGINE_WRITE_FW(engine, RING_CTL, > RING_CTL_SIZE(ring->size) | RING_VALID); > @@ -241,12 +260,16 @@ static int xcs_resume(struct intel_engine_cs *engine) > if (__intel_wait_for_register_fw(engine->uncore, > RING_CTL(engine->mmio_base), > RING_VALID, RING_VALID, > - 5000, 0, NULL)) > + 5000, 0, NULL)) { > + ENGINE_TRACE(engine, "failed to restart\n"); > goto err; > + } > > - if (GRAPHICS_VER(engine->i915) > 2) > + if (GRAPHICS_VER(engine->i915) > 2) { > ENGINE_WRITE_FW(engine, > RING_MI_MODE, _MASKED_BIT_DISABLE(STOP_RING)); > + ENGINE_POSTING_READ(engine, RING_MI_MODE); > + } > > /* Now awake, let it get started */ > if (ring->tail != ring->head) { > -- > 2.25.1 >
Hi Matt, > -----Original Message----- > From: Roper, Matthew D <matthew.d.roper@intel.com> > Sent: Wednesday, October 16, 2024 5:23 AM > To: Gote, Nitin R <nitin.r.gote@intel.com> > Cc: intel-gfx@lists.freedesktop.org; Shyti, Andi <andi.shyti@intel.com>; Wilson, > Chris P <chris.p.wilson@intel.com>; Das, Nirmoy <nirmoy.das@intel.com>; > Chris Wilson <chris.p.wilson@linux.intel.com> > Subject: Re: [PATCH v4] drm/i915/gt: Retry RING_HEAD reset until it get sticks > > On Tue, Oct 15, 2024 at 08:27:10PM +0530, Nitin Gote wrote: > > we see an issue where resets fails because the engine resumes from an > > incorrect RING_HEAD. Since the RING_HEAD doesn't point to the > > remaining requests to re-run, but may instead point into the > > uninitialised portion of the ring, the GPU may be then fed invalid > > instructions from a privileged context, oft pushing the GPU into an > > unrecoverable hang. > > > > If at first the write doesn't succeed, try, try again. > > > > v2: Avoid unnecessary timeout macro (Andi) > > > > v3: Correct comment format (Andi) > > > > v4: Make it generic for all platform as it won't impact (Chris) > > > > Link: https://gitlab.freedesktop.org/drm/intel/-/issues/5432 > > Testcase: igt/i915_selftest/hangcheck > > The referenced HSW-specific gitlab issue was closed in 2022 and hadn't been > active for a while before that. This patch from Chris was originally posted as an > attachment on that gitlab issue asking if it helped, but nobody responded that it > did/didn't improve the situation so it may or may not have been relevant to > what was originally reported in that ticket. > > Looking in cibuglog, the most similar failures I see today are the ones getting > associated with issue #12310. I.e., > > <3> [220.415493] i915 0000:00:02.0: [drm] *ERROR* failed to set rcs0 > head to zero ctl 00000000 head 00001db8 tail 00000000 start 7fffa000 > > Are you trying to solve that CI issue or is there a different user-submitted report > somewhere that this patch is trying to address? > > > Matt > Yes. This patch is for https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12310 I will update the link. - Nitin > > > Signed-off-by: Chris Wilson <chris.p.wilson@linux.intel.com> > > Signed-off-by: Nitin Gote <nitin.r.gote@intel.com> > > --- > > .../gpu/drm/i915/gt/intel_ring_submission.c | 31 ++++++++++++++++--- > > 1 file changed, 27 insertions(+), 4 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > b/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > index 72277bc8322e..b6b25fe22cb8 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c > > @@ -192,6 +192,7 @@ static bool stop_ring(struct intel_engine_cs > > *engine) static int xcs_resume(struct intel_engine_cs *engine) { > > struct intel_ring *ring = engine->legacy.ring; > > + ktime_t kt; > > > > ENGINE_TRACE(engine, "ring:{HEAD:%04x, TAIL:%04x}\n", > > ring->head, ring->tail); > > @@ -230,9 +231,27 @@ static int xcs_resume(struct intel_engine_cs > *engine) > > set_pp_dir(engine); > > > > /* First wake the ring up to an empty/idle ring */ > > - ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); > > + for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC); > > + ktime_before(ktime_get(), (kt)); cpu_relax()) { > > + /* > > + * In case of resets fails because engine resumes from > > + * incorrect RING_HEAD and then GPU may be then fed > > + * to invalid instrcutions, which may lead to unrecoverable > > + * hang. So at first write doesn't succeed then try again. > > + */ > > + ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); > > + if (ENGINE_READ_FW(engine, RING_HEAD) == ring->head) > > + break; > > + } > > + > > ENGINE_WRITE_FW(engine, RING_TAIL, ring->head); > > - ENGINE_POSTING_READ(engine, RING_TAIL); > > + if (ENGINE_READ_FW(engine, RING_HEAD) != > ENGINE_READ_FW(engine, RING_TAIL)) { > > + ENGINE_TRACE(engine, "failed to reset empty ring: [%x, %x]: > %x\n", > > + ENGINE_READ_FW(engine, RING_HEAD), > > + ENGINE_READ_FW(engine, RING_TAIL), > > + ring->head); > > + goto err; > > + } > > > > ENGINE_WRITE_FW(engine, RING_CTL, > > RING_CTL_SIZE(ring->size) | RING_VALID); @@ -241,12 > +260,16 @@ > > static int xcs_resume(struct intel_engine_cs *engine) > > if (__intel_wait_for_register_fw(engine->uncore, > > RING_CTL(engine->mmio_base), > > RING_VALID, RING_VALID, > > - 5000, 0, NULL)) > > + 5000, 0, NULL)) { > > + ENGINE_TRACE(engine, "failed to restart\n"); > > goto err; > > + } > > > > - if (GRAPHICS_VER(engine->i915) > 2) > > + if (GRAPHICS_VER(engine->i915) > 2) { > > ENGINE_WRITE_FW(engine, > > RING_MI_MODE, > _MASKED_BIT_DISABLE(STOP_RING)); > > + ENGINE_POSTING_READ(engine, RING_MI_MODE); > > + } > > > > /* Now awake, let it get started */ > > if (ring->tail != ring->head) { > > -- > > 2.25.1 > > > > -- > Matt Roper > Graphics Software Engineer > Linux GPU Platform Enablement > Intel Corporation
Hi Nitin, > > > we see an issue where resets fails because the engine resumes from an > > > incorrect RING_HEAD. Since the RING_HEAD doesn't point to the > > > remaining requests to re-run, but may instead point into the > > > uninitialised portion of the ring, the GPU may be then fed invalid > > > instructions from a privileged context, oft pushing the GPU into an > > > unrecoverable hang. > > > > > > If at first the write doesn't succeed, try, try again. > > > > > > v2: Avoid unnecessary timeout macro (Andi) > > > > > > v3: Correct comment format (Andi) > > > > > > v4: Make it generic for all platform as it won't impact (Chris) > > > > > > Link: https://gitlab.freedesktop.org/drm/intel/-/issues/5432 > > > Testcase: igt/i915_selftest/hangcheck > > > > The referenced HSW-specific gitlab issue was closed in 2022 and hadn't been > > active for a while before that. This patch from Chris was originally posted as an > > attachment on that gitlab issue asking if it helped, but nobody responded that it > > did/didn't improve the situation so it may or may not have been relevant to > > what was originally reported in that ticket. > > > > Looking in cibuglog, the most similar failures I see today are the ones getting > > associated with issue #12310. I.e., > > > > <3> [220.415493] i915 0000:00:02.0: [drm] *ERROR* failed to set rcs0 > > head to zero ctl 00000000 head 00001db8 tail 00000000 start 7fffa000 > > > > Are you trying to solve that CI issue or is there a different user-submitted report > > somewhere that this patch is trying to address? > > > > > > Matt > > > > Yes. This patch is for https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12310 > I will update the link. No worries, I can update the link here. Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com> Thanks, Andi
diff --git a/drivers/gpu/drm/i915/gt/intel_ring_submission.c b/drivers/gpu/drm/i915/gt/intel_ring_submission.c index 72277bc8322e..b6b25fe22cb8 100644 --- a/drivers/gpu/drm/i915/gt/intel_ring_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_ring_submission.c @@ -192,6 +192,7 @@ static bool stop_ring(struct intel_engine_cs *engine) static int xcs_resume(struct intel_engine_cs *engine) { struct intel_ring *ring = engine->legacy.ring; + ktime_t kt; ENGINE_TRACE(engine, "ring:{HEAD:%04x, TAIL:%04x}\n", ring->head, ring->tail); @@ -230,9 +231,27 @@ static int xcs_resume(struct intel_engine_cs *engine) set_pp_dir(engine); /* First wake the ring up to an empty/idle ring */ - ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); + for ((kt) = ktime_get() + (2 * NSEC_PER_MSEC); + ktime_before(ktime_get(), (kt)); cpu_relax()) { + /* + * In case of resets fails because engine resumes from + * incorrect RING_HEAD and then GPU may be then fed + * to invalid instrcutions, which may lead to unrecoverable + * hang. So at first write doesn't succeed then try again. + */ + ENGINE_WRITE_FW(engine, RING_HEAD, ring->head); + if (ENGINE_READ_FW(engine, RING_HEAD) == ring->head) + break; + } + ENGINE_WRITE_FW(engine, RING_TAIL, ring->head); - ENGINE_POSTING_READ(engine, RING_TAIL); + if (ENGINE_READ_FW(engine, RING_HEAD) != ENGINE_READ_FW(engine, RING_TAIL)) { + ENGINE_TRACE(engine, "failed to reset empty ring: [%x, %x]: %x\n", + ENGINE_READ_FW(engine, RING_HEAD), + ENGINE_READ_FW(engine, RING_TAIL), + ring->head); + goto err; + } ENGINE_WRITE_FW(engine, RING_CTL, RING_CTL_SIZE(ring->size) | RING_VALID); @@ -241,12 +260,16 @@ static int xcs_resume(struct intel_engine_cs *engine) if (__intel_wait_for_register_fw(engine->uncore, RING_CTL(engine->mmio_base), RING_VALID, RING_VALID, - 5000, 0, NULL)) + 5000, 0, NULL)) { + ENGINE_TRACE(engine, "failed to restart\n"); goto err; + } - if (GRAPHICS_VER(engine->i915) > 2) + if (GRAPHICS_VER(engine->i915) > 2) { ENGINE_WRITE_FW(engine, RING_MI_MODE, _MASKED_BIT_DISABLE(STOP_RING)); + ENGINE_POSTING_READ(engine, RING_MI_MODE); + } /* Now awake, let it get started */ if (ring->tail != ring->head) {