Message ID | 20180425105923.13430-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Chris Wilson <chris@chris-wilson.co.uk> writes: > When WaIdleLiteRestore isn't enough. > > Fixes an odd hang on gen8 (both bsw and bdw) during gem_ctx_switch, Do you have a testcase name? (testcase tag would be nice too) > where by all intents and purposes if we trigger a lite-restore as it is > processing the pipecontrol flushes, the RING is restored to the oword s/RING/RING_HEAD ? > following the command and tries to execute the destination address for > the pipecontrol rather than a valid command. With the theory being that > it doesn't like RING_HEAD being within a cacheline of the restored > RING_TAIL, we can evade that issue by not triggering a lite-restore if > we know we are inside the last request. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > --- > drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > index 029901a8fa38..5c50263e45d3 100644 > --- a/drivers/gpu/drm/i915/intel_lrc.c > +++ b/drivers/gpu/drm/i915/intel_lrc.c > @@ -639,6 +639,19 @@ static void execlists_dequeue(struct intel_engine_cs *engine) > if (port_count(&port[1])) > goto unlock; > > + /* > + * Skip invoking a lite-restore if we know we have already > + * started processing the last request queued to HW. This > + * prevents a mystery *unrecoverable* hang on gen8, maybe > + * related to updating TAIL within a cacheline of HEAD? (As Did you try with WA_TAIL_DWORDS 16? -Mika > + * there is still a delay between submitting the ESLP update > + * and HW responding, we may still encounter whatever condition > + * trips up, just less often.) > + */ > + if (i915_seqno_passed(intel_engine_get_seqno(engine), > + last->global_seqno - 1)) > + goto unlock; > + > /* > * WaIdleLiteRestore:bdw,skl > * Apply the wa NOOPs to prevent > -- > 2.17.0 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
Quoting Mika Kuoppala (2018-04-25 12:19:08) > Chris Wilson <chris@chris-wilson.co.uk> writes: > > > When WaIdleLiteRestore isn't enough. > > > > Fixes an odd hang on gen8 (both bsw and bdw) during gem_ctx_switch, > > Do you have a testcase name? (testcase tag would be nice too) Just keep running gem_ctx_switch. Switching between the static set of contexts, no rebind pressure whatsoever. > > where by all intents and purposes if we trigger a lite-restore as it is > > processing the pipecontrol flushes, the RING is restored to the oword > > s/RING/RING_HEAD ? > > > following the command and tries to execute the destination address for > > the pipecontrol rather than a valid command. With the theory being that > > it doesn't like RING_HEAD being within a cacheline of the restored > > RING_TAIL, we can evade that issue by not triggering a lite-restore if > > we know we are inside the last request. > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > --- > > drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++++++++ > > 1 file changed, 13 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c > > index 029901a8fa38..5c50263e45d3 100644 > > --- a/drivers/gpu/drm/i915/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/intel_lrc.c > > @@ -639,6 +639,19 @@ static void execlists_dequeue(struct intel_engine_cs *engine) > > if (port_count(&port[1])) > > goto unlock; > > > > + /* > > + * Skip invoking a lite-restore if we know we have already > > + * started processing the last request queued to HW. This > > + * prevents a mystery *unrecoverable* hang on gen8, maybe > > + * related to updating TAIL within a cacheline of HEAD? (As > > Did you try with WA_TAIL_DWORDS 16? Sure can try, but the error state doesn't indicate TAIL==HEAD as would be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't generate the arbitration event, so no more context switches). -Chris
Quoting Chris Wilson (2018-04-25 12:23:30) > Quoting Mika Kuoppala (2018-04-25 12:19:08) > > Did you try with WA_TAIL_DWORDS 16? > > Sure can try, but the error state doesn't indicate TAIL==HEAD as would > be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't > generate the arbitration event, so no more context switches). Watch https://patchwork.freedesktop.org/series/42284/ Actually surviving on my bdw -Chris
Quoting Chris Wilson (2018-04-25 12:31:29) > Quoting Chris Wilson (2018-04-25 12:23:30) > > Quoting Mika Kuoppala (2018-04-25 12:19:08) > > > Did you try with WA_TAIL_DWORDS 16? > > > > Sure can try, but the error state doesn't indicate TAIL==HEAD as would > > be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't > > generate the arbitration event, so no more context switches). > > Watch https://patchwork.freedesktop.org/series/42284/ > > Actually surviving on my bdw And what appears to work just as well is @@ -1387,7 +1387,7 @@ static int execlists_request_alloc(struct i915_request *request) */ request->reserved_space += EXECLISTS_REQUEST_SIZE; - ret = intel_ring_wait_for_space(request->ring, request->reserved_space); + ret = intel_ring_cacheline_align(request); if (ret) return ret; If that doesn't hint towards some weirdness in HW, I don't know what would ;) -Chris
Quoting Chris Wilson (2018-04-25 13:45:45) > Quoting Chris Wilson (2018-04-25 12:31:29) > > Quoting Chris Wilson (2018-04-25 12:23:30) > > > Quoting Mika Kuoppala (2018-04-25 12:19:08) > > > > Did you try with WA_TAIL_DWORDS 16? > > > > > > Sure can try, but the error state doesn't indicate TAIL==HEAD as would > > > be the issue with WaIdleLiteRestore (restoring to an idle ring wouldn't > > > generate the arbitration event, so no more context switches). > > > > Watch https://patchwork.freedesktop.org/series/42284/ > > > > Actually surviving on my bdw > > And what appears to work just as well is > > @@ -1387,7 +1387,7 @@ static int execlists_request_alloc(struct i915_request *request) > */ > request->reserved_space += EXECLISTS_REQUEST_SIZE; > > - ret = intel_ring_wait_for_space(request->ring, request->reserved_space); > + ret = intel_ring_cacheline_align(request); > if (ret) > return ret; > > If that doesn't hint towards some weirdness in HW, I don't know what > would ;) Fwiw, I've now hit GPU hangs with WA_TAIL_DWORDS set to 16 and with the cacheline_align; scratch that theory. -Chris
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 029901a8fa38..5c50263e45d3 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -639,6 +639,19 @@ static void execlists_dequeue(struct intel_engine_cs *engine) if (port_count(&port[1])) goto unlock; + /* + * Skip invoking a lite-restore if we know we have already + * started processing the last request queued to HW. This + * prevents a mystery *unrecoverable* hang on gen8, maybe + * related to updating TAIL within a cacheline of HEAD? (As + * there is still a delay between submitting the ESLP update + * and HW responding, we may still encounter whatever condition + * trips up, just less often.) + */ + if (i915_seqno_passed(intel_engine_get_seqno(engine), + last->global_seqno - 1)) + goto unlock; + /* * WaIdleLiteRestore:bdw,skl * Apply the wa NOOPs to prevent
When WaIdleLiteRestore isn't enough. Fixes an odd hang on gen8 (both bsw and bdw) during gem_ctx_switch, where by all intents and purposes if we trigger a lite-restore as it is processing the pipecontrol flushes, the RING is restored to the oword following the command and tries to execute the destination address for the pipecontrol rather than a valid command. With the theory being that it doesn't like RING_HEAD being within a cacheline of the restored RING_TAIL, we can evade that issue by not triggering a lite-restore if we know we are inside the last request. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> --- drivers/gpu/drm/i915/intel_lrc.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)