Message ID | 20190921095542.23205-4-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/5] drm/i915/execlists: Refactor -EIO markup of hung requests | expand |
On 21/09/2019 10:55, Chris Wilson wrote: > Due to the nature of preempt-to-busy the execlists active tracking and > the schedule queue may become temporarily desync'ed (between resubmission > to HW and its ack from HW). This means that we may have unwound a > request and passed it back to the virtual engine, but it is still > inflight on the HW and may even result in a GPU hang. If we detect that > GPU hang and try to reset, the hanging request->engine will no longer > match the current engine, which means that the request is not on the > execlists active list and we should not try to find an older incomplete > request. Given that we have deduced this must be a request on a virtual > engine, it is the single active request in the context and so must be > guilty (as the context is still inflight, it is prevented from being > executed on another engine as we process the reset). > > Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy") > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_lrc.c | 7 +++++-- > drivers/gpu/drm/i915/gt/intel_reset.c | 4 +--- > 2 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > index 1b2bacc60300..3eadd294bcd7 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -2326,11 +2326,14 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) > > static struct i915_request *active_request(struct i915_request *rq) > { > - const struct list_head * const list = > - &i915_request_active_timeline(rq)->requests; > const struct intel_context * const ce = rq->hw_context; > struct i915_request *active = NULL; > + struct list_head *list; > > + if (!i915_request_is_active(rq)) /* unwound, but incomplete! */ Is it time to store the veng pointer separately in the request so we can add the assert here? Like GEM_BUG_ON(!...engine_is_virtual(rq->orig_engine)) ? > + return rq; > + > + list = &i915_request_active_timeline(rq)->requests; > list_for_each_entry_from_reverse(rq, list, link) { > if (i915_request_completed(rq)) > break; > diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c > index 89048ca8924c..ae68c3786f63 100644 > --- a/drivers/gpu/drm/i915/gt/intel_reset.c > +++ b/drivers/gpu/drm/i915/gt/intel_reset.c > @@ -42,11 +42,10 @@ static void engine_skip_context(struct i915_request *rq) > struct intel_engine_cs *engine = rq->engine; > struct i915_gem_context *hung_ctx = rq->gem_context; > > - lockdep_assert_held(&engine->active.lock); > - > if (!i915_request_is_active(rq)) > return; > > + lockdep_assert_held(&engine->active.lock); > list_for_each_entry_continue(rq, &engine->active.requests, sched.link) > if (rq->gem_context == hung_ctx) > i915_request_skip(rq, -EIO); > @@ -123,7 +122,6 @@ void __i915_request_reset(struct i915_request *rq, bool guilty) > rq->fence.seqno, > yesno(guilty)); > > - lockdep_assert_held(&rq->engine->active.lock); > GEM_BUG_ON(i915_request_completed(rq)); > > if (guilty) { > Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Regards, Tvrtko
Quoting Tvrtko Ursulin (2019-09-23 14:46:27) > > On 21/09/2019 10:55, Chris Wilson wrote: > > Due to the nature of preempt-to-busy the execlists active tracking and > > the schedule queue may become temporarily desync'ed (between resubmission > > to HW and its ack from HW). This means that we may have unwound a > > request and passed it back to the virtual engine, but it is still > > inflight on the HW and may even result in a GPU hang. If we detect that > > GPU hang and try to reset, the hanging request->engine will no longer > > match the current engine, which means that the request is not on the > > execlists active list and we should not try to find an older incomplete > > request. Given that we have deduced this must be a request on a virtual > > engine, it is the single active request in the context and so must be > > guilty (as the context is still inflight, it is prevented from being > > executed on another engine as we process the reset). > > > > Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy") > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > --- > > drivers/gpu/drm/i915/gt/intel_lrc.c | 7 +++++-- > > drivers/gpu/drm/i915/gt/intel_reset.c | 4 +--- > > 2 files changed, 6 insertions(+), 5 deletions(-) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > > index 1b2bacc60300..3eadd294bcd7 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > > @@ -2326,11 +2326,14 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) > > > > static struct i915_request *active_request(struct i915_request *rq) > > { > > - const struct list_head * const list = > > - &i915_request_active_timeline(rq)->requests; > > const struct intel_context * const ce = rq->hw_context; > > struct i915_request *active = NULL; > > + struct list_head *list; > > > > + if (!i915_request_is_active(rq)) /* unwound, but incomplete! */ > > Is it time to store the veng pointer separately in the request so we can > add the assert here? Like > GEM_BUG_ON(!...engine_is_virtual(rq->orig_engine)) ? intel_engine_is_virtual(rq->hw_context->engine). But this is currently agnostic, and here we are only asking question is this request in the engine->active.list, which applies to both normal and virtual setups. Hmm, this bug does apply to normal reset, I only spotted it because we added the lockdep warnings that picked up that rq->engine was different. There maybe some use for that assertion, but I don't think this is one. -Chris
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 1b2bacc60300..3eadd294bcd7 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2326,11 +2326,14 @@ static void reset_csb_pointers(struct intel_engine_cs *engine) static struct i915_request *active_request(struct i915_request *rq) { - const struct list_head * const list = - &i915_request_active_timeline(rq)->requests; const struct intel_context * const ce = rq->hw_context; struct i915_request *active = NULL; + struct list_head *list; + if (!i915_request_is_active(rq)) /* unwound, but incomplete! */ + return rq; + + list = &i915_request_active_timeline(rq)->requests; list_for_each_entry_from_reverse(rq, list, link) { if (i915_request_completed(rq)) break; diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 89048ca8924c..ae68c3786f63 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -42,11 +42,10 @@ static void engine_skip_context(struct i915_request *rq) struct intel_engine_cs *engine = rq->engine; struct i915_gem_context *hung_ctx = rq->gem_context; - lockdep_assert_held(&engine->active.lock); - if (!i915_request_is_active(rq)) return; + lockdep_assert_held(&engine->active.lock); list_for_each_entry_continue(rq, &engine->active.requests, sched.link) if (rq->gem_context == hung_ctx) i915_request_skip(rq, -EIO); @@ -123,7 +122,6 @@ void __i915_request_reset(struct i915_request *rq, bool guilty) rq->fence.seqno, yesno(guilty)); - lockdep_assert_held(&rq->engine->active.lock); GEM_BUG_ON(i915_request_completed(rq)); if (guilty) {
Due to the nature of preempt-to-busy the execlists active tracking and the schedule queue may become temporarily desync'ed (between resubmission to HW and its ack from HW). This means that we may have unwound a request and passed it back to the virtual engine, but it is still inflight on the HW and may even result in a GPU hang. If we detect that GPU hang and try to reset, the hanging request->engine will no longer match the current engine, which means that the request is not on the execlists active list and we should not try to find an older incomplete request. Given that we have deduced this must be a request on a virtual engine, it is the single active request in the context and so must be guilty (as the context is still inflight, it is prevented from being executed on another engine as we process the reset). Fixes: 22b7a426bbe1 ("drm/i915/execlists: Preempt-to-busy") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> --- drivers/gpu/drm/i915/gt/intel_lrc.c | 7 +++++-- drivers/gpu/drm/i915/gt/intel_reset.c | 4 +--- 2 files changed, 6 insertions(+), 5 deletions(-)