Message ID | 20200121100927.114886-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915/execlists: Reclaim hanging virtual request | expand |
On 21/01/2020 10:09, Chris Wilson wrote: > If we encounter a hang on a virtual engine, as we process the hang the > request may already have been moved back to the virtual engine (we are > processing the hang on the physical engine). We need to reclaim the > request from the virtual engine so that the locking is consistent and > local to the real engine on which we will hold the request for error > state capturing. > > Fixes: 748317386afb ("drm/i915/execlists: Offline error capture") > Testcase: igt/gem_exec_balancer/hang > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > --- > drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > index 3a30767ff0c4..a0acf1898c1e 100644 > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine) > cap->rq = active_request(cap->rq->context->timeline, cap->rq); > GEM_BUG_ON(!cap->rq); > > + if (cap->rq->engine != engine) { /* preempted virtual engine */ > + struct virtual_engine *ve = to_virtual_engine(cap->rq->engine); > + unsigned long flags; > + > + /* > + * An unsubmitted request along a virtual engine will > + * remain on the active (this) engine until we are able > + * to process the context switch away (and so mark the > + * context as no longer in flight). That cannot have happened > + * yet, otherwise we would not be hanging! > + */ > + spin_lock_irqsave(&ve->base.active.lock, flags); > + GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine); > + GEM_BUG_ON(ve->request != cap->rq); > + ve->request = NULL; > + spin_unlock_irqrestore(&ve->base.active.lock, flags); > + > + cap->rq->engine = engine; > + } Conceptually this I think belongs in execlists_hold, not capture. Since hold has the responsibility to hold correctly. Then also, this is all about a race with __unwind_incomplete_requests, yes? If so would need to be done under the engine->active.lock or this can still happen between now and execlists_hold. Regards, Tvrtko > + > /* > * Remove the request from the execlists queue, and take ownership > * of the request. We pass it to our worker who will _slowly_ compress >
Quoting Tvrtko Ursulin (2020-01-21 11:20:36) > > On 21/01/2020 10:09, Chris Wilson wrote: > > If we encounter a hang on a virtual engine, as we process the hang the > > request may already have been moved back to the virtual engine (we are > > processing the hang on the physical engine). We need to reclaim the > > request from the virtual engine so that the locking is consistent and > > local to the real engine on which we will hold the request for error > > state capturing. > > > > Fixes: 748317386afb ("drm/i915/execlists: Offline error capture") > > Testcase: igt/gem_exec_balancer/hang > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > --- > > drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++ > > 1 file changed, 20 insertions(+) > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > > index 3a30767ff0c4..a0acf1898c1e 100644 > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > > @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine) > > cap->rq = active_request(cap->rq->context->timeline, cap->rq); > > GEM_BUG_ON(!cap->rq); > > > > + if (cap->rq->engine != engine) { /* preempted virtual engine */ > > + struct virtual_engine *ve = to_virtual_engine(cap->rq->engine); > > + unsigned long flags; > > + > > + /* > > + * An unsubmitted request along a virtual engine will > > + * remain on the active (this) engine until we are able > > + * to process the context switch away (and so mark the > > + * context as no longer in flight). That cannot have happened > > + * yet, otherwise we would not be hanging! > > + */ > > + spin_lock_irqsave(&ve->base.active.lock, flags); > > + GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine); > > + GEM_BUG_ON(ve->request != cap->rq); > > + ve->request = NULL; > > + spin_unlock_irqrestore(&ve->base.active.lock, flags); > > + > > + cap->rq->engine = engine; > > + } > > Conceptually this I think belongs in execlists_hold, not capture. Since > hold has the responsibility to hold correctly. Yeah, I didn't put it into execlists_hold() as thought that would be clearer without it, and that this was more a side-effect of process_csb() vs __unwind_incomplete_requests(), and didn't want to mix up with the question of engine->active.lock Agreed that execlists_capture() does not describe it well, so probably shouldn't be called directly from here. > Then also, this is all about a race with __unwind_incomplete_requests, > yes? If so would need to be done under the engine->active.lock or this > can still happen between now and execlists_hold. The serialisation here is on process_csb(), since we have prevented the tasklet being called and own the reset on the engine, we know that execlists_schedule_out() can't be called on this engine, and so we own the virtual request. It's not protected by the engine->active.lock. So happy with moving it to before the lock in execlists_hold()? -Chris
Quoting Chris Wilson (2020-01-21 11:33:36) > Quoting Tvrtko Ursulin (2020-01-21 11:20:36) > > > > On 21/01/2020 10:09, Chris Wilson wrote: > > > If we encounter a hang on a virtual engine, as we process the hang the > > > request may already have been moved back to the virtual engine (we are > > > processing the hang on the physical engine). We need to reclaim the > > > request from the virtual engine so that the locking is consistent and > > > local to the real engine on which we will hold the request for error > > > state capturing. > > > > > > Fixes: 748317386afb ("drm/i915/execlists: Offline error capture") > > > Testcase: igt/gem_exec_balancer/hang > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> > > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > > --- > > > drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++ > > > 1 file changed, 20 insertions(+) > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > > > index 3a30767ff0c4..a0acf1898c1e 100644 > > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > > > @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine) > > > cap->rq = active_request(cap->rq->context->timeline, cap->rq); > > > GEM_BUG_ON(!cap->rq); > > > > > > + if (cap->rq->engine != engine) { /* preempted virtual engine */ > > > + struct virtual_engine *ve = to_virtual_engine(cap->rq->engine); > > > + unsigned long flags; > > > + > > > + /* > > > + * An unsubmitted request along a virtual engine will > > > + * remain on the active (this) engine until we are able > > > + * to process the context switch away (and so mark the > > > + * context as no longer in flight). That cannot have happened > > > + * yet, otherwise we would not be hanging! > > > + */ > > > + spin_lock_irqsave(&ve->base.active.lock, flags); > > > + GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine); > > > + GEM_BUG_ON(ve->request != cap->rq); > > > + ve->request = NULL; > > > + spin_unlock_irqrestore(&ve->base.active.lock, flags); > > > + > > > + cap->rq->engine = engine; > > > + } > > > > Conceptually this I think belongs in execlists_hold, not capture. Since > > hold has the responsibility to hold correctly. > > Yeah, I didn't put it into execlists_hold() as thought that would be > clearer without it, and that this was more a side-effect of > process_csb() vs __unwind_incomplete_requests(), and didn't want to mix > up with the question of engine->active.lock > > Agreed that execlists_capture() does not describe it well, so probably > shouldn't be called directly from here. > > > Then also, this is all about a race with __unwind_incomplete_requests, > > yes? If so would need to be done under the engine->active.lock or this > > can still happen between now and execlists_hold. > > The serialisation here is on process_csb(), since we have prevented the > tasklet being called and own the reset on the engine, we know that > execlists_schedule_out() can't be called on this engine, and so we own > the virtual request. It's not protected by the engine->active.lock. > > So happy with moving it to before the lock in execlists_hold()? Bleugh. Put it in execlists_hold() suggests adding something like static void execlists_hold(struct intel_engine_cs *engine, struct i915_request *rq) { if (cap->rq->engine != engine) { /* preempted virtual engine */ struct virtual_engine *ve = to_virtual_engine(cap->rq->engine); unsigned long flags; /* foo */ GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id, &engine->gt->reset.flags); /* * An unsubmitted request along a virtual engine will * remain on the active (this) engine until we are able * to process the context switch away (and so mark the * context as no longer in flight). That cannot have happened * yet, otherwise we would not be hanging! */ spin_lock_irqsave(&ve->base.active.lock, flags); GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine); GEM_BUG_ON(ve->request != cap->rq); ve->request = NULL; spin_unlock_irqrestore(&ve->base.active.lock, flags); cap->rq->engine = engine; } spin_lock_irq(&engine->active.lock); /* * Transfer this request onto the hold queue to prevent it * being resumbitted to HW (and potentially completed) before we have * released it. Since we may have already submitted following * requests, we need to remove those as well. */ GEM_BUG_ON(i915_request_on_hold(rq)); GEM_BUG_ON(rq->engine != engine); __execlists_hold(rq); spin_unlock_irq(&engine->active.lock); } Writing down the serialisation rules does suggest that execlists_hold() is not as generic as before. (On the other hand, doing so will help with writing the selftest for this bug, and engine->suspend() is a completely different kettle of fish that plays games with engine->active.hold directly.) -Chris
diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 3a30767ff0c4..a0acf1898c1e 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine) cap->rq = active_request(cap->rq->context->timeline, cap->rq); GEM_BUG_ON(!cap->rq); + if (cap->rq->engine != engine) { /* preempted virtual engine */ + struct virtual_engine *ve = to_virtual_engine(cap->rq->engine); + unsigned long flags; + + /* + * An unsubmitted request along a virtual engine will + * remain on the active (this) engine until we are able + * to process the context switch away (and so mark the + * context as no longer in flight). That cannot have happened + * yet, otherwise we would not be hanging! + */ + spin_lock_irqsave(&ve->base.active.lock, flags); + GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine); + GEM_BUG_ON(ve->request != cap->rq); + ve->request = NULL; + spin_unlock_irqrestore(&ve->base.active.lock, flags); + + cap->rq->engine = engine; + } + /* * Remove the request from the execlists queue, and take ownership * of the request. We pass it to our worker who will _slowly_ compress
If we encounter a hang on a virtual engine, as we process the hang the request may already have been moved back to the virtual engine (we are processing the hang on the physical engine). We need to reclaim the request from the virtual engine so that the locking is consistent and local to the real engine on which we will hold the request for error state capturing. Fixes: 748317386afb ("drm/i915/execlists: Offline error capture") Testcase: igt/gem_exec_balancer/hang Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> --- drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)