drm/i915/execlists: Reclaim hanging virtual request
diff mbox series

Message ID 20200121100927.114886-1-chris@chris-wilson.co.uk
State New
Headers show
Series
  • drm/i915/execlists: Reclaim hanging virtual request
Related show

Commit Message

Chris Wilson Jan. 21, 2020, 10:09 a.m. UTC
If we encounter a hang on a virtual engine, as we process the hang the
request may already have been moved back to the virtual engine (we are
processing the hang on the physical engine). We need to reclaim the
request from the virtual engine so that the locking is consistent and
local to the real engine on which we will hold the request for error
state capturing.

Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
Testcase: igt/gem_exec_balancer/hang
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Comments

Tvrtko Ursulin Jan. 21, 2020, 11:20 a.m. UTC | #1
On 21/01/2020 10:09, Chris Wilson wrote:
> If we encounter a hang on a virtual engine, as we process the hang the
> request may already have been moved back to the virtual engine (we are
> processing the hang on the physical engine). We need to reclaim the
> request from the virtual engine so that the locking is consistent and
> local to the real engine on which we will hold the request for error
> state capturing.
> 
> Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
> Testcase: igt/gem_exec_balancer/hang
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++
>   1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index 3a30767ff0c4..a0acf1898c1e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine)
>   	cap->rq = active_request(cap->rq->context->timeline, cap->rq);
>   	GEM_BUG_ON(!cap->rq);
>   
> +	if (cap->rq->engine != engine) { /* preempted virtual engine */
> +		struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
> +		unsigned long flags;
> +
> +		/*
> +		 * An unsubmitted request along a virtual engine will
> +		 * remain on the active (this) engine until we are able
> +		 * to process the context switch away (and so mark the
> +		 * context as no longer in flight). That cannot have happened
> +		 * yet, otherwise we would not be hanging!
> +		 */
> +		spin_lock_irqsave(&ve->base.active.lock, flags);
> +		GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
> +		GEM_BUG_ON(ve->request != cap->rq);
> +		ve->request = NULL;
> +		spin_unlock_irqrestore(&ve->base.active.lock, flags);
> +
> +		cap->rq->engine = engine;
> +	}

Conceptually this I think belongs in execlists_hold, not capture. Since 
hold has the responsibility to hold correctly.

Then also, this is all about a race with __unwind_incomplete_requests, 
yes? If so would need to be done under the engine->active.lock or this 
can still happen between now and execlists_hold.

Regards,

Tvrtko

> +
>   	/*
>   	 * Remove the request from the execlists queue, and take ownership
>   	 * of the request. We pass it to our worker who will _slowly_ compress
>
Chris Wilson Jan. 21, 2020, 11:33 a.m. UTC | #2
Quoting Tvrtko Ursulin (2020-01-21 11:20:36)
> 
> On 21/01/2020 10:09, Chris Wilson wrote:
> > If we encounter a hang on a virtual engine, as we process the hang the
> > request may already have been moved back to the virtual engine (we are
> > processing the hang on the physical engine). We need to reclaim the
> > request from the virtual engine so that the locking is consistent and
> > local to the real engine on which we will hold the request for error
> > state capturing.
> > 
> > Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
> > Testcase: igt/gem_exec_balancer/hang
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++
> >   1 file changed, 20 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index 3a30767ff0c4..a0acf1898c1e 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine)
> >       cap->rq = active_request(cap->rq->context->timeline, cap->rq);
> >       GEM_BUG_ON(!cap->rq);
> >   
> > +     if (cap->rq->engine != engine) { /* preempted virtual engine */
> > +             struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
> > +             unsigned long flags;
> > +
> > +             /*
> > +              * An unsubmitted request along a virtual engine will
> > +              * remain on the active (this) engine until we are able
> > +              * to process the context switch away (and so mark the
> > +              * context as no longer in flight). That cannot have happened
> > +              * yet, otherwise we would not be hanging!
> > +              */
> > +             spin_lock_irqsave(&ve->base.active.lock, flags);
> > +             GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
> > +             GEM_BUG_ON(ve->request != cap->rq);
> > +             ve->request = NULL;
> > +             spin_unlock_irqrestore(&ve->base.active.lock, flags);
> > +
> > +             cap->rq->engine = engine;
> > +     }
> 
> Conceptually this I think belongs in execlists_hold, not capture. Since 
> hold has the responsibility to hold correctly.

Yeah, I didn't put it into execlists_hold() as thought that would be
clearer without it, and that this was more a side-effect of
process_csb() vs __unwind_incomplete_requests(), and didn't want to mix
up with the question of engine->active.lock

Agreed that execlists_capture() does not describe it well, so probably
shouldn't be called directly from here.

> Then also, this is all about a race with __unwind_incomplete_requests, 
> yes? If so would need to be done under the engine->active.lock or this 
> can still happen between now and execlists_hold.

The serialisation here is on process_csb(), since we have prevented the
tasklet being called and own the reset on the engine, we know that
execlists_schedule_out() can't be called on this engine, and so we own
the virtual request. It's not protected by the engine->active.lock.

So happy with moving it to before the lock in execlists_hold()?
-Chris
Chris Wilson Jan. 21, 2020, 11:44 a.m. UTC | #3
Quoting Chris Wilson (2020-01-21 11:33:36)
> Quoting Tvrtko Ursulin (2020-01-21 11:20:36)
> > 
> > On 21/01/2020 10:09, Chris Wilson wrote:
> > > If we encounter a hang on a virtual engine, as we process the hang the
> > > request may already have been moved back to the virtual engine (we are
> > > processing the hang on the physical engine). We need to reclaim the
> > > request from the virtual engine so that the locking is consistent and
> > > local to the real engine on which we will hold the request for error
> > > state capturing.
> > > 
> > > Fixes: 748317386afb ("drm/i915/execlists: Offline error capture")
> > > Testcase: igt/gem_exec_balancer/hang
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > ---
> > >   drivers/gpu/drm/i915/gt/intel_lrc.c | 20 ++++++++++++++++++++
> > >   1 file changed, 20 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > index 3a30767ff0c4..a0acf1898c1e 100644
> > > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > > @@ -2582,6 +2582,26 @@ static void execlists_capture(struct intel_engine_cs *engine)
> > >       cap->rq = active_request(cap->rq->context->timeline, cap->rq);
> > >       GEM_BUG_ON(!cap->rq);
> > >   
> > > +     if (cap->rq->engine != engine) { /* preempted virtual engine */
> > > +             struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
> > > +             unsigned long flags;
> > > +
> > > +             /*
> > > +              * An unsubmitted request along a virtual engine will
> > > +              * remain on the active (this) engine until we are able
> > > +              * to process the context switch away (and so mark the
> > > +              * context as no longer in flight). That cannot have happened
> > > +              * yet, otherwise we would not be hanging!
> > > +              */
> > > +             spin_lock_irqsave(&ve->base.active.lock, flags);
> > > +             GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
> > > +             GEM_BUG_ON(ve->request != cap->rq);
> > > +             ve->request = NULL;
> > > +             spin_unlock_irqrestore(&ve->base.active.lock, flags);
> > > +
> > > +             cap->rq->engine = engine;
> > > +     }
> > 
> > Conceptually this I think belongs in execlists_hold, not capture. Since 
> > hold has the responsibility to hold correctly.
> 
> Yeah, I didn't put it into execlists_hold() as thought that would be
> clearer without it, and that this was more a side-effect of
> process_csb() vs __unwind_incomplete_requests(), and didn't want to mix
> up with the question of engine->active.lock
> 
> Agreed that execlists_capture() does not describe it well, so probably
> shouldn't be called directly from here.
> 
> > Then also, this is all about a race with __unwind_incomplete_requests, 
> > yes? If so would need to be done under the engine->active.lock or this 
> > can still happen between now and execlists_hold.
> 
> The serialisation here is on process_csb(), since we have prevented the
> tasklet being called and own the reset on the engine, we know that
> execlists_schedule_out() can't be called on this engine, and so we own
> the virtual request. It's not protected by the engine->active.lock.
> 
> So happy with moving it to before the lock in execlists_hold()?

Bleugh. Put it in execlists_hold() suggests adding something like

static void execlists_hold(struct intel_engine_cs *engine,
                           struct i915_request *rq)
{
        if (cap->rq->engine != engine) { /* preempted virtual engine */
                struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
                unsigned long flags;

		/* foo */
                GEM_BUG_ON(!test_bit(I915_RESET_ENGINE + engine->id,
                                     &engine->gt->reset.flags);

                /*
                 * An unsubmitted request along a virtual engine will
                 * remain on the active (this) engine until we are able
                 * to process the context switch away (and so mark the
                 * context as no longer in flight). That cannot have happened
                 * yet, otherwise we would not be hanging!
                 */
                spin_lock_irqsave(&ve->base.active.lock, flags);
                GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
                GEM_BUG_ON(ve->request != cap->rq);
                ve->request = NULL;
                spin_unlock_irqrestore(&ve->base.active.lock, flags);

                cap->rq->engine = engine;
        }

        spin_lock_irq(&engine->active.lock);

        /*
         * Transfer this request onto the hold queue to prevent it
         * being resumbitted to HW (and potentially completed) before we have
         * released it. Since we may have already submitted following
         * requests, we need to remove those as well.
         */
        GEM_BUG_ON(i915_request_on_hold(rq));
        GEM_BUG_ON(rq->engine != engine);
        __execlists_hold(rq);

        spin_unlock_irq(&engine->active.lock);
}

Writing down the serialisation rules does suggest that execlists_hold()
is not as generic as before. (On the other hand, doing so will help with
writing the selftest for this bug, and engine->suspend() is a completely
different kettle of fish that plays games with engine->active.hold
directly.)
-Chris

Patch
diff mbox series

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index 3a30767ff0c4..a0acf1898c1e 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2582,6 +2582,26 @@  static void execlists_capture(struct intel_engine_cs *engine)
 	cap->rq = active_request(cap->rq->context->timeline, cap->rq);
 	GEM_BUG_ON(!cap->rq);
 
+	if (cap->rq->engine != engine) { /* preempted virtual engine */
+		struct virtual_engine *ve = to_virtual_engine(cap->rq->engine);
+		unsigned long flags;
+
+		/*
+		 * An unsubmitted request along a virtual engine will
+		 * remain on the active (this) engine until we are able
+		 * to process the context switch away (and so mark the
+		 * context as no longer in flight). That cannot have happened
+		 * yet, otherwise we would not be hanging!
+		 */
+		spin_lock_irqsave(&ve->base.active.lock, flags);
+		GEM_BUG_ON(intel_context_inflight(cap->rq->context) != engine);
+		GEM_BUG_ON(ve->request != cap->rq);
+		ve->request = NULL;
+		spin_unlock_irqrestore(&ve->base.active.lock, flags);
+
+		cap->rq->engine = engine;
+	}
+
 	/*
 	 * Remove the request from the execlists queue, and take ownership
 	 * of the request. We pass it to our worker who will _slowly_ compress