drm/i915: Flush pending interrupt following a GPU reset

Message ID	20180321150023.29252-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Wed, 21 Mar 2018 15:00:23 +0000 Message-Id: <20180321150023.29252-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a GPU reset Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

20180321150023.29252-1-chris@chris-wilson.co.uk (mailing list archive)

State

New, archived

Headers

From: Chris Wilson <chris@chris-wilson.co.uk>
To: intel-gfx@lists.freedesktop.org
Date: Wed, 21 Mar 2018 15:00:23 +0000
Message-Id: <20180321150023.29252-1-chris@chris-wilson.co.uk>
MIME-Version: 1.0
Subject: [Intel-gfx] [PATCH] drm/i915: Flush pending interrupt following a
	GPU reset
Precedence: list
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

Chris Wilson March 21, 2018, 3 p.m. UTC

After resetting the GPU (or subset of engines), call synchronize_irq()
to flush any pending irq before proceeding with the cleanup. For a
device level reset, we disable the interupts around the reset, but when
resetting just one engine, we have to avoid such global disabling. This
leaves us open to an interrupt arriving for the engine as we try to
reset it. We already do try to flush the IIR following the reset, but we
have to ensure that the in-flight interrupt does not land after we start
cleaning up after the reset; enter synchronize_irq().

As it current stands, we very rarely, but fatally, see sequences such as:

    2.... 57964564us : execlists_reset_prepare: rcs0
    2.... 57964613us : execlists_reset: rcs0 seqno=424
    0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
    2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- global_seqno 1060
    2.... 57964703us : execlists_reset_finish: rcs0
    0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, irq-posted?=1

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Cc: Michel Thierry <michel.thierry@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
---
 drivers/gpu/drm/i915/intel_uncore.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

jeff.mcgee@intel.com March 21, 2018, 3:55 p.m. UTC | #1

On Wed, Mar 21, 2018 at 03:00:23PM +0000, Chris Wilson wrote:
> After resetting the GPU (or subset of engines), call synchronize_irq()
> to flush any pending irq before proceeding with the cleanup. For a
> device level reset, we disable the interupts around the reset, but when
> resetting just one engine, we have to avoid such global disabling. This
> leaves us open to an interrupt arriving for the engine as we try to
> reset it. We already do try to flush the IIR following the reset, but we
> have to ensure that the in-flight interrupt does not land after we start
> cleaning up after the reset; enter synchronize_irq().
> 
> As it current stands, we very rarely, but fatally, see sequences such as:
> 
>     2.... 57964564us : execlists_reset_prepare: rcs0
>     2.... 57964613us : execlists_reset: rcs0 seqno=424
>     0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
>     2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- global_seqno 1060
>     2.... 57964703us : execlists_reset_finish: rcs0
>     0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, irq-posted?=1
> 
I can repro this sequence easily with force preemption IGT. Just tried this
patch and the issue is still there. For the moment I can just mitigate with
https://patchwork.freedesktop.org/patch/211086/

> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Cc: Michel Thierry <michel.thierry@intel.com>
> Cc: Michał Winiarski <michal.winiarski@intel.com>
> ---
>  drivers/gpu/drm/i915/intel_uncore.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
> index 4c616d074a97..04830d6125d6 100644
> --- a/drivers/gpu/drm/i915/intel_uncore.c
> +++ b/drivers/gpu/drm/i915/intel_uncore.c
> @@ -2116,11 +2116,14 @@ int intel_gpu_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
>  		i915_stop_engines(dev_priv, engine_mask);
>  
>  		ret = -ENODEV;
> -		if (reset)
> +		if (reset) {
> +			GEM_TRACE("engine_mask=%x\n", engine_mask);
>  			ret = reset(dev_priv, engine_mask);
> +		}
>  		if (ret != -ETIMEDOUT)
>  			break;
>  
> +		synchronize_irq(dev_priv->drm.irq);
>  		cond_resched();
>  	}
>  	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);
> -- 
> 2.16.2
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson March 21, 2018, 4:41 p.m. UTC | #2

Quoting Jeff McGee (2018-03-21 15:55:16)
> On Wed, Mar 21, 2018 at 03:00:23PM +0000, Chris Wilson wrote:
> > After resetting the GPU (or subset of engines), call synchronize_irq()
> > to flush any pending irq before proceeding with the cleanup. For a
> > device level reset, we disable the interupts around the reset, but when
> > resetting just one engine, we have to avoid such global disabling. This
> > leaves us open to an interrupt arriving for the engine as we try to
> > reset it. We already do try to flush the IIR following the reset, but we
> > have to ensure that the in-flight interrupt does not land after we start
> > cleaning up after the reset; enter synchronize_irq().
> > 
> > As it current stands, we very rarely, but fatally, see sequences such as:
> > 
> >     2.... 57964564us : execlists_reset_prepare: rcs0
> >     2.... 57964613us : execlists_reset: rcs0 seqno=424
> >     0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
> >     2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- global_seqno 1060
> >     2.... 57964703us : execlists_reset_finish: rcs0
> >     0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, irq-posted?=1
> > 
> I can repro this sequence easily with force preemption IGT. Just tried this
> patch and the issue is still there. For the moment I can just mitigate with
> https://patchwork.freedesktop.org/patch/211086/

These patch is purely to make sure that the irq_handler is not called
after execlists_reset, as that is what we rely on.

We *should* never be in a situation where CSB tail is invalid, that
implies either we screwed up or the hw is incoherent.
-Chris

Chris Wilson March 21, 2018, 4:42 p.m. UTC | #3

Quoting Jeff McGee (2018-03-21 15:55:16)
> On Wed, Mar 21, 2018 at 03:00:23PM +0000, Chris Wilson wrote:
> > After resetting the GPU (or subset of engines), call synchronize_irq()
> > to flush any pending irq before proceeding with the cleanup. For a
> > device level reset, we disable the interupts around the reset, but when
> > resetting just one engine, we have to avoid such global disabling. This
> > leaves us open to an interrupt arriving for the engine as we try to
> > reset it. We already do try to flush the IIR following the reset, but we
> > have to ensure that the in-flight interrupt does not land after we start
> > cleaning up after the reset; enter synchronize_irq().
> > 
> > As it current stands, we very rarely, but fatally, see sequences such as:
> > 
> >     2.... 57964564us : execlists_reset_prepare: rcs0
> >     2.... 57964613us : execlists_reset: rcs0 seqno=424
> >     0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
> >     2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- global_seqno 1060
> >     2.... 57964703us : execlists_reset_finish: rcs0
> >     0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, irq-posted?=1
> > 
> I can repro this sequence easily with force preemption IGT.

With the sequence I suggested?
-Chris

jeff.mcgee@intel.com March 21, 2018, 5:10 p.m. UTC | #4

On Wed, Mar 21, 2018 at 04:42:32PM +0000, Chris Wilson wrote:
> Quoting Jeff McGee (2018-03-21 15:55:16)
> > On Wed, Mar 21, 2018 at 03:00:23PM +0000, Chris Wilson wrote:
> > > After resetting the GPU (or subset of engines), call synchronize_irq()
> > > to flush any pending irq before proceeding with the cleanup. For a
> > > device level reset, we disable the interupts around the reset, but when
> > > resetting just one engine, we have to avoid such global disabling. This
> > > leaves us open to an interrupt arriving for the engine as we try to
> > > reset it. We already do try to flush the IIR following the reset, but we
> > > have to ensure that the in-flight interrupt does not land after we start
> > > cleaning up after the reset; enter synchronize_irq().
> > > 
> > > As it current stands, we very rarely, but fatally, see sequences such as:
> > > 
> > >     2.... 57964564us : execlists_reset_prepare: rcs0
> > >     2.... 57964613us : execlists_reset: rcs0 seqno=424
> > >     0d.h1 57964615us : gen8_cs_irq_handler: rcs0 CS active=1
> > >     2d..1 57964617us : __i915_request_unsubmit: rcs0 fence 29:1056 <- global_seqno 1060
> > >     2.... 57964703us : execlists_reset_finish: rcs0
> > >     0..s. 57964705us : execlists_submission_tasklet: rcs0 awake?=1, active=0, irq-posted?=1
> > > 
> > I can repro this sequence easily with force preemption IGT.
> 
> With the sequence I suggested?
> -Chris

Yes. Your approach to protecting port[1] context is working well. This is
the only issue I'm still hitting. I'll post my updated RFC set in a sec.
-Jeff

diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index 4c616d074a97..04830d6125d6 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -2116,11 +2116,14 @@  int intel_gpu_reset(struct drm_i915_private *dev_priv, unsigned engine_mask)
 		i915_stop_engines(dev_priv, engine_mask);
 
 		ret = -ENODEV;
-		if (reset)
+		if (reset) {
+			GEM_TRACE("engine_mask=%x\n", engine_mask);
 			ret = reset(dev_priv, engine_mask);
+		}
 		if (ret != -ETIMEDOUT)
 			break;
 
+		synchronize_irq(dev_priv->drm.irq);
 		cond_resched();
 	}
 	intel_uncore_forcewake_put(dev_priv, FORCEWAKE_ALL);

drm/i915: Flush pending interrupt following a GPU reset

Commit Message

Comments

Patch