[v2] drm/i915: Fix premature LRC unpin in GuC mode
diff mbox

Message ID 1453301447-15233-1-git-send-email-tvrtko.ursulin@linux.intel.com
State New
Headers show

Commit Message

Tvrtko Ursulin Jan. 20, 2016, 2:50 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

In GuC mode LRC pinning lifetime depends exclusively on the
request liftime. Since that is terminated by the seqno update
that opens up a race condition between GPU finishing writing
out the context image and the driver unpinning the LRC.

To extend the LRC lifetime we will employ a similar approach
to what legacy ringbuffer submission does.

We will start tracking the last submitted context per engine
and keep it pinned until it is replaced by another one.

Note that the driver unload path is a bit fragile and could
benefit greatly from efforts to unify the legacy and exec
list submission code paths.

At the moment i915_gem_context_fini has special casing for the
two which are potentialy not needed, and also depends on
i915_gem_cleanup_ringbuffer running before itself.

v2:
 * Move pinning into engine->emit_request and actually fix
   the reference/unreference logic. (Chris Wilson)

 * ring->dev can be NULL on driver unload so use a different
   route towards it.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Issue: VIZ-4277
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Nick Hoath <nicholas.hoath@intel.com>
---
I cannot test this with GuC but it passes BAT with execlists
and some real world smoke tests.
---
 drivers/gpu/drm/i915/i915_gem_context.c | 4 +++-
 drivers/gpu/drm/i915/intel_lrc.c        | 9 ++++++++-
 2 files changed, 11 insertions(+), 2 deletions(-)

Comments

Chris Wilson Jan. 20, 2016, 3:18 p.m. UTC | #1
On Wed, Jan 20, 2016 at 02:50:47PM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> In GuC mode LRC pinning lifetime depends exclusively on the
> request liftime. Since that is terminated by the seqno update
> that opens up a race condition between GPU finishing writing
> out the context image and the driver unpinning the LRC.
> 
> To extend the LRC lifetime we will employ a similar approach
> to what legacy ringbuffer submission does.
> 
> We will start tracking the last submitted context per engine
> and keep it pinned until it is replaced by another one.
> 
> Note that the driver unload path is a bit fragile and could
> benefit greatly from efforts to unify the legacy and exec
> list submission code paths.
> 
> At the moment i915_gem_context_fini has special casing for the
> two which are potentialy not needed, and also depends on
> i915_gem_cleanup_ringbuffer running before itself.
> 
> v2:
>  * Move pinning into engine->emit_request and actually fix
>    the reference/unreference logic. (Chris Wilson)
> 
>  * ring->dev can be NULL on driver unload so use a different
>    route towards it.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Issue: VIZ-4277
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Nick Hoath <nicholas.hoath@intel.com>
> ---
> I cannot test this with GuC but it passes BAT with execlists
> and some real world smoke tests.
> ---
>  drivers/gpu/drm/i915/i915_gem_context.c | 4 +++-
>  drivers/gpu/drm/i915/intel_lrc.c        | 9 ++++++++-
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
> index c25083c78ba7..0b419e165836 100644
> --- a/drivers/gpu/drm/i915/i915_gem_context.c
> +++ b/drivers/gpu/drm/i915/i915_gem_context.c
> @@ -438,7 +438,9 @@ void i915_gem_context_fini(struct drm_device *dev)
>  	for (i = 0; i < I915_NUM_RINGS; i++) {
>  		struct intel_engine_cs *ring = &dev_priv->ring[i];
>  

This is the nasty part where the GPU still has access to the backing
pages as we release them. A hard to hit module-unload vs active GPU for
sure, but it is something that we can prevent.

The context-fini vs engine-fini ordering is also apparently tied to the
use of intel_gpu_reset() here (as it clobbers the GPU without dropping
the GEM state, causing a foul up when tearing down the engine).

If we had actually called i915_reset (and so i915_gem_reset) instead, we
should expect last_context to be NULL... Oops, Tvrtko you need to
inspect i915_gem_context_reset() as well for reseting last_context back
to NULL after the GPU reset.

As for here, I would just pencil in a plan to replace this chunk
entirely with a call to i915_reset(), or a slightly trimmed down
version and tidy up the active GEM cleanup in the process.
-Chris

Patch
diff mbox

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index c25083c78ba7..0b419e165836 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -438,7 +438,9 @@  void i915_gem_context_fini(struct drm_device *dev)
 	for (i = 0; i < I915_NUM_RINGS; i++) {
 		struct intel_engine_cs *ring = &dev_priv->ring[i];
 
-		if (ring->last_context)
+		if (ring->last_context && i915.enable_execlists)
+			intel_lr_context_unpin(ring->last_context, ring);
+		else if (ring->last_context)
 			i915_gem_context_unreference(ring->last_context);
 
 		ring->default_context = NULL;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index e7c273acac1e..72ed176e091d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1120,7 +1120,7 @@  void intel_lr_context_unpin(struct intel_context *ctx,
 {
 	struct drm_i915_gem_object *ctx_obj = ctx->engine[engine->id].state;
 
-	WARN_ON(!mutex_is_locked(&engine->dev->struct_mutex));
+	WARN_ON(!mutex_is_locked(&ctx->i915->dev->struct_mutex));
 
 	if (WARN_ON_ONCE(!ctx_obj))
 		return;
@@ -1859,6 +1859,13 @@  static int gen8_emit_request(struct drm_i915_gem_request *request)
 	u32 cmd;
 	int ret;
 
+	if (ring->last_context != request->ctx) {
+		if (ring->last_context)
+			intel_lr_context_unpin(ring->last_context, ring);
+		intel_lr_context_pin(request->ctx, ring);
+		ring->last_context = request->ctx;
+	}
+
 	/*
 	 * Reserve space for 2 NOOPs at the end of each request to be
 	 * used as a workaround for not being allowed to do lite