Message ID | 1453378450-33327-4-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 21, 2016 at 12:14:10PM +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > In GuC mode LRC pinning lifetime depends exclusively on the > request liftime. Since that is terminated by the seqno update > that opens up a race condition between GPU finishing writing > out the context image and the driver unpinning the LRC. > > To extend the LRC lifetime we will employ a similar approach > to what legacy ringbuffer submission does. > > We will start tracking the last submitted context per engine > and keep it pinned until it is replaced by another one. > > Note that the driver unload path is a bit fragile and could > benefit greatly from efforts to unify the legacy and exec > list submission code paths. > > At the moment i915_gem_context_fini has special casing for the > two which are potentialy not needed, and also depends on > i915_gem_cleanup_ringbuffer running before itself. > > v2: > * Move pinning into engine->emit_request and actually fix > the reference/unreference logic. (Chris Wilson) > > * ring->dev can be NULL on driver unload so use a different > route towards it. > > v3: > * Rebase. > * Handle the reset path. (Chris Wilson) > * Exclude default context from the pinning - it is impossible > to get it right before default context special casing in > general is eliminated. > > v4: > * Rebased & moved context tracking to > intel_logical_ring_advance_and_submit. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Issue: VIZ-4277 > Cc: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Nick Hoath <nicholas.hoath@intel.com> Whilst it saddens me to see yet another (impossible) special case added that will just have to be deleted again, the series is Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> I wonder if it is possible to poison the context objects before and after, then do a deferred check for stray writes, and use that mode for igt/gem_ctx_* (with some tests targetting active->idle vs context-close). Would still be susceptible to timing as we need to hit the interval between the seqno being complete and the delayed context save, but that seems like the most reliable way to detect the error? -Chris
On 21/01/16 12:32, Chris Wilson wrote: > On Thu, Jan 21, 2016 at 12:14:10PM +0000, Tvrtko Ursulin wrote: >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> In GuC mode LRC pinning lifetime depends exclusively on the >> request liftime. Since that is terminated by the seqno update >> that opens up a race condition between GPU finishing writing >> out the context image and the driver unpinning the LRC. >> >> To extend the LRC lifetime we will employ a similar approach >> to what legacy ringbuffer submission does. >> >> We will start tracking the last submitted context per engine >> and keep it pinned until it is replaced by another one. >> >> Note that the driver unload path is a bit fragile and could >> benefit greatly from efforts to unify the legacy and exec >> list submission code paths. >> >> At the moment i915_gem_context_fini has special casing for the >> two which are potentialy not needed, and also depends on >> i915_gem_cleanup_ringbuffer running before itself. >> >> v2: >> * Move pinning into engine->emit_request and actually fix >> the reference/unreference logic. (Chris Wilson) >> >> * ring->dev can be NULL on driver unload so use a different >> route towards it. >> >> v3: >> * Rebase. >> * Handle the reset path. (Chris Wilson) >> * Exclude default context from the pinning - it is impossible >> to get it right before default context special casing in >> general is eliminated. >> >> v4: >> * Rebased & moved context tracking to >> intel_logical_ring_advance_and_submit. >> >> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> Issue: VIZ-4277 >> Cc: Chris Wilson <chris@chris-wilson.co.uk> >> Cc: Nick Hoath <nicholas.hoath@intel.com> > > Whilst it saddens me to see yet another (impossible) special case added > that will just have to be deleted again, the series is > Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Thanks and sorry, hopefully it will get cleanup up soon. There seems to be a growing number of people who want it done. And I still need to get back to your VMA rewrite and breadcrumbs would be nice as well. > I wonder if it is possible to poison the context objects before and > after, then do a deferred check for stray writes, and use that mode for > igt/gem_ctx_* (with some tests targetting active->idle vs > context-close). Would still be susceptible to timing as we need to > hit the interval between the seqno being complete and the delayed context > save, but that seems like the most reliable way to detect the error? First it needs to be tested with GuC to check that it actually fixes the issue. And pass CI of course. But I can't really figure where would you put this poisoning? You could put something in in exec list mode after context complete and check it before it is used next time, but I did not think we can hit this in exec list mode, only in GuC. You think it is possible? And in GuC mode I have no idea at which point you would put "poisoning" in? Regards, Tvrtko
On Thu, Jan 21, 2016 at 01:51:30PM +0000, Tvrtko Ursulin wrote: > But I can't really figure where would you put this poisoning? You > could put something in in exec list mode after context complete and > check it before it is used next time, I was thinking just in context-free. Move the pages to a poisoned list and check at the end of the test. The issue is that the GPU may write to the pages as we release them, so instead of releasing them we just poison them (or CRC them). > but I did not think we can hit > this in exec list mode, only in GuC. You think it is possible? With the current code, no since the last context unreference is done from i915_gem_request_free(), and the request reference is only dropped after we see the context-switch/active->idle state change (i.e. the context save should always be flushed by the time we unpin and free the context). However, that ordering imposes the struct_mutex upon request-free which leads to fairly severe issues that can be eleviated by moving the context-unreference into the request-retire - which opens up the context-close race to normal execlists, as the context can be unreferenced before the next context-switch-interrupt, now fixed in this patch. And reducing the context-pin lifetime even further should help mitigate context thrashing and certain userspace stress tests (OpenGL microbenchmarks). -Chris
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index 1a67e07b9e6a..7e6f8c7b6d01 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -324,9 +324,13 @@ err_destroy: static void i915_gem_context_unpin(struct intel_context *ctx, struct intel_engine_cs *engine) { - if (engine->id == RCS && ctx->legacy_hw_ctx.rcs_state) - i915_gem_object_ggtt_unpin(ctx->legacy_hw_ctx.rcs_state); - i915_gem_context_unreference(ctx); + if (i915.enable_execlists) { + intel_lr_context_unpin(ctx, engine); + } else { + if (engine->id == RCS && ctx->legacy_hw_ctx.rcs_state) + i915_gem_object_ggtt_unpin(ctx->legacy_hw_ctx.rcs_state); + i915_gem_context_unreference(ctx); + } } void i915_gem_context_reset(struct drm_device *dev) diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 8eb6e364fefc..0e215ea3f8ab 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -766,6 +766,7 @@ intel_logical_ring_advance_and_submit(struct drm_i915_gem_request *request) { struct intel_ringbuffer *ringbuf = request->ringbuf; struct drm_i915_private *dev_priv = request->i915; + struct intel_engine_cs *engine = request->ring; intel_logical_ring_advance(ringbuf); request->tail = ringbuf->tail; @@ -780,9 +781,20 @@ intel_logical_ring_advance_and_submit(struct drm_i915_gem_request *request) intel_logical_ring_emit(ringbuf, MI_NOOP); intel_logical_ring_advance(ringbuf); - if (intel_ring_stopped(request->ring)) + if (intel_ring_stopped(engine)) return 0; + if (engine->last_context != request->ctx) { + if (engine->last_context) + intel_lr_context_unpin(engine->last_context, engine); + if (request->ctx != request->i915->kernel_context) { + intel_lr_context_pin(request->ctx, engine); + engine->last_context = request->ctx; + } else { + engine->last_context = NULL; + } + } + if (dev_priv->guc.execbuf_client) i915_guc_submit(dev_priv->guc.execbuf_client, request); else @@ -1129,7 +1141,7 @@ void intel_lr_context_unpin(struct intel_context *ctx, { struct drm_i915_gem_object *ctx_obj = ctx->engine[engine->id].state; - WARN_ON(!mutex_is_locked(&engine->dev->struct_mutex)); + WARN_ON(!mutex_is_locked(&ctx->i915->dev->struct_mutex)); if (WARN_ON_ONCE(!ctx_obj)) return;