From patchwork Tue Oct 22 22:38:26 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11205461 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 87195139A for ; Tue, 22 Oct 2019 22:39:19 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 6F76F207FC for ; Tue, 22 Oct 2019 22:39:19 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6F76F207FC Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 69FCD6E91F; Tue, 22 Oct 2019 22:39:17 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id F29DC6E920 for ; Tue, 22 Oct 2019 22:39:04 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 18930838-1500050 for multiple; Tue, 22 Oct 2019 23:38:54 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Tue, 22 Oct 2019 23:38:26 +0100 Message-Id: <20191022223831.22677-7-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.24.0.rc0 In-Reply-To: <20191022223831.22677-1-chris@chris-wilson.co.uk> References: <20191022223831.22677-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 07/12] drm/i915/gem: Cancel contexts when hangchecking is disabled X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Normally, we rely on our hangcheck to prevent persistent batches from hogging the GPU. However, if the user disables hangcheck, this mechanism breaks down. Despite our insistence that this is unsafe, the users are equally insistent that they want to use endless batches and will disable the hangcheck mechanism. We are looking at replacing hangcheck, in the next patch, with a softer mechanism, that sends a pulse down the engine to check if it is well. We can use the same preemptive pulse to flush an active context off the GPU upon context close, preventing resources being lost and unkillable requests remaining on the GPU after process termination. Signed-off-by: Chris Wilson Cc: Joonas Lahtinen Cc: MichaƂ Winiarski Cc: Jon Bloomfield Reviewed-by: Jon Bloomfield Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gem/i915_gem_context.c | 141 ++++++++++++++++++++ 1 file changed, 141 insertions(+) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_context.c b/drivers/gpu/drm/i915/gem/i915_gem_context.c index 7b01f4605f21..b2f042d87be0 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_context.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_context.c @@ -70,6 +70,7 @@ #include #include "gt/intel_lrc_reg.h" +#include "gt/intel_engine_heartbeat.h" #include "gt/intel_engine_user.h" #include "i915_gem_context.h" @@ -276,6 +277,135 @@ void i915_gem_context_release(struct kref *ref) schedule_work(&gc->free_work); } +static inline struct i915_gem_engines * +__context_engines_static(const struct i915_gem_context *ctx) +{ + return rcu_dereference_protected(ctx->engines, true); +} + +static bool __reset_engine(struct intel_engine_cs *engine) +{ + struct intel_gt *gt = engine->gt; + bool success = false; + + if (!intel_has_reset_engine(gt)) + return false; + + if (!test_and_set_bit(I915_RESET_ENGINE + engine->id, + >->reset.flags)) { + success = intel_engine_reset(engine, NULL) == 0; + clear_and_wake_up_bit(I915_RESET_ENGINE + engine->id, + >->reset.flags); + } + + return success; +} + +static void __reset_context(struct i915_gem_context *ctx, + struct intel_engine_cs *engine) +{ + intel_gt_handle_error(engine->gt, engine->mask, 0, + "context closure in %s", ctx->name); +} + +static bool __cancel_engine(struct intel_engine_cs *engine) +{ + /* + * Send a "high priority pulse" down the engine to cause the + * current request to be momentarily preempted. (If it fails to + * be preempted, it will be reset). As we have marked our context + * as banned, any incomplete request, including any running, will + * be skipped following the preemption. + * + * If there is no hangchecking (one of the reasons why we try to + * cancel the context) and no forced preemption, there may be no + * means by which we reset the GPU and evict the persistent hog. + * Ergo if we are unable to inject a preemptive pulse that can + * kill the banned context, we fallback to doing a local reset + * instead. + */ + if (CONFIG_DRM_I915_PREEMPT_TIMEOUT && !intel_engine_pulse(engine)) + return true; + + /* If we are unable to send a pulse, try resetting this engine. */ + return __reset_engine(engine); +} + +static struct intel_engine_cs * +active_engine(struct dma_fence *fence, struct intel_context *ce) +{ + struct i915_request *rq = to_request(fence); + struct intel_engine_cs *engine, *locked; + + /* + * Serialise with __i915_request_submit() so that it sees + * is-banned?, or we know the request is already inflight. + */ + locked = READ_ONCE(rq->engine); + spin_lock_irq(&locked->active.lock); + while (unlikely(locked != (engine = READ_ONCE(rq->engine)))) { + spin_unlock(&locked->active.lock); + spin_lock(&engine->active.lock); + locked = engine; + } + + engine = NULL; + if (i915_request_is_active(rq) && !rq->fence.error) + engine = rq->engine; + + spin_unlock_irq(&locked->active.lock); + + return engine; +} + +static void kill_context(struct i915_gem_context *ctx) +{ + struct i915_gem_engines_iter it; + struct intel_context *ce; + + /* + * If we are already banned, it was due to a guilty request causing + * a reset and the entire context being evicted from the GPU. + */ + if (i915_gem_context_is_banned(ctx)) + return; + + i915_gem_context_set_banned(ctx); + + /* + * Map the user's engine back to the actual engines; one virtual + * engine will be mapped to multiple engines, and using ctx->engine[] + * the same engine may be have multiple instances in the user's map. + * However, we only care about pending requests, so only include + * engines on which there are incomplete requests. + */ + for_each_gem_engine(ce, __context_engines_static(ctx), it) { + struct intel_engine_cs *engine; + struct dma_fence *fence; + + if (!ce->timeline) + continue; + + fence = i915_active_fence_get(&ce->timeline->last_request); + if (!fence) + continue; + + /* Check with the backend if the request is still inflight */ + engine = active_engine(fence, ce); + + /* First attempt to gracefully cancel the context */ + if (engine && !__cancel_engine(engine)) + /* + * If we are unable to send a preemptive pulse to bump + * the context from the GPU, we have to resort to a full + * reset. We hope the collateral damage is worth it. + */ + __reset_context(ctx, engine); + + dma_fence_put(fence); + } +} + static void context_close(struct i915_gem_context *ctx) { struct i915_address_space *vm; @@ -298,6 +428,17 @@ static void context_close(struct i915_gem_context *ctx) lut_close(ctx); mutex_unlock(&ctx->mutex); + + /* + * If the user has disabled hangchecking, we can not be sure that + * the batches will ever complete after the context is closed, + * keeping the context and all resources pinned forever. So in this + * case we opt to forcibly kill off all remaining requests on + * context close. + */ + if (!i915_modparams.enable_hangcheck) + kill_context(ctx); + i915_gem_context_put(ctx); }