From patchwork Thu Jul 25 22:44:03 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11059915 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E35E6C5 for ; Thu, 25 Jul 2019 22:44:24 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1D96628897 for ; Thu, 25 Jul 2019 22:44:24 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 0ED9B28A73; Thu, 25 Jul 2019 22:44:24 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 608FA28897 for ; Thu, 25 Jul 2019 22:44:23 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E47EA89259; Thu, 25 Jul 2019 22:44:21 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id EF4DD6E838 for ; Thu, 25 Jul 2019 22:44:19 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 17610920-1500050 for multiple; Thu, 25 Jul 2019 23:44:09 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Thu, 25 Jul 2019 23:44:03 +0100 Message-Id: <20190725224407.4206-1-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.22.0 MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 1/5] drm/i915: Capture vma contents outside of spinlock X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP Currently we use the engine->active.lock to ensure that the request is not retired as we capture the data. However, we only need to ensure that the vma are not removed prior to use acquiring their contents, and since we have already relinquished our stop-machine protection, we assume that the user will not be overwriting the contents before we are able to record them. In order to capture the vma outside of the spinlock, we acquire a reference and mark the vma as active to prevent it from being unbound. However, since it is tricky allocate an entry in the fence tree (doing so would require taking a mutex) while inside the engine spinlock, we use an atomic bit and special case the handling for i915_active_wait. The core benefit is that we can use some non-atomic methods for mapping the device pages, we can remove the slow compression phase out of atomic context (i.e. stop antagonising the nmi-watchdog), and no we longer need large reserves of atomic pages. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111215 Signed-off-by: Chris Wilson Reviewed-by: Matthew Auld --- drivers/gpu/drm/i915/i915_active.c | 34 ++++++- drivers/gpu/drm/i915/i915_active.h | 3 + drivers/gpu/drm/i915/i915_active_types.h | 3 + drivers/gpu/drm/i915/i915_gpu_error.c | 112 ++++++++++++++++------- 4 files changed, 117 insertions(+), 35 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c index 13f304a29fc8..22341c62c204 100644 --- a/drivers/gpu/drm/i915/i915_active.c +++ b/drivers/gpu/drm/i915/i915_active.c @@ -196,6 +196,7 @@ void __i915_active_init(struct drm_i915_private *i915, debug_active_init(ref); ref->i915 = i915; + ref->flags = 0; ref->active = active; ref->retire = retire; ref->tree = RB_ROOT; @@ -262,6 +263,34 @@ void i915_active_release(struct i915_active *ref) active_retire(ref); } +static void __active_ungrab(struct i915_active *ref) +{ + clear_and_wake_up_bit(I915_ACTIVE_GRAB_BIT, &ref->flags); +} + +bool i915_active_trygrab(struct i915_active *ref) +{ + debug_active_assert(ref); + + if (test_and_set_bit(I915_ACTIVE_GRAB_BIT, &ref->flags)) + return false; + + if (!atomic_add_unless(&ref->count, 1, 0)) { + __active_ungrab(ref); + return false; + } + + return true; +} + +void i915_active_ungrab(struct i915_active *ref) +{ + GEM_BUG_ON(!test_bit(I915_ACTIVE_GRAB_BIT, &ref->flags)); + + active_retire(ref); + __active_ungrab(ref); +} + int i915_active_wait(struct i915_active *ref) { struct active_node *it, *n; @@ -270,7 +299,7 @@ int i915_active_wait(struct i915_active *ref) might_sleep(); might_lock(&ref->mutex); - if (RB_EMPTY_ROOT(&ref->tree)) + if (i915_active_is_idle(ref)) return 0; err = mutex_lock_interruptible(&ref->mutex); @@ -292,6 +321,9 @@ int i915_active_wait(struct i915_active *ref) if (err) return err; + if (wait_on_bit(&ref->flags, I915_ACTIVE_GRAB_BIT, TASK_KILLABLE)) + return -EINTR; + if (!i915_active_is_idle(ref)) return -EBUSY; diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h index 134166d31251..ba68b077ec6c 100644 --- a/drivers/gpu/drm/i915/i915_active.h +++ b/drivers/gpu/drm/i915/i915_active.h @@ -395,6 +395,9 @@ int i915_active_acquire(struct i915_active *ref); void i915_active_release(struct i915_active *ref); void __i915_active_release_nested(struct i915_active *ref, int subclass); +bool i915_active_trygrab(struct i915_active *ref); +void i915_active_ungrab(struct i915_active *ref); + static inline bool i915_active_is_idle(const struct i915_active *ref) { diff --git a/drivers/gpu/drm/i915/i915_active_types.h b/drivers/gpu/drm/i915/i915_active_types.h index 5b0a3024ce24..74743dd0d5f0 100644 --- a/drivers/gpu/drm/i915/i915_active_types.h +++ b/drivers/gpu/drm/i915/i915_active_types.h @@ -36,6 +36,9 @@ struct i915_active { struct mutex mutex; atomic_t count; + unsigned long flags; +#define I915_ACTIVE_GRAB_BIT 0 + int (*active)(struct i915_active *ref); void (*retire)(struct i915_active *ref); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 56dfc2650836..674d341a23f6 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -298,7 +298,7 @@ static void *compress_next_page(struct compress *c, if (dst->page_count >= dst->num_pages) return ERR_PTR(-ENOSPC); - page = pool_alloc(&c->pool, ATOMIC_MAYFAIL); + page = pool_alloc(&c->pool, ALLOW_FAIL); if (!page) return ERR_PTR(-ENOMEM); @@ -327,8 +327,6 @@ static int compress_page(struct compress *c, if (zlib_deflate(zstream, Z_NO_FLUSH) != Z_OK) return -EIO; - - touch_nmi_watchdog(); } while (zstream->avail_in); /* Fallback to uncompressed if we increase size? */ @@ -407,7 +405,7 @@ static int compress_page(struct compress *c, { void *ptr; - ptr = pool_alloc(&c->pool, ATOMIC_MAYFAIL); + ptr = pool_alloc(&c->pool, ALLOW_FAIL); if (!ptr) return -ENOMEM; @@ -1001,12 +999,14 @@ i915_error_object_create(struct drm_i915_private *i915, dma_addr_t dma; int ret; + might_sleep(); + if (!vma || !vma->pages) return NULL; num_pages = min_t(u64, vma->size, vma->obj->base.size) >> PAGE_SHIFT; num_pages = DIV_ROUND_UP(10 * num_pages, 8); /* worstcase zlib growth */ - dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ATOMIC_MAYFAIL); + dst = kmalloc(sizeof(*dst) + num_pages * sizeof(u32 *), ALLOW_FAIL); if (!dst) return NULL; @@ -1027,9 +1027,9 @@ i915_error_object_create(struct drm_i915_private *i915, ggtt->vm.insert_page(&ggtt->vm, dma, slot, I915_CACHE_NONE, 0); - s = io_mapping_map_atomic_wc(&ggtt->iomap, slot); + s = io_mapping_map_wc(&ggtt->iomap, slot, PAGE_SIZE); ret = compress_page(compress, (void __force *)s, dst); - io_mapping_unmap_atomic(s); + io_mapping_unmap(s); if (ret) break; } @@ -1302,10 +1302,42 @@ static void record_context(struct drm_i915_error_context *e, e->active = atomic_read(&ctx->active_count); } -static void +struct capture_vma { + struct capture_vma *next; + void **slot; +}; + +static struct capture_vma * +capture_vma(struct capture_vma *next, + struct i915_vma *vma, + struct drm_i915_error_object **out) +{ + struct capture_vma *c; + + *out = NULL; + if (!vma) + return next; + + c = kmalloc(sizeof(*c), ATOMIC_MAYFAIL); + if (!c) + return next; + + if (!i915_active_trygrab(&vma->active)) { + kfree(c); + return next; + } + + c->slot = (void **)out; + *c->slot = i915_vma_get(vma); + + c->next = next; + return c; +} + +static struct capture_vma * request_record_user_bo(struct i915_request *request, struct drm_i915_error_engine *ee, - struct compress *compress) + struct capture_vma *capture) { struct i915_capture_list *c; struct drm_i915_error_object **bo; @@ -1315,7 +1347,7 @@ request_record_user_bo(struct i915_request *request, for (c = request->capture_list; c; c = c->next) max++; if (!max) - return; + return capture; bo = kmalloc_array(max, sizeof(*bo), ATOMIC_MAYFAIL); if (!bo) { @@ -1324,21 +1356,19 @@ request_record_user_bo(struct i915_request *request, bo = kmalloc_array(max, sizeof(*bo), ATOMIC_MAYFAIL); } if (!bo) - return; + return capture; count = 0; for (c = request->capture_list; c; c = c->next) { - bo[count] = i915_error_object_create(request->i915, - c->vma, - compress); - if (!bo[count]) - break; + capture = capture_vma(capture, c->vma, &bo[count]); if (++count == max) break; } ee->user_bo = bo; ee->user_bo_count = count; + + return capture; } static struct drm_i915_error_object * @@ -1369,6 +1399,7 @@ gem_record_rings(struct i915_gpu_state *error, struct compress *compress) for (i = 0; i < I915_NUM_ENGINES; i++) { struct intel_engine_cs *engine = i915->engine[i]; struct drm_i915_error_engine *ee = &error->engine[i]; + struct capture_vma *capture = NULL; struct i915_request *request; unsigned long flags; @@ -1393,26 +1424,29 @@ gem_record_rings(struct i915_gpu_state *error, struct compress *compress) record_context(&ee->context, ctx); - /* We need to copy these to an anonymous buffer + /* + * We need to copy these to an anonymous buffer * as the simplest method to avoid being overwritten * by userspace. */ - ee->batchbuffer = - i915_error_object_create(i915, - request->batch, - compress); + capture = capture_vma(capture, + request->batch, + &ee->batchbuffer); if (HAS_BROKEN_CS_TLB(i915)) - ee->wa_batchbuffer = - i915_error_object_create(i915, - engine->gt->scratch, - compress); - request_record_user_bo(request, ee, compress); + capture = capture_vma(capture, + engine->gt->scratch, + &ee->wa_batchbuffer); + + capture = request_record_user_bo(request, ee, capture); - ee->ctx = - i915_error_object_create(i915, - request->hw_context->state, - compress); + capture = capture_vma(capture, + request->hw_context->state, + &ee->ctx); + + capture = capture_vma(capture, + ring->vma, + &ee->ringbuffer); error->simulated |= i915_gem_context_no_error_capture(ctx); @@ -1423,15 +1457,25 @@ gem_record_rings(struct i915_gpu_state *error, struct compress *compress) ee->cpu_ring_head = ring->head; ee->cpu_ring_tail = ring->tail; - ee->ringbuffer = - i915_error_object_create(i915, - ring->vma, - compress); engine_record_requests(engine, request, ee); } spin_unlock_irqrestore(&engine->active.lock, flags); + while (capture) { + struct capture_vma *this = capture; + struct i915_vma *vma = *this->slot; + + *this->slot = + i915_error_object_create(i915, vma, compress); + + i915_active_ungrab(&vma->active); + i915_vma_put(vma); + + capture = this->next; + kfree(this); + } + ee->hws_page = i915_error_object_create(i915, engine->status_page.vma, From patchwork Thu Jul 25 22:44:04 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11059917 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0C2566C5 for ; Thu, 25 Jul 2019 22:44:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F065328A71 for ; Thu, 25 Jul 2019 22:44:25 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E496928A74; Thu, 25 Jul 2019 22:44:25 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 8513428A71 for ; Thu, 25 Jul 2019 22:44:25 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id CD8AB6E839; Thu, 25 Jul 2019 22:44:23 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 916E3891D6 for ; Thu, 25 Jul 2019 22:44:21 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 17610921-1500050 for multiple; Thu, 25 Jul 2019 23:44:09 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Thu, 25 Jul 2019 23:44:04 +0100 Message-Id: <20190725224407.4206-2-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.22.0 In-Reply-To: <20190725224407.4206-1-chris@chris-wilson.co.uk> References: <20190725224407.4206-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 2/5] drm/i915/gt: Add to timeline requires the timeline mutex X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP Modifying a remote context requires careful serialisation with requests on that context, and that serialisation requires us to take their timeline->mutex. Make it so. Note that while struct_mutex rules, we can't create more than one request in parallel, but that age is soon coming to an end. v2: Though it doesn't affect the current users, contexts may share timelines so check if we already hold the right mutex. Signed-off-by: Chris Wilson Cc: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/intel_context.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index 9292b6ca5e9c..d64b45f7ec6d 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -254,10 +254,18 @@ int intel_context_prepare_remote_request(struct intel_context *ce, /* Only suitable for use in remotely modifying this context */ GEM_BUG_ON(rq->hw_context == ce); - /* Queue this switch after all other activity by this context. */ - err = i915_active_request_set(&tl->last_request, rq); - if (err) - return err; + if (rq->timeline != tl) { /* beware timeline sharing */ + err = mutex_lock_interruptible_nested(&tl->mutex, + SINGLE_DEPTH_NESTING); + if (err) + return err; + + /* Queue this switch after current activity by this context. */ + err = i915_active_request_set(&tl->last_request, rq); + if (err) + goto unlock; + } + lockdep_assert_held(&tl->mutex); /* * Guarantee context image and the timeline remains pinned until the @@ -267,7 +275,12 @@ int intel_context_prepare_remote_request(struct intel_context *ce, * words transfer the pinned ce object to tracked active request. */ GEM_BUG_ON(i915_active_is_idle(&ce->active)); - return i915_active_ref(&ce->active, rq->fence.context, rq); + err = i915_active_ref(&ce->active, rq->fence.context, rq); + +unlock: + if (rq->timeline != tl) + mutex_unlock(&tl->mutex); + return err; } struct i915_request *intel_context_create_request(struct intel_context *ce) From patchwork Thu Jul 25 22:44:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11059921 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8A3636C5 for ; Thu, 25 Jul 2019 22:44:27 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 78DC628897 for ; Thu, 25 Jul 2019 22:44:27 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6D72728A73; Thu, 25 Jul 2019 22:44:27 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 853CE28897 for ; Thu, 25 Jul 2019 22:44:26 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 805D96E83C; Thu, 25 Jul 2019 22:44:24 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id D478B89219 for ; Thu, 25 Jul 2019 22:44:21 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 17610922-1500050 for multiple; Thu, 25 Jul 2019 23:44:09 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Thu, 25 Jul 2019 23:44:05 +0100 Message-Id: <20190725224407.4206-3-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.22.0 In-Reply-To: <20190725224407.4206-1-chris@chris-wilson.co.uk> References: <20190725224407.4206-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 3/5] drm/i915: Unshare the idle-barrier from other kernel requests X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP Under some circumstances (see intel_context_prepare_remote_request), we may use a request along a kernel context to modify the logical state of another. To keep the target context in place while the request executes, we take an active reference on it using the kernel timeline. This is the same timeline as we use for the idle-barrier, and so we end up reusing the same active node. Except that the idle barrier is special and cannot be reused in this manner! Give the idle-barrier a reserved timeline index (0) so that it will always be unique (give or take we may issue multiple idle barriers across multiple engines). Reported-by: Lionel Landwerlin Fixes: ce476c80b8bf ("drm/i915: Keep contexts pinned until after the next kernel context switch") Fixes: a9877da2d629 ("drm/i915/oa: Reconfigure contexts on the fly") Signed-off-by: Chris Wilson Cc: Lionel Landwerlin Cc: Tvrtko Ursulin Tested-by: Lionel Landwerlin --- drivers/gpu/drm/i915/gt/intel_context.c | 40 ++- drivers/gpu/drm/i915/gt/intel_context.h | 13 +- drivers/gpu/drm/i915/gt/selftest_context.c | 310 ++++++++++++++++++ drivers/gpu/drm/i915/i915_active.c | 69 +++- .../drm/i915/selftests/i915_live_selftests.h | 3 +- 5 files changed, 398 insertions(+), 37 deletions(-) create mode 100644 drivers/gpu/drm/i915/gt/selftest_context.c diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index d64b45f7ec6d..211ac6568a5d 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -162,23 +162,41 @@ static int __intel_context_active(struct i915_active *active) if (err) goto err_ring; + return 0; + +err_ring: + intel_ring_unpin(ce->ring); +err_put: + intel_context_put(ce); + return err; +} + +int intel_context_active_acquire(struct intel_context *ce) +{ + int err; + + err = i915_active_acquire(&ce->active); + if (err) + return err; + /* Preallocate tracking nodes */ if (!i915_gem_context_is_kernel(ce->gem_context)) { err = i915_active_acquire_preallocate_barrier(&ce->active, ce->engine); - if (err) - goto err_state; + if (err) { + i915_active_release(&ce->active); + return err; + } } return 0; +} -err_state: - __context_unpin_state(ce->state); -err_ring: - intel_ring_unpin(ce->ring); -err_put: - intel_context_put(ce); - return err; +void intel_context_active_release(struct intel_context *ce) +{ + /* Nodes preallocated in intel_context_active() */ + i915_active_acquire_barrier(&ce->active); + i915_active_release(&ce->active); } void @@ -297,3 +315,7 @@ struct i915_request *intel_context_create_request(struct intel_context *ce) return rq; } + +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) +#include "selftest_context.c" +#endif diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h index 23c7e4c0ce7c..07f9924de48f 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.h +++ b/drivers/gpu/drm/i915/gt/intel_context.h @@ -104,17 +104,8 @@ static inline void intel_context_exit(struct intel_context *ce) ce->ops->exit(ce); } -static inline int intel_context_active_acquire(struct intel_context *ce) -{ - return i915_active_acquire(&ce->active); -} - -static inline void intel_context_active_release(struct intel_context *ce) -{ - /* Nodes preallocated in intel_context_active() */ - i915_active_acquire_barrier(&ce->active); - i915_active_release(&ce->active); -} +int intel_context_active_acquire(struct intel_context *ce); +void intel_context_active_release(struct intel_context *ce); static inline struct intel_context *intel_context_get(struct intel_context *ce) { diff --git a/drivers/gpu/drm/i915/gt/selftest_context.c b/drivers/gpu/drm/i915/gt/selftest_context.c new file mode 100644 index 000000000000..e3b45fe747ae --- /dev/null +++ b/drivers/gpu/drm/i915/gt/selftest_context.c @@ -0,0 +1,310 @@ +/* + * SPDX-License-Identifier: GPL-2.0 + * + * Copyright © 2019 Intel Corporation + */ + +#include "i915_selftest.h" +#include "intel_gt.h" + +#include "gem/selftests/mock_context.h" +#include "selftests/igt_flush_test.h" +#include "selftests/mock_drm.h" + +static int request_sync(struct i915_request *rq) +{ + long timeout; + int err = 0; + + i915_request_get(rq); + + i915_request_add(rq); + timeout = i915_request_wait(rq, 0, HZ / 10); + if (timeout < 0) + err = timeout; + else + i915_request_retire_upto(rq); + + i915_request_put(rq); + + return err; +} + +static int context_sync(struct intel_context *ce) +{ + struct intel_timeline *tl = ce->ring->timeline; + int err = 0; + + do { + struct i915_request *rq; + long timeout; + + rcu_read_lock(); + rq = rcu_dereference(tl->last_request.request); + if (rq) + rq = i915_request_get_rcu(rq); + rcu_read_unlock(); + if (!rq) + break; + + timeout = i915_request_wait(rq, 0, HZ / 10); + if (timeout < 0) + err = timeout; + else + i915_request_retire_upto(rq); + + i915_request_put(rq); + } while (!err); + + return err; +} + +static int __live_active_context(struct intel_engine_cs *engine, + struct i915_gem_context *fixme) +{ + struct intel_context *ce; + int pass; + int err; + + /* + * We keep active contexts alive until after a subsequent context + * switch as the final write from the context-save will be after + * we retire the final request. We track when we unpin the context, + * under the presumption that the final pin is from the last request, + * and instead of immediately unpinning the context, we add a task + * to unpin the context from the next idle-barrier. + * + * This test makes sure that the context is kept alive until a + * subsequent idle-barrier (emitted when the engine wakeref hits 0 + * with no more outstanding requests). + */ + + if (intel_engine_pm_is_awake(engine)) { + pr_err("%s is awake before starting %s!\n", + engine->name, __func__); + return -EINVAL; + } + + ce = intel_context_create(fixme, engine); + if (!ce) + return -ENOMEM; + + for (pass = 0; pass <= 2; pass++) { + struct i915_request *rq; + + rq = intel_context_create_request(ce); + if (IS_ERR(rq)) { + err = PTR_ERR(rq); + goto err; + } + + err = request_sync(rq); + if (err) + goto err; + + /* Context will be kept active until after an idle-barrier. */ + if (i915_active_is_idle(&ce->active)) { + pr_err("context is not active; expected idle-barrier (%s pass %d)\n", + engine->name, pass); + err = -EINVAL; + goto err; + } + + if (!intel_engine_pm_is_awake(engine)) { + pr_err("%s is asleep before idle-barrier\n", + engine->name); + err = -EINVAL; + goto err; + } + } + + /* Now make sure our idle-barriers are flushed */ + err = context_sync(engine->kernel_context); + if (err) + goto err; + + if (!i915_active_is_idle(&ce->active)) { + pr_err("context is still active!"); + err = -EINVAL; + } + + if (intel_engine_pm_is_awake(engine)) { + struct drm_printer p = drm_debug_printer(__func__); + + intel_engine_dump(engine, &p, + "%s is still awake after idle-barriers\n", + engine->name); + GEM_TRACE_DUMP(); + + err = -EINVAL; + goto err; + } + +err: + intel_context_put(ce); + return err; +} + +static int live_active_context(void *arg) +{ + struct intel_gt *gt = arg; + struct intel_engine_cs *engine; + struct i915_gem_context *fixme; + enum intel_engine_id id; + struct drm_file *file; + int err = 0; + + file = mock_file(gt->i915); + if (IS_ERR(file)) + return PTR_ERR(file); + + mutex_lock(>->i915->drm.struct_mutex); + + fixme = live_context(gt->i915, file); + if (!fixme) { + err = -ENOMEM; + goto unlock; + } + + for_each_engine(engine, gt->i915, id) { + err = __live_active_context(engine, fixme); + if (err) + break; + + err = igt_flush_test(gt->i915, I915_WAIT_LOCKED); + if (err) + break; + } + +unlock: + mutex_unlock(>->i915->drm.struct_mutex); + mock_file_free(gt->i915, file); + return err; +} + +static int __remote_sync(struct intel_context *ce, struct intel_context *remote) +{ + struct i915_request *rq; + int err; + + err = intel_context_pin(remote); + if (err) + return err; + + rq = intel_context_create_request(ce); + if (IS_ERR(rq)) { + err = PTR_ERR(rq); + goto unpin; + } + + err = intel_context_prepare_remote_request(remote, rq); + if (err) { + i915_request_add(rq); + goto unpin; + } + + err = request_sync(rq); + +unpin: + intel_context_unpin(remote); + return err; +} + +static int __live_remote_context(struct intel_engine_cs *engine, + struct i915_gem_context *fixme) +{ + struct intel_context *local, *remote; + int pass; + int err; + + /* + * Check that our idle barriers do not interfere with normal + * activity tracking. In particular, check that operating + * on the context image remotely (intel_context_prepare_remote_request) + * which inserts foriegn fences into intel_context.active are not + * clobbered. + */ + + remote = intel_context_create(fixme, engine); + if (!remote) + return -ENOMEM; + + local = intel_context_create(fixme, engine); + if (!local) { + err = -ENOMEM; + goto err_remote; + } + + for (pass = 0; pass <= 2; pass++) { + err = __remote_sync(local, remote); + if (err) + break; + + err = __remote_sync(engine->kernel_context, remote); + if (err) + break; + + if (i915_active_is_idle(&remote->active)) { + pr_err("remote context is not active; expected idle-barrier (%s pass %d)\n", + engine->name, pass); + err = -EINVAL; + break; + } + } + + intel_context_put(local); +err_remote: + intel_context_put(remote); + return err; +} + +static int live_remote_context(void *arg) +{ + struct intel_gt *gt = arg; + struct intel_engine_cs *engine; + struct i915_gem_context *fixme; + enum intel_engine_id id; + struct drm_file *file; + int err = 0; + + file = mock_file(gt->i915); + if (IS_ERR(file)) + return PTR_ERR(file); + + mutex_lock(>->i915->drm.struct_mutex); + + fixme = live_context(gt->i915, file); + if (!fixme) { + err = -ENOMEM; + goto unlock; + } + + for_each_engine(engine, gt->i915, id) { + err = __live_remote_context(engine, fixme); + if (err) + break; + + err = igt_flush_test(gt->i915, I915_WAIT_LOCKED); + if (err) + break; + } + +unlock: + mutex_unlock(>->i915->drm.struct_mutex); + mock_file_free(gt->i915, file); + return err; +} + +int intel_context_live_selftests(struct drm_i915_private *i915) +{ + static const struct i915_subtest tests[] = { + SUBTEST(live_active_context), + SUBTEST(live_remote_context), + }; + struct intel_gt *gt = &i915->gt; + + if (intel_gt_is_wedged(gt)) + return 0; + + return intel_gt_live_subtests(tests, gt); +} diff --git a/drivers/gpu/drm/i915/i915_active.c b/drivers/gpu/drm/i915/i915_active.c index 22341c62c204..bbb44b072d1a 100644 --- a/drivers/gpu/drm/i915/i915_active.c +++ b/drivers/gpu/drm/i915/i915_active.c @@ -184,6 +184,7 @@ active_instance(struct i915_active *ref, u64 idx) ref->cache = node; mutex_unlock(&ref->mutex); + BUILD_BUG_ON(offsetof(typeof(*node), base)); return &node->base; } @@ -213,6 +214,8 @@ int i915_active_ref(struct i915_active *ref, struct i915_active_request *active; int err; + GEM_BUG_ON(!timeline); /* reserved for idle-barrier */ + /* Prevent reaping in case we malloc/wait while building the tree */ err = i915_active_acquire(ref); if (err) @@ -223,6 +226,7 @@ int i915_active_ref(struct i915_active *ref, err = -ENOMEM; goto out; } + GEM_BUG_ON(IS_ERR(active->request)); if (!i915_active_request_isset(active)) atomic_inc(&ref->count); @@ -374,6 +378,34 @@ void i915_active_fini(struct i915_active *ref) } #endif +static struct active_node *idle_barrier(struct i915_active *ref) +{ + struct active_node *idle = NULL; + struct rb_node *rb; + + if (RB_EMPTY_ROOT(&ref->tree)) + return NULL; + + mutex_lock(&ref->mutex); + for (rb = rb_first(&ref->tree); rb; rb = rb_next(rb)) { + struct active_node *node; + + node = rb_entry(rb, typeof(*node), node); + if (node->timeline) + break; + + if (!i915_active_request_isset(&node->base)) { + GEM_BUG_ON(!list_empty(&node->base.link)); + rb_erase(rb, &ref->tree); + idle = node; + break; + } + } + mutex_unlock(&ref->mutex); + + return idle; +} + int i915_active_acquire_preallocate_barrier(struct i915_active *ref, struct intel_engine_cs *engine) { @@ -383,23 +415,32 @@ int i915_active_acquire_preallocate_barrier(struct i915_active *ref, int err; GEM_BUG_ON(!engine->mask); + GEM_BUG_ON(!llist_empty(&ref->barriers)); + for_each_engine_masked(engine, i915, engine->mask, tmp) { - struct intel_context *kctx = engine->kernel_context; struct active_node *node; - node = kmem_cache_alloc(global.slab_cache, GFP_KERNEL); - if (unlikely(!node)) { - err = -ENOMEM; - goto unwind; + node = idle_barrier(ref); + if (!node) { + node = kmem_cache_alloc(global.slab_cache, + GFP_KERNEL | + __GFP_RETRY_MAYFAIL | + __GFP_NOWARN); + if (unlikely(!node)) { + err = -ENOMEM; + goto unwind; + } + + node->ref = ref; + node->timeline = 0; + node->base.retire = node_retire; } - i915_active_request_init(&node->base, - (void *)engine, node_retire); - node->timeline = kctx->ring->timeline->fence_context; - node->ref = ref; + intel_engine_pm_get(engine); + + RCU_INIT_POINTER(node->base.request, (void *)engine); atomic_inc(&ref->count); - intel_engine_pm_get(engine); llist_add((struct llist_node *)&node->base.link, &ref->barriers); } @@ -434,6 +475,7 @@ void i915_active_acquire_barrier(struct i915_active *ref) node = container_of((struct list_head *)pos, typeof(*node), base.link); + GEM_BUG_ON(node->timeline); engine = (void *)rcu_access_pointer(node->base.request); RCU_INIT_POINTER(node->base.request, ERR_PTR(-EAGAIN)); @@ -442,12 +484,7 @@ void i915_active_acquire_barrier(struct i915_active *ref) p = &ref->tree.rb_node; while (*p) { parent = *p; - if (rb_entry(parent, - struct active_node, - node)->timeline < node->timeline) - p = &parent->rb_right; - else - p = &parent->rb_left; + p = &parent->rb_left; } rb_link_node(&node->node, parent, p); rb_insert_color(&node->node, &ref->tree); diff --git a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h index 2b31a4ee0b4c..a841d3f9bedc 100644 --- a/drivers/gpu/drm/i915/selftests/i915_live_selftests.h +++ b/drivers/gpu/drm/i915/selftests/i915_live_selftests.h @@ -15,6 +15,7 @@ selftest(workarounds, intel_workarounds_live_selftests) selftest(timelines, intel_timeline_live_selftests) selftest(requests, i915_request_live_selftests) selftest(active, i915_active_live_selftests) +selftest(gt_contexts, intel_context_live_selftests) selftest(objects, i915_gem_object_live_selftests) selftest(mman, i915_gem_mman_live_selftests) selftest(dmabuf, i915_gem_dmabuf_live_selftests) @@ -24,7 +25,7 @@ selftest(gtt, i915_gem_gtt_live_selftests) selftest(gem, i915_gem_live_selftests) selftest(evict, i915_gem_evict_live_selftests) selftest(hugepages, i915_gem_huge_page_live_selftests) -selftest(contexts, i915_gem_context_live_selftests) +selftest(gem_contexts, i915_gem_context_live_selftests) selftest(blt, i915_gem_object_blt_live_selftests) selftest(client, i915_gem_client_blt_live_selftests) selftest(reset, intel_reset_live_selftests) From patchwork Thu Jul 25 22:44:06 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11059923 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C5E376C5 for ; Thu, 25 Jul 2019 22:44:34 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B4A9628A71 for ; Thu, 25 Jul 2019 22:44:34 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A8BC628A74; Thu, 25 Jul 2019 22:44:34 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 3EC9F28A71 for ; Thu, 25 Jul 2019 22:44:34 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C49876E83B; Thu, 25 Jul 2019 22:44:33 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id F30016E83A for ; Thu, 25 Jul 2019 22:44:28 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 17610923-1500050 for multiple; Thu, 25 Jul 2019 23:44:09 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Thu, 25 Jul 2019 23:44:06 +0100 Message-Id: <20190725224407.4206-4-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.22.0 In-Reply-To: <20190725224407.4206-1-chris@chris-wilson.co.uk> References: <20190725224407.4206-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 4/5] drm/i915/execlists: Force preemption X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP If the preempted context takes too long to relinquish control, e.g. it is stuck inside a shader with arbitration disabled, evict that context with an engine reset. This ensures that preemptions are reasonably responsive, providing a tighter QoS for the more important context at the cost of flagging unresponsive contexts more frequently (i.e. instead of using an ~10s hangcheck, we now evict at ~100ms). The challenge of lies in picking a timeout that can be reasonably serviced by HW for typical workloads, balancing the existing clients against the needs for responsiveness. Signed-off-by: Chris Wilson Cc: Mika Kuoppala Cc: Tvrtko Ursulin Reviewed-by: Mika Kuoppala --- drivers/gpu/drm/i915/Kconfig.profile | 12 ++++++ drivers/gpu/drm/i915/gt/intel_lrc.c | 62 ++++++++++++++++++++++++++-- 2 files changed, 71 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile index 48df8889a88a..3184e8491333 100644 --- a/drivers/gpu/drm/i915/Kconfig.profile +++ b/drivers/gpu/drm/i915/Kconfig.profile @@ -25,3 +25,15 @@ config DRM_I915_SPIN_REQUEST May be 0 to disable the initial spin. In practice, we estimate the cost of enabling the interrupt (if currently disabled) to be a few microseconds. + +config DRM_I915_PREEMPT_TIMEOUT + int "Preempt timeout (ms)" + default 100 # milliseconds + help + How long to wait (in milliseconds) for a preemption event to occur + when submitting a new context via execlists. If the current context + does not hit an arbitration point and yield to HW before the timer + expires, the HW will be reset to allow the more important context + to execute. + + May be 0 to disable the timeout. diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 884dfc1cb033..b85ee12c2451 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -945,6 +945,21 @@ static void record_preemption(struct intel_engine_execlists *execlists) (void)I915_SELFTEST_ONLY(execlists->preempt_hang.count++); } +static unsigned long preempt_expires(void) +{ + const unsigned long timeout = + msecs_to_jiffies_timeout(CONFIG_DRM_I915_PREEMPT_TIMEOUT); + + /* + * Paranoia to make sure the compiler computes the timeout before + * loading 'jiffies' as jiffies is volatile and may be updated in + * the background by a timer tick. All to reduce the complexity + * of the addition and reduce the risk of losing a jiffie. + */ + barrier(); + return jiffies + timeout; +} + static void execlists_dequeue(struct intel_engine_cs *engine) { struct intel_engine_execlists * const execlists = &engine->execlists; @@ -1283,6 +1298,8 @@ static void execlists_dequeue(struct intel_engine_cs *engine) *port = execlists_schedule_in(last, port - execlists->pending); memset(port + 1, 0, (last_port - port) * sizeof(*port)); execlists_submit_ports(engine); + if (CONFIG_DRM_I915_PREEMPT_TIMEOUT) + mod_timer(&execlists->timer, preempt_expires()); } else { ring_set_paused(engine, 0); } @@ -1467,13 +1484,45 @@ static void process_csb(struct intel_engine_cs *engine) invalidate_csb_entries(&buf[0], &buf[num_entries - 1]); } -static void __execlists_submission_tasklet(struct intel_engine_cs *const engine) +static bool __execlists_submission_tasklet(struct intel_engine_cs *const engine) { lockdep_assert_held(&engine->active.lock); process_csb(engine); - if (!engine->execlists.pending[0]) + if (!engine->execlists.pending[0]) { execlists_dequeue(engine); + return true; + } + + return false; +} + +static void preempt_reset(struct intel_engine_cs *engine) +{ + const unsigned int bit = I915_RESET_ENGINE + engine->id; + unsigned long *lock = &engine->gt->reset.flags; + + if (test_and_set_bit(bit, lock)) + return; + + /* Mark this tasklet as disabled to avoid waiting for it to complete */ + tasklet_disable_nosync(&engine->execlists.tasklet); + + intel_engine_reset(engine, "preemption time out"); + + tasklet_enable(&engine->execlists.tasklet); + clear_and_wake_up_bit(bit, lock); +} + +static bool preempt_timeout(struct intel_engine_cs *const engine) +{ + if (!CONFIG_DRM_I915_PREEMPT_TIMEOUT) + return false; + + if (!intel_engine_has_preemption(engine)) + return false; + + return !timer_pending(&engine->execlists.timer); } /* @@ -1484,10 +1533,17 @@ static void execlists_submission_tasklet(unsigned long data) { struct intel_engine_cs * const engine = (struct intel_engine_cs *)data; unsigned long flags; + bool reset = false; spin_lock_irqsave(&engine->active.lock, flags); - __execlists_submission_tasklet(engine); + + if (!__execlists_submission_tasklet(engine) && preempt_timeout(engine)) + reset = true; + spin_unlock_irqrestore(&engine->active.lock, flags); + + if (reset) + preempt_reset(engine); } static void execlists_submission_timer(struct timer_list *timer) From patchwork Thu Jul 25 22:44:07 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11059919 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 80A181580 for ; Thu, 25 Jul 2019 22:44:26 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 6D62B28897 for ; Thu, 25 Jul 2019 22:44:26 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6183228A73; Thu, 25 Jul 2019 22:44:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 8B45B28897 for ; Thu, 25 Jul 2019 22:44:24 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C2ACA6E838; Thu, 25 Jul 2019 22:44:23 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id C0AAA890B6 for ; Thu, 25 Jul 2019 22:44:20 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 17610924-1500050 for multiple; Thu, 25 Jul 2019 23:44:10 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Thu, 25 Jul 2019 23:44:07 +0100 Message-Id: <20190725224407.4206-5-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.22.0 In-Reply-To: <20190725224407.4206-1-chris@chris-wilson.co.uk> References: <20190725224407.4206-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 5/5] drm/i915: Replace hangcheck by heartbeats X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP Replace sampling the engine state every so often with a periodic heartbeat request to measure the health of an engine. This is coupled with the forced-preemption to allow long running requests to survive so long as they do not block other users. Signed-off-by: Chris Wilson Cc: Joonas Lahtinen Cc: Tvrtko Ursulin Cc: Jon Bloomfield --- Note, strictly this still requires struct_mutex-free retirement for the corner case where the idle-worker is ineffective and we get a backlog of requests on the kernel ring. Or if the legacy ringbuffer is full... When we are able to retire from outside of struct_mutex we can do the full idle-barrier and idle-work from here. --- drivers/gpu/drm/i915/Kconfig.profile | 11 + drivers/gpu/drm/i915/Makefile | 2 +- drivers/gpu/drm/i915/gt/intel_engine.h | 32 -- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 10 +- .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 96 +++++ .../gpu/drm/i915/gt/intel_engine_heartbeat.h | 17 + drivers/gpu/drm/i915/gt/intel_engine_pm.c | 5 +- drivers/gpu/drm/i915/gt/intel_engine_types.h | 14 +- drivers/gpu/drm/i915/gt/intel_gt.c | 1 - drivers/gpu/drm/i915/gt/intel_gt.h | 4 - drivers/gpu/drm/i915/gt/intel_gt_pm.c | 2 - drivers/gpu/drm/i915/gt/intel_hangcheck.c | 360 ------------------ drivers/gpu/drm/i915/gt/intel_reset.c | 3 +- drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 3 - drivers/gpu/drm/i915/i915_debugfs.c | 86 ----- drivers/gpu/drm/i915/i915_drv.c | 3 +- drivers/gpu/drm/i915/i915_drv.h | 1 - drivers/gpu/drm/i915/i915_gpu_error.c | 37 +- drivers/gpu/drm/i915/i915_gpu_error.h | 2 - drivers/gpu/drm/i915/i915_params.c | 5 - drivers/gpu/drm/i915/i915_params.h | 1 - 21 files changed, 148 insertions(+), 547 deletions(-) create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c create mode 100644 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h delete mode 100644 drivers/gpu/drm/i915/gt/intel_hangcheck.c diff --git a/drivers/gpu/drm/i915/Kconfig.profile b/drivers/gpu/drm/i915/Kconfig.profile index 3184e8491333..aafb57f84169 100644 --- a/drivers/gpu/drm/i915/Kconfig.profile +++ b/drivers/gpu/drm/i915/Kconfig.profile @@ -37,3 +37,14 @@ config DRM_I915_PREEMPT_TIMEOUT to execute. May be 0 to disable the timeout. + +config DRM_I915_HEARTBEAT_INTERVAL + int "Interval between heartbeat pulses (ms)" + default 2500 # microseconds + help + While active the driver uses a periodic request, a heartbeat, to + check the wellness of the GPU and to regularly flush state changes + (idle barriers). + + May be 0 to disable heartbeats and therefore disable automatic GPU + hang detection. diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile index 524516251a40..18201852d68d 100644 --- a/drivers/gpu/drm/i915/Makefile +++ b/drivers/gpu/drm/i915/Makefile @@ -73,10 +73,10 @@ gt-y += \ gt/intel_breadcrumbs.o \ gt/intel_context.o \ gt/intel_engine_cs.o \ + gt/intel_engine_heartbeat.o \ gt/intel_engine_pm.o \ gt/intel_gt.o \ gt/intel_gt_pm.o \ - gt/intel_hangcheck.o \ gt/intel_lrc.o \ gt/intel_renderstate.o \ gt/intel_reset.o \ diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h index db5c73ce86ee..38eeb4320c97 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine.h +++ b/drivers/gpu/drm/i915/gt/intel_engine.h @@ -90,38 +90,6 @@ struct drm_printer; /* seqno size is actually only a uint32, but since we plan to use MI_FLUSH_DW to * do the writes, and that must have qw aligned offsets, simply pretend it's 8b. */ -enum intel_engine_hangcheck_action { - ENGINE_IDLE = 0, - ENGINE_WAIT, - ENGINE_ACTIVE_SEQNO, - ENGINE_ACTIVE_HEAD, - ENGINE_ACTIVE_SUBUNITS, - ENGINE_WAIT_KICK, - ENGINE_DEAD, -}; - -static inline const char * -hangcheck_action_to_str(const enum intel_engine_hangcheck_action a) -{ - switch (a) { - case ENGINE_IDLE: - return "idle"; - case ENGINE_WAIT: - return "wait"; - case ENGINE_ACTIVE_SEQNO: - return "active seqno"; - case ENGINE_ACTIVE_HEAD: - return "active head"; - case ENGINE_ACTIVE_SUBUNITS: - return "active subunits"; - case ENGINE_WAIT_KICK: - return "wait kick"; - case ENGINE_DEAD: - return "dead"; - } - - return "unknown"; -} void intel_engines_set_scheduler_caps(struct drm_i915_private *i915); diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 65cbf1d9118d..89f4a3825b77 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -621,7 +621,6 @@ static int intel_engine_setup_common(struct intel_engine_cs *engine) intel_engine_init_active(engine, ENGINE_PHYSICAL); intel_engine_init_breadcrumbs(engine); intel_engine_init_execlists(engine); - intel_engine_init_hangcheck(engine); intel_engine_init_batch_pool(engine); intel_engine_init_cmd_parser(engine); intel_engine_init__pm(engine); @@ -1453,8 +1452,13 @@ void intel_engine_dump(struct intel_engine_cs *engine, drm_printf(m, "*** WEDGED ***\n"); drm_printf(m, "\tAwake? %d\n", atomic_read(&engine->wakeref.count)); - drm_printf(m, "\tHangcheck: %d ms ago\n", - jiffies_to_msecs(jiffies - engine->hangcheck.action_timestamp)); + + rcu_read_lock(); + rq = READ_ONCE(engine->last_heartbeat); + if (rq) + drm_printf(m, "\tHeartbeat: %d ms ago\n", + jiffies_to_msecs(jiffies - rq->emitted_jiffies)); + rcu_read_unlock(); drm_printf(m, "\tReset count: %d (global %d)\n", i915_reset_engine_count(error, engine), i915_reset_count(error)); diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c new file mode 100644 index 000000000000..4a955835c0fa --- /dev/null +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c @@ -0,0 +1,96 @@ +/* + * SPDX-License-Identifier: MIT + * + * Copyright © 2019 Intel Corporation + */ + +#include "i915_request.h" + +#include "intel_context.h" +#include "intel_engine_heartbeat.h" +#include "intel_engine_pm.h" +#include "intel_gt.h" +#include "intel_reset.h" + +static long delay(void) +{ + const long t = msecs_to_jiffies(CONFIG_DRM_I915_HEARTBEAT_INTERVAL); + + return round_jiffies_up_relative(t); +} + +static void heartbeat(struct work_struct *wrk) +{ + struct intel_engine_cs *engine = + container_of(wrk, typeof(*engine), heartbeat.work); + struct intel_context *ce = engine->kernel_context; + struct i915_request *rq; + + if (!intel_engine_pm_get_if_awake(engine)) + return; + + rq = engine->last_heartbeat; + if (rq && i915_request_completed(rq)) { + i915_request_put(rq); + engine->last_heartbeat = NULL; + } + + if (intel_gt_is_wedged(engine->gt)) + goto out; + + if (engine->last_heartbeat) { + intel_gt_handle_error(engine->gt, engine->mask, + I915_ERROR_CAPTURE, + "stopped heartbeat on %s", engine->name); + goto out; + } + + if (engine->wakeref_serial == engine->serial) + goto out; + + if (intel_context_timeline_lock(ce)) + goto out; + + intel_context_enter(ce); + rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN); + intel_context_exit(ce); + if (IS_ERR(rq)) + goto unlock; + + engine->last_heartbeat = i915_request_get(rq); + engine->wakeref_serial = engine->serial + 1; + + i915_request_add_barriers(rq); + __i915_request_commit(rq); + + if (engine->schedule) { + struct i915_sched_attr attr = { .priority = INT_MAX }; + + engine->schedule(rq, &attr); + } + +unlock: + intel_context_timeline_unlock(ce); +out: + schedule_delayed_work(&engine->heartbeat, delay()); + intel_engine_pm_put(engine); +} + +void intel_engine_unpark_heartbeat(struct intel_engine_cs *engine) +{ + if (!CONFIG_DRM_I915_HEARTBEAT_INTERVAL) + return; + + schedule_delayed_work(&engine->heartbeat, delay()); +} + +void intel_engine_park_heartbeat(struct intel_engine_cs *engine) +{ + cancel_delayed_work_sync(&engine->heartbeat); + i915_request_put(fetch_and_zero(&engine->last_heartbeat)); +} + +void intel_engine_init_heartbeat(struct intel_engine_cs *engine) +{ + INIT_DELAYED_WORK(&engine->heartbeat, heartbeat); +} diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h new file mode 100644 index 000000000000..e04f9c7aa85d --- /dev/null +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.h @@ -0,0 +1,17 @@ +/* + * SPDX-License-Identifier: MIT + * + * Copyright © 2019 Intel Corporation + */ + +#ifndef INTEL_ENGINE_HEARTBEAT_H +#define INTEL_ENGINE_HEARTBEAT_H + +struct intel_engine_cs; + +void intel_engine_init_heartbeat(struct intel_engine_cs *engine); + +void intel_engine_park_heartbeat(struct intel_engine_cs *engine); +void intel_engine_unpark_heartbeat(struct intel_engine_cs *engine); + +#endif /* INTEL_ENGINE_HEARTBEAT_H */ diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c index e74fbf04a68d..6349bb038c72 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c @@ -7,6 +7,7 @@ #include "i915_drv.h" #include "intel_engine.h" +#include "intel_engine_heartbeat.h" #include "intel_engine_pm.h" #include "intel_gt.h" #include "intel_gt_pm.h" @@ -32,7 +33,7 @@ static int __engine_unpark(struct intel_wakeref *wf) if (engine->unpark) engine->unpark(engine); - intel_engine_init_hangcheck(engine); + intel_engine_unpark_heartbeat(engine); return 0; } @@ -115,6 +116,7 @@ static int __engine_park(struct intel_wakeref *wf) GEM_TRACE("%s\n", engine->name); + intel_engine_park_heartbeat(engine); intel_engine_disarm_breadcrumbs(engine); /* Must be reset upon idling, or we may miss the busy wakeup. */ @@ -142,4 +144,5 @@ void intel_engine_pm_put(struct intel_engine_cs *engine) void intel_engine_init__pm(struct intel_engine_cs *engine) { intel_wakeref_init(&engine->wakeref); + intel_engine_init_heartbeat(engine); } diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h index 8be63019d707..9a3ead556743 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h @@ -14,6 +14,7 @@ #include #include #include +#include #include "i915_gem.h" #include "i915_gem_batch_pool.h" @@ -55,14 +56,6 @@ struct intel_instdone { u32 row[I915_MAX_SLICES][I915_MAX_SUBSLICES]; }; -struct intel_engine_hangcheck { - u64 acthd; - u32 last_ring; - u32 last_head; - unsigned long action_timestamp; - struct intel_instdone instdone; -}; - struct intel_ring { struct kref ref; struct i915_vma *vma; @@ -298,6 +291,9 @@ struct intel_engine_cs { struct drm_i915_gem_object *default_state; void *pinned_default_state; + struct delayed_work heartbeat; + struct i915_request *last_heartbeat; + /* Rather than have every client wait upon all user interrupts, * with the herd waking after every interrupt and each doing the * heavyweight seqno dance, we delegate the task (of being the @@ -437,8 +433,6 @@ struct intel_engine_cs { /* status_notifier: list of callbacks for context-switch changes */ struct atomic_notifier_head context_status_notifier; - struct intel_engine_hangcheck hangcheck; - #define I915_ENGINE_NEEDS_CMD_PARSER BIT(0) #define I915_ENGINE_SUPPORTS_STATS BIT(1) #define I915_ENGINE_HAS_PREEMPTION BIT(2) diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index f7e69db4019d..feaf916c9abd 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -19,7 +19,6 @@ void intel_gt_init_early(struct intel_gt *gt, struct drm_i915_private *i915) spin_lock_init(>->closed_lock); - intel_gt_init_hangcheck(gt); intel_gt_init_reset(gt); intel_gt_pm_init_early(gt); } diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h index 640bb0531f5b..54d22913ede3 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.h +++ b/drivers/gpu/drm/i915/gt/intel_gt.h @@ -39,8 +39,6 @@ void intel_gt_clear_error_registers(struct intel_gt *gt, void intel_gt_flush_ggtt_writes(struct intel_gt *gt); void intel_gt_chipset_flush(struct intel_gt *gt); -void intel_gt_init_hangcheck(struct intel_gt *gt); - int intel_gt_init_scratch(struct intel_gt *gt, unsigned int size); void intel_gt_fini_scratch(struct intel_gt *gt); @@ -55,6 +53,4 @@ static inline bool intel_gt_is_wedged(struct intel_gt *gt) return __intel_reset_failed(>->reset); } -void intel_gt_queue_hangcheck(struct intel_gt *gt); - #endif /* __INTEL_GT_H__ */ diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index 65c0d0c9d543..d223b07ee324 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -46,8 +46,6 @@ static int intel_gt_unpark(struct intel_wakeref *wf) i915_pmu_gt_unparked(i915); - intel_gt_queue_hangcheck(gt); - pm_notify(i915, INTEL_GT_UNPARK); return 0; diff --git a/drivers/gpu/drm/i915/gt/intel_hangcheck.c b/drivers/gpu/drm/i915/gt/intel_hangcheck.c deleted file mode 100644 index 05d042cdefe2..000000000000 --- a/drivers/gpu/drm/i915/gt/intel_hangcheck.c +++ /dev/null @@ -1,360 +0,0 @@ -/* - * Copyright © 2016 Intel Corporation - * - * Permission is hereby granted, free of charge, to any person obtaining a - * copy of this software and associated documentation files (the "Software"), - * to deal in the Software without restriction, including without limitation - * the rights to use, copy, modify, merge, publish, distribute, sublicense, - * and/or sell copies of the Software, and to permit persons to whom the - * Software is furnished to do so, subject to the following conditions: - * - * The above copyright notice and this permission notice (including the next - * paragraph) shall be included in all copies or substantial portions of the - * Software. - * - * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR - * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL - * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER - * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING - * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS - * IN THE SOFTWARE. - * - */ - -#include "i915_drv.h" -#include "intel_engine.h" -#include "intel_gt.h" -#include "intel_reset.h" - -struct hangcheck { - u64 acthd; - u32 ring; - u32 head; - enum intel_engine_hangcheck_action action; - unsigned long action_timestamp; - int deadlock; - struct intel_instdone instdone; - bool wedged:1; - bool stalled:1; -}; - -static bool instdone_unchanged(u32 current_instdone, u32 *old_instdone) -{ - u32 tmp = current_instdone | *old_instdone; - bool unchanged; - - unchanged = tmp == *old_instdone; - *old_instdone |= tmp; - - return unchanged; -} - -static bool subunits_stuck(struct intel_engine_cs *engine) -{ - struct drm_i915_private *dev_priv = engine->i915; - struct intel_instdone instdone; - struct intel_instdone *accu_instdone = &engine->hangcheck.instdone; - bool stuck; - int slice; - int subslice; - - intel_engine_get_instdone(engine, &instdone); - - /* There might be unstable subunit states even when - * actual head is not moving. Filter out the unstable ones by - * accumulating the undone -> done transitions and only - * consider those as progress. - */ - stuck = instdone_unchanged(instdone.instdone, - &accu_instdone->instdone); - stuck &= instdone_unchanged(instdone.slice_common, - &accu_instdone->slice_common); - - for_each_instdone_slice_subslice(dev_priv, slice, subslice) { - stuck &= instdone_unchanged(instdone.sampler[slice][subslice], - &accu_instdone->sampler[slice][subslice]); - stuck &= instdone_unchanged(instdone.row[slice][subslice], - &accu_instdone->row[slice][subslice]); - } - - return stuck; -} - -static enum intel_engine_hangcheck_action -head_stuck(struct intel_engine_cs *engine, u64 acthd) -{ - if (acthd != engine->hangcheck.acthd) { - - /* Clear subunit states on head movement */ - memset(&engine->hangcheck.instdone, 0, - sizeof(engine->hangcheck.instdone)); - - return ENGINE_ACTIVE_HEAD; - } - - if (!subunits_stuck(engine)) - return ENGINE_ACTIVE_SUBUNITS; - - return ENGINE_DEAD; -} - -static enum intel_engine_hangcheck_action -engine_stuck(struct intel_engine_cs *engine, u64 acthd) -{ - enum intel_engine_hangcheck_action ha; - u32 tmp; - - ha = head_stuck(engine, acthd); - if (ha != ENGINE_DEAD) - return ha; - - if (IS_GEN(engine->i915, 2)) - return ENGINE_DEAD; - - /* Is the chip hanging on a WAIT_FOR_EVENT? - * If so we can simply poke the RB_WAIT bit - * and break the hang. This should work on - * all but the second generation chipsets. - */ - tmp = ENGINE_READ(engine, RING_CTL); - if (tmp & RING_WAIT) { - intel_gt_handle_error(engine->gt, engine->mask, 0, - "stuck wait on %s", engine->name); - ENGINE_WRITE(engine, RING_CTL, tmp); - return ENGINE_WAIT_KICK; - } - - return ENGINE_DEAD; -} - -static void hangcheck_load_sample(struct intel_engine_cs *engine, - struct hangcheck *hc) -{ - hc->acthd = intel_engine_get_active_head(engine); - hc->ring = ENGINE_READ(engine, RING_START); - hc->head = ENGINE_READ(engine, RING_HEAD); -} - -static void hangcheck_store_sample(struct intel_engine_cs *engine, - const struct hangcheck *hc) -{ - engine->hangcheck.acthd = hc->acthd; - engine->hangcheck.last_ring = hc->ring; - engine->hangcheck.last_head = hc->head; -} - -static enum intel_engine_hangcheck_action -hangcheck_get_action(struct intel_engine_cs *engine, - const struct hangcheck *hc) -{ - if (intel_engine_is_idle(engine)) - return ENGINE_IDLE; - - if (engine->hangcheck.last_ring != hc->ring) - return ENGINE_ACTIVE_SEQNO; - - if (engine->hangcheck.last_head != hc->head) - return ENGINE_ACTIVE_SEQNO; - - return engine_stuck(engine, hc->acthd); -} - -static void hangcheck_accumulate_sample(struct intel_engine_cs *engine, - struct hangcheck *hc) -{ - unsigned long timeout = I915_ENGINE_DEAD_TIMEOUT; - - hc->action = hangcheck_get_action(engine, hc); - - /* We always increment the progress - * if the engine is busy and still processing - * the same request, so that no single request - * can run indefinitely (such as a chain of - * batches). The only time we do not increment - * the hangcheck score on this ring, if this - * engine is in a legitimate wait for another - * engine. In that case the waiting engine is a - * victim and we want to be sure we catch the - * right culprit. Then every time we do kick - * the ring, make it as a progress as the seqno - * advancement might ensure and if not, it - * will catch the hanging engine. - */ - - switch (hc->action) { - case ENGINE_IDLE: - case ENGINE_ACTIVE_SEQNO: - /* Clear head and subunit states on seqno movement */ - hc->acthd = 0; - - memset(&engine->hangcheck.instdone, 0, - sizeof(engine->hangcheck.instdone)); - - /* Intentional fall through */ - case ENGINE_WAIT_KICK: - case ENGINE_WAIT: - engine->hangcheck.action_timestamp = jiffies; - break; - - case ENGINE_ACTIVE_HEAD: - case ENGINE_ACTIVE_SUBUNITS: - /* - * Seqno stuck with still active engine gets leeway, - * in hopes that it is just a long shader. - */ - timeout = I915_SEQNO_DEAD_TIMEOUT; - break; - - case ENGINE_DEAD: - break; - - default: - MISSING_CASE(hc->action); - } - - hc->stalled = time_after(jiffies, - engine->hangcheck.action_timestamp + timeout); - hc->wedged = time_after(jiffies, - engine->hangcheck.action_timestamp + - I915_ENGINE_WEDGED_TIMEOUT); -} - -static void hangcheck_declare_hang(struct intel_gt *gt, - intel_engine_mask_t hung, - intel_engine_mask_t stuck) -{ - struct intel_engine_cs *engine; - intel_engine_mask_t tmp; - char msg[80]; - int len; - - /* If some rings hung but others were still busy, only - * blame the hanging rings in the synopsis. - */ - if (stuck != hung) - hung &= ~stuck; - len = scnprintf(msg, sizeof(msg), - "%s on ", stuck == hung ? "no progress" : "hang"); - for_each_engine_masked(engine, gt->i915, hung, tmp) - len += scnprintf(msg + len, sizeof(msg) - len, - "%s, ", engine->name); - msg[len-2] = '\0'; - - return intel_gt_handle_error(gt, hung, I915_ERROR_CAPTURE, "%s", msg); -} - -/* - * This is called when the chip hasn't reported back with completed - * batchbuffers in a long time. We keep track per ring seqno progress and - * if there are no progress, hangcheck score for that ring is increased. - * Further, acthd is inspected to see if the ring is stuck. On stuck case - * we kick the ring. If we see no progress on three subsequent calls - * we assume chip is wedged and try to fix it by resetting the chip. - */ -static void hangcheck_elapsed(struct work_struct *work) -{ - struct intel_gt *gt = - container_of(work, typeof(*gt), hangcheck.work.work); - intel_engine_mask_t hung = 0, stuck = 0, wedged = 0; - struct intel_engine_cs *engine; - enum intel_engine_id id; - intel_wakeref_t wakeref; - - if (!i915_modparams.enable_hangcheck) - return; - - if (!READ_ONCE(gt->awake)) - return; - - if (intel_gt_is_wedged(gt)) - return; - - wakeref = intel_runtime_pm_get_if_in_use(>->i915->runtime_pm); - if (!wakeref) - return; - - /* As enabling the GPU requires fairly extensive mmio access, - * periodically arm the mmio checker to see if we are triggering - * any invalid access. - */ - intel_uncore_arm_unclaimed_mmio_detection(gt->uncore); - - for_each_engine(engine, gt->i915, id) { - struct hangcheck hc; - - intel_engine_signal_breadcrumbs(engine); - - hangcheck_load_sample(engine, &hc); - hangcheck_accumulate_sample(engine, &hc); - hangcheck_store_sample(engine, &hc); - - if (hc.stalled) { - hung |= engine->mask; - if (hc.action != ENGINE_DEAD) - stuck |= engine->mask; - } - - if (hc.wedged) - wedged |= engine->mask; - } - - if (GEM_SHOW_DEBUG() && (hung | stuck)) { - struct drm_printer p = drm_debug_printer("hangcheck"); - - for_each_engine(engine, gt->i915, id) { - if (intel_engine_is_idle(engine)) - continue; - - intel_engine_dump(engine, &p, "%s\n", engine->name); - } - } - - if (wedged) { - dev_err(gt->i915->drm.dev, - "GPU recovery timed out," - " cancelling all in-flight rendering.\n"); - GEM_TRACE_DUMP(); - intel_gt_set_wedged(gt); - } - - if (hung) - hangcheck_declare_hang(gt, hung, stuck); - - intel_runtime_pm_put(>->i915->runtime_pm, wakeref); - - /* Reset timer in case GPU hangs without another request being added */ - intel_gt_queue_hangcheck(gt); -} - -void intel_gt_queue_hangcheck(struct intel_gt *gt) -{ - unsigned long delay; - - if (unlikely(!i915_modparams.enable_hangcheck)) - return; - - /* - * Don't continually defer the hangcheck so that it is always run at - * least once after work has been scheduled on any ring. Otherwise, - * we will ignore a hung ring if a second ring is kept busy. - */ - - delay = round_jiffies_up_relative(DRM_I915_HANGCHECK_JIFFIES); - queue_delayed_work(system_long_wq, >->hangcheck.work, delay); -} - -void intel_engine_init_hangcheck(struct intel_engine_cs *engine) -{ - memset(&engine->hangcheck, 0, sizeof(engine->hangcheck)); - engine->hangcheck.action_timestamp = jiffies; -} - -void intel_gt_init_hangcheck(struct intel_gt *gt) -{ - INIT_DELAYED_WORK(>->hangcheck.work, hangcheck_elapsed); -} - -#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) -#include "selftest_hangcheck.c" -#endif diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 98c071fe532b..0cb457538e62 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -978,8 +978,6 @@ void intel_gt_reset(struct intel_gt *gt, if (ret) goto taint; - intel_gt_queue_hangcheck(gt); - finish: reset_finish(gt, awake); unlock: @@ -1305,4 +1303,5 @@ void __intel_fini_wedge(struct intel_wedge_me *w) #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) #include "selftest_reset.c" +#include "selftest_hangcheck.c" #endif diff --git a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c index e2fa38a1ff0f..0badee567276 100644 --- a/drivers/gpu/drm/i915/gt/selftest_hangcheck.c +++ b/drivers/gpu/drm/i915/gt/selftest_hangcheck.c @@ -1732,7 +1732,6 @@ int intel_hangcheck_live_selftests(struct drm_i915_private *i915) }; struct intel_gt *gt = &i915->gt; intel_wakeref_t wakeref; - bool saved_hangcheck; int err; if (!intel_has_gpu_reset(gt->i915)) @@ -1742,7 +1741,6 @@ int intel_hangcheck_live_selftests(struct drm_i915_private *i915) return -EIO; /* we're long past hope of a successful reset */ wakeref = intel_runtime_pm_get(>->i915->runtime_pm); - saved_hangcheck = fetch_and_zero(&i915_modparams.enable_hangcheck); drain_delayed_work(>->hangcheck.work); /* flush param */ err = intel_gt_live_subtests(tests, gt); @@ -1751,7 +1749,6 @@ int intel_hangcheck_live_selftests(struct drm_i915_private *i915) igt_flush_test(gt->i915, I915_WAIT_LOCKED); mutex_unlock(>->i915->drm.struct_mutex); - i915_modparams.enable_hangcheck = saved_hangcheck; intel_runtime_pm_put(>->i915->runtime_pm, wakeref); return err; diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index 24787bb48c9f..1f0408ff345b 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -1044,91 +1044,6 @@ static int i915_frequency_info(struct seq_file *m, void *unused) return ret; } -static void i915_instdone_info(struct drm_i915_private *dev_priv, - struct seq_file *m, - struct intel_instdone *instdone) -{ - int slice; - int subslice; - - seq_printf(m, "\t\tINSTDONE: 0x%08x\n", - instdone->instdone); - - if (INTEL_GEN(dev_priv) <= 3) - return; - - seq_printf(m, "\t\tSC_INSTDONE: 0x%08x\n", - instdone->slice_common); - - if (INTEL_GEN(dev_priv) <= 6) - return; - - for_each_instdone_slice_subslice(dev_priv, slice, subslice) - seq_printf(m, "\t\tSAMPLER_INSTDONE[%d][%d]: 0x%08x\n", - slice, subslice, instdone->sampler[slice][subslice]); - - for_each_instdone_slice_subslice(dev_priv, slice, subslice) - seq_printf(m, "\t\tROW_INSTDONE[%d][%d]: 0x%08x\n", - slice, subslice, instdone->row[slice][subslice]); -} - -static int i915_hangcheck_info(struct seq_file *m, void *unused) -{ - struct drm_i915_private *i915 = node_to_i915(m->private); - struct intel_gt *gt = &i915->gt; - struct intel_engine_cs *engine; - intel_wakeref_t wakeref; - enum intel_engine_id id; - - seq_printf(m, "Reset flags: %lx\n", gt->reset.flags); - if (test_bit(I915_WEDGED, >->reset.flags)) - seq_puts(m, "\tWedged\n"); - if (test_bit(I915_RESET_BACKOFF, >->reset.flags)) - seq_puts(m, "\tDevice (global) reset in progress\n"); - - if (!i915_modparams.enable_hangcheck) { - seq_puts(m, "Hangcheck disabled\n"); - return 0; - } - - if (timer_pending(>->hangcheck.work.timer)) - seq_printf(m, "Hangcheck active, timer fires in %dms\n", - jiffies_to_msecs(gt->hangcheck.work.timer.expires - - jiffies)); - else if (delayed_work_pending(>->hangcheck.work)) - seq_puts(m, "Hangcheck active, work pending\n"); - else - seq_puts(m, "Hangcheck inactive\n"); - - seq_printf(m, "GT active? %s\n", yesno(gt->awake)); - - with_intel_runtime_pm(&i915->runtime_pm, wakeref) { - for_each_engine(engine, i915, id) { - struct intel_instdone instdone; - - seq_printf(m, "%s: %d ms ago\n", - engine->name, - jiffies_to_msecs(jiffies - - engine->hangcheck.action_timestamp)); - - seq_printf(m, "\tACTHD = 0x%08llx [current 0x%08llx]\n", - (long long)engine->hangcheck.acthd, - intel_engine_get_active_head(engine)); - - intel_engine_get_instdone(engine, &instdone); - - seq_puts(m, "\tinstdone read =\n"); - i915_instdone_info(i915, m, &instdone); - - seq_puts(m, "\tinstdone accu =\n"); - i915_instdone_info(i915, m, - &engine->hangcheck.instdone); - } - } - - return 0; -} - static int ironlake_drpc_info(struct seq_file *m) { struct drm_i915_private *i915 = node_to_i915(m->private); @@ -4387,7 +4302,6 @@ static const struct drm_info_list i915_debugfs_list[] = { {"i915_guc_stage_pool", i915_guc_stage_pool, 0}, {"i915_huc_load_status", i915_huc_load_status_info, 0}, {"i915_frequency_info", i915_frequency_info, 0}, - {"i915_hangcheck_info", i915_hangcheck_info, 0}, {"i915_drpc_info", i915_drpc_info, 0}, {"i915_emon_status", i915_emon_status, 0}, {"i915_ring_freq_table", i915_ring_freq_table, 0}, diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index f2d3d754af37..76993eba3f89 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -411,8 +411,7 @@ static int i915_getparam_ioctl(struct drm_device *dev, void *data, return -ENODEV; break; case I915_PARAM_HAS_GPU_RESET: - value = i915_modparams.enable_hangcheck && - intel_has_gpu_reset(dev_priv); + value = intel_has_gpu_reset(dev_priv); if (value && intel_has_reset_engine(dev_priv)) value = 2; break; diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 59d4a1146039..0e715b1266f0 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2375,7 +2375,6 @@ extern const struct dev_pm_ops i915_pm_ops; int i915_driver_probe(struct pci_dev *pdev, const struct pci_device_id *ent); void i915_driver_remove(struct drm_device *dev); -void intel_engine_init_hangcheck(struct intel_engine_cs *engine); int vlv_force_gfx_clock(struct drm_i915_private *dev_priv, bool on); static inline bool intel_gvt_active(struct drm_i915_private *dev_priv) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 674d341a23f6..eff10ca5a788 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -554,10 +554,6 @@ static void error_print_engine(struct drm_i915_error_state_buf *m, } err_printf(m, " ring->head: 0x%08x\n", ee->cpu_ring_head); err_printf(m, " ring->tail: 0x%08x\n", ee->cpu_ring_tail); - err_printf(m, " hangcheck timestamp: %dms (%lu%s)\n", - jiffies_to_msecs(ee->hangcheck_timestamp - epoch), - ee->hangcheck_timestamp, - ee->hangcheck_timestamp == epoch ? "; epoch" : ""); err_printf(m, " engine reset count: %u\n", ee->reset_count); for (n = 0; n < ee->num_ports; n++) { @@ -695,11 +691,8 @@ static void __err_print_to_sgl(struct drm_i915_error_state_buf *m, ts = ktime_to_timespec64(error->uptime); err_printf(m, "Uptime: %lld s %ld us\n", (s64)ts.tv_sec, ts.tv_nsec / NSEC_PER_USEC); - err_printf(m, "Epoch: %lu jiffies (%u HZ)\n", error->epoch, HZ); - err_printf(m, "Capture: %lu jiffies; %d ms ago, %d ms after epoch\n", - error->capture, - jiffies_to_msecs(jiffies - error->capture), - jiffies_to_msecs(error->capture - error->epoch)); + err_printf(m, "Capture: %lu jiffies; %d ms ago\n", + error->capture, jiffies_to_msecs(jiffies - error->capture)); for (i = 0; i < ARRAY_SIZE(error->engine); i++) { if (!error->engine[i].context.pid) @@ -760,7 +753,9 @@ static void __err_print_to_sgl(struct drm_i915_error_state_buf *m, for (i = 0; i < ARRAY_SIZE(error->engine); i++) { if (error->engine[i].engine_id != -1) - error_print_engine(m, &error->engine[i], error->epoch); + error_print_engine(m, + &error->engine[i], + error->capture); } for (i = 0; i < ARRAY_SIZE(error->engine); i++) { @@ -790,7 +785,7 @@ static void __err_print_to_sgl(struct drm_i915_error_state_buf *m, for (j = 0; j < ee->num_requests; j++) error_print_request(m, " ", &ee->requests[j], - error->epoch); + error->capture); } print_error_obj(m, m->i915->engine[i], @@ -1171,8 +1166,6 @@ static void error_record_engine_registers(struct i915_gpu_state *error, } ee->idle = intel_engine_is_idle(engine); - if (!ee->idle) - ee->hangcheck_timestamp = engine->hangcheck.action_timestamp; ee->reset_count = i915_reset_engine_count(&dev_priv->gpu_error, engine); @@ -1673,22 +1666,6 @@ static void capture_params(struct i915_gpu_state *error) i915_params_copy(&error->params, &i915_modparams); } -static unsigned long capture_find_epoch(const struct i915_gpu_state *error) -{ - unsigned long epoch = error->capture; - int i; - - for (i = 0; i < ARRAY_SIZE(error->engine); i++) { - const struct drm_i915_error_engine *ee = &error->engine[i]; - - if (ee->hangcheck_timestamp && - time_before(ee->hangcheck_timestamp, epoch)) - epoch = ee->hangcheck_timestamp; - } - - return epoch; -} - static void capture_finish(struct i915_gpu_state *error) { struct i915_ggtt *ggtt = &error->i915->ggtt; @@ -1740,8 +1717,6 @@ i915_capture_gpu_state(struct drm_i915_private *i915) error->overlay = intel_overlay_capture_error_state(i915); error->display = intel_display_capture_error_state(i915); - error->epoch = capture_find_epoch(error); - capture_finish(error); compress_fini(&compress); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index a24c35107d16..c4ba297ca862 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -34,7 +34,6 @@ struct i915_gpu_state { ktime_t boottime; ktime_t uptime; unsigned long capture; - unsigned long epoch; struct drm_i915_private *i915; @@ -84,7 +83,6 @@ struct i915_gpu_state { int engine_id; /* Software tracked state */ bool idle; - unsigned long hangcheck_timestamp; int num_requests; u32 reset_count; diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c index 296452f9efe4..e4cd58e90e40 100644 --- a/drivers/gpu/drm/i915/i915_params.c +++ b/drivers/gpu/drm/i915/i915_params.c @@ -77,11 +77,6 @@ i915_param_named(error_capture, bool, 0600, "triaging and debugging hangs."); #endif -i915_param_named_unsafe(enable_hangcheck, bool, 0600, - "Periodically check GPU activity for detecting hangs. " - "WARNING: Disabling this can cause system wide hangs. " - "(default: true)"); - i915_param_named_unsafe(enable_psr, int, 0600, "Enable PSR " "(0=disabled, 1=enabled) " diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h index d29ade3b7de6..9a3359c6f6f3 100644 --- a/drivers/gpu/drm/i915/i915_params.h +++ b/drivers/gpu/drm/i915/i915_params.h @@ -68,7 +68,6 @@ struct drm_printer; param(char *, force_probe, CONFIG_DRM_I915_FORCE_PROBE) \ /* leave bools at the end to not create holes */ \ param(bool, alpha_support, IS_ENABLED(CONFIG_DRM_I915_ALPHA_SUPPORT)) \ - param(bool, enable_hangcheck, true) \ param(bool, prefault_disable, false) \ param(bool, load_detect_test, false) \ param(bool, force_reset_modeset_test, false) \