From patchwork Mon Jan 13 10:44:39 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11329927 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id CF95D13A0 for ; Mon, 13 Jan 2020 10:45:09 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id B81F220661 for ; Mon, 13 Jan 2020 10:45:09 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B81F220661 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 13EF289FA6; Mon, 13 Jan 2020 10:45:04 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3C78C89F9F for ; Mon, 13 Jan 2020 10:45:02 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 19859676-1500050 for multiple; Mon, 13 Jan 2020 10:44:44 +0000 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Mon, 13 Jan 2020 10:44:39 +0000 Message-Id: <20200113104442.1753973-1-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.25.0.rc2 MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 1/4] drm/i915/gt: Always reset the timeslice after a context switch X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Currently, we reset the timer after a pre-eemption event. This has the side-effect that the timeslice runs into the second context after the first is completed. To be more fair, we want to reset the clock after promotion as well. Signed-off-by: Chris Wilson Cc: Tvrtko Ursulin Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/intel_lrc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 9af1b2b493f4..a6ac37dece0a 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2285,7 +2285,6 @@ static void process_csb(struct intel_engine_cs *engine) /* Point active to the new ELSP; prevent overwriting */ WRITE_ONCE(execlists->active, execlists->pending); - set_timeslice(engine); if (!inject_preempt_hang(execlists)) ring_set_paused(engine, 0); @@ -2325,6 +2324,9 @@ static void process_csb(struct intel_engine_cs *engine) } } while (head != tail); + if (execlists_active(execlists)) + set_timeslice(engine); + execlists->csb_head = head; /* From patchwork Mon Jan 13 10:44:40 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11329923 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3E8FA92A for ; Mon, 13 Jan 2020 10:45:06 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 25D8720661 for ; Mon, 13 Jan 2020 10:45:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 25D8720661 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 3CC2189F92; Mon, 13 Jan 2020 10:45:03 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3C26A89F97 for ; Mon, 13 Jan 2020 10:45:02 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 19859677-1500050 for multiple; Mon, 13 Jan 2020 10:44:45 +0000 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Mon, 13 Jan 2020 10:44:40 +0000 Message-Id: <20200113104442.1753973-2-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.25.0.rc2 In-Reply-To: <20200113104442.1753973-1-chris@chris-wilson.co.uk> References: <20200113104442.1753973-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 2/4] drm/i915: Use common priotree lists for virtual engine X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Since commit 422d7df4f090 ("drm/i915: Replace engine->timeline with a plain list"), we used the default embedded priotree slot for the virtual engine request queue, which means we can also use the same solitary slot with the scheduler. However, the priolist is expected to be guarded by the engine->active.lock, but this is not true for the virtual engine References: 422d7df4f090 ("drm/i915: Replace engine->timeline with a plain list") Signed-off-by: Chris Wilson Cc: Mika Kuoppala Cc: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/intel_lrc.c | 3 +++ drivers/gpu/drm/i915/i915_request.c | 4 +++- drivers/gpu/drm/i915/i915_request.h | 16 ++++++++++++++++ drivers/gpu/drm/i915/i915_scheduler.c | 3 +-- 4 files changed, 23 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index a6ac37dece0a..685659f079a2 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -985,6 +985,8 @@ __unwind_incomplete_requests(struct intel_engine_cs *engine) GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root)); list_move(&rq->sched.link, pl); + set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags); + active = rq; } else { struct intel_engine_cs *owner = rq->context->engine; @@ -2473,6 +2475,7 @@ static void execlists_submit_request(struct i915_request *request) spin_lock_irqsave(&engine->active.lock, flags); queue_request(engine, &request->sched, rq_prio(request)); + set_bit(I915_FENCE_FLAG_PQUEUE, &request->fence.flags); GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root)); GEM_BUG_ON(list_empty(&request->sched.link)); diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index be185886e4fc..9ed0d3bc7249 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -408,8 +408,10 @@ bool __i915_request_submit(struct i915_request *request) xfer: /* We may be recursing from the signal callback of another i915 fence */ spin_lock_nested(&request->lock, SINGLE_DEPTH_NESTING); - if (!test_and_set_bit(I915_FENCE_FLAG_ACTIVE, &request->fence.flags)) + if (!test_and_set_bit(I915_FENCE_FLAG_ACTIVE, &request->fence.flags)) { list_move_tail(&request->sched.link, &engine->active.requests); + clear_bit(I915_FENCE_FLAG_PQUEUE, &request->fence.flags); + } if (test_bit(DMA_FENCE_FLAG_ENABLE_SIGNAL_BIT, &request->fence.flags) && !test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &request->fence.flags) && diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index 031433691a06..f3e50ec989b8 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -70,6 +70,17 @@ enum { */ I915_FENCE_FLAG_ACTIVE = DMA_FENCE_FLAG_USER_BITS, + /* + * I915_FENCE_FLAG_PQUEUE - this request is ready for execution + * + * Using the scheduler, when a request is ready for execution it is put + * into the priority queue. We want to track its membership within that + * queue so that we can easily check before rescheduling. + * + * See i915_request_in_priority_queue() + */ + I915_FENCE_FLAG_PQUEUE, + /* * I915_FENCE_FLAG_SIGNAL - this request is currently on signal_list * @@ -361,6 +372,11 @@ static inline bool i915_request_is_active(const struct i915_request *rq) return test_bit(I915_FENCE_FLAG_ACTIVE, &rq->fence.flags); } +static inline bool i915_request_in_priority_queue(const struct i915_request *rq) +{ + return test_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags); +} + /** * Returns true if seq1 is later than seq2. */ diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c index bf87c70bfdd9..4f6e4d6c590a 100644 --- a/drivers/gpu/drm/i915/i915_scheduler.c +++ b/drivers/gpu/drm/i915/i915_scheduler.c @@ -338,8 +338,7 @@ static void __i915_schedule(struct i915_sched_node *node, continue; } - if (!intel_engine_is_virtual(engine) && - !i915_request_is_active(node_to_request(node))) { + if (i915_request_in_priority_queue(node_to_request(node))) { if (!cache.priolist) cache.priolist = i915_sched_lookup_priolist(engine, From patchwork Mon Jan 13 10:44:41 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11329929 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2FA9292A for ; Mon, 13 Jan 2020 10:45:11 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 183B920661 for ; Mon, 13 Jan 2020 10:45:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 183B920661 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D2F5289FA5; Mon, 13 Jan 2020 10:45:03 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3B71389F92 for ; Mon, 13 Jan 2020 10:45:02 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 19859678-1500050 for multiple; Mon, 13 Jan 2020 10:44:45 +0000 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Mon, 13 Jan 2020 10:44:41 +0000 Message-Id: <20200113104442.1753973-3-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.25.0.rc2 In-Reply-To: <20200113104442.1753973-1-chris@chris-wilson.co.uk> References: <20200113104442.1753973-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 3/4] drm/i915/gt: Allow temporary suspension of inflight requests X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" In order to support out-of-line error capture, we need to remove the active request from HW and put it to one side while a worker compresses and stores all the details associated with that request. (As that compression may take an arbitrary user-controlled amount of time, we want to let the engine continue running on other workloads while the hanging request is dumped.) Not only do we need to remove the active request, but we also have to remove its context and all requests that were dependent on it (both in flight, queued and future submission). Finally once the capture is complete, we need to be able to resubmit the request and its dependents and allow them to execute. References: https://gitlab.freedesktop.org/drm/intel/issues/738 Signed-off-by: Chris Wilson Cc: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 1 + drivers/gpu/drm/i915/gt/intel_engine_types.h | 1 + drivers/gpu/drm/i915/gt/intel_lrc.c | 109 ++++++++++++++++++- drivers/gpu/drm/i915/gt/selftest_lrc.c | 103 ++++++++++++++++++ drivers/gpu/drm/i915/i915_request.h | 22 ++++ 5 files changed, 231 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index f451ef376548..c296aaf381e7 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -671,6 +671,7 @@ void intel_engine_init_active(struct intel_engine_cs *engine, unsigned int subclass) { INIT_LIST_HEAD(&engine->active.requests); + INIT_LIST_HEAD(&engine->active.hold); spin_lock_init(&engine->active.lock); lockdep_set_subclass(&engine->active.lock, subclass); diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h index 00287515e7af..6f4f9912a364 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h @@ -295,6 +295,7 @@ struct intel_engine_cs { struct { spinlock_t lock; struct list_head requests; + struct list_head hold; } active; struct llist_head barrier_tasks; diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index 685659f079a2..e0347e05e577 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2355,6 +2355,77 @@ static void __execlists_submission_tasklet(struct intel_engine_cs *const engine) } } +static void __execlists_hold(struct i915_request *rq) +{ + struct i915_dependency *p; + + if (!i915_request_completed(rq)) { + RQ_TRACE(rq, "on hold\n"); + if (i915_request_is_active(rq)) + __i915_request_unsubmit(rq); + clear_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags); + list_move_tail(&rq->sched.link, &rq->engine->active.hold); + i915_request_set_hold(rq); + } + + list_for_each_entry(p, &rq->sched.waiters_list, wait_link) { + struct i915_request *w = + container_of(p->waiter, typeof(*w), sched); + + /* Leave semaphores spinning on the other engines */ + if (w->engine != rq->engine) + continue; + + __execlists_hold(w); + } +} + +__maybe_unused +static void execlists_hold(struct intel_engine_cs *engine, + struct i915_request *rq) +{ + spin_lock_irq(&engine->active.lock); + __execlists_hold(rq); + spin_unlock_irq(&engine->active.lock); +} + +static void __execlists_unhold(struct i915_request *rq) +{ + struct i915_dependency *p; + + if (!i915_request_has_hold(rq)) + return; + + i915_request_clear_hold(rq); + list_move_tail(&rq->sched.link, + i915_sched_lookup_priolist(rq->engine, rq_prio(rq))); + set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags); + RQ_TRACE(rq, "hold release\n"); + + list_for_each_entry(p, &rq->sched.waiters_list, wait_link) { + struct i915_request *w = + container_of(p->waiter, typeof(*w), sched); + + if (w->engine != rq->engine) + continue; + + __execlists_unhold(w); + } +} + +__maybe_unused +static void execlists_unhold(struct intel_engine_cs *engine, + struct i915_request *rq) +{ + spin_lock_irq(&engine->active.lock); + __execlists_unhold(rq); + if (rq_prio(rq) > engine->execlists.queue_priority_hint) { + engine->execlists.queue_priority_hint = rq_prio(rq); + tasklet_hi_schedule(&engine->execlists.tasklet); + } + spin_unlock_irq(&engine->active.lock); +} + static noinline void preempt_reset(struct intel_engine_cs *engine) { const unsigned int bit = I915_RESET_ENGINE + engine->id; @@ -2466,6 +2537,29 @@ static void submit_queue(struct intel_engine_cs *engine, __submit_queue_imm(engine); } +static bool hold_request(const struct i915_request *rq) +{ + struct i915_dependency *p; + + list_for_each_entry(p, &rq->sched.signalers_list, signal_link) { + const struct i915_request *s = + container_of(p->signaler, typeof(*s), sched); + + if (s->engine != rq->engine) + continue; + + return i915_request_has_hold(s); + } + + return false; +} + +static bool on_hold(const struct intel_engine_cs *engine, + const struct i915_request *rq) +{ + return !list_empty(&engine->active.hold) && hold_request(rq); +} + static void execlists_submit_request(struct i915_request *request) { struct intel_engine_cs *engine = request->engine; @@ -2474,13 +2568,18 @@ static void execlists_submit_request(struct i915_request *request) /* Will be called from irq-context when using foreign fences. */ spin_lock_irqsave(&engine->active.lock, flags); - queue_request(engine, &request->sched, rq_prio(request)); - set_bit(I915_FENCE_FLAG_PQUEUE, &request->fence.flags); + if (unlikely(on_hold(engine, request))) { + i915_request_set_hold(request); + list_add_tail(&request->sched.link, &engine->active.hold); + } else { + queue_request(engine, &request->sched, rq_prio(request)); + set_bit(I915_FENCE_FLAG_PQUEUE, &request->fence.flags); - GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root)); - GEM_BUG_ON(list_empty(&request->sched.link)); + GEM_BUG_ON(RB_EMPTY_ROOT(&engine->execlists.queue.rb_root)); + GEM_BUG_ON(list_empty(&request->sched.link)); - submit_queue(engine, request); + submit_queue(engine, request); + } spin_unlock_irqrestore(&engine->active.lock, flags); } diff --git a/drivers/gpu/drm/i915/gt/selftest_lrc.c b/drivers/gpu/drm/i915/gt/selftest_lrc.c index 15cda024e3e4..78501d79c0ea 100644 --- a/drivers/gpu/drm/i915/gt/selftest_lrc.c +++ b/drivers/gpu/drm/i915/gt/selftest_lrc.c @@ -285,6 +285,108 @@ static int live_unlite_preempt(void *arg) return live_unlite_restore(arg, I915_USER_PRIORITY(I915_PRIORITY_MAX)); } +static int live_hold_reset(void *arg) +{ + struct intel_gt *gt = arg; + struct intel_engine_cs *engine; + enum intel_engine_id id; + struct igt_spinner spin; + int err = 0; + + /* + * In order to support offline error capture for fast preempt reset, + * we need to decouple the guilty request and ensure that it and its + * descendents are not executed while the capture is in progress. + */ + + if (!intel_has_reset_engine(gt)) + return 0; + + if (igt_spinner_init(&spin, gt)) + return -ENOMEM; + + for_each_engine(engine, gt, id) { + struct intel_context *ce; + unsigned long heartbeat; + struct i915_request *rq; + + ce = intel_context_create(engine); + if (IS_ERR(ce)) { + err = PTR_ERR(ce); + break; + } + + engine_heartbeat_disable(engine, &heartbeat); + + rq = igt_spinner_create_request(&spin, ce, MI_ARB_CHECK); + if (IS_ERR(rq)) { + err = PTR_ERR(rq); + goto out; + } + i915_request_add(rq); + + if (!igt_wait_for_spinner(&spin, rq)) { + intel_gt_set_wedged(gt); + err = -ETIME; + goto out; + } + + /* We have our request executing, now remove it and reset */ + + if (test_and_set_bit(I915_RESET_ENGINE + id, + >->reset.flags)) { + spin_unlock_irq(&engine->active.lock); + intel_gt_set_wedged(gt); + err = -EBUSY; + goto out; + } + tasklet_disable(&engine->execlists.tasklet); + + engine->execlists.tasklet.func(engine->execlists.tasklet.data); + GEM_BUG_ON(execlists_active(&engine->execlists) != rq); + + execlists_hold(engine, rq); + GEM_BUG_ON(!i915_request_has_hold(rq)); + + intel_engine_reset(engine, NULL); + GEM_BUG_ON(rq->fence.error != -EIO); + + tasklet_enable(&engine->execlists.tasklet); + clear_and_wake_up_bit(I915_RESET_ENGINE + id, + >->reset.flags); + + /* Check that we do not resubmit the held request */ + i915_request_get(rq); + if (!i915_request_wait(rq, 0, HZ / 5)) { + pr_err("%s: on hold request completed!\n", + engine->name); + i915_request_put(rq); + err = -EIO; + goto out; + } + GEM_BUG_ON(!i915_request_has_hold(rq)); + + /* But is resubmitted on release */ + execlists_unhold(engine, rq); + if (i915_request_wait(rq, 0, HZ / 5) < 0) { + pr_err("%s: held request did not complete!\n", + engine->name); + intel_gt_set_wedged(gt); + err = -ETIME; + } + i915_request_put(rq); + +out: + engine_heartbeat_enable(engine, heartbeat); + intel_context_put(ce); + if (err) + break; + } + + igt_spinner_fini(&spin); + return err; +} + static int emit_semaphore_chain(struct i915_request *rq, struct i915_vma *vma, int idx) { @@ -3315,6 +3417,7 @@ int intel_execlists_live_selftests(struct drm_i915_private *i915) SUBTEST(live_sanitycheck), SUBTEST(live_unlite_switch), SUBTEST(live_unlite_preempt), + SUBTEST(live_hold_reset), SUBTEST(live_timeslice_preempt), SUBTEST(live_timeslice_queue), SUBTEST(live_busywait_preempt), diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index f3e50ec989b8..2da8cfb3f3a1 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -89,6 +89,13 @@ enum { */ I915_FENCE_FLAG_SIGNAL, + /* + * I915_FENCE_FLAG_HOLD - this request is currently on hold + * + * This request has been suspended, pending an ongoing investigation. + */ + I915_FENCE_FLAG_HOLD, + /* * I915_FENCE_FLAG_NOPREEMPT - this request should not be preempted * @@ -499,6 +506,21 @@ static inline bool i915_request_has_sentinel(const struct i915_request *rq) return unlikely(test_bit(I915_FENCE_FLAG_SENTINEL, &rq->fence.flags)); } +static inline bool i915_request_has_hold(const struct i915_request *rq) +{ + return unlikely(test_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags)); +} + +static inline void i915_request_set_hold(struct i915_request *rq) +{ + set_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags); +} + +static inline void i915_request_clear_hold(struct i915_request *rq) +{ + clear_bit(I915_FENCE_FLAG_HOLD, &rq->fence.flags); +} + static inline struct intel_timeline * i915_request_timeline(struct i915_request *rq) { From patchwork Mon Jan 13 10:44:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 11329925 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 445B613A0 for ; Mon, 13 Jan 2020 10:45:08 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 2CE3120661 for ; Mon, 13 Jan 2020 10:45:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2CE3120661 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 9F20C89F97; Mon, 13 Jan 2020 10:45:03 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 3CE3A89FA0 for ; Mon, 13 Jan 2020 10:45:02 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 19859679-1500050 for multiple; Mon, 13 Jan 2020 10:44:45 +0000 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Mon, 13 Jan 2020 10:44:42 +0000 Message-Id: <20200113104442.1753973-4-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.25.0.rc2 In-Reply-To: <20200113104442.1753973-1-chris@chris-wilson.co.uk> References: <20200113104442.1753973-1-chris@chris-wilson.co.uk> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 4/4] drm/i915/execlists: Offline error capture X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Currently, we skip error capture upon forced preemption. We apply forced preemption when there is a higher priority request that should be running but is being blocked, and we skip inline error capture so that the preemption request is not further delayed by a user controlled capture -- extending the denial of service. However, preemption reset is also used for heartbeats and regular GPU hangs. By skipping the error capture, we remove the ability to debug GPU hangs. In order to capture the error without delaying the preemption request further, we can do an out-of-line capture by removing the guilty request from the execution queue and scheduling a work to dump that request. When removing a request, we need to remove the entire context and all descendants from the execution queue, so that they do not jump past. Closes: https://gitlab.freedesktop.org/drm/intel/issues/738 Signed-off-by: Chris Wilson --- drivers/gpu/drm/i915/gt/intel_lrc.c | 99 ++++++++++++++++++++++++++++- 1 file changed, 97 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c index e0347e05e577..ad474b4bc2f3 100644 --- a/drivers/gpu/drm/i915/gt/intel_lrc.c +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c @@ -2380,7 +2380,6 @@ static void __execlists_hold(struct i915_request *rq) } } -__maybe_unused static void execlists_hold(struct intel_engine_cs *engine, struct i915_request *rq) { @@ -2413,7 +2412,12 @@ static void __execlists_unhold(struct i915_request *rq) } } -__maybe_unused +struct execlists_capture { + struct work_struct work; + struct i915_request *rq; + struct i915_gpu_coredump *error; +}; + static void execlists_unhold(struct intel_engine_cs *engine, struct i915_request *rq) { @@ -2426,6 +2430,96 @@ static void execlists_unhold(struct intel_engine_cs *engine, spin_unlock_irq(&engine->active.lock); } +static void execlists_capture_work(struct work_struct *work) +{ + struct execlists_capture *cap = container_of(work, typeof(*cap), work); + const gfp_t gfp = GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN; + struct intel_engine_cs *engine = cap->rq->engine; + struct intel_gt_coredump *gt = cap->error->gt; + struct intel_engine_capture_vma *vma; + + vma = intel_engine_coredump_add_request(gt->engine, cap->rq, gfp); + if (vma) { + struct i915_vma_compress *comp = i915_vma_capture_prepare(gt); + + intel_engine_coredump_add_vma(gt->engine, vma, comp); + i915_vma_capture_finish(gt, comp); + } + + gt->simulated = gt->engine->simulated; + cap->error->simulated = gt->simulated; + + i915_error_state_store(cap->error); + i915_gpu_coredump_put(cap->error); + + execlists_unhold(engine, cap->rq); + + kfree(cap); +} + +static struct i915_gpu_coredump *capture_regs(struct intel_engine_cs *engine) +{ + const gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; + struct i915_gpu_coredump *e; + + e = i915_gpu_coredump_alloc(engine->i915, gfp); + if (!e) + return NULL; + + e->gt = intel_gt_coredump_alloc(engine->gt, gfp); + if (!e->gt) + goto err; + + e->gt->engine = intel_engine_coredump_alloc(engine, gfp); + if (!e->gt->engine) + goto err_gt; + + return e; + +err_gt: + kfree(e->gt); +err: + kfree(e); + return NULL; +} + +static void execlists_capture(struct intel_engine_cs *engine) +{ + struct execlists_capture *cap; + + if (!IS_ENABLED(CONFIG_DRM_I915_CAPTURE_ERROR)) + return; + + cap = kmalloc(sizeof(*cap), GFP_ATOMIC); + if (!cap) + return; + + cap->rq = execlists_active(&engine->execlists); + GEM_BUG_ON(!cap->rq); + + /* + * We need to _quickly_ capture the engine state before we reset. + * We are inside an atomic section (softirq) here and we are delaying + * the forced preemption event. + */ + cap->error = capture_regs(engine); + if (!cap->error) { + kfree(cap); + return; + } + + /* + * Remove the request from the execlists queue, and take ownership + * of the request. We pass it to our worker who will _slowly_ compress + * all the pages the _user_ requested for debugging their batch, after + * which we return it to the queue for signaling. + */ + execlists_hold(engine, cap->rq); + + INIT_WORK(&cap->work, execlists_capture_work); + schedule_work(&cap->work); +} + static noinline void preempt_reset(struct intel_engine_cs *engine) { const unsigned int bit = I915_RESET_ENGINE + engine->id; @@ -2443,6 +2537,7 @@ static noinline void preempt_reset(struct intel_engine_cs *engine) ENGINE_TRACE(engine, "preempt timeout %lu+%ums\n", READ_ONCE(engine->props.preempt_timeout_ms), jiffies_to_msecs(jiffies - engine->execlists.preempt.expires)); + execlists_capture(engine); intel_engine_reset(engine, "preemption time out"); tasklet_enable(&engine->execlists.tasklet);