From patchwork Fri Jan 20 23:28:25 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13110688 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B746BC05027 for ; Fri, 20 Jan 2023 23:29:24 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B229610EB2C; Fri, 20 Jan 2023 23:28:53 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id D7CE910EB20; Fri, 20 Jan 2023 23:28:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674257329; x=1705793329; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=T9AYqIYZKt7e7W+7KvP3P+HNdzseaLMw4tUBhGGSJJI=; b=Ziuc7cSvzTx3vSF+N/k2IGB5iSKBPibXuSu2EVBXRPtnpvhpnQ5f2XBG P0MSx693zRZDLscwenVUG2sqflDCt3gaGwcSA5zGqoiB/9VDQQPW77iBw 8kG1E9q4+7huttzV4vFNW3oq5VOqHqgDzjS4d/7mEwhMsRGzg7JA2HKZA Fa3so6RceLuYlzFyMURIJpQy5ZtfbF3DeiKHZZlpsAiZkm2dmGsVb3Bvb e9rIrf9XrLualYEyRkVIq8EAvo75J1NIOa5C64V0qQNKZ3M42Nh5G40TM eG4z8KWHTbLsfoh+v2mw25g/Dyf7m+CPBZf2XOid/28wEN3PufVVfjlR1 g==; X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="324413560" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="324413560" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 15:28:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="693021608" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="693021608" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga001.jf.intel.com with ESMTP; 20 Jan 2023 15:28:48 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 20 Jan 2023 15:28:25 -0800 Message-Id: <20230120232831.28177-2-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> References: <20230120232831.28177-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Andy Shevchenko , Michael Cheng , Alan Previn , intel-gfx@lists.freedesktop.org, Lucas De Marchi , DRI-Devel@Lists.FreeDesktop.Org, Andrzej Hajda , Rodrigo Vivi , Tejas Upadhyay , Matthew Auld Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison When GuC support was added to error capture, the locking around the request object was broken. Fix it up. The context based search manages the spinlocking around the search internally. So it needs to grab the reference count internally as well. The execlist only request based search relies on external locking, so it needs an external reference count but within the spinlock not outside it. The only other caller of the context based search is the code for dumping engine state to debugfs. That code wasn't previously getting an explicit reference at all as it does everything while holding the execlist specific spinlock. So, that needs updaing as well as that spinlock doesn't help when using GuC submission. Rather than trying to conditionally get/put depending on submission model, just change it to always do the get/put. In addition, intel_guc_find_hung_context() was not acquiring the correct spinlock before searching the request list. So fix that up too. While at it, add some extra whitespace padding for readability. v2: Explicitly document adding an extra blank line in some dense code (Andy Shevchenko). Fix multiple potential null pointer derefs in case of no request found (some spotted by Tvrtko, but there was more!). Also fix a leaked request in case of !started and another in __guc_reset_context now that intel_context_find_active_request is actually reference counting the returned request. v3: Add a _get suffix to intel_context_find_active_request now that it grabs a reference (Daniele). Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset") Cc: Matthew Brost Cc: John Harrison Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Daniele Ceraolo Spurio Cc: Andrzej Hajda Cc: Matthew Auld Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Tejas Upadhyay Cc: Andy Shevchenko Cc: Aravind Iddamsetty Cc: Alan Previn Cc: Bruce Chang Cc: intel-gfx@lists.freedesktop.org Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/intel_context.c | 4 +++- drivers/gpu/drm/i915/gt/intel_context.h | 3 +-- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 6 +++++- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +++++++++++++- drivers/gpu/drm/i915/i915_gpu_error.c | 13 ++++++------- 5 files changed, 28 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index e94365b08f1ef..4285c1c71fa12 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -528,7 +528,7 @@ struct i915_request *intel_context_create_request(struct intel_context *ce) return rq; } -struct i915_request *intel_context_find_active_request(struct intel_context *ce) +struct i915_request *intel_context_find_active_request_get(struct intel_context *ce) { struct intel_context *parent = intel_context_to_parent(ce); struct i915_request *rq, *active = NULL; @@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce) active = rq; } + if (active) + active = i915_request_get_rcu(active); spin_unlock_irqrestore(&parent->guc_state.lock, flags); return active; diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h index fb62b7b8cbcda..ccc80c6607ca8 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.h +++ b/drivers/gpu/drm/i915/gt/intel_context.h @@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct intel_context *ce, struct i915_request *intel_context_create_request(struct intel_context *ce); -struct i915_request * -intel_context_find_active_request(struct intel_context *ce); +struct i915_request *intel_context_find_active_request_get(struct intel_context *ce); static inline bool intel_context_is_barrier(const struct intel_context *ce) { diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 922f1bb22dc68..fbc0a81617e89 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d if (guc) { ce = intel_engine_get_hung_context(engine); if (ce) - hung_rq = intel_context_find_active_request(ce); + hung_rq = intel_context_find_active_request_get(ce); } else { hung_rq = intel_engine_execlist_find_hung_request(engine); + if (hung_rq) + hung_rq = i915_request_get_rcu(hung_rq); } if (hung_rq) @@ -2250,6 +2252,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d else intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m); + if (hung_rq) + i915_request_put(hung_rq); } void intel_engine_dump(struct intel_engine_cs *engine, diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index b436dd7f12e42..ad4b2848b0f83 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1702,7 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st goto next_context; guilty = false; - rq = intel_context_find_active_request(ce); + rq = intel_context_find_active_request_get(ce); if (!rq) { head = ce->ring->tail; goto out_replay; @@ -1715,6 +1715,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st head = intel_ring_wrap(ce->ring, rq->head); __i915_request_reset(rq, guilty); + i915_request_put(rq); out_replay: guc_reset_state(ce, head, guilty); next_context: @@ -4820,6 +4821,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock_irqsave(&guc->context_lookup, flags); xa_for_each(&guc->context_lookup, index, ce) { + bool found; + if (!kref_get_unless_zero(&ce->ref)) continue; @@ -4836,10 +4839,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) goto next; } + found = false; + spin_lock(&ce->guc_state.lock); list_for_each_entry(rq, &ce->guc_state.requests, sched.link) { if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE) continue; + found = true; + break; + } + spin_unlock(&ce->guc_state.lock); + + if (found) { intel_engine_set_hung_context(engine, ce); /* Can only cope with one hang at a time... */ @@ -4847,6 +4858,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock(&guc->context_lookup); goto done; } + next: intel_context_put(ce); xa_lock(&guc->context_lookup); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 9d5d5a397b64e..5c73dfa2fb3f6 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1607,7 +1607,7 @@ capture_engine(struct intel_engine_cs *engine, ce = intel_engine_get_hung_context(engine); if (ce) { intel_engine_clear_hung_context(engine); - rq = intel_context_find_active_request(ce); + rq = intel_context_find_active_request_get(ce); if (!rq || !i915_request_started(rq)) goto no_request_capture; } else { @@ -1618,21 +1618,18 @@ capture_engine(struct intel_engine_cs *engine, if (!intel_uc_uses_guc_submission(&engine->gt->uc)) { spin_lock_irqsave(&engine->sched_engine->lock, flags); rq = intel_engine_execlist_find_hung_request(engine); + if (rq) + rq = i915_request_get_rcu(rq); spin_unlock_irqrestore(&engine->sched_engine->lock, flags); } } - if (rq) - rq = i915_request_get_rcu(rq); - if (!rq) goto no_request_capture; capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) { - i915_request_put(rq); + if (!capture) goto no_request_capture; - } if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) intel_guc_capture_get_matching_node(engine->gt, ee, ce); @@ -1642,6 +1639,8 @@ capture_engine(struct intel_engine_cs *engine, return ee; no_request_capture: + if (rq) + i915_request_put(rq); kfree(ee); return NULL; } From patchwork Fri Jan 20 23:28:26 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13110683 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5265DC05027 for ; Fri, 20 Jan 2023 23:28:53 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id AB3BB10EB18; Fri, 20 Jan 2023 23:28:51 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0C7C410EB18; Fri, 20 Jan 2023 23:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674257330; x=1705793330; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=fOYwcAJI38yRwYDU7TNYSF63Z2R+8sQqnDD9n74UuyQ=; b=j8xMn3XCou12vGoLSyp8VRQaxy5RY+rw1wYKou5OIQjWapYWWEjn0h6a Xbs+7I9tCunCxcItLQT/YzBhWuE6OOlRiVugXXgC9dnS04vpr3kdCdwZ2 qlTuJSMk15HpDzewdLbO1OZBsR9VoW55/c+RALMGc6XXIlrkqU4C+Gslt UhBgeTifj06b+6LCN1g5qKm/kcD4dPZCQbiatT2BAcsy+j+1+xtKyh4j0 ZrLDD0dbZKPGavU0jTEoySoiWxyLmljAs/HfLqxYsB4JEM03hg+La2D3z qVH0Vb+Bg8LHjFd7QnVI1n3+nxBJqEONIwNWLr56hwhECPJAoQFASGkQJ A==; X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="324413563" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="324413563" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 15:28:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="693021611" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="693021611" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga001.jf.intel.com with ESMTP; 20 Jan 2023 15:28:48 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 20 Jan 2023 15:28:26 -0800 Message-Id: <20230120232831.28177-3-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> References: <20230120232831.28177-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison The debugfs dump of requests was confused about what state requires the execlist lock versus the GuC lock. There was also a bunch of duplicated messy code between it and the error capture code. So refactor the hung request search into a re-usable function. And reduce the span of the execlist state lock to only the execlist specific code paths. In order to do that, also move the report of hold count (which is an execlist only concept) from the top level dump function to the lower level execlist specific function. Also, move the execlist specific code into the execlist source file. v2: Rename some functions and move to more appropriate files (Daniele). Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/intel_engine.h | 4 +- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 74 +++++++++---------- .../drm/i915/gt/intel_execlists_submission.c | 27 +++++++ .../drm/i915/gt/intel_execlists_submission.h | 4 + drivers/gpu/drm/i915/i915_gpu_error.c | 26 +------ 5 files changed, 73 insertions(+), 62 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h index 0e24af5efee9c..b58c30ac8ef02 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine.h +++ b/drivers/gpu/drm/i915/gt/intel_engine.h @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct list_head *requests, ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now); -struct i915_request * -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine); +void intel_engine_get_hung_entity(struct intel_engine_cs *engine, + struct intel_context **ce, struct i915_request **rq); u32 intel_engine_context_size(struct intel_gt *gt, u8 class); struct intel_context * diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index fbc0a81617e89..1d77e27801bce 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2114,17 +2114,6 @@ static void print_request_ring(struct drm_printer *m, struct i915_request *rq) } } -static unsigned long list_count(struct list_head *list) -{ - struct list_head *pos; - unsigned long count = 0; - - list_for_each(pos, list) - count++; - - return count; -} - static unsigned long read_ul(void *p, size_t x) { return *(unsigned long *)(p + x); @@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct list_head *requests, } } -static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m) +static void engine_dump_active_requests(struct intel_engine_cs *engine, + struct drm_printer *m) { + struct intel_context *hung_ce = NULL; struct i915_request *hung_rq = NULL; - struct intel_context *ce; - bool guc; /* * No need for an engine->irq_seqno_barrier() before the seqno reads. @@ -2229,29 +2218,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d * But the intention here is just to report an instantaneous snapshot * so that's fine. */ - lockdep_assert_held(&engine->sched_engine->lock); + intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq); drm_printf(m, "\tRequests:\n"); - guc = intel_uc_uses_guc_submission(&engine->gt->uc); - if (guc) { - ce = intel_engine_get_hung_context(engine); - if (ce) - hung_rq = intel_context_find_active_request_get(ce); - } else { - hung_rq = intel_engine_execlist_find_hung_request(engine); - if (hung_rq) - hung_rq = i915_request_get_rcu(hung_rq); - } - if (hung_rq) engine_dump_request(hung_rq, m, "\t\thung"); + else if (hung_ce) + drm_printf(m, "\t\tGot hung ce but no hung rq!\n"); - if (guc) + if (intel_uc_uses_guc_submission(&engine->gt->uc)) intel_guc_dump_active_requests(engine, hung_rq, m); else - intel_engine_dump_active_requests(&engine->sched_engine->requests, - hung_rq, m); + intel_execlist_dump_active_requests(engine, hung_rq, m); + if (hung_rq) i915_request_put(hung_rq); } @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs *engine, struct i915_gpu_error * const error = &engine->i915->gpu_error; struct i915_request *rq; intel_wakeref_t wakeref; - unsigned long flags; ktime_t dummy; if (header) { @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs *engine, i915_reset_count(error)); print_properties(engine, m); - spin_lock_irqsave(&engine->sched_engine->lock, flags); engine_dump_active_requests(engine, m); - drm_printf(m, "\tOn hold?: %lu\n", - list_count(&engine->sched_engine->hold)); - spin_unlock_irqrestore(&engine->sched_engine->lock, flags); - drm_printf(m, "\tMMIO base: 0x%08x\n", engine->mmio_base); wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm); if (wakeref) { @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct intel_engine_cs **siblings, return siblings[0]->cops->create_virtual(siblings, count, flags); } -struct i915_request * -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine) +static struct i915_request *engine_execlist_find_hung_request(struct intel_engine_cs *engine) { struct i915_request *request, *active = NULL; @@ -2405,6 +2378,33 @@ intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine) return active; } +void intel_engine_get_hung_entity(struct intel_engine_cs *engine, + struct intel_context **ce, struct i915_request **rq) +{ + unsigned long flags; + + *ce = intel_engine_get_hung_context(engine); + if (*ce) { + intel_engine_clear_hung_context(engine); + + *rq = intel_context_find_active_request_get(*ce); + return; + } + + /* + * Getting here with GuC enabled means it is a forced error capture + * with no actual hang. So, no need to attempt the execlist search. + */ + if (intel_uc_uses_guc_submission(&engine->gt->uc)) + return; + + spin_lock_irqsave(&engine->sched_engine->lock, flags); + *rq = engine_execlist_find_hung_request(engine); + if (*rq) + *rq = i915_request_get_rcu(*rq); + spin_unlock_irqrestore(&engine->sched_engine->lock, flags); +} + void xehp_enable_ccs_engines(struct intel_engine_cs *engine) { /* diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 18ffe55282e59..05995c8577bef 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine, spin_unlock_irqrestore(&sched_engine->lock, flags); } +static unsigned long list_count(struct list_head *list) +{ + struct list_head *pos; + unsigned long count = 0; + + list_for_each(pos, list) + count++; + + return count; +} + +void intel_execlist_dump_active_requests(struct intel_engine_cs *engine, + struct i915_request *hung_rq, + struct drm_printer *m) +{ + unsigned long flags; + + spin_lock_irqsave(&engine->sched_engine->lock, flags); + + intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m); + + drm_printf(m, "\tOn hold?: %lu\n", + list_count(&engine->sched_engine->hold)); + + spin_unlock_irqrestore(&engine->sched_engine->lock, flags); +} + #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) #include "selftest_execlists.c" #endif diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h index a1aa92c983a51..cb07488a03764 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine, int indent), unsigned int max); +void intel_execlist_dump_active_requests(struct intel_engine_cs *engine, + struct i915_request *hung_rq, + struct drm_printer *m); + bool intel_engine_in_execlists_submission_mode(const struct intel_engine_cs *engine); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 5c73dfa2fb3f6..b20bd6365615b 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine, { struct intel_engine_capture_vma *capture = NULL; struct intel_engine_coredump *ee; - struct intel_context *ce; + struct intel_context *ce = NULL; struct i915_request *rq = NULL; - unsigned long flags; ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags); if (!ee) return NULL; - ce = intel_engine_get_hung_context(engine); - if (ce) { - intel_engine_clear_hung_context(engine); - rq = intel_context_find_active_request_get(ce); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; - } else { - /* - * Getting here with GuC enabled means it is a forced error capture - * with no actual hang. So, no need to attempt the execlist search. - */ - if (!intel_uc_uses_guc_submission(&engine->gt->uc)) { - spin_lock_irqsave(&engine->sched_engine->lock, flags); - rq = intel_engine_execlist_find_hung_request(engine); - if (rq) - rq = i915_request_get_rcu(rq); - spin_unlock_irqrestore(&engine->sched_engine->lock, - flags); - } - } - if (!rq) + intel_engine_get_hung_entity(engine, &ce, &rq); + if (!rq || !i915_request_started(rq)) goto no_request_capture; capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); From patchwork Fri Jan 20 23:28:27 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13110687 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 48888C25B50 for ; Fri, 20 Jan 2023 23:29:14 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 86D4110EB29; Fri, 20 Jan 2023 23:28:53 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2C6D210EB20; Fri, 20 Jan 2023 23:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674257330; x=1705793330; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=qaT2rgcAcf9E5p8ozoegNOTvPs0ayIedab+ccSu9xZc=; b=nuux2WFKQP8FUvnAatpxFkOXdktlr2e/uCxt9FHJr/W+Q/wufvB2k1s8 IbX+Gdea1mb2vsGcLFirkx902sXsxx0ImHN269kkul1NeoUch3tbfQ1lh Ypl6VSQH4LF5nqwmf3lZafVrO0IcSU/f7x/2sgGEdU0xuAZtPhwu0r2Qh 4o2AojPQFvKxVknozDpWk7vEgTkvNeUljmNwTV68CAkqgQnF8qSmVOtHx RmU4A/m/AzyVXwyNDmCUF3UnRPeCgGJoegipF/5ptEFZD5DybG5UIZK8u uTBdfqNPDSMxW4+oryQTep1FOWwSY5T6zTaCh6I6ZpBjDkq2xtJyYCdtP A==; X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="324413564" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="324413564" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 15:28:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="693021614" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="693021614" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga001.jf.intel.com with ESMTP; 20 Jan 2023 15:28:48 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 20 Jan 2023 15:28:27 -0800 Message-Id: <20230120232831.28177-4-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> References: <20230120232831.28177-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH v4 3/7] drm/i915: Allow error capture without a request X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if a context is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null pointer. v3: Tidy up request locking code flow (Tvrtko) v4: Pull in improved info message from next patch and fix up potential leak of GuC register state (Daniele) Signed-off-by: John Harrison Reviewed-by: Umesh Nerlige Ramappa Acked-by: Tvrtko Ursulin Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++--------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index b20bd6365615b..225f1b11a6b93 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee) } static bool record_context(struct i915_gem_context_coredump *e, - const struct i915_request *rq) + struct intel_context *ce) { struct i915_gem_context *ctx; struct task_struct *task; bool simulated; rcu_read_lock(); - ctx = rcu_dereference(rq->context->gem_context); + ctx = rcu_dereference(ce->gem_context); if (ctx && !kref_get_unless_zero(&ctx->ref)) ctx = NULL; rcu_read_unlock(); @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e, e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); - e->total_runtime = intel_context_get_total_runtime_ns(rq->context); - e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context); + e->total_runtime = intel_context_get_total_runtime_ns(ce); + e->avg_runtime = intel_context_get_avg_runtime_ns(ce); simulated = i915_gem_context_no_error_capture(ctx); @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_ return ee; } +static struct intel_engine_capture_vma * +engine_coredump_add_context(struct intel_engine_coredump *ee, + struct intel_context *ce, + gfp_t gfp) +{ + struct intel_engine_capture_vma *vma = NULL; + + ee->simulated |= record_context(&ee->context, ce); + if (ee->simulated) + return NULL; + + /* + * We need to copy these to an anonymous buffer + * as the simplest method to avoid being overwritten + * by userspace. + */ + vma = capture_vma(vma, ce->ring->vma, "ring", gfp); + vma = capture_vma(vma, ce->state, "HW context", gfp); + + return vma; +} + struct intel_engine_capture_vma * intel_engine_coredump_add_request(struct intel_engine_coredump *ee, struct i915_request *rq, gfp_t gfp) { - struct intel_engine_capture_vma *vma = NULL; + struct intel_engine_capture_vma *vma; - ee->simulated |= record_context(&ee->context, rq); - if (ee->simulated) + vma = engine_coredump_add_context(ee, rq->context, gfp); + if (!vma) return NULL; /* @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee, */ vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch"); vma = capture_user(vma, rq, gfp); - vma = capture_vma(vma, rq->ring->vma, "ring", gfp); - vma = capture_vma(vma, rq->context->state, "HW context", gfp); ee->rq_head = rq->head; ee->rq_post = rq->postfix; @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; + if (rq && !i915_request_started(rq)) { + drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", + engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); + i915_request_put(rq); + rq = NULL; + } - capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) - goto no_request_capture; - if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) - intel_guc_capture_get_matching_node(engine->gt, ee, ce); + if (rq) { + capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); + i915_request_put(rq); + } else if (ce) { + capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL); + } - intel_engine_coredump_add_vma(ee, capture, compress); - i915_request_put(rq); + if (capture) { + intel_engine_coredump_add_vma(ee, capture, compress); - return ee; + if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) + intel_guc_capture_get_matching_node(engine->gt, ee, ce); + } else { + kfree(ee); + ee = NULL; + } -no_request_capture: - if (rq) - i915_request_put(rq); - kfree(ee); - return NULL; + return ee; } static void From patchwork Fri Jan 20 23:28:28 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13110685 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id E232AC25B50 for ; Fri, 20 Jan 2023 23:29:01 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B947710EB24; Fri, 20 Jan 2023 23:28:52 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 55E8710EB18; Fri, 20 Jan 2023 23:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674257330; x=1705793330; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NVEvf8IZip4KrKwtRVXqza9Zc/m5o6AsWQTg/2GrPHw=; b=jggMEiB8Pyz/zyjBfo0tA/DCyLLB2/BkzD4ZiljLqZ25LcMYA1bKWN66 k3e/su6098qIBkJoRoueAjhVHLovbCvpt+NRDMr/h9yP8wKrv4uzQWef5 ZMeX2K3efXhQsKEZfN8VnZmBMGX+R9lYXXcZSJzs51WsYvbEhMQAf0oss R16D01k0KTmHNhYy7i/8IOavz1E8uAvDLKU+l+NkbzLqJmKrCIjB1UrLA gOpfewbwtuuApDFrcSXaFZ0xXixfgtZTlL8rrCmk+pkxxBT1hAWurtpFH VPv+erqqNIaMJp+igDMnTgMTLFe9lAljiczHqrXrhOydV2/xaLkuNtiDu A==; X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="324413565" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="324413565" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 15:28:50 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="693021618" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="693021618" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga001.jf.intel.com with ESMTP; 20 Jan 2023 15:28:48 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 20 Jan 2023 15:28:28 -0800 Message-Id: <20230120232831.28177-5-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> References: <20230120232831.28177-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH v4 4/7] drm/i915: Allow error capture of a pending request X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison A hang situation has been observed where the only requests on the context were either completed or not yet started according to the breaadcrumbs. However, the register state claimed a batch was (maybe) in progress. So, allow capture of the pending request on the grounds that this might be better than nothing. v2: Reword 'not started' warning message (Tvrtko) Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/i915_gpu_error.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 225f1b11a6b93..904f21e1380cd 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1624,12 +1624,9 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (rq && !i915_request_started(rq)) { + if (rq && !i915_request_started(rq)) drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); - i915_request_put(rq); - rq = NULL; - } if (rq) { capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); From patchwork Fri Jan 20 23:28:29 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13110690 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5FC8EC25B50 for ; Fri, 20 Jan 2023 23:29:26 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 74AD710EB30; Fri, 20 Jan 2023 23:29:02 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 7549810EB22; Fri, 20 Jan 2023 23:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674257330; x=1705793330; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MXHjO/r51sJWf//8G91MF+USmcK0WCiA3qVITFrftH4=; b=YRxw8ohM+pJ67VXgzqjSExp1fqzW46JFfY5y3LeUG7tEyk5HeU1bABsj Ri2ZchsRIEBfdvCQXzvwzzdnn4mUlfzwidv/gp5sWfId4pGeSjnyFTtCJ iA/SL4sv2WfUxHcZ5f+EsVezwp8hMubwkkqmQYYScn/prTfR02vHtB097 0ABgbxkTiVS+z9Ed2MVhAKsEnYhtL9R9hVqD/VJjkqDMcrioFohJoptuO btEFjsxAf3GuWxZdbf1jOicKAvgaSVFx9bHgpo8pzGO48iYPdkXtCCq7Y fm8X1sV4CUFOaFnOFKxAwc0Ad4CLHZ+fs7vx0o11oVR7FReCcty/PVrUz A==; X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="324413566" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="324413566" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 15:28:50 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="693021621" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="693021621" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga001.jf.intel.com with ESMTP; 20 Jan 2023 15:28:49 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 20 Jan 2023 15:28:29 -0800 Message-Id: <20230120232831.28177-6-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> References: <20230120232831.28177-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH v4 5/7] drm/i915/guc: Look for a guilty context when an engine reset fails X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison Engine resets are supposed to never fail. But in the case when one does (due to unknown reasons that normally come down to a missing w/a), it is useful to get as much information out of the system as possible. Given that the GuC intentionally dies on such a situation, it is not possible to get a guilty context notification back. So do a manual search instead. Given that GuC is dead, this is safe because GuC won't be changing the engine state asynchronously. v2: Change comment to be less alarming (Tvrtko) Signed-off-by: John Harrison Acked-by: Tvrtko Ursulin Reviewed-by: Daniele Ceraolo Spurio --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index ad4b2848b0f83..b6b4061e4b633 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4755,11 +4755,24 @@ static void reset_fail_worker_func(struct work_struct *w) guc->submission_state.reset_fail_mask = 0; spin_unlock_irqrestore(&guc->submission_state.lock, flags); - if (likely(reset_fail_mask)) + if (likely(reset_fail_mask)) { + struct intel_engine_cs *engine; + enum intel_engine_id id; + + /* + * GuC is toast at this point - it dead loops after sending the failed + * reset notification. So need to manually determine the guilty context. + * Note that it should be reliable to do this here because the GuC is + * toast and will not be scheduling behind the KMD's back. + */ + for_each_engine_masked(engine, gt, reset_fail_mask, id) + intel_guc_find_hung_context(engine); + intel_gt_handle_error(gt, reset_fail_mask, I915_ERROR_CAPTURE, - "GuC failed to reset engine mask=0x%x\n", + "GuC failed to reset engine mask=0x%x", reset_fail_mask); + } } int intel_guc_engine_failure_process_msg(struct intel_guc *guc, From patchwork Fri Jan 20 23:28:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13110689 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 9F234C27C76 for ; Fri, 20 Jan 2023 23:29:25 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id C99E110EB2B; Fri, 20 Jan 2023 23:29:00 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id B8D1E10EB23; Fri, 20 Jan 2023 23:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674257330; x=1705793330; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=NZ6L9RYHczTAmmUkpMQxggbG7ytl0rIzwE0gQkol4Vw=; b=euL5rn0zTjkKYLiXfunQmRjuPJWgqwd+w+oO7vrpcjUjSyuSknKUwrmd QlUcUf+mihDyDj5J7VaNwh9YfepcNx2NvmbEYB/WHvSMA9qWl9YhYali3 TI8SEi9PXPxndwCjnZp7FY1ZbB1jPDmkP3w9OHh90j1MjvB20hjzcM92U lggCRFq/qEpqZ8wQg6cthbrn4h7kcnye/7W5Im7IVNKqxF7oUUUz2ht35 q19psK4j7ClDwQQ8Xy+3ACzpcuZxI1S65ue5XZp5mNPpYXzMNPgnFlGqw wbSMJPi0DQ5NwzesAyb2XlRLRN/NXOnfnqpFPEyBfdRShQSqX/SOWO6mv Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="324413568" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="324413568" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 15:28:50 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="693021625" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="693021625" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga001.jf.intel.com with ESMTP; 20 Jan 2023 15:28:49 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 20 Jan 2023 15:28:30 -0800 Message-Id: <20230120232831.28177-7-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> References: <20230120232831.28177-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH v4 6/7] drm/i915/guc: Add a debug print on GuC triggered reset X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison For understanding bug reports, it can be useful to have an explicit dmesg print when a reset notification is received from GuC. As opposed to simply inferring that this happened from other messages. Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index b6b4061e4b633..b11d98092ffd1 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4666,6 +4666,10 @@ static void guc_handle_context_reset(struct intel_guc *guc, { trace_intel_context_reset(ce); + drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n", + ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags), + test_bit(CONTEXT_BANNED, &ce->flags)); + if (likely(intel_context_is_schedulable(ce))) { capture_error_state(guc, ce); guc_context_replay(ce); From patchwork Fri Jan 20 23:28:31 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13110686 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id C6889C05027 for ; Fri, 20 Jan 2023 23:29:07 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 124BA10EB26; Fri, 20 Jan 2023 23:28:52 +0000 (UTC) Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 952CA10EB20; Fri, 20 Jan 2023 23:28:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674257330; x=1705793330; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Iu0P2ZIfIuIe5Zvs5goqSilamyFGDM6epKp/Bep3TsE=; b=ZjaTql/4al0FrJuExYiMHGRiY1dgRaCiO7btwAvqczZWhn+f/arF0FPe D3ekzNwy8lTDxSVIbrMDquWbZqBgSVL7k1gGoVEIKxtutTTf8okXIoL8q nKaqwSaH/1+dk+7r2EH2mfl1h+hSOqRbFHz8tSmGScYR70dDOuppdzK0+ tkXA/Ce7w48OQ7EqTTuDv7UDTzBL/kJeJvfX6kds4gbGdYSbfkpUXM3Ok 5QQ6iIlnDxwMjF0MmDBYXTyE3VC5i0I7mnzqisY7wdOENzIVS1xs9zqEO pbqZwFB4haaTIqSEP/Najtcy4AfC7UGBVXIACmN/Ns/XvHa5Htv94OJS9 g==; X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="324413567" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="324413567" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jan 2023 15:28:50 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10596"; a="693021628" X-IronPort-AV: E=Sophos;i="5.97,233,1669104000"; d="scan'208";a="693021628" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by orsmga001.jf.intel.com with ESMTP; 20 Jan 2023 15:28:49 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Date: Fri, 20 Jan 2023 15:28:31 -0800 Message-Id: <20230120232831.28177-8-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.0 In-Reply-To: <20230120232831.28177-1-John.C.Harrison@Intel.com> References: <20230120232831.28177-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ Subject: [Intel-gfx] [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: DRI-Devel@Lists.FreeDesktop.Org Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: John Harrison The GuC specific register state entry in the error capture object was just called 'capture'. Although the companion 'node' entry was called 'guc_capture_node'. Rename the base entry to be 'guc_capture' instead so that it is a) more consistent and b) more obvious what it is. Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c | 8 ++++---- drivers/gpu/drm/i915/i915_gpu_error.h | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c index 1c1b85073b4bd..fc3b994626a4f 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c @@ -1506,7 +1506,7 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf, if (!ebuf || !ee) return -EINVAL; - cap = ee->capture; + cap = ee->guc_capture; if (!cap || !ee->engine) return -ENODEV; @@ -1576,8 +1576,8 @@ void intel_guc_capture_free_node(struct intel_engine_coredump *ee) if (!ee || !ee->guc_capture_node) return; - guc_capture_add_node_to_cachelist(ee->capture, ee->guc_capture_node); - ee->capture = NULL; + guc_capture_add_node_to_cachelist(ee->guc_capture, ee->guc_capture_node); + ee->guc_capture = NULL; ee->guc_capture_node = NULL; } @@ -1611,7 +1611,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt, (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK)) { list_del(&n->link); ee->guc_capture_node = n; - ee->capture = guc->capture; + ee->guc_capture = guc->capture; return; } } diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index efc75cc2ffdb9..56027ffbce51f 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -94,7 +94,7 @@ struct intel_engine_coredump { struct intel_instdone instdone; /* GuC matched capture-lists info */ - struct intel_guc_state_capture *capture; + struct intel_guc_state_capture *guc_capture; struct __guc_capture_parsed_output *guc_capture_node; struct i915_gem_context_coredump {