From patchwork Thu Jan 26 00:54:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116471 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 62664C54E94 for ; Thu, 26 Jan 2023 00:54:48 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id A39A010E8D6; Thu, 26 Jan 2023 00:54:39 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id 20AB910E129; Thu, 26 Jan 2023 00:54:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694471; x=1706230471; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GNyw7ajAmmWsgmDUsWUB0opvick3hwfEdGdwN6hpZHA=; b=TjbcbvdAn7kGfUXasdmTMZOLPL64Dex+okiwtRVDrNQflL6kba6Np17R 0NAWonLDm1fF8NBCHAt47UY9znKvps6qmG6gZWexIBc6V6Y8UWEfOUr34 RpgFXkksoDss3j2STs06lXpY9l+qDjxB6fcYYQifZVYqloOCbDknrfzeF NjGfFB3b647sH985l/k/mAMXZ0ChBn0hY/LpeHNoYemmPX4wi22l4dv9c n9kYS2T4j91L7dZZgsqQ7V/qHcqzV+eDoQYYyRltwOUuCisxvcdO5SfZ1 iVXqmKtdFju55zukzz1KL5IDLauwRTqV/t1fjcOH5ry06jpy8s+NooSYB A==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064422" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064422" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:30 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404271" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404271" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:29 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 1/8] drm/i915/guc: Fix locking when searching for a hung request Date: Wed, 25 Jan 2023 16:54:13 -0800 Message-Id: <20230126005420.160070-2-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Matthew Brost , Tvrtko Ursulin , Chris Wilson , Michael Cheng , Alan Previn , Umesh Nerlige Ramappa , Matthew Auld , Lucas De Marchi , Daniele Ceraolo Spurio , DRI-Devel@Lists.FreeDesktop.Org, Rodrigo Vivi , Tejas Upadhyay , intel-gfx@lists.freedesktop.org, John Harrison , Bruce Chang Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison intel_guc_find_hung_context() was not acquiring the correct spinlock before searching the request list. So fix that up. While at it, add some extra whitespace padding for readability. Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Cc: John Harrison Cc: Matthew Brost Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Daniele Ceraolo Spurio Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Tejas Upadhyay Cc: Chris Wilson Cc: Bruce Chang Cc: Alan Previn Cc: Matthew Auld Cc: intel-gfx@lists.freedesktop.org Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index b436dd7f12e42..3b34a82d692be 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock_irqsave(&guc->context_lookup, flags); xa_for_each(&guc->context_lookup, index, ce) { + bool found; + if (!kref_get_unless_zero(&ce->ref)) continue; @@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) goto next; } + found = false; + spin_lock(&ce->guc_state.lock); list_for_each_entry(rq, &ce->guc_state.requests, sched.link) { if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE) continue; + found = true; + break; + } + spin_unlock(&ce->guc_state.lock); + + if (found) { intel_engine_set_hung_context(engine, ce); /* Can only cope with one hang at a time... */ @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine) xa_lock(&guc->context_lookup); goto done; } + next: intel_context_put(ce); xa_lock(&guc->context_lookup); From patchwork Thu Jan 26 00:54:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116470 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4C384C27C76 for ; Thu, 26 Jan 2023 00:54:46 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 64CD810E8D5; Thu, 26 Jan 2023 00:54:39 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id A6A2610E8CF; Thu, 26 Jan 2023 00:54:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694472; x=1706230472; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=f0f+KaFxcVQ5FLgvFUvHD7g59KXi06cG+dUy+Ao/QXI=; b=c6XcDzmf1i7ZIB+DnDX8cj1Pc9Oo3fu9aMeRg5XlJF/+agBJSD6Uy38s E3CMPa70NVcOPLAiUcutDjoTzHwGJcVjMLsj9zcIQunbg8RMV8a6QZAFB HiPvAfxArY7GbKwC7CrStMo2eZgzQCuG0bNgAqzxDyZEaBrEZxqLde9pl S9iFLZtCRjczVcKv+mH96ByqRAzUPTHuDZDcf1JEFl7A0uuJVIAxX9xy8 k/mtgvOa7uC2uHIb0FdAkCRrnkfV6ZRP1BfYKsVwnboa0kZD2WH8toLY+ ufStsQgIMACUJnQfo0EAt7Pzzm2osydNQBudT5jfSLsDPJ/jvMSaKcyuH A==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064425" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064425" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:31 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404280" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404280" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:30 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 2/8] drm/i915: Fix request locking during error capture & debugfs dump Date: Wed, 25 Jan 2023 16:54:14 -0800 Message-Id: <20230126005420.160070-3-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Matthew Brost , Tvrtko Ursulin , Andy Shevchenko , Michael Cheng , Aravind Iddamsetty , Alan Previn , Umesh Nerlige Ramappa , intel-gfx@lists.freedesktop.org, Lucas De Marchi , Bruce Chang , Daniele Ceraolo Spurio , DRI-Devel@Lists.FreeDesktop.Org, Andrzej Hajda , Rodrigo Vivi , Tejas Upadhyay , John Harrison , Matthew Auld Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison When GuC support was added to error capture, the locking around the request object was broken. Fix it up. The context based search manages the spinlocking around the search internally. So it needs to grab the reference count internally as well. The execlist only request based search relies on external locking, so it needs an external reference count but within the spinlock not outside it. The only other caller of the context based search is the code for dumping engine state to debugfs. That code wasn't previously getting an explicit reference at all as it does everything while holding the execlist specific spinlock. So, that needs updaing as well as that spinlock doesn't help when using GuC submission. Rather than trying to conditionally get/put depending on submission model, just change it to always do the get/put. v2: Explicitly document adding an extra blank line in some dense code (Andy Shevchenko). Fix multiple potential null pointer derefs in case of no request found (some spotted by Tvrtko, but there was more!). Also fix a leaked request in case of !started and another in __guc_reset_context now that intel_context_find_active_request is actually reference counting the returned request. v3: Add a _get suffix to intel_context_find_active_request now that it grabs a reference (Daniele). v4: Split the intel_guc_find_hung_context change to a separate patch and rename intel_context_find_active_request_get to intel_context_get_active_request (Tvrtko). Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset") Cc: Matthew Brost Cc: John Harrison Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Daniele Ceraolo Spurio Cc: Andrzej Hajda Cc: Matthew Auld Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Tejas Upadhyay Cc: Andy Shevchenko Cc: Aravind Iddamsetty Cc: Alan Previn Cc: Bruce Chang Cc: intel-gfx@lists.freedesktop.org Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio Acked-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/intel_context.c | 4 +++- drivers/gpu/drm/i915/gt/intel_context.h | 3 +-- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 6 +++++- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++- drivers/gpu/drm/i915/i915_gpu_error.c | 13 ++++++------- 5 files changed, 17 insertions(+), 12 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c index e94365b08f1ef..2aa63ec521b89 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.c +++ b/drivers/gpu/drm/i915/gt/intel_context.c @@ -528,7 +528,7 @@ struct i915_request *intel_context_create_request(struct intel_context *ce) return rq; } -struct i915_request *intel_context_find_active_request(struct intel_context *ce) +struct i915_request *intel_context_get_active_request(struct intel_context *ce) { struct intel_context *parent = intel_context_to_parent(ce); struct i915_request *rq, *active = NULL; @@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce) active = rq; } + if (active) + active = i915_request_get_rcu(active); spin_unlock_irqrestore(&parent->guc_state.lock, flags); return active; diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h index fb62b7b8cbcda..0a8d553da3f43 100644 --- a/drivers/gpu/drm/i915/gt/intel_context.h +++ b/drivers/gpu/drm/i915/gt/intel_context.h @@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct intel_context *ce, struct i915_request *intel_context_create_request(struct intel_context *ce); -struct i915_request * -intel_context_find_active_request(struct intel_context *ce); +struct i915_request *intel_context_get_active_request(struct intel_context *ce); static inline bool intel_context_is_barrier(const struct intel_context *ce) { diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 922f1bb22dc68..a86bdbee7a6be 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d if (guc) { ce = intel_engine_get_hung_context(engine); if (ce) - hung_rq = intel_context_find_active_request(ce); + hung_rq = intel_context_get_active_request(ce); } else { hung_rq = intel_engine_execlist_find_hung_request(engine); + if (hung_rq) + hung_rq = i915_request_get_rcu(hung_rq); } if (hung_rq) @@ -2250,6 +2252,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d else intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m); + if (hung_rq) + i915_request_put(hung_rq); } void intel_engine_dump(struct intel_engine_cs *engine, diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 3b34a82d692be..a2b263e5fd667 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -1702,7 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st goto next_context; guilty = false; - rq = intel_context_find_active_request(ce); + rq = intel_context_get_active_request(ce); if (!rq) { head = ce->ring->tail; goto out_replay; @@ -1715,6 +1715,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st head = intel_ring_wrap(ce->ring, rq->head); __i915_request_reset(rq, guilty); + i915_request_put(rq); out_replay: guc_reset_state(ce, head, guilty); next_context: diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 9d5d5a397b64e..9e2d17785a9a8 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1607,7 +1607,7 @@ capture_engine(struct intel_engine_cs *engine, ce = intel_engine_get_hung_context(engine); if (ce) { intel_engine_clear_hung_context(engine); - rq = intel_context_find_active_request(ce); + rq = intel_context_get_active_request(ce); if (!rq || !i915_request_started(rq)) goto no_request_capture; } else { @@ -1618,21 +1618,18 @@ capture_engine(struct intel_engine_cs *engine, if (!intel_uc_uses_guc_submission(&engine->gt->uc)) { spin_lock_irqsave(&engine->sched_engine->lock, flags); rq = intel_engine_execlist_find_hung_request(engine); + if (rq) + rq = i915_request_get_rcu(rq); spin_unlock_irqrestore(&engine->sched_engine->lock, flags); } } - if (rq) - rq = i915_request_get_rcu(rq); - if (!rq) goto no_request_capture; capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) { - i915_request_put(rq); + if (!capture) goto no_request_capture; - } if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) intel_guc_capture_get_matching_node(engine->gt, ee, ce); @@ -1642,6 +1639,8 @@ capture_engine(struct intel_engine_cs *engine, return ee; no_request_capture: + if (rq) + i915_request_put(rq); kfree(ee); return NULL; } From patchwork Thu Jan 26 00:54:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116476 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4996CC27C76 for ; Thu, 26 Jan 2023 00:55:00 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7A40910E8DF; Thu, 26 Jan 2023 00:54:49 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id 1324110E8D1; Thu, 26 Jan 2023 00:54:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694473; x=1706230473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=I2QGYGUHU2bsMx7P4C4vWRT7MkAIYM5yq1q7G5IDyzE=; b=fpgTnhHkN+n4z2aJPtDCND1prhk4bVU+9SDsSxdn0slVTN7BNC2HRady abC3RsvSQQJVMG3QS3eWjJjnhqafZ15LLEGlb1n7ZYxQz4PV5BqdIMTlD LSvKyX5tJiweVa1ySACCsP3RSfnSJwVnDuqfisiTmKmsbPjK/RTZHWAAQ SRQiR3yPkpjwBGVwVvZ7MVBG2/G8Lvk5UQ7SXnmEnwoa+E6x1ZwgtC46e rysA9J70upFbq0yGaSBtKdqvVWWrQMT6tjv6Pfl+8bs9//+lHPzIV1J2R b76sryY6Kbemi4hMal6tUo/o7M1T2/yfem21ApvuCGg6dpQoaT/YGx49P Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064427" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064427" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404284" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404284" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:31 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 3/8] drm/i915: Fix up locking around dumping requests lists Date: Wed, 25 Jan 2023 16:54:15 -0800 Message-Id: <20230126005420.160070-4-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Matthew Brost , Tvrtko Ursulin , Michael Cheng , Alan Previn , Umesh Nerlige Ramappa , Matthew Auld , Lucas De Marchi , Daniele Ceraolo Spurio , DRI-Devel@Lists.FreeDesktop.Org, Rodrigo Vivi , John Harrison , Bruce Chang Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison The debugfs dump of requests was confused about what state requires the execlist lock versus the GuC lock. There was also a bunch of duplicated messy code between it and the error capture code. So refactor the hung request search into a re-usable function. And reduce the span of the execlist state lock to only the execlist specific code paths. In order to do that, also move the report of hold count (which is an execlist only concept) from the top level dump function to the lower level execlist specific function. Also, move the execlist specific code into the execlist source file. v2: Rename some functions and move to more appropriate files (Daniele). v3: Rename new execlist dump function (Daniele) Signed-off-by: John Harrison Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC") Cc: John Harrison Cc: Matthew Brost Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: Daniele Ceraolo Spurio Cc: Matt Roper Cc: Umesh Nerlige Ramappa Cc: Michael Cheng Cc: Lucas De Marchi Cc: Bruce Chang Cc: Alan Previn Cc: Matthew Auld --- drivers/gpu/drm/i915/gt/intel_engine.h | 4 +- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 74 +++++++++---------- .../drm/i915/gt/intel_execlists_submission.c | 27 +++++++ .../drm/i915/gt/intel_execlists_submission.h | 4 + drivers/gpu/drm/i915/i915_gpu_error.c | 26 +------ 5 files changed, 73 insertions(+), 62 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h index 0e24af5efee9c..b58c30ac8ef02 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine.h +++ b/drivers/gpu/drm/i915/gt/intel_engine.h @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct list_head *requests, ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine, ktime_t *now); -struct i915_request * -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine); +void intel_engine_get_hung_entity(struct intel_engine_cs *engine, + struct intel_context **ce, struct i915_request **rq); u32 intel_engine_context_size(struct intel_gt *gt, u8 class); struct intel_context * diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index a86bdbee7a6be..9f703f255d721 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -2114,17 +2114,6 @@ static void print_request_ring(struct drm_printer *m, struct i915_request *rq) } } -static unsigned long list_count(struct list_head *list) -{ - struct list_head *pos; - unsigned long count = 0; - - list_for_each(pos, list) - count++; - - return count; -} - static unsigned long read_ul(void *p, size_t x) { return *(unsigned long *)(p + x); @@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct list_head *requests, } } -static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m) +static void engine_dump_active_requests(struct intel_engine_cs *engine, + struct drm_printer *m) { + struct intel_context *hung_ce = NULL; struct i915_request *hung_rq = NULL; - struct intel_context *ce; - bool guc; /* * No need for an engine->irq_seqno_barrier() before the seqno reads. @@ -2229,29 +2218,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d * But the intention here is just to report an instantaneous snapshot * so that's fine. */ - lockdep_assert_held(&engine->sched_engine->lock); + intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq); drm_printf(m, "\tRequests:\n"); - guc = intel_uc_uses_guc_submission(&engine->gt->uc); - if (guc) { - ce = intel_engine_get_hung_context(engine); - if (ce) - hung_rq = intel_context_get_active_request(ce); - } else { - hung_rq = intel_engine_execlist_find_hung_request(engine); - if (hung_rq) - hung_rq = i915_request_get_rcu(hung_rq); - } - if (hung_rq) engine_dump_request(hung_rq, m, "\t\thung"); + else if (hung_ce) + drm_printf(m, "\t\tGot hung ce but no hung rq!\n"); - if (guc) + if (intel_uc_uses_guc_submission(&engine->gt->uc)) intel_guc_dump_active_requests(engine, hung_rq, m); else - intel_engine_dump_active_requests(&engine->sched_engine->requests, - hung_rq, m); + intel_execlists_dump_active_requests(engine, hung_rq, m); + if (hung_rq) i915_request_put(hung_rq); } @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs *engine, struct i915_gpu_error * const error = &engine->i915->gpu_error; struct i915_request *rq; intel_wakeref_t wakeref; - unsigned long flags; ktime_t dummy; if (header) { @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs *engine, i915_reset_count(error)); print_properties(engine, m); - spin_lock_irqsave(&engine->sched_engine->lock, flags); engine_dump_active_requests(engine, m); - drm_printf(m, "\tOn hold?: %lu\n", - list_count(&engine->sched_engine->hold)); - spin_unlock_irqrestore(&engine->sched_engine->lock, flags); - drm_printf(m, "\tMMIO base: 0x%08x\n", engine->mmio_base); wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm); if (wakeref) { @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct intel_engine_cs **siblings, return siblings[0]->cops->create_virtual(siblings, count, flags); } -struct i915_request * -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine) +static struct i915_request *engine_execlist_find_hung_request(struct intel_engine_cs *engine) { struct i915_request *request, *active = NULL; @@ -2405,6 +2378,33 @@ intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine) return active; } +void intel_engine_get_hung_entity(struct intel_engine_cs *engine, + struct intel_context **ce, struct i915_request **rq) +{ + unsigned long flags; + + *ce = intel_engine_get_hung_context(engine); + if (*ce) { + intel_engine_clear_hung_context(engine); + + *rq = intel_context_get_active_request(*ce); + return; + } + + /* + * Getting here with GuC enabled means it is a forced error capture + * with no actual hang. So, no need to attempt the execlist search. + */ + if (intel_uc_uses_guc_submission(&engine->gt->uc)) + return; + + spin_lock_irqsave(&engine->sched_engine->lock, flags); + *rq = engine_execlist_find_hung_request(engine); + if (*rq) + *rq = i915_request_get_rcu(*rq); + spin_unlock_irqrestore(&engine->sched_engine->lock, flags); +} + void xehp_enable_ccs_engines(struct intel_engine_cs *engine) { /* diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c index 18ffe55282e59..3c573d41d4046 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine, spin_unlock_irqrestore(&sched_engine->lock, flags); } +static unsigned long list_count(struct list_head *list) +{ + struct list_head *pos; + unsigned long count = 0; + + list_for_each(pos, list) + count++; + + return count; +} + +void intel_execlists_dump_active_requests(struct intel_engine_cs *engine, + struct i915_request *hung_rq, + struct drm_printer *m) +{ + unsigned long flags; + + spin_lock_irqsave(&engine->sched_engine->lock, flags); + + intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m); + + drm_printf(m, "\tOn hold?: %lu\n", + list_count(&engine->sched_engine->hold)); + + spin_unlock_irqrestore(&engine->sched_engine->lock, flags); +} + #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) #include "selftest_execlists.c" #endif diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h index a1aa92c983a51..d2c7d45ea0623 100644 --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine, int indent), unsigned int max); +void intel_execlists_dump_active_requests(struct intel_engine_cs *engine, + struct i915_request *hung_rq, + struct drm_printer *m); + bool intel_engine_in_execlists_submission_mode(const struct intel_engine_cs *engine); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 9e2d17785a9a8..b20bd6365615b 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine, { struct intel_engine_capture_vma *capture = NULL; struct intel_engine_coredump *ee; - struct intel_context *ce; + struct intel_context *ce = NULL; struct i915_request *rq = NULL; - unsigned long flags; ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags); if (!ee) return NULL; - ce = intel_engine_get_hung_context(engine); - if (ce) { - intel_engine_clear_hung_context(engine); - rq = intel_context_get_active_request(ce); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; - } else { - /* - * Getting here with GuC enabled means it is a forced error capture - * with no actual hang. So, no need to attempt the execlist search. - */ - if (!intel_uc_uses_guc_submission(&engine->gt->uc)) { - spin_lock_irqsave(&engine->sched_engine->lock, flags); - rq = intel_engine_execlist_find_hung_request(engine); - if (rq) - rq = i915_request_get_rcu(rq); - spin_unlock_irqrestore(&engine->sched_engine->lock, - flags); - } - } - if (!rq) + intel_engine_get_hung_entity(engine, &ce, &rq); + if (!rq || !i915_request_started(rq)) goto no_request_capture; capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); From patchwork Thu Jan 26 00:54:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116474 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1FDE3C27C76 for ; Thu, 26 Jan 2023 00:54:54 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B1D9510E8D9; Thu, 26 Jan 2023 00:54:41 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4E8E510E8D2; Thu, 26 Jan 2023 00:54:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694473; x=1706230473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=zTd7lf6ad9RbSljaLkuZ/But6bcl+1S+wpat7nkSU1I=; b=K6wvk6pEgOdKqD2nTEcoX28jRWIfy0Sno5kfSgtLZwt+R0SPxi++Ys49 s6/jnB+GTVel2KnDBq3Sj9l+1vsBJsmdY6zVcPHxv5G27RuGrDSxoBe2W NaTCSRRdxV3HlmKTV2ZIYRAl6TkBodGh8mqOfq4xxFJrgjOhQNvhwhihI XIpqEOGztp/EXCbxEOzuKtCNMQcAXDRRy89N999jt47FZ3XHET8PPmDDk B7Thy8FThZzhKiEYLZs9tPEMJ9DqCBxkqcImc0OPN8MLPo3HqgaGdxOWt rzvTZYCC5KczUwGTEq3aspqP+lXvAyhNTVCAEvvUfC8GkqOb/H53R8wYJ Q==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064428" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064428" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404289" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404289" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:32 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 4/8] drm/i915: Allow error capture without a request Date: Wed, 25 Jan 2023 16:54:16 -0800 Message-Id: <20230126005420.160070-5-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Daniele Ceraolo Spurio , Umesh Nerlige Ramappa , John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison There was a report of error captures occurring without any hung context being indicated despite the capture being initiated by a 'hung context notification' from GuC. The problem was not reproducible. However, it is possible to happen if the context in question has no active requests. For example, if the hang was in the context switch itself then the breadcrumb write would have occurred and the KMD would see an idle context. In the interests of attempting to provide as much information as possible about a hang, it seems wise to include the engine info regardless of whether a request was found or not. As opposed to just prentending there was no hang at all. So update the error capture code to always record engine information if a context is given. Which means updating record_context() to take a context instead of a request (which it only ever used to find the context anyway). And split the request agnostic parts of intel_engine_coredump_add_request() out into a seaprate function. v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null pointer. v3: Tidy up request locking code flow (Tvrtko) v4: Pull in improved info message from next patch and fix up potential leak of GuC register state (Daniele) Signed-off-by: John Harrison Reviewed-by: Umesh Nerlige Ramappa (v2) Reviewed-by: Daniele Ceraolo Spurio Acked-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++--------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index b20bd6365615b..225f1b11a6b93 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee) } static bool record_context(struct i915_gem_context_coredump *e, - const struct i915_request *rq) + struct intel_context *ce) { struct i915_gem_context *ctx; struct task_struct *task; bool simulated; rcu_read_lock(); - ctx = rcu_dereference(rq->context->gem_context); + ctx = rcu_dereference(ce->gem_context); if (ctx && !kref_get_unless_zero(&ctx->ref)) ctx = NULL; rcu_read_unlock(); @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e, e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); - e->total_runtime = intel_context_get_total_runtime_ns(rq->context); - e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context); + e->total_runtime = intel_context_get_total_runtime_ns(ce); + e->avg_runtime = intel_context_get_avg_runtime_ns(ce); simulated = i915_gem_context_no_error_capture(ctx); @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_ return ee; } +static struct intel_engine_capture_vma * +engine_coredump_add_context(struct intel_engine_coredump *ee, + struct intel_context *ce, + gfp_t gfp) +{ + struct intel_engine_capture_vma *vma = NULL; + + ee->simulated |= record_context(&ee->context, ce); + if (ee->simulated) + return NULL; + + /* + * We need to copy these to an anonymous buffer + * as the simplest method to avoid being overwritten + * by userspace. + */ + vma = capture_vma(vma, ce->ring->vma, "ring", gfp); + vma = capture_vma(vma, ce->state, "HW context", gfp); + + return vma; +} + struct intel_engine_capture_vma * intel_engine_coredump_add_request(struct intel_engine_coredump *ee, struct i915_request *rq, gfp_t gfp) { - struct intel_engine_capture_vma *vma = NULL; + struct intel_engine_capture_vma *vma; - ee->simulated |= record_context(&ee->context, rq); - if (ee->simulated) + vma = engine_coredump_add_context(ee, rq->context, gfp); + if (!vma) return NULL; /* @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee, */ vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch"); vma = capture_user(vma, rq, gfp); - vma = capture_vma(vma, rq->ring->vma, "ring", gfp); - vma = capture_vma(vma, rq->context->state, "HW context", gfp); ee->rq_head = rq->head; ee->rq_post = rq->postfix; @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (!rq || !i915_request_started(rq)) - goto no_request_capture; + if (rq && !i915_request_started(rq)) { + drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", + engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); + i915_request_put(rq); + rq = NULL; + } - capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); - if (!capture) - goto no_request_capture; - if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) - intel_guc_capture_get_matching_node(engine->gt, ee, ce); + if (rq) { + capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); + i915_request_put(rq); + } else if (ce) { + capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL); + } - intel_engine_coredump_add_vma(ee, capture, compress); - i915_request_put(rq); + if (capture) { + intel_engine_coredump_add_vma(ee, capture, compress); - return ee; + if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE) + intel_guc_capture_get_matching_node(engine->gt, ee, ce); + } else { + kfree(ee); + ee = NULL; + } -no_request_capture: - if (rq) - i915_request_put(rq); - kfree(ee); - return NULL; + return ee; } static void From patchwork Thu Jan 26 00:54:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116469 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AD8FDC27C76 for ; Thu, 26 Jan 2023 00:54:43 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 5FC2210E8D4; Thu, 26 Jan 2023 00:54:38 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id 78D6610E8D3; Thu, 26 Jan 2023 00:54:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694473; x=1706230473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MMGVM3oKDx1++G5z9tXhnD/tSjRvUWveU/E16hB4uys=; b=PrLZrwyyAdqx7YMq8FtoN39GFTkvc5yfSugS9vfHuw7LavOZEiO/4pTg PtVlPsQ2sLJlaTfPvB7aVR4A7s/N25J7/hy/Y/44xDmxGpWklin3YakTf YcOQijBc5TenQ0Ufto+t+dq/MchUYQ35sI8a6psrslWiEnDfZsVle3vAP FMxpmT05F8smkGzT09KXzKDgJkgCsxB34IS5pAz9stDiWBj8byBpjccGB Sf0z7W5V8BnFlZsIKlsjbqLi+FoZ64JKto0dL4i8ravypJsJNywnG4bCS 5c7ujvGIs0tkJzcbxJF1zgXukletB1TF/2svzArcGKjEu+Enq38gt5RDf A==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064429" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064429" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404293" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404293" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:32 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 5/8] drm/i915: Allow error capture of a pending request Date: Wed, 25 Jan 2023 16:54:17 -0800 Message-Id: <20230126005420.160070-6-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison A hang situation has been observed where the only requests on the context were either completed or not yet started according to the breaadcrumbs. However, the register state claimed a batch was (maybe) in progress. So, allow capture of the pending request on the grounds that this might be better than nothing. v2: Reword 'not started' warning message (Tvrtko) Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/i915_gpu_error.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 225f1b11a6b93..904f21e1380cd 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1624,12 +1624,9 @@ capture_engine(struct intel_engine_cs *engine, return NULL; intel_engine_get_hung_entity(engine, &ce, &rq); - if (rq && !i915_request_started(rq)) { + if (rq && !i915_request_started(rq)) drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n", engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id); - i915_request_put(rq); - rq = NULL; - } if (rq) { capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL); From patchwork Thu Jan 26 00:54:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116472 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 965D4C27C76 for ; Thu, 26 Jan 2023 00:54:50 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 050CC10E8D7; Thu, 26 Jan 2023 00:54:40 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id AC7B710E8CF; Thu, 26 Jan 2023 00:54:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694473; x=1706230473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=rdM+o6H8ybxuEDzfnAh/Dh6qJbeerttIapfW/5uZKIk=; b=MKcFmqOfTxBxVNA978qMpQlVZ0jyQTWmvF3VG+l5UNliQ/VUGsxYJbvv G/T2g8aFnZ1STPJOQHt4vG05MXdYwnygOfl+CKtAqXHMKqRfeZvDa9igu r9Ig9L6El1Qc+GPcpR6lT9wciE+OuvodaDITfvhyNvxlDbsqLCTuw6S4I vb9EZVHAj8svTyCLEtOQ8MClryHveV394VZS1tXlRuRkKaqJiY4RHq0RY uZmZljFlPJlOrGt9pvjVsDSa6pA4MsB+p5r4O7oMAyarlNeVEv6s87AAM 6/hHFFnveZqfiGrV44vv804lBaRXouaFa/GdmNOurBMP2dVNvZSUb2lKu g==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064431" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064431" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404297" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404297" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:32 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 6/8] drm/i915/guc: Look for a guilty context when an engine reset fails Date: Wed, 25 Jan 2023 16:54:18 -0800 Message-Id: <20230126005420.160070-7-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Daniele Ceraolo Spurio , John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison Engine resets are supposed to never fail. But in the case when one does (due to unknown reasons that normally come down to a missing w/a), it is useful to get as much information out of the system as possible. Given that the GuC intentionally dies on such a situation, it is not possible to get a guilty context notification back. So do a manual search instead. Given that GuC is dead, this is safe because GuC won't be changing the engine state asynchronously. v2: Change comment to be less alarming (Tvrtko) Signed-off-by: John Harrison Acked-by: Tvrtko Ursulin Reviewed-by: Daniele Ceraolo Spurio --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 17 +++++++++++++++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index a2b263e5fd667..7adc35bd4435a 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4755,11 +4755,24 @@ static void reset_fail_worker_func(struct work_struct *w) guc->submission_state.reset_fail_mask = 0; spin_unlock_irqrestore(&guc->submission_state.lock, flags); - if (likely(reset_fail_mask)) + if (likely(reset_fail_mask)) { + struct intel_engine_cs *engine; + enum intel_engine_id id; + + /* + * GuC is toast at this point - it dead loops after sending the failed + * reset notification. So need to manually determine the guilty context. + * Note that it should be reliable to do this here because the GuC is + * toast and will not be scheduling behind the KMD's back. + */ + for_each_engine_masked(engine, gt, reset_fail_mask, id) + intel_guc_find_hung_context(engine); + intel_gt_handle_error(gt, reset_fail_mask, I915_ERROR_CAPTURE, - "GuC failed to reset engine mask=0x%x\n", + "GuC failed to reset engine mask=0x%x", reset_fail_mask); + } } int intel_guc_engine_failure_process_msg(struct intel_guc *guc, From patchwork Thu Jan 26 00:54:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116475 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D3976C61D97 for ; Thu, 26 Jan 2023 00:54:58 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 09D6310E8D8; Thu, 26 Jan 2023 00:54:48 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id D0A8210E8D4; Thu, 26 Jan 2023 00:54:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694473; x=1706230473; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ge027wYS9Kpvkw04aKFhgn6e7W9uzBegmpREjabV0iY=; b=YwdUyb5SxrDwB7DnHeFXvED3DDu6QHNuciDcWkPy77JmB3702ikQpRsj iwt1QhfAetgCgmtEMCGps9nxuu4oUAVTGvVmSAxmga4CiYtFVgyG6xKin +a5zqOMs63cLOtkj3dVQgGG/dWXQx7iHoTRasrf7Durg0bhrN5caBn42W DRjUxKzmcXvndt+3RxK0+/DAS1YmM9NSyl/Xq9taZSy/KvY2bLyjLEsF3 Cy45/1xr4EPnHgbNKsq1apNNAqTZC3I6QHhHRoiB44hLkyeY6/q1bgOC3 sCdcAXvYl35UKmGyUizpiZFcBnLQbc4H40x6Ai13Pojj+FL8o8swKKlCl w==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064432" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064432" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404306" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404306" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:33 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 7/8] drm/i915/guc: Add a debug print on GuC triggered reset Date: Wed, 25 Jan 2023 16:54:19 -0800 Message-Id: <20230126005420.160070-8-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: John Harrison , DRI-Devel@Lists.FreeDesktop.Org, Tvrtko Ursulin Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison For understanding bug reports, it can be useful to have an explicit dmesg print when a reset notification is received from GuC. As opposed to simply inferring that this happened from other messages. Signed-off-by: John Harrison Reviewed-by: Tvrtko Ursulin --- drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index 7adc35bd4435a..2e6ab0bb5c2b6 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -4666,6 +4666,10 @@ static void guc_handle_context_reset(struct intel_guc *guc, { trace_intel_context_reset(ce); + drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n", + ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags), + test_bit(CONTEXT_BANNED, &ce->flags)); + if (likely(intel_context_is_schedulable(ce))) { capture_error_state(guc, ce); guc_context_replay(ce); From patchwork Thu Jan 26 00:54:20 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Harrison X-Patchwork-Id: 13116473 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6F975C54EAA for ; Thu, 26 Jan 2023 00:54:52 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1D64510E8D2; Thu, 26 Jan 2023 00:54:41 +0000 (UTC) Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by gabe.freedesktop.org (Postfix) with ESMTPS id 93C9710E8CF; Thu, 26 Jan 2023 00:54:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1674694474; x=1706230474; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=gE2TL8/wmSdfF/qfLiTs7dtrj7rx1/jqcUGqBVk0hBM=; b=dyw/m0NBVRRwh5yKCRe7aEN3m9TpjF8FG+TbQTYG9TakGdUg3/gjqs78 HunVAk+DKhEdQtX7raVH61ZCp9L0ygKHWYZqzVKDVtiVP2gFMObr3bUQG ZMuYIaLYtf44xez8Of5/JSEqITxgvhXWKtgCcpDezItGjonrZQ/+VgxxL jh+Ml32+MS38+BpYoE1SSCx83mkVF5z1gFwRphUE41uZczP8acpUxLise Ia4L0KVRO6FkL3alUrY8swOUB/5ayqg74Qn3aqBPev++Xa5dq4dQKx8u9 yiYkcWIefGO6bQ+p4DhWbwe8uj1vTG3YPdUGtfLWIZIwHmzHR8TW8xCFF A==; X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="389064433" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="389064433" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 25 Jan 2023 16:54:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10601"; a="751404311" X-IronPort-AV: E=Sophos;i="5.97,246,1669104000"; d="scan'208";a="751404311" Received: from relo-linux-5.jf.intel.com ([10.165.21.152]) by FMSMGA003.fm.intel.com with ESMTP; 25 Jan 2023 16:54:33 -0800 From: John.C.Harrison@Intel.com To: Intel-GFX@Lists.FreeDesktop.Org Subject: [PATCH v5 8/8] drm/i915/guc: Rename GuC register state capture node to be more obvious Date: Wed, 25 Jan 2023 16:54:20 -0800 Message-Id: <20230126005420.160070-9-John.C.Harrison@Intel.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20230126005420.160070-1-John.C.Harrison@Intel.com> References: <20230126005420.160070-1-John.C.Harrison@Intel.com> MIME-Version: 1.0 Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Daniele Ceraolo Spurio , John Harrison , DRI-Devel@Lists.FreeDesktop.Org Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" From: John Harrison The GuC specific register state entry in the error capture object was just called 'capture'. Although the companion 'node' entry was called 'guc_capture_node'. Rename the base entry to be 'guc_capture' instead so that it is a) more consistent and b) more obvious what it is. Signed-off-by: John Harrison Reviewed-by: Daniele Ceraolo Spurio --- drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c | 8 ++++---- drivers/gpu/drm/i915/i915_gpu_error.h | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c index 1c1b85073b4bd..fc3b994626a4f 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c @@ -1506,7 +1506,7 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf, if (!ebuf || !ee) return -EINVAL; - cap = ee->capture; + cap = ee->guc_capture; if (!cap || !ee->engine) return -ENODEV; @@ -1576,8 +1576,8 @@ void intel_guc_capture_free_node(struct intel_engine_coredump *ee) if (!ee || !ee->guc_capture_node) return; - guc_capture_add_node_to_cachelist(ee->capture, ee->guc_capture_node); - ee->capture = NULL; + guc_capture_add_node_to_cachelist(ee->guc_capture, ee->guc_capture_node); + ee->guc_capture = NULL; ee->guc_capture_node = NULL; } @@ -1611,7 +1611,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt, (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK)) { list_del(&n->link); ee->guc_capture_node = n; - ee->capture = guc->capture; + ee->guc_capture = guc->capture; return; } } diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index efc75cc2ffdb9..56027ffbce51f 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -94,7 +94,7 @@ struct intel_engine_coredump { struct intel_instdone instdone; /* GuC matched capture-lists info */ - struct intel_guc_state_capture *capture; + struct intel_guc_state_capture *guc_capture; struct __guc_capture_parsed_output *guc_capture_node; struct i915_gem_context_coredump {