From patchwork Tue Jul 20 20:57:41 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matthew Brost X-Patchwork-Id: 12389271 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D879C636C9 for ; Tue, 20 Jul 2021 20:41:12 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 0D8AE60FEA for ; Tue, 20 Jul 2021 20:41:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 0D8AE60FEA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 2F1366E580; Tue, 20 Jul 2021 20:40:30 +0000 (UTC) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by gabe.freedesktop.org (Postfix) with ESMTPS id E2A5E6E519; Tue, 20 Jul 2021 20:40:17 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10051"; a="296885373" X-IronPort-AV: E=Sophos;i="5.84,256,1620716400"; d="scan'208";a="296885373" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jul 2021 13:40:17 -0700 X-IronPort-AV: E=Sophos;i="5.84,256,1620716400"; d="scan'208";a="414906080" Received: from dhiatt-server.jf.intel.com ([10.54.81.3]) by orsmga006-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 20 Jul 2021 13:40:16 -0700 From: Matthew Brost To: , Date: Tue, 20 Jul 2021 13:57:41 -0700 Message-Id: <20210720205802.39610-22-matthew.brost@intel.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210720205802.39610-1-matthew.brost@intel.com> References: <20210720205802.39610-1-matthew.brost@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [RFC PATCH 21/42] drm/i915/guc: Add hang check to GuC submit engine X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" The heartbeat uses a single instance of a GuC submit engine (GSE) to do the hang check. As such if a different GSE's state machine hangs, the heartbeat cannot detect this hang. Add timer to each GSE which in turn can disable all submissions if it is hung. Cc: John Harrison Signed-off-by: Matthew Brost --- .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 36 +++++++++++++++++++ .../i915/gt/uc/intel_guc_submission_types.h | 3 ++ 2 files changed, 39 insertions(+) diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c index d8be5a41d0ca..4cf233d39bea 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c @@ -105,15 +105,21 @@ static bool tasklet_blocked(struct guc_submit_engine *gse) return test_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags); } +/* 2 seconds seems like a reasonable timeout waiting for a G2H */ +#define MAX_TASKLET_BLOCKED_NS 2000000000 static void set_tasklet_blocked(struct guc_submit_engine *gse) { lockdep_assert_held(&gse->sched_engine.lock); + hrtimer_start_range_ns(&gse->hang_timer, + ns_to_ktime(MAX_TASKLET_BLOCKED_NS), 0, + HRTIMER_MODE_REL_PINNED); set_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags); } static void __clr_tasklet_blocked(struct guc_submit_engine *gse) { lockdep_assert_held(&gse->sched_engine.lock); + hrtimer_cancel(&gse->hang_timer); clear_bit(GSE_STATE_TASKLET_BLOCKED, &gse->flags); } @@ -1021,6 +1027,7 @@ static void disable_submission(struct intel_guc *guc) if (__tasklet_is_enabled(&sched_engine->tasklet)) { GEM_BUG_ON(!guc->ct.enabled); __tasklet_disable_sync_once(&sched_engine->tasklet); + hrtimer_try_to_cancel(&guc->gse[i]->hang_timer); sched_engine->tasklet.callback = NULL; } } @@ -3716,6 +3723,33 @@ static void guc_sched_engine_destroy(struct kref *kref) kfree(gse); } +static enum hrtimer_restart gse_hang(struct hrtimer *hrtimer) +{ + struct guc_submit_engine *gse = + container_of(hrtimer, struct guc_submit_engine, hang_timer); + struct intel_guc *guc = gse->sched_engine.private_data; + +#if IS_ENABLED(CONFIG_DRM_I915_SELFTEST) + if (guc->gse_hang_expected) + drm_dbg(&guc_to_gt(guc)->i915->drm, + "GSE[%i] hung, disabling submission", gse->id); + else + drm_err(&guc_to_gt(guc)->i915->drm, + "GSE[%i] hung, disabling submission", gse->id); +#else + drm_err(&guc_to_gt(guc)->i915->drm, + "GSE[%i] hung, disabling submission", gse->id); +#endif + + /* + * Tasklet not making forward progress, disable submission which in turn + * will kick in the heartbeat to do a full GPU reset. + */ + disable_submission(guc); + + return HRTIMER_NORESTART; +} + static void guc_submit_engine_init(struct intel_guc *guc, struct guc_submit_engine *gse, int id) @@ -3733,6 +3767,8 @@ static void guc_submit_engine_init(struct intel_guc *guc, sched_engine->retire_inflight_request_prio = guc_retire_inflight_request_prio; sched_engine->private_data = guc; + hrtimer_init(&gse->hang_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + gse->hang_timer.function = gse_hang; gse->id = id; } diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h index a5933e07bdd2..eae2e9725ede 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission_types.h @@ -6,6 +6,8 @@ #ifndef _INTEL_GUC_SUBMISSION_TYPES_H_ #define _INTEL_GUC_SUBMISSION_TYPES_H_ +#include + #include "gt/intel_engine_types.h" #include "gt/intel_context_types.h" #include "i915_scheduler_types.h" @@ -41,6 +43,7 @@ struct guc_submit_engine { unsigned long flags; int total_num_rq_with_no_guc_id; atomic_t num_guc_ids_not_ready; + struct hrtimer hang_timer; int id; /*