From patchwork Mon Jul 6 14:01:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: =?utf-8?q?Micha=C5=82_Winiarski?= X-Patchwork-Id: 11645927 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 37E5160D for ; Mon, 6 Jul 2020 14:01:34 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 15BF92070C for ; Mon, 6 Jul 2020 14:01:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=hardline-pl.20150623.gappssmtp.com header.i=@hardline-pl.20150623.gappssmtp.com header.b="b/rkXoJ8" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 15BF92070C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=hardline.pl Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6C6446E408; Mon, 6 Jul 2020 14:01:32 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mail-lf1-x142.google.com (mail-lf1-x142.google.com [IPv6:2a00:1450:4864:20::142]) by gabe.freedesktop.org (Postfix) with ESMTPS id 479CC6E3F0 for ; Mon, 6 Jul 2020 14:01:31 +0000 (UTC) Received: by mail-lf1-x142.google.com with SMTP id m26so22634550lfo.13 for ; Mon, 06 Jul 2020 07:01:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hardline-pl.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=219xSDumJPYjqxHQmqdoEZBw1V8je07YpNfIcKaUJFw=; b=b/rkXoJ8p2Z1ZNFMbObobZTBqLYsaGyU/+lKo0JAzKf1wTlT1WxWPapF0FkxskpscN Io7EK/hw8ZHcWeY9UHy0iQqFtTVtqX+ujGrSYmVpbR9ktX6o/qOuJoYL68GIOpsTOE9+ fR+23sPQH7HSkaEFjjmL3mErTVRfANMStWR8iA91zFiR+Q6OjXA68WrWJ9F7iG614yg2 iAWsjLRejUJZxyfaEPRrwtHRMckf6Q9ihggkd1lI5wlZm1LXGYEgwHACOmpmnqbctIdS ggDlQUVgjEueGU3vB5h90AXafWR5majRTqvh1/PgXDXbTIgYNGLG87uT1HfVUvNvWYOX Za7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=219xSDumJPYjqxHQmqdoEZBw1V8je07YpNfIcKaUJFw=; b=AzJ45tBlIbA6JWzWn84Zg5lu1Tngs59PpwC6YaimUySHUAt38VzBbBkIFieJ09MbRE JBez48sTOF2fcNusVCIONU2MIrd9godx1dPGLtsHeGU59JTiTVK1pITR8WVRLqupMDgt nYXc2vgM0GuHlAbE41uci0WivqLNpFDEGSyyanXDp0eU/AX5B0mLdo68AbV3JsMZEtYV jCFk5joILIOYi3uRQZNtqGCByWXeNeMJEL/SPja6CmwpKZsMAHyULYHNEzY0NlNpG3R4 iE9kNVXba5tB24i6d9Tv/dMV/SbQD+yD6ylfynZ097RV5/9BvymJ3v6um98O95RS9qaB 0vOA== X-Gm-Message-State: AOAM5330Zz8iTcsEUdK+1ebk/cK1vqSidgHr9JtC0cMos/GwUtxFxeAI HQKibfmaAIWfkH2lX72scNxNg6cI/s5WHA== X-Google-Smtp-Source: ABdhPJwE02i50NDFz1GhYaCyjkTXx7Pp7gSlct+Iuj35DlXRmveAW8U6CjtPZAGIvqfIW638yXYcmg== X-Received: by 2002:a19:a8c:: with SMTP id 134mr30510960lfk.128.1594044089112; Mon, 06 Jul 2020 07:01:29 -0700 (PDT) Received: from localhost (109241244009.gdansk.vectranet.pl. [109.241.244.9]) by smtp.gmail.com with ESMTPSA id q1sm7868983lji.71.2020.07.06.07.01.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Jul 2020 07:01:28 -0700 (PDT) From: =?utf-8?q?Micha=C5=82_Winiarski?= To: intel-gfx@lists.freedesktop.org Date: Mon, 6 Jul 2020 16:01:23 +0200 Message-Id: <20200706140125.172844-1-michal@hardline.pl> X-Mailer: git-send-email 2.27.0 MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH v3 1/3] drm/i915: Reboot CI if we get wedged during driver init X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?utf-8?q?Micha=C5=82_Winiarski?= , Chris Wilson Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" From: Michał Winiarski Getting wedged device on driver init is pretty much unrecoverable. Since we're running various scenarios that may potentially hit this in CI (module reload / selftests / hotunplug), and if it happens, it means that we can't trust any subsequent CI results, we should just apply the taint to let the CI know that it should reboot (CI checks taint between test runs). v2: Comment that WEDGED_ON_INIT is non-recoverable, distinguish WEDGED_ON_INIT from WEDGED_ON_FINI (Chris) v3: Appease checkpatch, fixup search-replace logic expression mindbomb in assert (Chris) Signed-off-by: Michał Winiarski Cc: Chris Wilson Cc: Michal Wajdeczko Cc: Petri Latvala Reviewed-by: Chris Wilson --- drivers/gpu/drm/i915/gt/intel_engine_user.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt.c | 2 +- drivers/gpu/drm/i915/gt/intel_gt.h | 12 ++++++++---- drivers/gpu/drm/i915/gt/intel_gt_pm.c | 2 +- drivers/gpu/drm/i915/gt/intel_reset.c | 13 +++++++++++-- drivers/gpu/drm/i915/gt/intel_reset.h | 10 ++-------- drivers/gpu/drm/i915/gt/intel_reset_types.h | 7 ++++++- 7 files changed, 30 insertions(+), 18 deletions(-) diff --git a/drivers/gpu/drm/i915/gt/intel_engine_user.c b/drivers/gpu/drm/i915/gt/intel_engine_user.c index 848decee9066..34e6096f196e 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_user.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_user.c @@ -201,7 +201,7 @@ void intel_engines_driver_register(struct drm_i915_private *i915) uabi_node); char old[sizeof(engine->name)]; - if (intel_gt_has_init_error(engine->gt)) + if (intel_gt_has_unrecoverable_error(engine->gt)) continue; /* ignore incomplete engines */ GEM_BUG_ON(engine->class >= ARRAY_SIZE(uabi_classes)); diff --git a/drivers/gpu/drm/i915/gt/intel_gt.c b/drivers/gpu/drm/i915/gt/intel_gt.c index ebc29b6ee86c..876f78759095 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.c +++ b/drivers/gpu/drm/i915/gt/intel_gt.c @@ -510,7 +510,7 @@ static int __engines_verify_workarounds(struct intel_gt *gt) static void __intel_gt_disable(struct intel_gt *gt) { - intel_gt_set_wedged_on_init(gt); + intel_gt_set_wedged_on_fini(gt); intel_gt_suspend_prepare(gt); intel_gt_suspend_late(gt); diff --git a/drivers/gpu/drm/i915/gt/intel_gt.h b/drivers/gpu/drm/i915/gt/intel_gt.h index 4fac043750aa..982957ca4e62 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt.h +++ b/drivers/gpu/drm/i915/gt/intel_gt.h @@ -58,14 +58,18 @@ static inline u32 intel_gt_scratch_offset(const struct intel_gt *gt, return i915_ggtt_offset(gt->scratch) + field; } -static inline bool intel_gt_is_wedged(const struct intel_gt *gt) +static inline bool intel_gt_has_unrecoverable_error(const struct intel_gt *gt) { - return __intel_reset_failed(>->reset); + return test_bit(I915_WEDGED_ON_INIT, >->reset.flags) || + test_bit(I915_WEDGED_ON_FINI, >->reset.flags); } -static inline bool intel_gt_has_init_error(const struct intel_gt *gt) +static inline bool intel_gt_is_wedged(const struct intel_gt *gt) { - return test_bit(I915_WEDGED_ON_INIT, >->reset.flags); + GEM_BUG_ON(intel_gt_has_unrecoverable_error(gt) && + !test_bit(I915_WEDGED, >->reset.flags)); + + return unlikely(test_bit(I915_WEDGED, >->reset.flags)); } #endif /* __INTEL_GT_H__ */ diff --git a/drivers/gpu/drm/i915/gt/intel_gt_pm.c b/drivers/gpu/drm/i915/gt/intel_gt_pm.c index f1d5333f9456..274aa0dd7050 100644 --- a/drivers/gpu/drm/i915/gt/intel_gt_pm.c +++ b/drivers/gpu/drm/i915/gt/intel_gt_pm.c @@ -188,7 +188,7 @@ int intel_gt_resume(struct intel_gt *gt) enum intel_engine_id id; int err; - err = intel_gt_has_init_error(gt); + err = intel_gt_has_unrecoverable_error(gt); if (err) return err; diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c index 0156f1f5c736..6f94b6479a2f 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.c +++ b/drivers/gpu/drm/i915/gt/intel_reset.c @@ -880,7 +880,7 @@ static bool __intel_gt_unset_wedged(struct intel_gt *gt) return true; /* Never fully initialised, recovery impossible */ - if (test_bit(I915_WEDGED_ON_INIT, >->reset.flags)) + if (intel_gt_has_unrecoverable_error(gt)) return false; GT_TRACE(gt, "start\n"); @@ -1342,7 +1342,7 @@ int intel_gt_terminally_wedged(struct intel_gt *gt) if (!intel_gt_is_wedged(gt)) return 0; - if (intel_gt_has_init_error(gt)) + if (intel_gt_has_unrecoverable_error(gt)) return -EIO; /* Reset still in progress? Maybe we will recover? */ @@ -1360,6 +1360,15 @@ void intel_gt_set_wedged_on_init(struct intel_gt *gt) I915_WEDGED_ON_INIT); intel_gt_set_wedged(gt); set_bit(I915_WEDGED_ON_INIT, >->reset.flags); + + /* Wedged on init is non-recoverable */ + add_taint_for_CI(TAINT_WARN); +} + +void intel_gt_set_wedged_on_fini(struct intel_gt *gt) +{ + intel_gt_set_wedged(gt); + set_bit(I915_WEDGED_ON_FINI, >->reset.flags); } void intel_gt_init_reset(struct intel_gt *gt) diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h index 8e8d5f761166..a0eec7c11c0c 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset.h +++ b/drivers/gpu/drm/i915/gt/intel_reset.h @@ -47,8 +47,10 @@ int intel_gt_terminally_wedged(struct intel_gt *gt); /* * There's no unset_wedged_on_init paired with this one. * Once we're wedged on init, there's no going back. + * Same thing for unset_wedged_on_fini. */ void intel_gt_set_wedged_on_init(struct intel_gt *gt); +void intel_gt_set_wedged_on_fini(struct intel_gt *gt); int __intel_gt_reset(struct intel_gt *gt, intel_engine_mask_t engine_mask); @@ -71,14 +73,6 @@ void __intel_fini_wedge(struct intel_wedge_me *w); (W)->gt; \ __intel_fini_wedge((W))) -static inline bool __intel_reset_failed(const struct intel_reset *reset) -{ - GEM_BUG_ON(test_bit(I915_WEDGED_ON_INIT, &reset->flags) ? - !test_bit(I915_WEDGED, &reset->flags) : false); - - return unlikely(test_bit(I915_WEDGED, &reset->flags)); -} - bool intel_has_gpu_reset(const struct intel_gt *gt); bool intel_has_reset_engine(const struct intel_gt *gt); diff --git a/drivers/gpu/drm/i915/gt/intel_reset_types.h b/drivers/gpu/drm/i915/gt/intel_reset_types.h index f43bc3a0fe4f..add6b86d9d03 100644 --- a/drivers/gpu/drm/i915/gt/intel_reset_types.h +++ b/drivers/gpu/drm/i915/gt/intel_reset_types.h @@ -34,12 +34,17 @@ struct intel_reset { * longer use the GPU - similar to #I915_WEDGED bit. The difference in * in the way we're handling "forced" unwedged (e.g. through debugfs), * which is not allowed in case we failed to initialize. + * + * #I915_WEDGED_ON_FINI - Similar to #I915_WEDGED_ON_INIT, except we + * use it to mark that the GPU is no longer available (and prevent + * users from using it). */ unsigned long flags; #define I915_RESET_BACKOFF 0 #define I915_RESET_MODESET 1 #define I915_RESET_ENGINE 2 -#define I915_WEDGED_ON_INIT (BITS_PER_LONG - 2) +#define I915_WEDGED_ON_INIT (BITS_PER_LONG - 3) +#define I915_WEDGED_ON_FINI (BITS_PER_LONG - 2) #define I915_WEDGED (BITS_PER_LONG - 1) struct mutex mutex; /* serialises wedging/unwedging */