From patchwork Mon May 15 21:14:22 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michel Thierry X-Patchwork-Id: 9727979 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 5052260386 for ; Mon, 15 May 2017 21:14:30 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5227B28971 for ; Mon, 15 May 2017 21:14:30 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 45AA9289BC; Mon, 15 May 2017 21:14:30 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 41D5428971 for ; Mon, 15 May 2017 21:14:28 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 0F12189E0C; Mon, 15 May 2017 21:14:27 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2CFAE89E0C for ; Mon, 15 May 2017 21:14:25 +0000 (UTC) Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by fmsmga105.fm.intel.com with ESMTP; 15 May 2017 14:14:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.38,346,1491289200"; d="scan'208";a="857078564" Received: from relo-linux-11.sc.intel.com ([10.3.160.214]) by FMSMGA003.fm.intel.com with ESMTP; 15 May 2017 14:14:22 -0700 From: Michel Thierry To: intel-gfx@lists.freedesktop.org Date: Mon, 15 May 2017 14:14:22 -0700 Message-Id: <20170515211422.15763-1-michel.thierry@intel.com> X-Mailer: git-send-email 2.11.0 In-Reply-To: <20170427231300.32841-3-michel.thierry@intel.com> References: <20170427231300.32841-3-michel.thierry@intel.com> Subject: [Intel-gfx] [PATCH 02/20] drm/i915: Modify error handler for per engine hang recovery X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP This is a preparatory patch which modifies error handler to do per engine hang recovery. The actual patch which implements this sequence follows later in the series. The aim is to prepare existing recovery function to adapt to this new function where applicable (which fails at this point because core implementation is lacking) and continue recovery using legacy full gpu reset. A helper function is also added to query the availability of engine reset. The error events that are used to notify user of reset are currently ommited in case of engine reset. In legacy we report an error event, a reset event before resetting the gpu and a reset done event marking the completion of reset. The same behaviour can be adapted but reset event will only dispatched once even when multiple engines are hung, and there needs to be some control to prevent multiple events in case the reset-engine fails and the driver fall backs to full chip reset. Note that this implementation of engine reset is for i915 directly submitting to the ELSP, where the driver manages the hang detection, recovery and resubmission. With GuC submission these tasks are shared between driver and firmware; i915 will still responsible for detecting a hang, and when it does it will have to request GuC to reset that Engine and remind the firmware about the outstanding submissions. This will be added in different patch. v2: rebase, advertise engine reset availability in platform definition, add note about GuC submission. v3: s/*engine_reset*/*reset_engine*/. (Chris) Handle reset as 2 level resets, by first going to engine only and fall backing to full/chip reset as needed, i.e. reset_engine will need the struct_mutex. v4: Pass the engine mask to i915_reset. (Chris) v5: Rebase, update selftests. v6: Rebase, prepare for mutex-less reset engine. v7: Pass reset_engine mask as a function parameter, and iterate over the engine mask for reset_engine. (Chris) v8: Use i915.reset >=2 in has_reset_engine; remove redundant reset logging; add a reset-engine-in-progress flag to prevent concurrent resets, and avoid dual purposing of reset-backoff. (Chris) Cc: Chris Wilson Cc: Mika Kuoppala Signed-off-by: Ian Lister Signed-off-by: Tomas Elf Signed-off-by: Arun Siluvery Signed-off-by: Michel Thierry --- drivers/gpu/drm/i915/i915_drv.c | 13 +++++++++++++ drivers/gpu/drm/i915/i915_drv.h | 8 ++++++++ drivers/gpu/drm/i915/i915_irq.c | 24 ++++++++++++++++++++++++ drivers/gpu/drm/i915/i915_pci.c | 5 ++++- drivers/gpu/drm/i915/intel_uncore.c | 11 +++++++++++ 5 files changed, 60 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 72fb47a439d2..fa21458d6e1e 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -1877,6 +1877,19 @@ void i915_reset(struct drm_i915_private *dev_priv) goto finish; } +/** + * i915_reset_engine - reset GPU engine to recover from a hang + * @engine: engine to reset + * + * Reset a specific GPU engine. Useful if a hang is detected. + * Returns zero on successful reset or otherwise an error code. + */ +int i915_reset_engine(struct intel_engine_cs *engine) +{ + /* FIXME: replace me with engine reset sequence */ + return -ENODEV; +} + static int i915_pm_suspend(struct device *kdev) { struct pci_dev *pdev = to_pci_dev(kdev); diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index ebc38f33723d..f0ff918e5d0b 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -705,6 +705,7 @@ struct intel_csr { func(has_ddi); \ func(has_decoupled_mmio); \ func(has_dp_mst); \ + func(has_reset_engine); \ func(has_fbc); \ func(has_fpga_dbg); \ func(has_full_ppgtt); \ @@ -1498,6 +1499,10 @@ struct i915_gpu_error { * inspect the bit and do the reset directly, otherwise the worker * waits for the struct_mutex. * + * #I915_RESET_ENGINE_IN_PROGRESS - Since the driver doesn't need to + * acquire the struct_mutex to reset an engine, we need an explicit + * flag to prevent two concurrent reset-engine attempts. + * * #I915_WEDGED - If reset fails and we can no longer use the GPU, * we set the #I915_WEDGED bit. Prior to command submission, e.g. * i915_gem_request_alloc(), this bit is checked and the sequence @@ -1506,6 +1511,7 @@ struct i915_gpu_error { unsigned long flags; #define I915_RESET_BACKOFF 0 #define I915_RESET_HANDOFF 1 +#define I915_RESET_ENGINE_IN_PROGRESS 2 #define I915_WEDGED (BITS_PER_LONG - 1) /** @@ -2989,6 +2995,8 @@ extern void i915_driver_unload(struct drm_device *dev); extern int intel_gpu_reset(struct drm_i915_private *dev_priv, u32 engine_mask); extern bool intel_has_gpu_reset(struct drm_i915_private *dev_priv); extern void i915_reset(struct drm_i915_private *dev_priv); +extern int i915_reset_engine(struct intel_engine_cs *engine); +extern bool intel_has_reset_engine(struct drm_i915_private *dev_priv); extern int intel_guc_reset(struct drm_i915_private *dev_priv); extern void intel_engine_init_hangcheck(struct intel_engine_cs *engine); extern void intel_hangcheck_init(struct drm_i915_private *dev_priv); diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 9f5ae1e938be..958d56f49dad 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2741,6 +2741,30 @@ void i915_handle_error(struct drm_i915_private *dev_priv, if (!engine_mask) goto out; + /* try engine reset first, and continue if fails */ + if (intel_has_reset_engine(dev_priv)) { + struct intel_engine_cs *engine; + unsigned int tmp; + + /* protect against concurrent reset attempts */ + if (test_and_set_bit(I915_RESET_ENGINE_IN_PROGRESS, + &dev_priv->gpu_error.flags)) + goto out; + + for_each_engine_masked(engine, dev_priv, engine_mask, tmp) { + if (i915_reset_engine(engine) == 0) + engine_mask &= ~intel_engine_flag(engine); + } + + /* clear unconditionally, full reset won't care about it */ + clear_bit(I915_RESET_ENGINE_IN_PROGRESS, + &dev_priv->gpu_error.flags); + + if (!engine_mask) + goto out; + } + + /* full reset needs the mutex, stop any other user trying to do so */ if (test_and_set_bit(I915_RESET_BACKOFF, &dev_priv->gpu_error.flags)) goto out; diff --git a/drivers/gpu/drm/i915/i915_pci.c b/drivers/gpu/drm/i915/i915_pci.c index f80db2ccd92f..4dfb400aef85 100644 --- a/drivers/gpu/drm/i915/i915_pci.c +++ b/drivers/gpu/drm/i915/i915_pci.c @@ -310,7 +310,8 @@ static const struct intel_device_info intel_haswell_info = { BDW_COLORS, \ .has_logical_ring_contexts = 1, \ .has_full_48bit_ppgtt = 1, \ - .has_64bit_reloc = 1 + .has_64bit_reloc = 1, \ + .has_reset_engine = 1 static const struct intel_device_info intel_broadwell_info = { BDW_FEATURES, @@ -341,6 +342,7 @@ static const struct intel_device_info intel_cherryview_info = { .has_gmch_display = 1, .has_aliasing_ppgtt = 1, .has_full_ppgtt = 1, + .has_reset_engine = 1, .display_mmio_offset = VLV_DISPLAY_BASE, GEN_CHV_PIPEOFFSETS, CURSOR_OFFSETS, @@ -389,6 +391,7 @@ static const struct intel_device_info intel_skylake_gt3_info = { .has_aliasing_ppgtt = 1, \ .has_full_ppgtt = 1, \ .has_full_48bit_ppgtt = 1, \ + .has_reset_engine = 1, \ GEN_DEFAULT_PIPEOFFSETS, \ IVB_CURSOR_OFFSETS, \ BDW_COLORS diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index a9a6933afda2..5f55cb57127a 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -1776,6 +1776,17 @@ bool intel_has_gpu_reset(struct drm_i915_private *dev_priv) return intel_get_gpu_reset(dev_priv) != NULL; } +/* + * When GuC submission is enabled, GuC manages ELSP and can initiate the + * engine reset too. For now, fall back to full GPU reset if it is enabled. + */ +bool intel_has_reset_engine(struct drm_i915_private *dev_priv) +{ + return (dev_priv->info.has_reset_engine && + !dev_priv->guc.execbuf_client && + i915.reset >= 2); +} + int intel_guc_reset(struct drm_i915_private *dev_priv) { int ret;