From patchwork Thu Feb 14 02:57:08 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Santa, Carlos" X-Patchwork-Id: 10811639 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 54B44746 for ; Thu, 14 Feb 2019 02:57:46 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 40C1F2DB83 for ; Thu, 14 Feb 2019 02:57:46 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 34BB92DB9C; Thu, 14 Feb 2019 02:57:46 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id AB2E92DB83 for ; Thu, 14 Feb 2019 02:57:45 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id B92F26E846; Thu, 14 Feb 2019 02:57:44 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by gabe.freedesktop.org (Postfix) with ESMTPS id 953286E846 for ; Thu, 14 Feb 2019 02:57:41 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 Feb 2019 18:57:41 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,367,1544515200"; d="scan'208";a="124345890" Received: from miryad.jf.intel.com ([10.54.74.35]) by fmsmga008.fm.intel.com with ESMTP; 13 Feb 2019 18:57:41 -0800 From: Carlos Santa To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Feb 2019 18:57:08 -0800 Message-Id: <20190214025713.34150-2-carlos.santa@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190214025713.34150-1-carlos.santa@intel.com> References: <20190214025713.34150-1-carlos.santa@intel.com> Subject: [Intel-gfx] [PATCH v3 1/6] drm/i915: Add engine reset count in get-reset-stats ioctl X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michel Thierry MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP From: Michel Thierry Users/tests relying on the total reset count will start seeing a smaller number since most of the hangs can be handled by engine reset. Note that if reset engine x, context a running on engine y will be unaware and unaffected. To start the discussion, include just a total engine reset count. If it is deemed useful, it can be extended to report each engine separately. Our igt's gem_reset_stats test will need changes to ignore the pad field, since it can now return reset_engine_count. v2: s/engine_reset/reset_engine/, use union in uapi to not break compatibility. v3: Keep rejecting attempts to use pad as input (Antonio) v4: Rebased. v5: Rebased. Cc: Chris Wilson Cc: Mika Kuoppala Cc: Antonio Argenziano Cc: Cc: Tvrtko Ursulin Signed-off-by: Michel Thierry Signed-off-by: Carlos Santa --- drivers/gpu/drm/i915/i915_gem_context.c | 12 ++++++++++-- include/uapi/drm/i915_drm.h | 6 +++++- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index 459f8eae1c39..cbfe8f2eb3f2 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -1889,6 +1889,8 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev, struct drm_i915_private *dev_priv = to_i915(dev); struct drm_i915_reset_stats *args = data; struct i915_gem_context *ctx; + struct intel_engine_cs *engine; + enum intel_engine_id id; int ret; if (args->flags || args->pad) @@ -1907,10 +1909,16 @@ int i915_gem_context_reset_stats_ioctl(struct drm_device *dev, * we should wrap the hangstats with a seqlock. */ - if (capable(CAP_SYS_ADMIN)) + if (capable(CAP_SYS_ADMIN)) { args->reset_count = i915_reset_count(&dev_priv->gpu_error); - else + for_each_engine(engine, dev_priv, id) + args->reset_engine_count += + i915_reset_engine_count(&dev_priv->gpu_error, + engine); + } else { args->reset_count = 0; + args->reset_engine_count = 0; + } args->batch_active = atomic_read(&ctx->guilty_count); args->batch_pending = atomic_read(&ctx->active_count); diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index cc03ef9f885f..3f2c89740b0e 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -1642,7 +1642,11 @@ struct drm_i915_reset_stats { /* Number of batches lost pending for execution, for this context */ __u32 batch_pending; - __u32 pad; + union { + __u32 pad; + /* Engine resets since boot/module reload, for all contexts */ + __u32 reset_engine_count; + }; }; struct drm_i915_gem_userptr { From patchwork Thu Feb 14 02:57:09 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Santa, Carlos" X-Patchwork-Id: 10811643 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ECB0C922 for ; Thu, 14 Feb 2019 02:57:48 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D84E52DB83 for ; Thu, 14 Feb 2019 02:57:48 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id CC7B42DB9C; Thu, 14 Feb 2019 02:57:48 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id D92A52DB83 for ; Thu, 14 Feb 2019 02:57:47 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 11D546E84A; Thu, 14 Feb 2019 02:57:45 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by gabe.freedesktop.org (Postfix) with ESMTPS id D100E6E846 for ; Thu, 14 Feb 2019 02:57:42 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 Feb 2019 18:57:42 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,367,1544515200"; d="scan'208";a="124345899" Received: from miryad.jf.intel.com ([10.54.74.35]) by fmsmga008.fm.intel.com with ESMTP; 13 Feb 2019 18:57:42 -0800 From: Carlos Santa To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Feb 2019 18:57:09 -0800 Message-Id: <20190214025713.34150-3-carlos.santa@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190214025713.34150-1-carlos.santa@intel.com> References: <20190214025713.34150-1-carlos.santa@intel.com> Subject: [Intel-gfx] [PATCH v3 2/6] drm/i915: Watchdog timeout: IRQ handler for gen8+ X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michel Thierry MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP From: Michel Thierry *** General *** Watchdog timeout (or "media engine reset") is a feature that allows userland applications to enable hang detection on individual batch buffers. The detection mechanism itself is mostly bound to the hardware and the only thing that the driver needs to do to support this form of hang detection is to implement the interrupt handling support as well as watchdog command emission before and after the emitted batch buffer start instruction in the ring buffer. The principle of the hang detection mechanism is as follows: 1. Once the decision has been made to enable watchdog timeout for a particular batch buffer and the driver is in the process of emitting the batch buffer start instruction into the ring buffer it also emits a watchdog timer start instruction before and a watchdog timer cancellation instruction after the batch buffer start instruction in the ring buffer. 2. Once the GPU execution reaches the watchdog timer start instruction the hardware watchdog counter is started by the hardware. The counter keeps counting until either reaching a previously configured threshold value or the timer cancellation instruction is executed. 2a. If the counter reaches the threshold value the hardware fires a watchdog interrupt that is picked up by the watchdog interrupt handler. This means that a hang has been detected and the driver needs to deal with it the same way it would deal with a engine hang detected by the periodic hang checker. The only difference between the two is that we already blamed the active request (to ensure an engine reset). 2b. If the batch buffer completes and the execution reaches the watchdog cancellation instruction before the watchdog counter reaches its threshold value the watchdog is cancelled and nothing more comes of it. No hang is detected. Note about future interaction with preemption: Preemption could happen in a command sequence prior to watchdog counter getting disabled, resulting in watchdog being triggered following preemption (e.g. when watchdog had been enabled in the low priority batch). The driver will need to explicitly disable the watchdog counter as part of the preemption sequence. *** This patch introduces: *** 1. IRQ handler code for watchdog timeout allowing direct hang recovery based on hardware-driven hang detection, which then integrates directly with the hang recovery path. This is independent of having per-engine reset or just full gpu reset. 2. Watchdog specific register information. Currently the render engine and all available media engines support watchdog timeout (VECS is only supported in GEN9). The specifications elude to the BCS engine being supported but that is currently not supported by this commit. Note that the value to stop the counter is different between render and non-render engines in GEN8; GEN9 onwards it's the same. v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check against false-positives. v3: Don't use high priority tasklet, use engine_last_submit while checking for false-positives. From GEN9 onwards, the stop counter bit is the same for all engines. v4: Remove unnecessary brackets, use current_seqno to mark the request as guilty in the hangcheck/capture code. v5: Rebased after RESET_ENGINEs flag. v6: Don't capture error state in case of watchdog timeout. The capture process is time consuming and this will align to what happens when we use GuC to handle the watchdog timeout. (Chris) v7: Rebase. v8: Rebase, use HZ to reschedule. v9: Rebase, get forcewake domains in function (no longer in execlists struct). v10: Rebase. v11: Rebase, remove extra braces (Tvrtko), implement watchdog_to_clock_counts helper (Tvrtko), Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko), Use a global heartbeat seqno instead of engine seqno (Chris) Make all engines checks all class based checks (Tvrtko) Cc: Antonio Argenziano Cc: Tvrtko Ursulin Signed-off-by: Michel Thierry Signed-off-by: Carlos Santa --- drivers/gpu/drm/i915/i915_drv.h | 8 +++ drivers/gpu/drm/i915/i915_gpu_error.h | 4 ++ drivers/gpu/drm/i915/i915_irq.c | 12 +++- drivers/gpu/drm/i915/i915_reg.h | 6 ++ drivers/gpu/drm/i915/intel_engine_cs.c | 1 + drivers/gpu/drm/i915/intel_hangcheck.c | 17 ++++-- drivers/gpu/drm/i915/intel_lrc.c | 81 +++++++++++++++++++++++++ drivers/gpu/drm/i915/intel_ringbuffer.h | 6 ++ 8 files changed, 129 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 63a008aebfcd..0fcb2df869a2 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -3120,6 +3120,14 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id) return ctx; } +static inline u32 +watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us) +{ + u64 threshold = 0; + + return threshold; +} + int i915_perf_open_ioctl(struct drm_device *dev, void *data, struct drm_file *file); int i915_perf_add_config_ioctl(struct drm_device *dev, void *data, diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index f408060e0667..bd1821c73ecd 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -233,6 +233,9 @@ struct i915_gpu_error { * i915_mutex_lock_interruptible()?). I915_RESET_BACKOFF serves a * secondary role in preventing two concurrent global reset attempts. * + * #I915_RESET_WATCHDOG - When hw detects a hang before us, we can use + * I915_RESET_WATCHDOG to report the hang detection cause accurately. + * * #I915_RESET_ENGINE[num_engines] - Since the driver doesn't need to * acquire the struct_mutex to reset an engine, we need an explicit * flag to prevent two concurrent reset attempts in the same engine. @@ -248,6 +251,7 @@ struct i915_gpu_error { #define I915_RESET_BACKOFF 0 #define I915_RESET_MODESET 1 #define I915_RESET_ENGINE 2 +#define I915_RESET_WATCHDOG 3 #define I915_WEDGED (BITS_PER_LONG - 1) /** Number of times an engine has been reset */ diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 4b23b2fd1fad..e2a1a07b0f2c 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1456,6 +1456,9 @@ gen8_cs_irq_handler(struct intel_engine_cs *engine, u32 iir) if (tasklet) tasklet_hi_schedule(&engine->execlists.tasklet); + + if (iir & GT_GEN8_WATCHDOG_INTERRUPT) + tasklet_schedule(&engine->execlists.watchdog_tasklet); } static void gen8_gt_irq_ack(struct drm_i915_private *i915, @@ -3883,17 +3886,24 @@ static void gen8_gt_irq_postinstall(struct drm_i915_private *dev_priv) u32 gt_interrupts[] = { GT_RENDER_USER_INTERRUPT << GEN8_RCS_IRQ_SHIFT | GT_CONTEXT_SWITCH_INTERRUPT << GEN8_RCS_IRQ_SHIFT | + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_RCS_IRQ_SHIFT | GT_RENDER_USER_INTERRUPT << GEN8_BCS_IRQ_SHIFT | GT_CONTEXT_SWITCH_INTERRUPT << GEN8_BCS_IRQ_SHIFT, GT_RENDER_USER_INTERRUPT << GEN8_VCS1_IRQ_SHIFT | GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS1_IRQ_SHIFT | + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS1_IRQ_SHIFT | GT_RENDER_USER_INTERRUPT << GEN8_VCS2_IRQ_SHIFT | - GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT, + GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VCS2_IRQ_SHIFT | + GT_GEN8_WATCHDOG_INTERRUPT << GEN8_VCS2_IRQ_SHIFT, 0, GT_RENDER_USER_INTERRUPT << GEN8_VECS_IRQ_SHIFT | GT_CONTEXT_SWITCH_INTERRUPT << GEN8_VECS_IRQ_SHIFT }; + /* VECS watchdog is only available in skl+ */ + if (INTEL_GEN(dev_priv) >= 9) + gt_interrupts[3] |= GT_GEN8_WATCHDOG_INTERRUPT; + dev_priv->pm_ier = 0x0; dev_priv->pm_imr = ~dev_priv->pm_ier; GEN8_IRQ_INIT_NDX(GT, 0, ~gt_interrupts[0], gt_interrupts[0]); diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h index 1eca166d95bb..a0e101bbcbce 100644 --- a/drivers/gpu/drm/i915/i915_reg.h +++ b/drivers/gpu/drm/i915/i915_reg.h @@ -2335,6 +2335,11 @@ enum i915_power_well_id { #define RING_START(base) _MMIO((base) + 0x38) #define RING_CTL(base) _MMIO((base) + 0x3c) #define RING_CTL_SIZE(size) ((size) - PAGE_SIZE) /* in bytes -> pages */ +#define RING_CNTR(base) _MMIO((base) + 0x178) +#define GEN8_WATCHDOG_ENABLE 0 +#define GEN8_WATCHDOG_DISABLE 1 +#define GEN8_XCS_WATCHDOG_DISABLE 0xFFFFFFFF /* GEN8 & non-render only */ +#define RING_THRESH(base) _MMIO((base) + 0x17C) #define RING_SYNC_0(base) _MMIO((base) + 0x40) #define RING_SYNC_1(base) _MMIO((base) + 0x44) #define RING_SYNC_2(base) _MMIO((base) + 0x48) @@ -2894,6 +2899,7 @@ enum i915_power_well_id { #define GT_BSD_USER_INTERRUPT (1 << 12) #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT_S1 (1 << 11) /* hsw+; rsvd on snb, ivb, vlv */ #define GT_CONTEXT_SWITCH_INTERRUPT (1 << 8) +#define GT_GEN8_WATCHDOG_INTERRUPT (1 << 6) /* gen8+ */ #define GT_RENDER_L3_PARITY_ERROR_INTERRUPT (1 << 5) /* !snb */ #define GT_RENDER_PIPECTL_NOTIFY_INTERRUPT (1 << 4) #define GT_RENDER_CS_MASTER_ERROR_INTERRUPT (1 << 3) diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c index 7ae753358a6d..74f563d23cc8 100644 --- a/drivers/gpu/drm/i915/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/intel_engine_cs.c @@ -1106,6 +1106,7 @@ void intel_engines_park(struct drm_i915_private *i915) /* Flush the residual irq tasklets first. */ intel_engine_disarm_breadcrumbs(engine); tasklet_kill(&engine->execlists.tasklet); + tasklet_kill(&engine->execlists.watchdog_tasklet); /* * We are committed now to parking the engines, make sure there diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c index 58b6ff8453dc..bc10acb24d9a 100644 --- a/drivers/gpu/drm/i915/intel_hangcheck.c +++ b/drivers/gpu/drm/i915/intel_hangcheck.c @@ -218,7 +218,8 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine, static void hangcheck_declare_hang(struct drm_i915_private *i915, unsigned int hung, - unsigned int stuck) + unsigned int stuck, + unsigned int watchdog) { struct intel_engine_cs *engine; char msg[80]; @@ -231,13 +232,16 @@ static void hangcheck_declare_hang(struct drm_i915_private *i915, if (stuck != hung) hung &= ~stuck; len = scnprintf(msg, sizeof(msg), - "%s on ", stuck == hung ? "no progress" : "hang"); + "%s on ", watchdog ? "watchdog timeout" : + stuck == hung ? "no progress" : "hang"); for_each_engine_masked(engine, i915, hung, tmp) len += scnprintf(msg + len, sizeof(msg) - len, "%s, ", engine->name); msg[len-2] = '\0'; - return i915_handle_error(i915, hung, I915_ERROR_CAPTURE, "%s", msg); + return i915_handle_error(i915, hung, + watchdog ? 0 : I915_ERROR_CAPTURE, + "%s", msg); } /* @@ -255,7 +259,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work) gpu_error.hangcheck_work.work); struct intel_engine_cs *engine; enum intel_engine_id id; - unsigned int hung = 0, stuck = 0, wedged = 0; + unsigned int hung = 0, stuck = 0, wedged = 0, watchdog = 0; if (!i915_modparams.enable_hangcheck) return; @@ -266,6 +270,9 @@ static void i915_hangcheck_elapsed(struct work_struct *work) if (i915_terminally_wedged(&dev_priv->gpu_error)) return; + if (test_and_clear_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags)) + watchdog = 1; + /* As enabling the GPU requires fairly extensive mmio access, * periodically arm the mmio checker to see if we are triggering * any invalid access. @@ -311,7 +318,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work) } if (hung) - hangcheck_declare_hang(dev_priv, hung, stuck); + hangcheck_declare_hang(dev_priv, hung, stuck, watchdog); /* Reset timer in case GPU hangs without another request being added */ i915_queue_hangcheck(dev_priv); diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 9ca7dc7a6fa5..41697fb468e3 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -2352,6 +2352,69 @@ static int gen8_emit_flush_render(struct i915_request *request, return 0; } +/* From GEN9 onwards, all engines use the same RING_CNTR format */ +static inline u32 get_watchdog_disable(struct intel_engine_cs *engine) +{ + if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9) + return GEN8_WATCHDOG_DISABLE; + else + return GEN8_XCS_WATCHDOG_DISABLE; +} + +#define GEN8_WATCHDOG_1000US(dev_priv) watchdog_to_clock_counts(dev_priv, 1000) +static void gen8_watchdog_irq_handler(unsigned long data) +{ + struct intel_engine_cs *engine = (struct intel_engine_cs *)data; + struct drm_i915_private *dev_priv = engine->i915; + enum forcewake_domains fw_domains; + u32 current_seqno; + + switch (engine->class) { + default: + MISSING_CASE(engine->id); + /* fall through */ + case RENDER_CLASS: + fw_domains = FORCEWAKE_RENDER; + break; + case VIDEO_DECODE_CLASS: + case VIDEO_ENHANCEMENT_CLASS: + fw_domains = FORCEWAKE_MEDIA; + break; + } + + intel_uncore_forcewake_get(dev_priv, fw_domains); + + /* Stop the counter to prevent further timeout interrupts */ + I915_WRITE_FW(RING_CNTR(engine->mmio_base), get_watchdog_disable(engine)); + + current_seqno = intel_engine_get_hangcheck_seqno(engine); + + /* did the request complete after the timer expired? */ + if (engine->hangcheck.next_seqno == current_seqno) + goto fw_put; + + if (engine->hangcheck.watchdog == current_seqno) { + /* Make sure the active request will be marked as guilty */ + engine->hangcheck.acthd = intel_engine_get_active_head(engine); + engine->hangcheck.last_seqno = current_seqno; + + /* And try to run the hangcheck_work as soon as possible */ + set_bit(I915_RESET_WATCHDOG, &dev_priv->gpu_error.flags); + queue_delayed_work(system_long_wq, + &dev_priv->gpu_error.hangcheck_work, + round_jiffies_up_relative(HZ)); + } else { + engine->hangcheck.watchdog = current_seqno; + /* Re-start the counter, if really hung, it will expire again */ + I915_WRITE_FW(RING_THRESH(engine->mmio_base), + GEN8_WATCHDOG_1000US(dev_priv)); + I915_WRITE_FW(RING_CNTR(engine->mmio_base), GEN8_WATCHDOG_ENABLE); + } + +fw_put: + intel_uncore_forcewake_put(dev_priv, fw_domains); +} + /* * Reserve space for 2 NOOPs at the end of each request to be * used as a workaround for not being allowed to do lite @@ -2539,6 +2602,21 @@ logical_ring_default_irqs(struct intel_engine_cs *engine) engine->irq_enable_mask = GT_RENDER_USER_INTERRUPT << shift; engine->irq_keep_mask = GT_CONTEXT_SWITCH_INTERRUPT << shift; + + switch (engine->class) { + default: + /* BCS engine does not support hw watchdog */ + break; + case RENDER_CLASS: + case VIDEO_DECODE_CLASS: + engine->irq_keep_mask |= GT_GEN8_WATCHDOG_INTERRUPT << shift; + break; + case VIDEO_ENHANCEMENT_CLASS: + if (INTEL_GEN(engine->i915) >= 9) + engine->irq_keep_mask |= + GT_GEN8_WATCHDOG_INTERRUPT << shift; + break; + } } static int @@ -2556,6 +2634,9 @@ logical_ring_setup(struct intel_engine_cs *engine) tasklet_init(&engine->execlists.tasklet, execlists_submission_tasklet, (unsigned long)engine); + tasklet_init(&engine->execlists.watchdog_tasklet, + gen8_watchdog_irq_handler, (unsigned long)engine); + logical_ring_default_vfuncs(engine); logical_ring_default_irqs(engine); diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 465094e38d32..9c0c1d68f3a1 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -122,6 +122,7 @@ struct intel_engine_hangcheck { u64 acthd; u32 last_seqno; u32 next_seqno; + u32 watchdog; unsigned long action_timestamp; struct intel_instdone instdone; }; @@ -222,6 +223,11 @@ struct intel_engine_execlists { */ struct tasklet_struct tasklet; + /** + * @watchdog_tasklet: stop counter and re-schedule hangcheck_work asap + */ + struct tasklet_struct watchdog_tasklet; + /** * @default_priolist: priority list for I915_PRIORITY_NORMAL */ From patchwork Thu Feb 14 02:57:10 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Santa, Carlos" X-Patchwork-Id: 10811641 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 41130922 for ; Thu, 14 Feb 2019 02:57:47 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2DEB42DB83 for ; Thu, 14 Feb 2019 02:57:47 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 21FAE2DB9C; Thu, 14 Feb 2019 02:57:47 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 88E972DB83 for ; Thu, 14 Feb 2019 02:57:46 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id D3A966E849; Thu, 14 Feb 2019 02:57:44 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0DA786E846 for ; Thu, 14 Feb 2019 02:57:44 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 Feb 2019 18:57:43 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,367,1544515200"; d="scan'208";a="124345905" Received: from miryad.jf.intel.com ([10.54.74.35]) by fmsmga008.fm.intel.com with ESMTP; 13 Feb 2019 18:57:43 -0800 From: Carlos Santa To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Feb 2019 18:57:10 -0800 Message-Id: <20190214025713.34150-4-carlos.santa@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190214025713.34150-1-carlos.santa@intel.com> References: <20190214025713.34150-1-carlos.santa@intel.com> Subject: [Intel-gfx] [PATCH v3 3/6] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+ X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michel Thierry MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP From: Michel Thierry Emit the required commands into the ring buffer for starting and stopping the watchdog timer before/after batch buffer start during batch buffer submission. v2: Support watchdog threshold per context engine, merge lri commands, and move watchdog commands emission to emit_bb_start. Request space of combined start_watchdog, bb_start and stop_watchdog to avoid any error after emitting bb_start. v3: There were too many req->engine in emit_bb_start. Use GEM_BUG_ON instead of returning a very late EINVAL in the remote case of watchdog misprogramming; set correct LRI cmd size in emit_stop_watchdog. (Chris) v4: Rebase. v5: use to_intel_context instead of ctx->engine. v6: Rebase. v7: Rebase, Store gpu watchdog capability in engine flag (Tvrtko) Store WATCHDOG_DISABLE magic # in engine (Tvrtko) No need to declare emit_{start|stop}_watchdog as vfuncs (Tvrtko) Replace flag watchdog_running with enable_watchdog (Tvrtko) Emit a single MI_NOOP by conditionally checking whether the # of emitted OPs is odd (Tvrtko) Cc: Chris Wilson Cc: Antonio Argenziano Cc: Tvrtko Ursulin Signed-off-by: Michel Thierry Signed-off-by: Carlos Santa --- drivers/gpu/drm/i915/i915_gem_context.h | 4 ++ drivers/gpu/drm/i915/intel_engine_cs.c | 2 + drivers/gpu/drm/i915/intel_lrc.c | 78 +++++++++++++++++++++++-- drivers/gpu/drm/i915/intel_lrc.h | 2 + drivers/gpu/drm/i915/intel_ringbuffer.h | 18 ++++-- 5 files changed, 96 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gem_context.h b/drivers/gpu/drm/i915/i915_gem_context.h index b1eeac64da8b..dcf4e98666a6 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.h +++ b/drivers/gpu/drm/i915/i915_gem_context.h @@ -183,6 +183,10 @@ struct i915_gem_context { u32 *lrc_reg_state; u64 lrc_desc; int pin_count; + /** watchdog_threshold: hw watchdog threshold value, + * in clock counts + */ + u32 watchdog_threshold; /** * active_tracker: Active tracker for the external rq activity diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c index 74f563d23cc8..438bf93a4340 100644 --- a/drivers/gpu/drm/i915/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/intel_engine_cs.c @@ -324,6 +324,8 @@ intel_engine_setup(struct drm_i915_private *dev_priv, if (engine->context_size) DRIVER_CAPS(dev_priv)->has_logical_contexts = true; + engine->watchdog_disable_id = get_watchdog_disable(engine); + /* Nothing to do here, execute in order of dependencies */ engine->schedule = NULL; diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index 41697fb468e3..abbac267c6f5 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -2193,16 +2193,74 @@ static void execlists_reset_finish(struct intel_engine_cs *engine) atomic_read(&execlists->tasklet.count)); } +static u32 *gen8_emit_start_watchdog(struct i915_request *rq, u32 *cs) +{ + struct intel_engine_cs *engine = rq->engine; + struct i915_gem_context *ctx = rq->gem_context; + struct intel_context *ce = to_intel_context(ctx, engine); + + GEM_BUG_ON(!intel_engine_supports_watchdog(engine)); + + /* + * watchdog register must never be programmed to zero. This would + * cause the watchdog counter to exceed and not allow the engine to + * go into IDLE state + */ + GEM_BUG_ON(ce->watchdog_threshold == 0); + + /* Set counter period */ + *cs++ = MI_LOAD_REGISTER_IMM(2); + *cs++ = i915_mmio_reg_offset(RING_THRESH(engine->mmio_base)); + *cs++ = ce->watchdog_threshold; + /* Start counter */ + *cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base)); + *cs++ = GEN8_WATCHDOG_ENABLE; + + return cs; +} + +static u32 *gen8_emit_stop_watchdog(struct i915_request *rq, u32 *cs) +{ + struct intel_engine_cs *engine = rq->engine; + + GEM_BUG_ON(!intel_engine_supports_watchdog(engine)); + + *cs++ = MI_LOAD_REGISTER_IMM(1); + *cs++ = i915_mmio_reg_offset(RING_CNTR(engine->mmio_base)); + *cs++ = engine->watchdog_disable_id; + + return cs; +} + static int gen8_emit_bb_start(struct i915_request *rq, u64 offset, u32 len, const unsigned int flags) { + struct intel_engine_cs *engine = rq->engine; u32 *cs; + u32 num_dwords; + bool enable_watchdog = false; - cs = intel_ring_begin(rq, 6); + /* bb_start only */ + num_dwords = 6; + + /* check if watchdog will be required */ + if (to_intel_context(rq->gem_context, engine)->watchdog_threshold != 0) { + + /* + start_watchdog (6) + stop_watchdog (4) */ + num_dwords += 10; + enable_watchdog = true; + } + + cs = intel_ring_begin(rq, num_dwords); if (IS_ERR(cs)) return PTR_ERR(cs); + if (enable_watchdog) { + /* Start watchdog timer */ + cs = gen8_emit_start_watchdog(rq, cs); + } + /* * WaDisableCtxRestoreArbitration:bdw,chv * @@ -2229,10 +2287,16 @@ static int gen8_emit_bb_start(struct i915_request *rq, *cs++ = upper_32_bits(offset); *cs++ = MI_ARB_ON_OFF | MI_ARB_DISABLE; - *cs++ = MI_NOOP; - intel_ring_advance(rq, cs); + if (enable_watchdog) { + /* Cancel watchdog timer */ + cs = gen8_emit_stop_watchdog(rq, cs); + } + + if (*cs%2 != 0) + *cs++ = MI_NOOP; + intel_ring_advance(rq, cs); return 0; } @@ -2353,7 +2417,7 @@ static int gen8_emit_flush_render(struct i915_request *request, } /* From GEN9 onwards, all engines use the same RING_CNTR format */ -static inline u32 get_watchdog_disable(struct intel_engine_cs *engine) +u32 get_watchdog_disable(struct intel_engine_cs *engine) { if (engine->id == RCS || INTEL_GEN(engine->i915) >= 9) return GEN8_WATCHDOG_DISABLE; @@ -2548,6 +2612,9 @@ void intel_execlists_set_default_submission(struct intel_engine_cs *engine) I915_SCHEDULER_CAP_PRIORITY; if (intel_engine_has_preemption(engine)) engine->i915->caps.scheduler |= I915_SCHEDULER_CAP_PREEMPTION; + + if(engine->id != BCS) + engine->flags |= I915_ENGINE_SUPPORTS_WATCHDOG; } static void @@ -2726,6 +2793,9 @@ int logical_xcs_ring_init(struct intel_engine_cs *engine) if (err) return err; + /* BCS engine does not have a watchdog-expired irq */ + GEM_BUG_ON(!intel_engine_supports_watchdog(engine)); + return logical_ring_init(engine); } diff --git a/drivers/gpu/drm/i915/intel_lrc.h b/drivers/gpu/drm/i915/intel_lrc.h index 5779e776cc3f..9db4f6369574 100644 --- a/drivers/gpu/drm/i915/intel_lrc.h +++ b/drivers/gpu/drm/i915/intel_lrc.h @@ -120,4 +120,6 @@ void intel_virtual_engine_put(struct intel_engine_cs *engine); u32 gen8_make_rpcs(struct drm_i915_private *i915, struct intel_sseu *ctx_sseu); +u32 get_watchdog_disable(struct intel_engine_cs *engine); + #endif /* _INTEL_LRC_H_ */ diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index 9c0c1d68f3a1..d5db04069e91 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -359,6 +359,7 @@ struct intel_engine_cs { unsigned int hw_id; unsigned int guc_id; unsigned long mask; + u32 watchdog_disable_id; u8 uabi_class; @@ -462,6 +463,7 @@ struct intel_engine_cs { int (*init_context)(struct i915_request *rq); int (*emit_flush)(struct i915_request *request, u32 mode); + #define EMIT_INVALIDATE BIT(0) #define EMIT_FLUSH BIT(1) #define EMIT_BARRIER (EMIT_INVALIDATE | EMIT_FLUSH) @@ -519,10 +521,12 @@ struct intel_engine_cs { struct intel_engine_hangcheck hangcheck; -#define I915_ENGINE_NEEDS_CMD_PARSER BIT(0) -#define I915_ENGINE_SUPPORTS_STATS BIT(1) -#define I915_ENGINE_HAS_PREEMPTION BIT(2) -#define I915_ENGINE_IS_VIRTUAL BIT(3) +#define I915_ENGINE_NEEDS_CMD_PARSER BIT(0) +#define I915_ENGINE_SUPPORTS_STATS BIT(1) +#define I915_ENGINE_HAS_PREEMPTION BIT(2) +#define I915_ENGINE_IS_VIRTUAL BIT(3) +#define I915_ENGINE_SUPPORTS_WATCHDOG BIT(4) + unsigned int flags; /* @@ -611,6 +615,12 @@ intel_engine_is_virtual(const struct intel_engine_cs *engine) return engine->flags & I915_ENGINE_IS_VIRTUAL; } +static inline bool +intel_engine_supports_watchdog(const struct intel_engine_cs *engine) +{ + return engine->flags & I915_ENGINE_SUPPORTS_WATCHDOG; +} + static inline void execlists_set_active(struct intel_engine_execlists *execlists, unsigned int bit) From patchwork Thu Feb 14 02:57:11 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Santa, Carlos" X-Patchwork-Id: 10811647 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D4CAB746 for ; Thu, 14 Feb 2019 02:57:50 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BDF392DB83 for ; Thu, 14 Feb 2019 02:57:50 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id B248C2DB9C; Thu, 14 Feb 2019 02:57:50 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 26CF32DB83 for ; Thu, 14 Feb 2019 02:57:50 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id EDBB46E84E; Thu, 14 Feb 2019 02:57:48 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2A4EB6E84E for ; Thu, 14 Feb 2019 02:57:45 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 Feb 2019 18:57:44 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,367,1544515200"; d="scan'208";a="124345908" Received: from miryad.jf.intel.com ([10.54.74.35]) by fmsmga008.fm.intel.com with ESMTP; 13 Feb 2019 18:57:44 -0800 From: Carlos Santa To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Feb 2019 18:57:11 -0800 Message-Id: <20190214025713.34150-5-carlos.santa@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190214025713.34150-1-carlos.santa@intel.com> References: <20190214025713.34150-1-carlos.santa@intel.com> Subject: [Intel-gfx] [PATCH v3 4/6] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michel Thierry MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP From: Michel Thierry Final enablement patch for GPU hang detection using watchdog timeout. Using the gem_context_setparam ioctl, users can specify the desired timeout value in microseconds, and the driver will do the conversion to 'timestamps'. The recommended default watchdog threshold for video engines is 60000 us, since this has been _empirically determined_ to be a good compromise for low-latency requirements and low rate of false positives. The default register value is ~106000us and the theoretical max value (all 1s) is 353 seconds. [1] http://patchwork.freedesktop.org/patch/msgid/20170329135831.30254-2-chris@chris-wilson.co.uk v2: Fixed get api to return values in microseconds. Threshold updated to be per context engine. Check for u32 overflow. Capture ctx threshold value in error state. v3: Add a way to get array size, short-cut to disable all thresholds, return EFAULT / EINVAL as needed. Move the capture of the threshold value in the error state into a new patch. BXT has a different timestamp base (because why not?). v4: Checking if watchdog is available should be the first thing to do, instead of giving false hopes to abi users; remove unnecessary & in set_watchdog; ignore args->size in getparam. v5: GEN9-LP platforms have a different crystal clock frequency, use the right timestamp base for them (magic 8-ball predicts this will change again later on, so future-proof it). (Daniele) v6: Rebase, no more mutex BLK in getparam_ioctl. v7: use to_intel_context instead of ctx->engine. v8: Rebase, remove extra mutex from i915_gem_context_set_watchdog (Tvrtko), Update UAPI to use engine class while keeping thresholds per engine class (Michel). v9: Rebase, Remove outdated comment from the commit message (Tvrtko) Use the engine->flag to verify for gpu watchdog support (Tvrtko) Use the standard copy_to_user() instead (Tvrtko) Use the correct type when declaring engine class iterator (Tvrtko) Remove yet another unncessary mutex_lock (Tvrtko) Cc: Antonio Argenziano Cc: Tvrtko Ursulin Cc: Daniele Ceraolo Spurio Signed-off-by: Michel Thierry Signed-off-by: Carlos Santa --- drivers/gpu/drm/i915/i915_drv.h | 50 +++++++++++++- drivers/gpu/drm/i915/i915_gem_context.c | 91 +++++++++++++++++++++++++ include/uapi/drm/i915_drm.h | 1 + 3 files changed, 141 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 0fcb2df869a2..aaa5810ba76c 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -1582,6 +1582,9 @@ struct drm_i915_private { struct drm_i915_fence_reg fence_regs[I915_MAX_NUM_FENCES]; /* assume 965 */ int num_fence_regs; /* 8 on pre-965, 16 otherwise */ + /* Command stream timestamp base - helps define watchdog threshold */ + u32 cs_timestamp_base; + unsigned int fsb_freq, mem_freq, is_ddr3; unsigned int skl_preferred_vco_freq; unsigned int max_cdclk_freq; @@ -3120,10 +3123,55 @@ i915_gem_context_lookup(struct drm_i915_file_private *file_priv, u32 id) return ctx; } +/* + * BDW, CHV & SKL+ Timestamp timer resolution = 0.080 uSec, + * or 12500000 counts per second, or ~12 counts per microsecond. + * + * But BXT/GLK Timestamp timer resolution is different, 0.052 uSec, + * or 19200000 counts per second, or ~19 counts per microsecond. + * + * Future-proofing, some day it won't be as simple as just GEN & IS_LP. + */ +#define GEN8_TIMESTAMP_CNTS_PER_USEC 12 +#define GEN9_LP_TIMESTAMP_CNTS_PER_USEC 19 +static inline u32 cs_timestamp_in_us(struct drm_i915_private *dev_priv) +{ + u32 cs_timestamp_base = dev_priv->cs_timestamp_base; + + if (cs_timestamp_base) + return cs_timestamp_base; + + switch (INTEL_GEN(dev_priv)) { + default: + MISSING_CASE(INTEL_GEN(dev_priv)); + /* fall through */ + case 9: + cs_timestamp_base = IS_GEN9_LP(dev_priv) ? + GEN9_LP_TIMESTAMP_CNTS_PER_USEC : + GEN8_TIMESTAMP_CNTS_PER_USEC; + break; + case 8: + cs_timestamp_base = GEN8_TIMESTAMP_CNTS_PER_USEC; + break; + } + + dev_priv->cs_timestamp_base = cs_timestamp_base; + return cs_timestamp_base; +} + +static inline u32 +watchdog_to_us(struct drm_i915_private *dev_priv, u32 value_in_clock_counts) +{ + return value_in_clock_counts / cs_timestamp_in_us(dev_priv); +} + static inline u32 watchdog_to_clock_counts(struct drm_i915_private *dev_priv, u64 value_in_us) { - u64 threshold = 0; + u64 threshold = value_in_us * cs_timestamp_in_us(dev_priv); + + if (overflows_type(threshold, u32)) + return -EINVAL; return threshold; } diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index cbfe8f2eb3f2..e1abca28140b 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -1573,6 +1573,89 @@ get_engines(struct i915_gem_context *ctx, return err; } +/* Return the timer count threshold in microseconds. */ +int i915_gem_context_get_watchdog(struct i915_gem_context *ctx, + struct drm_i915_gem_context_param *args) +{ + struct drm_i915_private *dev_priv = ctx->i915; + struct intel_engine_cs *engine; + enum intel_engine_id id; + u32 threshold_in_us[OTHER_CLASS]; + + if(!intel_engine_supports_watchdog(dev_priv->engine[VCS])) + return -ENODEV; + + for_each_engine(engine, dev_priv, id) { + struct intel_context *ce = to_intel_context(ctx, engine); + + threshold_in_us[engine->class] = watchdog_to_us(dev_priv, + ce->watchdog_threshold); + } + + if (copy_to_user(u64_to_user_ptr(args->value), + &threshold_in_us, + sizeof(threshold_in_us))) { + return -EFAULT; + } + + args->size = sizeof(threshold_in_us); + + return 0; +} + +/* + * Based on time out value in microseconds (us) calculate + * timer count thresholds needed based on core frequency. + * Watchdog can be disabled by setting it to 0. + */ +int i915_gem_context_set_watchdog(struct i915_gem_context *ctx, + struct drm_i915_gem_context_param *args) +{ + struct drm_i915_private *dev_priv = ctx->i915; + struct intel_engine_cs *engine; + enum intel_engine_id id; + int i; + u32 threshold[OTHER_CLASS]; + + if(!intel_engine_supports_watchdog(dev_priv->engine[VCS])) + return -ENODEV; + + memset(threshold, 0, sizeof(threshold)); + + /* shortcut to disable in all engines */ + if (args->size == 0) + goto set_watchdog; + + if (args->size < sizeof(threshold)) + return -EFAULT; + + if (copy_from_user(threshold, + u64_to_user_ptr(args->value), + sizeof(threshold))) { + return -EFAULT; + } + + /* not supported in blitter engine */ + if (threshold[COPY_ENGINE_CLASS] != 0) + return -EINVAL; + + for (i = RENDER_CLASS; i < OTHER_CLASS; i++) { + threshold[i] = watchdog_to_clock_counts(dev_priv, threshold[i]); + + if (threshold[i] == -EINVAL) + return -EINVAL; + } + +set_watchdog: + for_each_engine(engine, dev_priv, id) { + struct intel_context *ce = to_intel_context(ctx, engine); + + ce->watchdog_threshold = threshold[engine->class]; + } + + return 0; +} + static int ctx_setparam(struct i915_gem_context *ctx, struct drm_i915_gem_context_param *args) { @@ -1640,6 +1723,10 @@ static int ctx_setparam(struct i915_gem_context *ctx, ret = set_engines(ctx, args); break; + case I915_CONTEXT_PARAM_WATCHDOG: + ret = i915_gem_context_set_watchdog(ctx, args); + break; + case I915_CONTEXT_PARAM_BAN_PERIOD: default: ret = -EINVAL; @@ -1843,6 +1930,10 @@ int i915_gem_context_getparam_ioctl(struct drm_device *dev, void *data, args->value = ctx->sched.priority >> I915_USER_PRIORITY_SHIFT; break; + case I915_CONTEXT_PARAM_WATCHDOG: + ret = i915_gem_context_get_watchdog(ctx, args); + break; + case I915_CONTEXT_PARAM_SSEU: ret = get_sseu(ctx, args); break; diff --git a/include/uapi/drm/i915_drm.h b/include/uapi/drm/i915_drm.h index 3f2c89740b0e..7dabdb3e0fad 100644 --- a/include/uapi/drm/i915_drm.h +++ b/include/uapi/drm/i915_drm.h @@ -1492,6 +1492,7 @@ struct drm_i915_gem_context_param { * See struct i915_context_param_engines. */ #define I915_CONTEXT_PARAM_ENGINES 0x9 +#define I915_CONTEXT_PARAM_WATCHDOG 0x10 __u64 value; }; From patchwork Thu Feb 14 02:57:12 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Santa, Carlos" X-Patchwork-Id: 10811645 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A543E746 for ; Thu, 14 Feb 2019 02:57:49 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 917FB2DB83 for ; Thu, 14 Feb 2019 02:57:49 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 85D2F2DB9C; Thu, 14 Feb 2019 02:57:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 36A0D2DB83 for ; Thu, 14 Feb 2019 02:57:49 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 8AE5C6E851; Thu, 14 Feb 2019 02:57:48 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by gabe.freedesktop.org (Postfix) with ESMTPS id 0A5696E855 for ; Thu, 14 Feb 2019 02:57:46 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 Feb 2019 18:57:45 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,367,1544515200"; d="scan'208";a="124345916" Received: from miryad.jf.intel.com ([10.54.74.35]) by fmsmga008.fm.intel.com with ESMTP; 13 Feb 2019 18:57:45 -0800 From: Carlos Santa To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Feb 2019 18:57:12 -0800 Message-Id: <20190214025713.34150-6-carlos.santa@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190214025713.34150-1-carlos.santa@intel.com> References: <20190214025713.34150-1-carlos.santa@intel.com> Subject: [Intel-gfx] [PATCH v3 5/6] drm/i915: Watchdog timeout: Include threshold value in error state X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michel Thierry MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP From: Michel Thierry Save the watchdog threshold (in us) as part of the engine state. v2: Only do it for gen8+ (and prevent a missing-case warn). v3: use ctx->__engine. v4: Rebase. v5: Rebase. Cc: Antonio Argenziano Cc: Tvrtko Ursulin Signed-off-by: Michel Thierry Signed-off-by: Carlos Santa --- drivers/gpu/drm/i915/i915_gpu_error.c | 12 ++++++++---- drivers/gpu/drm/i915/i915_gpu_error.h | 1 + 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 8792ad12373d..a2dddaaeb215 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -460,10 +460,12 @@ static void error_print_context(struct drm_i915_error_state_buf *m, const char *header, const struct drm_i915_error_context *ctx) { - err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d\n", + err_printf(m, "%s%s[%d] user_handle %d hw_id %d, prio %d, ban score %d%s guilty %d active %d, watchdog %dus\n", header, ctx->comm, ctx->pid, ctx->handle, ctx->hw_id, ctx->sched_attr.priority, ctx->ban_score, bannable(ctx), - ctx->guilty, ctx->active); + ctx->guilty, ctx->active, + INTEL_GEN(m->i915) >= 8 ? + watchdog_to_us(m->i915, ctx->watchdog_threshold) : 0); } static void error_print_engine(struct drm_i915_error_state_buf *m, @@ -1348,7 +1350,8 @@ static void error_record_engine_execlists(struct intel_engine_cs *engine, } static void record_context(struct drm_i915_error_context *e, - struct i915_gem_context *ctx) + struct i915_gem_context *ctx, + u32 engine_id) { if (ctx->pid) { struct task_struct *task; @@ -1369,6 +1372,7 @@ static void record_context(struct drm_i915_error_context *e, e->bannable = i915_gem_context_is_bannable(ctx); e->guilty = atomic_read(&ctx->guilty_count); e->active = atomic_read(&ctx->active_count); + e->watchdog_threshold = ctx->__engine[engine_id].watchdog_threshold; } static void request_record_user_bo(struct i915_request *request, @@ -1452,7 +1456,7 @@ static void gem_record_rings(struct i915_gpu_state *error) ee->vm = ctx->ppgtt ? &ctx->ppgtt->vm : &ggtt->vm; - record_context(&ee->context, ctx); + record_context(&ee->context, ctx, engine->id); /* We need to copy these to an anonymous buffer * as the simplest method to avoid being overwritten diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h index bd1821c73ecd..454707848248 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.h +++ b/drivers/gpu/drm/i915/i915_gpu_error.h @@ -122,6 +122,7 @@ struct i915_gpu_state { int ban_score; int active; int guilty; + int watchdog_threshold; bool bannable; struct i915_sched_attr sched_attr; } context; From patchwork Thu Feb 14 02:57:13 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Santa, Carlos" X-Patchwork-Id: 10811649 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DE1B2922 for ; Thu, 14 Feb 2019 02:57:53 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CBDBF2DB83 for ; Thu, 14 Feb 2019 02:57:53 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C02932DB9C; Thu, 14 Feb 2019 02:57:53 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 7A5BC2DB83 for ; Thu, 14 Feb 2019 02:57:53 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id ECE7A6E853; Thu, 14 Feb 2019 02:57:52 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by gabe.freedesktop.org (Postfix) with ESMTPS id EDC716E84E for ; Thu, 14 Feb 2019 02:57:46 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 13 Feb 2019 18:57:46 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.58,367,1544515200"; d="scan'208";a="124345921" Received: from miryad.jf.intel.com ([10.54.74.35]) by fmsmga008.fm.intel.com with ESMTP; 13 Feb 2019 18:57:46 -0800 From: Carlos Santa To: intel-gfx@lists.freedesktop.org Date: Wed, 13 Feb 2019 18:57:13 -0800 Message-Id: <20190214025713.34150-7-carlos.santa@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190214025713.34150-1-carlos.santa@intel.com> References: <20190214025713.34150-1-carlos.santa@intel.com> Subject: [Intel-gfx] [PATCH v3 6/6] drm/i915: Watchdog timeout: Blindly trust watchdog timeout for reset? X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Michel Thierry MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP From: Michel Thierry XXX: What to do when the watchdog irq fired twice but our hangcheck logic thinks the engine is not hung? For example, what if the active-head moved since the irq handler? One option is to just ignore the watchdog, if the engine is really hung, then the driver will detect the hang by itself later on (I'm inclined to this). But the other option is to blindly trust the HW, which is what this patch does... v1: Rebase. CC: Antonio Argenziano Cc: Tvrtko Ursulin Signed-off-by: Michel Thierry Signed-off-by: Carlos Santa --- drivers/gpu/drm/i915/intel_hangcheck.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c index bc10acb24d9a..223b79001854 100644 --- a/drivers/gpu/drm/i915/intel_hangcheck.c +++ b/drivers/gpu/drm/i915/intel_hangcheck.c @@ -288,7 +288,8 @@ static void i915_hangcheck_elapsed(struct work_struct *work) hangcheck_accumulate_sample(engine, &hc); hangcheck_store_sample(engine, &hc); - if (hc.stalled) { + if (hc.stalled || + engine->hangcheck.watchdog == intel_engine_get_hangcheck_seqno(engine)) { hung |= engine->mask; if (hc.action != ENGINE_DEAD) stuck |= engine->mask;