From patchwork Sat Jun 2 10:48:53 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 10444745 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id CE2B960375 for ; Sat, 2 Jun 2018 10:49:11 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id BBFDA28985 for ; Sat, 2 Jun 2018 10:49:11 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id AD862289F0; Sat, 2 Jun 2018 10:49:11 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id DAF3C28985 for ; Sat, 2 Jun 2018 10:49:10 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 7A3046E006; Sat, 2 Jun 2018 10:49:09 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 4E4D16E006 for ; Sat, 2 Jun 2018 10:49:07 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 11920775-1500050 for multiple; Sat, 02 Jun 2018 11:48:55 +0100 Received: by haswell.alporthouse.com (sSMTP sendmail emulation); Sat, 02 Jun 2018 11:48:54 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Sat, 2 Jun 2018 11:48:53 +0100 Message-Id: <20180602104853.17140-1-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.17.1 X-Originating-IP: 78.156.65.138 X-Country: code=GB country="United Kingdom" ip=78.156.65.138 Subject: [Intel-gfx] [PATCH] drm/i915: Declare the driver wedged if hangcheck makes no progress X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Mika Kuoppala MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP Hangcheck is our back up in case the GPU or the driver gets stuck. It detects when the GPU is not making any progress and issues a GPU reset. However, if the driver is failing to make any progress, we can get ourselves into a situation where we continually try resetting the GPU to no avail. Employ a second timeout such that if we continue to see the same seqno (the stalled engine has made no progress at all) over the course of several hangchecks, declare the driver wedged and attempt to start afresh. Signed-off-by: Chris Wilson Cc: Mika Kuoppala Reviewed-by: Mika Kuoppala --- drivers/gpu/drm/i915/i915_debugfs.c | 5 +++-- drivers/gpu/drm/i915/i915_drv.h | 2 ++ drivers/gpu/drm/i915/intel_hangcheck.c | 17 ++++++++++++++++- drivers/gpu/drm/i915/intel_ringbuffer.h | 3 ++- 4 files changed, 23 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index 82d06a03b22f..7e68684be1e4 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -1362,11 +1362,12 @@ static int i915_hangcheck_info(struct seq_file *m, void *unused) seq_printf(m, "\tseqno = %x [current %x, last %x]\n", engine->hangcheck.seqno, seqno[id], intel_engine_last_submit(engine)); - seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s\n", + seq_printf(m, "\twaiters? %s, fake irq active? %s, stalled? %s, wedged? %s\n", yesno(intel_engine_has_waiter(engine)), yesno(test_bit(engine->id, &dev_priv->gpu_error.missed_irq_rings)), - yesno(engine->hangcheck.stalled)); + yesno(engine->hangcheck.stalled), + yesno(engine->hangcheck.wedged)); spin_lock_irq(&b->rb_lock); for (rb = rb_first(&b->waiters); rb; rb = rb_next(rb)) { diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 6649962e991a..d254a29a59a0 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -909,6 +909,8 @@ struct i915_gem_mm { #define I915_ENGINE_DEAD_TIMEOUT (4 * HZ) /* Seqno, head and subunits dead */ #define I915_SEQNO_DEAD_TIMEOUT (12 * HZ) /* Seqno dead with active head */ +#define I915_ENGINE_WEDGED_TIMEOUT (60 * HZ) /* Reset but no recovery? */ + enum modeset_restore { MODESET_ON_LID_OPEN, MODESET_DONE, diff --git a/drivers/gpu/drm/i915/intel_hangcheck.c b/drivers/gpu/drm/i915/intel_hangcheck.c index d47e346bd49e..2fc7a0dd0df9 100644 --- a/drivers/gpu/drm/i915/intel_hangcheck.c +++ b/drivers/gpu/drm/i915/intel_hangcheck.c @@ -294,6 +294,7 @@ static void hangcheck_store_sample(struct intel_engine_cs *engine, engine->hangcheck.seqno = hc->seqno; engine->hangcheck.action = hc->action; engine->hangcheck.stalled = hc->stalled; + engine->hangcheck.wedged = hc->wedged; } static enum intel_engine_hangcheck_action @@ -368,6 +369,9 @@ static void hangcheck_accumulate_sample(struct intel_engine_cs *engine, hc->stalled = time_after(jiffies, engine->hangcheck.action_timestamp + timeout); + hc->wedged = time_after(jiffies, + engine->hangcheck.action_timestamp + + I915_ENGINE_WEDGED_TIMEOUT); } static void hangcheck_declare_hang(struct drm_i915_private *i915, @@ -409,7 +413,7 @@ static void i915_hangcheck_elapsed(struct work_struct *work) gpu_error.hangcheck_work.work); struct intel_engine_cs *engine; enum intel_engine_id id; - unsigned int hung = 0, stuck = 0; + unsigned int hung = 0, stuck = 0, wedged = 0; if (!i915_modparams.enable_hangcheck) return; @@ -440,6 +444,17 @@ static void i915_hangcheck_elapsed(struct work_struct *work) if (hc.action != ENGINE_DEAD) stuck |= intel_engine_flag(engine); } + + if (engine->hangcheck.wedged) + wedged |= intel_engine_flag(engine); + } + + if (wedged) { + dev_err(dev_priv->drm.dev, + "GPU recovery timed out," + " cancelling all in-flight rendering.\n"); + GEM_TRACE_DUMP(); + i915_gem_set_wedged(dev_priv); } if (hung) diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.h b/drivers/gpu/drm/i915/intel_ringbuffer.h index bed66500ca80..2c1b28a33df8 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.h +++ b/drivers/gpu/drm/i915/intel_ringbuffer.h @@ -120,7 +120,8 @@ struct intel_engine_hangcheck { int deadlock; struct intel_instdone instdone; struct i915_request *active_request; - bool stalled; + bool stalled:1; + bool wedged:1; }; struct intel_ring {