From patchwork Sun Sep 13 14:51:00 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ben Gamari X-Patchwork-Id: 47148 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by demeter.kernel.org (8.14.2/8.14.2) with ESMTP id n8DEvRwY028762 for ; Sun, 13 Sep 2009 14:57:27 GMT Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id E47DA9E83C; Sun, 13 Sep 2009 07:57:26 -0700 (PDT) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mail-qy0-f200.google.com (mail-qy0-f200.google.com [209.85.221.200]) by gabe.freedesktop.org (Postfix) with ESMTP id 9C2649E765 for ; Sun, 13 Sep 2009 07:57:24 -0700 (PDT) Received: by qyk38 with SMTP id 38so1883792qyk.27 for ; Sun, 13 Sep 2009 07:57:23 -0700 (PDT) Received: by 10.224.30.209 with SMTP id v17mr4417745qac.188.1252853489785; Sun, 13 Sep 2009 07:51:29 -0700 (PDT) Received: from localhost.localdomain (c-24-63-135-98.hsd1.ma.comcast.net [24.63.135.98]) by mx.google.com with ESMTPS id 2sm4219036qwi.36.2009.09.13.07.51.27 (version=SSLv3 cipher=RC4-MD5); Sun, 13 Sep 2009 07:51:28 -0700 (PDT) From: Ben Gamari To: Jesse Barnes , Chris Wilson , Eric Anholt Date: Sun, 13 Sep 2009 10:51:00 -0400 Message-Id: <1252853462-9236-5-git-send-email-bgamari.foss@gmail.com> X-Mailer: git-send-email 1.6.3.3 In-Reply-To: <1252853462-9236-4-git-send-email-bgamari.foss@gmail.com> References: <1252853462-9236-1-git-send-email-bgamari.foss@gmail.com> <1252853462-9236-2-git-send-email-bgamari.foss@gmail.com> <1252853462-9236-3-git-send-email-bgamari.foss@gmail.com> <1252853462-9236-4-git-send-email-bgamari.foss@gmail.com> Cc: intel-gfx@lists.freedesktop.org Subject: [Intel-gfx] [PATCH 4/6] drm/i915: Add hangcheck timer X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.9 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: intel-gfx-bounces@lists.freedesktop.org Errors-To: intel-gfx-bounces@lists.freedesktop.org We set a periodic timer to check on the GPU, resetting it every time a batch is completed. If the timer elapses, we check acthd. If acthd hasn't changed in two timer periods, we assume the chip is wedged. This is implemented in such a way that it leaves the option open to employ adaptive timer intervals in the future. One could wait until several timer periods have elapsed before declaring the chip dead. If the chip comes back after several periods but before the "dead" threshold, the timer interval or dead threshold could be raised. It is important to note that while checking for active requests, we need to account for the fact that requests are removed from the list (i.e. retired) in a deferred work queue handler. This means that merely checking for an empty request_list is insufficient; the list could be non-empty yet the GPU still idle, causing the hangcheck timer to incorrectly mark the GPU as wedged (it took me a while to figure that out---sigh...) Signed-off-by: Ben Gamari --- drivers/gpu/drm/i915/i915_dma.c | 3 ++ drivers/gpu/drm/i915/i915_drv.h | 7 ++++ drivers/gpu/drm/i915/i915_gem.c | 8 ++++- drivers/gpu/drm/i915/i915_irq.c | 59 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 75 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c index 5a6b731..aabe41b 100644 --- a/drivers/gpu/drm/i915/i915_dma.c +++ b/drivers/gpu/drm/i915/i915_dma.c @@ -1447,6 +1447,8 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags) if (!IS_IGDNG(dev)) intel_opregion_init(dev, 0); + setup_timer(&dev_priv->hangcheck_timer, i915_hangcheck_elapsed, + (unsigned long) dev); return 0; out_workqueue_free: @@ -1467,6 +1469,7 @@ int i915_driver_unload(struct drm_device *dev) struct drm_i915_private *dev_priv = dev->dev_private; destroy_workqueue(dev_priv->wq); + del_timer_sync(&dev_priv->hangcheck_timer); io_mapping_free(dev_priv->mm.gtt_mapping); if (dev_priv->mm.gtt_mtrr >= 0) { diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 933d832..afbcaa9 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -203,6 +203,12 @@ typedef struct drm_i915_private { unsigned int sr01, adpa, ppcr, dvob, dvoc, lvds; int vblank_pipe; + /* For hangcheck timer */ +#define DRM_I915_HANGCHECK_PERIOD 75 /* in jiffies */ + struct timer_list hangcheck_timer; + int hangcheck_count; + uint32_t last_acthd; + bool cursor_needs_physical; struct drm_mm vram; @@ -620,6 +626,7 @@ extern int i915_emit_box(struct drm_device *dev, int i, int DR1, int DR4); /* i915_irq.c */ +void i915_hangcheck_elapsed(unsigned long data); extern int i915_irq_emit(struct drm_device *dev, void *data, struct drm_file *file_priv); extern int i915_irq_wait(struct drm_device *dev, void *data, diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index c70e91b..579b3b0 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1584,8 +1584,11 @@ i915_add_request(struct drm_device *dev, struct drm_file *file_priv, } - if (was_empty && !dev_priv->mm.suspended) - queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, HZ); + if (!dev_priv->mm.suspended) { + mod_timer(&dev_priv->hangcheck_timer, jiffies + DRM_I915_HANGCHECK_PERIOD); + if (was_empty) + queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, HZ); + } return seqno; } @@ -3896,6 +3899,7 @@ i915_gem_idle(struct drm_device *dev) * We need to replace this with a semaphore, or something. */ dev_priv->mm.suspended = 1; + del_timer(&dev_priv->hangcheck_timer); /* Cancel the retire work handler, wait for it to finish if running */ diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 6c89f2f..126696a 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -601,6 +601,8 @@ irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS) if (iir & I915_USER_INTERRUPT) { dev_priv->mm.irq_gem_seqno = i915_get_gem_seqno(dev); DRM_WAKEUP(&dev_priv->irq_queue); + dev_priv->hangcheck_count = 0; + mod_timer(&dev_priv->hangcheck_timer, jiffies + DRM_I915_HANGCHECK_PERIOD); } if (pipea_stats & vblank_status) { @@ -880,6 +882,63 @@ int i915_vblank_swap(struct drm_device *dev, void *data, return -EINVAL; } +/** + * Returns whether all requests have been completed by the chip + */ +bool i915_all_requests_complete(struct drm_device *dev) +{ + struct drm_i915_gem_request *req; + drm_i915_private_t *dev_priv = dev->dev_private; + uint32_t seqno; + + seqno = i915_get_gem_seqno(dev); + list_for_each_entry(req, &dev_priv->mm.request_list, list) { + if (!i915_seqno_passed(seqno, req->seqno)) + return false; + } + return true; +} + +/** + * This is called when the chip hasn't reported back with completed + * batchbuffers in a long time. The first time this is called we simply record + * ACTHD. If ACTHD hasn't changed by the time the hangcheck timer elapses + * again, we assume the chip is wedged and try to fix it. + */ +void i915_hangcheck_elapsed(unsigned long data) +{ + struct drm_device *dev = (struct drm_device *)data; + drm_i915_private_t *dev_priv = dev->dev_private; + uint32_t acthd; + + if (!IS_I965G(dev)) + acthd = I915_READ(ACTHD); + else + acthd = I915_READ(ACTHD_I965); + + if (i915_all_requests_complete(dev)) { + dev_priv->hangcheck_count = 0; + return; + } + + if (dev_priv->last_acthd == acthd && dev_priv->hangcheck_count > 0) { + DRM_ERROR("Hangcheck timer elapsed... GPU hung\n"); + dev_priv->mm.wedged = true; /* Hopefully this is atomic */ + i915_handle_error(dev); + return; + } + + // Reset timer case chip hangs without another request being added + mod_timer(&dev_priv->hangcheck_timer, jiffies + DRM_I915_HANGCHECK_PERIOD); + + if (acthd != dev_priv->last_acthd) + dev_priv->hangcheck_count = 0; + else + dev_priv->hangcheck_count++; + + dev_priv->last_acthd = acthd; +} + /* drm_dma.h hooks */ static void igdng_irq_preinstall(struct drm_device *dev)