From patchwork Tue Nov 13 16:40:39 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Vetter X-Patchwork-Id: 1735311 Return-Path: X-Original-To: patchwork-intel-gfx@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork2.kernel.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by patchwork2.kernel.org (Postfix) with ESMTP id 1A5A0DF280 for ; Tue, 13 Nov 2012 16:52:31 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id F41B49EEEC for ; Tue, 13 Nov 2012 08:52:30 -0800 (PST) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mail-ea0-f177.google.com (mail-ea0-f177.google.com [209.85.215.177]) by gabe.freedesktop.org (Postfix) with ESMTP id 8966A9F0FA for ; Tue, 13 Nov 2012 08:51:51 -0800 (PST) Received: by mail-ea0-f177.google.com with SMTP id n13so2908133eaa.36 for ; Tue, 13 Nov 2012 08:51:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references; bh=FZTmycPRsk+pmcEzvL4s1/bENVztjzoD8+UeTHLNib4=; b=WK9jvmjmUpXgNpa2MWDkdPVohA8GeCVhMXFmwiSaRHYRcCV/KHLquSiziBT5NIaCIT yB9u5JuhXU3Oq7h/rFg6Dce76Vr0M0dCapWTXKSq8BX00M1mWZvGIO1JR3zSccZEDdNV 3e1/b2bnafTCYyiTOjT1R4YBXFLTxnMRzif4k= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references :x-gm-message-state; bh=FZTmycPRsk+pmcEzvL4s1/bENVztjzoD8+UeTHLNib4=; b=F0WERKEZyg2QGqoXuNtmRcGzqpLXAYx9uZtnYdRGMvubywMygMbvBh9RqYKeK66UK1 ttSmAA5VmLnHLFE/v9rquFim0VvJmQvMVdgry8qlJqfhQCQHMsWxEz+io2W/5vXlafkW nZE2B+NhWx+hnE/Gafu9HcG+3rPW8qDZbPMI0qQSXvT4wBlMdrVQ/uoOJyCK3LD8zVDc kB0j9Fr3u6OHm3o0qQzyiBtXCVQ9M7IAcg3eWAz+9TYjfFhJBA9xqJj8Vobphsl43Yw8 MLtjplJ29ucJBKU17Q+qSfLC+1tQuUML272qGmJYyGbporz9P5WUWGjqbLLb9zWEBpcs o2fQ== Received: by 10.14.205.65 with SMTP id i41mr71457721eeo.2.1352825511147; Tue, 13 Nov 2012 08:51:51 -0800 (PST) Received: from fliege.ffwll.local (178-83-130-250.dynamic.hispeed.ch. [178.83.130.250]) by mx.google.com with ESMTPS id g47sm23638747eeo.6.2012.11.13.08.51.47 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 13 Nov 2012 08:51:48 -0800 (PST) From: Daniel Vetter To: Intel Graphics Development Date: Tue, 13 Nov 2012 17:40:39 +0100 Message-Id: <1352824839-18911-2-git-send-email-daniel.vetter@ffwll.ch> X-Mailer: git-send-email 1.7.11.4 In-Reply-To: <1352824839-18911-1-git-send-email-daniel.vetter@ffwll.ch> References: <1352824839-18911-1-git-send-email-daniel.vetter@ffwll.ch> X-Gm-Message-State: ALoCoQmpxRV1k0hkVbcieskPz1UUfG1hYJsGmSTlw0vrAHhqpCiZl1Gz8IWALFUl5Xc0dilU3y/7 Cc: Daniel Vetter Subject: [Intel-gfx] [PATCH 2/2] drm/i915: clear up wedged transitions X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org We have two important transitions of the wedged state in the current code: - 0 -> 1: This means a hang has been detected, and signals to everyone that they please get of any locks, so that the reset work item can do its job. - 1 -> 0: The reset handler has completed. Now the last transition mixes up two states: "Reset completed and successful" and "Reset failed". To distinguish these two we do some tricks with the reset completion, but I simply could not convince myself that this doesn't race under odd circumstances. Hence split this up, and add a new terminal state indicating that the hw is gone for good. Also add explicit #defines for both states, update comments. v2: Split out the reset handling bugfix for the throttle ioctl. Signed-Off-by: Daniel Vetter --- drivers/gpu/drm/i915/i915_drv.h | 17 ++++++++++++++++- drivers/gpu/drm/i915/i915_gem.c | 37 +++++++++++-------------------------- drivers/gpu/drm/i915/i915_irq.c | 23 +++++++++++++++-------- 3 files changed, 42 insertions(+), 35 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 6958bb0..423541b 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -723,11 +723,26 @@ struct i915_gpu_error { /* Protected by the above dev->gpu_error.lock. */ struct drm_i915_error_state *first_error; struct work_struct work; - struct completion completion; unsigned long last_reset; + /** + * State variable controlling the reset flow + * + * 1 means a reset is in progress. This state will (presuming we don't + * have any bugs) decay into either 0 (successful reset) or 2 (hw + * terminally sour). All waiters on the reset_queue will be woken when + * that happens. + */ atomic_t wedged; +#define I915_RESET_IN_PROGRESS 1 +#define I915_WEDGED 2 + + /** + * Waitqueue to signal when the reset has completed. Used by clients + * that wait for dev_priv->mm.wedged to settle. + */ + wait_queue_head_t reset_queue; /* For gpu hang simulation. */ unsigned int stop_rings; diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 55cdad9..3a1f72c 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -89,36 +89,28 @@ static void i915_gem_info_remove_obj(struct drm_i915_private *dev_priv, static int i915_gem_wait_for_error(struct i915_gpu_error *error) { - struct completion *x = &error->completion; - unsigned long flags; int ret; if (!atomic_read(&error->wedged)) return 0; +#define EXIT_COND (atomic_read(&error->reset_queue) != I915_RESET_IN_PROGRESS) /* * Only wait 10 seconds for the gpu reset to complete to avoid hanging * userspace. If it takes that long something really bad is going on and * we should simply try to bail out and fail as gracefully as possible. */ - ret = wait_for_completion_interruptible_timeout(x, 10*HZ); + ret = wait_event_interruptible_timeout(error->reset_queue, + EXIT_COND, + 10*HZ); if (ret == 0) { DRM_ERROR("Timed out waiting for the gpu reset to complete\n"); return -EIO; } else if (ret < 0) { return ret; } +#undef EXIT_COND - if (atomic_read(&error->wedged)) { - /* GPU is hung, bump the completion count to account for - * the token we just consumed so that we never hit zero and - * end up waiting upon a subsequent completion event that - * will never happen. - */ - spin_lock_irqsave(&x->wait.lock, flags); - x->done++; - spin_unlock_irqrestore(&x->wait.lock, flags); - } return 0; } @@ -943,23 +935,16 @@ int i915_gem_check_wedge(struct i915_gpu_error *error, bool interruptible) { - if (atomic_read(&error->wedged)) { - struct completion *x = &error->completion; - bool recovery_complete; - unsigned long flags; - - /* Give the error handler a chance to run. */ - spin_lock_irqsave(&x->wait.lock, flags); - recovery_complete = x->done > 0; - spin_unlock_irqrestore(&x->wait.lock, flags); + unsigned tmp = atomic_read(&error->wedged); + if (tmp) { /* Non-interruptible callers can't handle -EAGAIN, hence return * -EIO unconditionally for these. */ if (!interruptible) return -EIO; - /* Recovery complete, but still wedged means reset failure. */ - if (recovery_complete) + /* Recovery complete, but the reset failed ... */ + if (tmp == I915_WEDGED) return -EIO; return -EAGAIN; @@ -1385,7 +1370,7 @@ out: /* If this -EIO is due to a gpu hang, give the reset code a * chance to clean up the mess. Otherwise return the proper * SIGBUS. */ - if (!atomic_read(&dev_priv->gpu_error.wedged)) + if (atomic_read(&dev_priv->gpu_error.wedged) == I915_WEDGED) return VM_FAULT_SIGBUS; case -EAGAIN: /* Give the error handler a chance to run and move the @@ -4112,7 +4097,7 @@ i915_gem_load(struct drm_device *dev) INIT_LIST_HEAD(&dev_priv->fence_regs[i].lru_list); INIT_DELAYED_WORK(&dev_priv->mm.retire_work, i915_gem_retire_work_handler); - init_completion(&dev_priv->gpu_error.completion); + init_waitqueue_head(&dev_priv->gpu_error.reset_queue); /* On GEN3 we really need to make sure the ARB C3 LP bit is set */ if (IS_GEN3(dev)) { diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 9d8921a..73d46f4 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -838,8 +838,10 @@ done: */ static void i915_error_work_func(struct work_struct *work) { - drm_i915_private_t *dev_priv = container_of(work, drm_i915_private_t, - gpu_error.work); + struct i915_gpu_error *error = container_of(work, struct i915_gpu_error, + work); + drm_i915_private_t *dev_priv = container_of(error, drm_i915_private_t, + gpu_error); struct drm_device *dev = dev_priv->dev; char *error_event[] = { "ERROR=1", NULL }; char *reset_event[] = { "RESET=1", NULL }; @@ -847,14 +849,19 @@ static void i915_error_work_func(struct work_struct *work) kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, error_event); - if (atomic_read(&dev_priv->gpu_error.wedged)) { + if (atomic_read(&error->wedged) == I915_RESET_IN_PROGRESS) { DRM_DEBUG_DRIVER("resetting chip\n"); kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, reset_event); + if (!i915_reset(dev)) { - atomic_set(&dev_priv->gpu_error.wedged, 0); + atomic_set(&error->wedged, 0); kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, reset_done_event); + } else { + atomic_set(&error->wedged, + I915_WEDGED); } - complete_all(&dev_priv->gpu_error.completion); + + wake_up_all(&dev_priv->gpu_error.reset_queue); } } @@ -1434,11 +1441,11 @@ void i915_handle_error(struct drm_device *dev, bool wedged) i915_report_and_clear_eir(dev); if (wedged) { - INIT_COMPLETION(dev_priv->gpu_error.completion); - atomic_set(&dev_priv->gpu_error.wedged, 1); + atomic_set(&dev_priv->gpu_error.wedged, I915_RESET_IN_PROGRESS); /* - * Wakeup waiting processes so they don't hang + * Wakeup waiting processes so that the reset work item + * doesn't deadlock trying to grab various locks. */ for_each_ring(ring, dev_priv, i) wake_up_all(&ring->irq_queue);