From patchwork Mon Nov 12 22:07:52 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Vetter X-Patchwork-Id: 1731471 Return-Path: X-Original-To: patchwork-intel-gfx@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork1.kernel.org Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by patchwork1.kernel.org (Postfix) with ESMTP id 90AE43FCDE for ; Mon, 12 Nov 2012 22:21:38 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 81EEC9E7E3 for ; Mon, 12 Nov 2012 14:21:38 -0800 (PST) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mail-ee0-f49.google.com (mail-ee0-f49.google.com [74.125.83.49]) by gabe.freedesktop.org (Postfix) with ESMTP id 5A95C9E796 for ; Mon, 12 Nov 2012 14:19:12 -0800 (PST) Received: by mail-ee0-f49.google.com with SMTP id c1so3704090eek.36 for ; Mon, 12 Nov 2012 14:19:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references; bh=Y9eUcfDauEBeByU4uxHsIjoW0Y5mFYbcp9J6WeIMtEU=; b=jIyPvq2SzREJZgCCuxCJ0ReS4Xik/AvonASigc3VCpbYYgyOACxfa9kCDzLZrYhcNg i4QovGh1etyIqp2ZVqXdQFk/+DTji4Wg1Zog0mKGLl8nnypdAZBKWrsAiZcL75+TMdcz +EFV0GRvZWGVNCYxiIY/tKaeIE3jz+VGvR0vw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer:in-reply-to:references :x-gm-message-state; bh=Y9eUcfDauEBeByU4uxHsIjoW0Y5mFYbcp9J6WeIMtEU=; b=fyn+J8HmVQN9ooy/FEGFsxtjoDeEKTz3moepCQHfzhOPHGabHjm6a8RjBm5LdA5WJ1 Bc1Rhqu/4IMhalqfqw3djVMjqb8P5Q9rrNuoxAk+tQKR6zp49EFgQsSC0iz6N73qdUkS eqobhHK39KveFsap5rx86kCE0sFcxXio5NBqi3p9HYUPTemIirqFKlN1l/lMNXUBvkTE 9I67F5ZwMszc+mf/VwxGYhHk222h3NMrTuc+sfeV6NvwYMGVSQETDC7lh4qn7wY8Icf0 1hWHUryRmFNQoWkDKJk9bTPpmc4S+pURKqRkZuv27EtSdpQh8zupr/zWYiXHnNPff6o7 M1wQ== Received: by 10.14.194.71 with SMTP id l47mr66923055een.6.1352758751906; Mon, 12 Nov 2012 14:19:11 -0800 (PST) Received: from fliege.ffwll.local (178-83-130-250.dynamic.hispeed.ch. [178.83.130.250]) by mx.google.com with ESMTPS id d44sm18553552eeo.10.2012.11.12.14.19.10 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 12 Nov 2012 14:19:11 -0800 (PST) From: Daniel Vetter To: Intel Graphics Development Date: Mon, 12 Nov 2012 23:07:52 +0100 Message-Id: <1352758073-31330-5-git-send-email-daniel.vetter@ffwll.ch> X-Mailer: git-send-email 1.7.11.4 In-Reply-To: <1352758073-31330-1-git-send-email-daniel.vetter@ffwll.ch> References: <1352758073-31330-1-git-send-email-daniel.vetter@ffwll.ch> X-Gm-Message-State: ALoCoQl/XkyjJnBEzxCanfjJRQq8nmQ6HzEq3tmvu1Iy15QYYtZpKfrLEgmdlY/eoPIZ888T/ohB Cc: Daniel Vetter Subject: [Intel-gfx] [PATCH 4/5] drm/i915: clear up wedged transitions X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org Errors-To: intel-gfx-bounces+patchwork-intel-gfx=patchwork.kernel.org@lists.freedesktop.org We have two important transitions of the wedged state in the current code: - 0 -> 1: This means a hang has been detected, and signals to everyone that they please get of any locks, so that the reset work item can do its job. - 1 -> 0: The reset handler has completed. Now the last transition mixes up two states: "Reset completed and successful" and "Reset failed". To distinguish these two we do some tricks with the reset completion, but I simply could not convince myself that this doesn't race under odd circumstances. Hence split this up, and add a new terminal state indicating that the hw is gone for good. Also add explicit #defines for both states, update comments. While auditing the code I've noticed one place (the throttle ioctl) which does not yet wait for the reset handler to complete and doesn't properly decode the wedge state into -EAGAIN/-EIO. Fix this up by calling the right helpers. This might explain the oddball "my compositor just died in a successfull gpu reset" reports. Signed-Off-by: Daniel Vetter --- drivers/gpu/drm/i915/i915_drv.h | 17 ++++++++++++++- drivers/gpu/drm/i915/i915_gem.c | 46 ++++++++++++++++------------------------- drivers/gpu/drm/i915/i915_irq.c | 23 ++++++++++++++------- 3 files changed, 49 insertions(+), 37 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 6958bb0..423541b 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -723,11 +723,26 @@ struct i915_gpu_error { /* Protected by the above dev->gpu_error.lock. */ struct drm_i915_error_state *first_error; struct work_struct work; - struct completion completion; unsigned long last_reset; + /** + * State variable controlling the reset flow + * + * 1 means a reset is in progress. This state will (presuming we don't + * have any bugs) decay into either 0 (successful reset) or 2 (hw + * terminally sour). All waiters on the reset_queue will be woken when + * that happens. + */ atomic_t wedged; +#define I915_RESET_IN_PROGRESS 1 +#define I915_WEDGED 2 + + /** + * Waitqueue to signal when the reset has completed. Used by clients + * that wait for dev_priv->mm.wedged to settle. + */ + wait_queue_head_t reset_queue; /* For gpu hang simulation. */ unsigned int stop_rings; diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index b2620c7..3a1f72c 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -89,36 +89,28 @@ static void i915_gem_info_remove_obj(struct drm_i915_private *dev_priv, static int i915_gem_wait_for_error(struct i915_gpu_error *error) { - struct completion *x = &error->completion; - unsigned long flags; int ret; if (!atomic_read(&error->wedged)) return 0; +#define EXIT_COND (atomic_read(&error->reset_queue) != I915_RESET_IN_PROGRESS) /* * Only wait 10 seconds for the gpu reset to complete to avoid hanging * userspace. If it takes that long something really bad is going on and * we should simply try to bail out and fail as gracefully as possible. */ - ret = wait_for_completion_interruptible_timeout(x, 10*HZ); + ret = wait_event_interruptible_timeout(error->reset_queue, + EXIT_COND, + 10*HZ); if (ret == 0) { DRM_ERROR("Timed out waiting for the gpu reset to complete\n"); return -EIO; } else if (ret < 0) { return ret; } +#undef EXIT_COND - if (atomic_read(&error->wedged)) { - /* GPU is hung, bump the completion count to account for - * the token we just consumed so that we never hit zero and - * end up waiting upon a subsequent completion event that - * will never happen. - */ - spin_lock_irqsave(&x->wait.lock, flags); - x->done++; - spin_unlock_irqrestore(&x->wait.lock, flags); - } return 0; } @@ -943,23 +935,16 @@ int i915_gem_check_wedge(struct i915_gpu_error *error, bool interruptible) { - if (atomic_read(&error->wedged)) { - struct completion *x = &error->completion; - bool recovery_complete; - unsigned long flags; - - /* Give the error handler a chance to run. */ - spin_lock_irqsave(&x->wait.lock, flags); - recovery_complete = x->done > 0; - spin_unlock_irqrestore(&x->wait.lock, flags); + unsigned tmp = atomic_read(&error->wedged); + if (tmp) { /* Non-interruptible callers can't handle -EAGAIN, hence return * -EIO unconditionally for these. */ if (!interruptible) return -EIO; - /* Recovery complete, but still wedged means reset failure. */ - if (recovery_complete) + /* Recovery complete, but the reset failed ... */ + if (tmp == I915_WEDGED) return -EIO; return -EAGAIN; @@ -1385,7 +1370,7 @@ out: /* If this -EIO is due to a gpu hang, give the reset code a * chance to clean up the mess. Otherwise return the proper * SIGBUS. */ - if (!atomic_read(&dev_priv->gpu_error.wedged)) + if (atomic_read(&dev_priv->gpu_error.wedged) == I915_WEDGED) return VM_FAULT_SIGBUS; case -EAGAIN: /* Give the error handler a chance to run and move the @@ -3402,8 +3387,13 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file) u32 seqno = 0; int ret; - if (atomic_read(&dev_priv->gpu_error.wedged)) - return -EIO; + ret = i915_gem_wait_for_error(&dev_priv->gpu_error); + if (ret) + return ret; + + ret = i915_gem_check_wedge(&dev_priv->gpu_error, false); + if (ret) + return ret; spin_lock(&file_priv->mm.lock); list_for_each_entry(request, &file_priv->mm.request_list, client_list) { @@ -4107,7 +4097,7 @@ i915_gem_load(struct drm_device *dev) INIT_LIST_HEAD(&dev_priv->fence_regs[i].lru_list); INIT_DELAYED_WORK(&dev_priv->mm.retire_work, i915_gem_retire_work_handler); - init_completion(&dev_priv->gpu_error.completion); + init_waitqueue_head(&dev_priv->gpu_error.reset_queue); /* On GEN3 we really need to make sure the ARB C3 LP bit is set */ if (IS_GEN3(dev)) { diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 9d8921a..73d46f4 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -838,8 +838,10 @@ done: */ static void i915_error_work_func(struct work_struct *work) { - drm_i915_private_t *dev_priv = container_of(work, drm_i915_private_t, - gpu_error.work); + struct i915_gpu_error *error = container_of(work, struct i915_gpu_error, + work); + drm_i915_private_t *dev_priv = container_of(error, drm_i915_private_t, + gpu_error); struct drm_device *dev = dev_priv->dev; char *error_event[] = { "ERROR=1", NULL }; char *reset_event[] = { "RESET=1", NULL }; @@ -847,14 +849,19 @@ static void i915_error_work_func(struct work_struct *work) kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, error_event); - if (atomic_read(&dev_priv->gpu_error.wedged)) { + if (atomic_read(&error->wedged) == I915_RESET_IN_PROGRESS) { DRM_DEBUG_DRIVER("resetting chip\n"); kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, reset_event); + if (!i915_reset(dev)) { - atomic_set(&dev_priv->gpu_error.wedged, 0); + atomic_set(&error->wedged, 0); kobject_uevent_env(&dev->primary->kdev.kobj, KOBJ_CHANGE, reset_done_event); + } else { + atomic_set(&error->wedged, + I915_WEDGED); } - complete_all(&dev_priv->gpu_error.completion); + + wake_up_all(&dev_priv->gpu_error.reset_queue); } } @@ -1434,11 +1441,11 @@ void i915_handle_error(struct drm_device *dev, bool wedged) i915_report_and_clear_eir(dev); if (wedged) { - INIT_COMPLETION(dev_priv->gpu_error.completion); - atomic_set(&dev_priv->gpu_error.wedged, 1); + atomic_set(&dev_priv->gpu_error.wedged, I915_RESET_IN_PROGRESS); /* - * Wakeup waiting processes so they don't hang + * Wakeup waiting processes so that the reset work item + * doesn't deadlock trying to grab various locks. */ for_each_ring(ring, dev_priv, i) wake_up_all(&ring->irq_queue);