From patchwork Fri Oct 9 12:21:45 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chris Wilson X-Patchwork-Id: 7361271 Return-Path: X-Original-To: patchwork-intel-gfx@patchwork.kernel.org Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.136]) by patchwork2.web.kernel.org (Postfix) with ESMTP id 73E24BEEA4 for ; Fri, 9 Oct 2015 12:22:00 +0000 (UTC) Received: from mail.kernel.org (localhost [127.0.0.1]) by mail.kernel.org (Postfix) with ESMTP id 75CF620866 for ; Fri, 9 Oct 2015 12:21:59 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) by mail.kernel.org (Postfix) with ESMTP id 7156820860 for ; Fri, 9 Oct 2015 12:21:58 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id A75446E578; Fri, 9 Oct 2015 05:21:57 -0700 (PDT) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from fireflyinternet.com (mail.fireflyinternet.com [87.106.93.118]) by gabe.freedesktop.org (Postfix) with ESMTP id 551D46E578 for ; Fri, 9 Oct 2015 05:21:56 -0700 (PDT) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from haswell.alporthouse.com (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP id 46404268-1500048 for multiple; Fri, 09 Oct 2015 13:22:08 +0100 Received: by haswell.alporthouse.com (sSMTP sendmail emulation); Fri, 09 Oct 2015 13:21:47 +0100 From: Chris Wilson To: intel-gfx@lists.freedesktop.org Date: Fri, 9 Oct 2015 13:21:45 +0100 Message-Id: <1444393305-20977-1-git-send-email-chris@chris-wilson.co.uk> X-Mailer: git-send-email 2.6.1 X-Originating-IP: 78.156.65.138 X-Country: code=GB country="United Kingdom" ip=78.156.65.138 Subject: [Intel-gfx] [PATCH] RFC drm/i915: Stop the machine whilst capturing the GPU crash dump X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_MED, T_RP_MATCHES_RCVD, UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The error state is purposefully racy as we expect it to be called at any time and so have avoided any locking whilst capturing the crash dump. However, with multi-engine GPUs and multiple CPUs, those races can manifest into OOPSes as we attempt to chase dangling pointers freed on other CPUs. Under discussion are lots of ways to slow down normal operation in order to protect the post-mortem error capture, but what it we take the opposite approach and freeze the machine whilst the error catpure runs (note the GPU may still running, but as long as we don't process any of the results the driver's bookkeeping will be static). Signed-off-by: Chris Wilson --- drivers/gpu/drm/i915/Kconfig | 1 + drivers/gpu/drm/i915/i915_drv.h | 2 ++ drivers/gpu/drm/i915/i915_gpu_error.c | 31 +++++++++++++++++++++---------- 3 files changed, 24 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig index 99e819767204..63df28910c8f 100644 --- a/drivers/gpu/drm/i915/Kconfig +++ b/drivers/gpu/drm/i915/Kconfig @@ -7,6 +7,7 @@ config DRM_I915 select AGP_INTEL if AGP select INTERVAL_TREE select ZLIB_DEFLATE + select STOP_MACHINE # we need shmfs for the swappable backing store, and in particular # the shmem_readpage() which depends upon tmpfs select SHMEM diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 9d16fc1189d6..14a882fe486c 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -488,6 +488,8 @@ struct drm_i915_error_state { struct kref ref; struct timeval time; + struct drm_i915_private *i915; + char error_msg[128]; int iommu; u32 reset_count; diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 64bdffcffb50..29cbec67bcfc 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -29,6 +29,7 @@ #include #include +#include #include "i915_drv.h" static const char *yesno(int v) @@ -1352,6 +1353,24 @@ static void i915_capture_gen_state(struct drm_i915_private *dev_priv, error->suspend_count = dev_priv->suspend_count; } +static int capture(void *data) +{ + struct drm_i915_error_state *error = data; + + i915_capture_gen_state(error->i915, error); + i915_capture_reg_state(error->i915, error); + i915_gem_capture_buffers(error->i915, error); + i915_gem_record_fences(error->i915->dev, error); + i915_gem_record_rings(error->i915->dev, error); + + do_gettimeofday(&error->time); + + error->overlay = intel_overlay_capture_error_state(error->i915->dev); + error->display = intel_display_capture_error_state(error->i915->dev); + + return 0; +} + /** * i915_capture_error_state - capture an error record for later analysis * @dev: drm device @@ -1377,17 +1396,9 @@ void i915_capture_error_state(struct drm_device *dev, bool wedged, } kref_init(&error->ref); + error->i915 = dev_priv; - i915_capture_gen_state(dev_priv, error); - i915_capture_reg_state(dev_priv, error); - i915_gem_capture_buffers(dev_priv, error); - i915_gem_record_fences(dev, error); - i915_gem_record_rings(dev, error); - - do_gettimeofday(&error->time); - - error->overlay = intel_overlay_capture_error_state(dev); - error->display = intel_display_capture_error_state(dev); + stop_machine(capture, error, NULL); i915_error_capture_msg(dev, error, wedged, error_msg); DRM_INFO("%s\n", error->error_msg);