From patchwork Thu Oct 6 07:00:09 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mika Kuoppala X-Patchwork-Id: 9363693 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 4E6D66077E for ; Thu, 6 Oct 2016 07:01:20 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3B3F428E29 for ; Thu, 6 Oct 2016 07:01:20 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 2DE9E28E31; Thu, 6 Oct 2016 07:01:20 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id E2B6328E29 for ; Thu, 6 Oct 2016 07:01:15 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 66B716E209; Thu, 6 Oct 2016 07:01:14 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5E88A6E209 for ; Thu, 6 Oct 2016 07:01:12 +0000 (UTC) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga104.fm.intel.com with ESMTP; 06 Oct 2016 00:00:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos; i="5.31,452,1473145200"; d="scan'208"; a="1061149184" Received: from rosetta.fi.intel.com ([10.237.72.98]) by orsmga002.jf.intel.com with ESMTP; 06 Oct 2016 00:00:41 -0700 Received: by rosetta.fi.intel.com (Postfix, from userid 1000) id 6DF517C1DF0; Thu, 6 Oct 2016 10:00:10 +0300 (EEST) From: Mika Kuoppala To: intel-gfx@lists.freedesktop.org Date: Thu, 6 Oct 2016 10:00:09 +0300 Message-Id: <1475737209-3380-1-git-send-email-mika.kuoppala@intel.com> X-Mailer: git-send-email 2.7.4 Subject: [Intel-gfx] [PATCH] drm/i915: Save hangcheck score across resets X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" X-Virus-Scanned: ClamAV using ClamSMTP Hangcheck score has been zeroed on engine init, which happens after reset recovery. This has worked well as we always reset all engines on hang, and also discarded all work submitted to engines. With commit 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests") driver gained capability to only discard the request or requests that were directly involved with the hang, and those who were deemed innocent, were replayed intact. Our hangcheck works by periodically sampling the engine state and then doing checks in multiple stages to see if engine is making progress. The engine capabilities differ. With render engine, we have a more ways to measure the progress and thus more checks and stages. With other engines, we only sample the seqno and head movement. Now consider that blitter engine is waiting on render and render engine has a batch which has stuck. Due to simpler checks, the blitter engine hangcheck score accumulates faster and reaches reset threshold quicker. We also blame the blitter for the hang as it had the highest score when recovery started. Blaming the wrong engine, we don't find the actual guilty request and most critically, won't make any progress after the reset. That will lead to second hang, with the same pattern, ad infinitum. Previously the false blaming of engine was not critical as score was only used as a trigger for full reset and debug aid in error states. But now, the score is essential of finding the culprit request. To fix this, keep the hangcheck scores across resets. We already have a decay mechanism in place if progress is being made. This ensures that even if we blame the wrong engine once, we don't do it twice or consistently, and the real culprit request will be cleared, real progress will be made and this untangles rest of the engines and lead to successful recovery. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98104 Cc: Chris Wilson Signed-off-by: Mika Kuoppala --- drivers/gpu/drm/i915/intel_engine_cs.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c index d00ec805f93d..4bb869eb11bc 100644 --- a/drivers/gpu/drm/i915/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/intel_engine_cs.c @@ -209,7 +209,6 @@ void intel_engine_init_seqno(struct intel_engine_cs *engine, u32 seqno) void intel_engine_init_hangcheck(struct intel_engine_cs *engine) { - memset(&engine->hangcheck, 0, sizeof(engine->hangcheck)); clear_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings); if (intel_engine_has_waiter(engine)) i915_queue_hangcheck(engine->i915);