From patchwork Thu Oct  6 07:00:09 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Mika Kuoppala <mika.kuoppala@linux.intel.com>
X-Patchwork-Id: 9363693
Return-Path: <intel-gfx-bounces@lists.freedesktop.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	4E6D66077E for <patchwork-intel-gfx@patchwork.kernel.org>;
	Thu,  6 Oct 2016 07:01:20 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3B3F428E29
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Thu,  6 Oct 2016 07:01:20 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id 2DE9E28E31; Thu,  6 Oct 2016 07:01:20 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-4.2 required=2.0 tests=BAYES_00, RCVD_IN_DNSWL_MED
	autolearn=ham version=3.3.1
Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177])
	(using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id E2B6328E29
	for <patchwork-intel-gfx@patchwork.kernel.org>;
	Thu,  6 Oct 2016 07:01:15 +0000 (UTC)
Received: from gabe.freedesktop.org (localhost [127.0.0.1])
	by gabe.freedesktop.org (Postfix) with ESMTP id 66B716E209;
	Thu,  6 Oct 2016 07:01:14 +0000 (UTC)
X-Original-To: intel-gfx@lists.freedesktop.org
Delivered-To: intel-gfx@lists.freedesktop.org
Received: from mga04.intel.com (mga04.intel.com [192.55.52.120])
	by gabe.freedesktop.org (Postfix) with ESMTPS id 5E88A6E209
	for <intel-gfx@lists.freedesktop.org>;
	Thu,  6 Oct 2016 07:01:12 +0000 (UTC)
Received: from orsmga002.jf.intel.com ([10.7.209.21])
	by fmsmga104.fm.intel.com with ESMTP; 06 Oct 2016 00:00:48 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos; i="5.31,452,1473145200"; d="scan'208";
	a="1061149184"
Received: from rosetta.fi.intel.com ([10.237.72.98])
	by orsmga002.jf.intel.com with ESMTP; 06 Oct 2016 00:00:41 -0700
Received: by rosetta.fi.intel.com (Postfix, from userid 1000)
	id 6DF517C1DF0; Thu,  6 Oct 2016 10:00:10 +0300 (EEST)
From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Thu,  6 Oct 2016 10:00:09 +0300
Message-Id: <1475737209-3380-1-git-send-email-mika.kuoppala@intel.com>
X-Mailer: git-send-email 2.7.4
Subject: [Intel-gfx] [PATCH] drm/i915: Save hangcheck score across resets
X-BeenThere: intel-gfx@lists.freedesktop.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: Intel graphics driver community testing & development
	<intel-gfx.lists.freedesktop.org>
List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe>
List-Archive: <https://lists.freedesktop.org/archives/intel-gfx>
List-Post: <mailto:intel-gfx@lists.freedesktop.org>
List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help>
List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>,
	<mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe>
MIME-Version: 1.0
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
X-Virus-Scanned: ClamAV using ClamSMTP

Hangcheck score has been zeroed on engine init, which happens
after reset recovery. This has worked well as we always reset
all engines on hang, and also discarded all work submitted
to engines.

With commit 821ed7df6e2a ("drm/i915: Update reset path to fix
incomplete requests") driver gained capability to only discard
the request or requests that were directly involved with the hang,
and those who were deemed innocent, were replayed intact.

Our hangcheck works by periodically sampling the engine state and
then doing checks in multiple stages to see if engine is making
progress. The engine capabilities differ. With render engine, we
have a more ways to measure the progress and thus more checks and
stages. With other engines, we only sample the seqno and head movement.

Now consider that blitter engine is waiting on render and render engine
has a batch which has stuck. Due to simpler checks, the blitter engine
hangcheck score accumulates faster and reaches reset threshold quicker.
We also blame the blitter for the hang as it had the highest score
when recovery started.

Blaming the wrong engine, we don't find the actual guilty request and
most critically, won't make any progress after the reset. That will
lead to second hang, with the same pattern, ad infinitum.

Previously the false blaming of engine was not critical as score was
only used as a trigger for full reset and debug aid in error states.
But now, the score is essential of finding the culprit request.

To fix this, keep the hangcheck scores across resets. We already
have a decay mechanism in place if progress is being made. This
ensures that even if we blame the wrong engine once, we don't
do it twice or consistently, and the real culprit request will be
cleared, real progress will be made and this untangles rest of
the engines and lead to successful recovery.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98104
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_engine_cs.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index d00ec805f93d..4bb869eb11bc 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -209,7 +209,6 @@ void intel_engine_init_seqno(struct intel_engine_cs *engine, u32 seqno)
 
 void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
 {
-	memset(&engine->hangcheck, 0, sizeof(engine->hangcheck));
 	clear_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings);
 	if (intel_engine_has_waiter(engine))
 		i915_queue_hangcheck(engine->i915);