drm/i915: Save hangcheck score across resets

Message ID	1475737209-3380-1-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Mika Kuoppala <mika.kuoppala@linux.intel.com> To: intel-gfx@lists.freedesktop.org Date: Thu, 6 Oct 2016 10:00:09 +0300 Message-Id: <1475737209-3380-1-git-send-email-mika.kuoppala@intel.com> Subject: [Intel-gfx] [PATCH] drm/i915: Save hangcheck score across resets Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

1475737209-3380-1-git-send-email-mika.kuoppala@intel.com (mailing list archive)

State

New, archived

Headers

From: Mika Kuoppala <mika.kuoppala@linux.intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Thu,  6 Oct 2016 10:00:09 +0300
Message-Id: <1475737209-3380-1-git-send-email-mika.kuoppala@intel.com>
Subject: [Intel-gfx] [PATCH] drm/i915: Save hangcheck score across resets
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

Mika Kuoppala Oct. 6, 2016, 7 a.m. UTC

Hangcheck score has been zeroed on engine init, which happens
after reset recovery. This has worked well as we always reset
all engines on hang, and also discarded all work submitted
to engines.

With commit 821ed7df6e2a ("drm/i915: Update reset path to fix
incomplete requests") driver gained capability to only discard
the request or requests that were directly involved with the hang,
and those who were deemed innocent, were replayed intact.

Our hangcheck works by periodically sampling the engine state and
then doing checks in multiple stages to see if engine is making
progress. The engine capabilities differ. With render engine, we
have a more ways to measure the progress and thus more checks and
stages. With other engines, we only sample the seqno and head movement.

Now consider that blitter engine is waiting on render and render engine
has a batch which has stuck. Due to simpler checks, the blitter engine
hangcheck score accumulates faster and reaches reset threshold quicker.
We also blame the blitter for the hang as it had the highest score
when recovery started.

Blaming the wrong engine, we don't find the actual guilty request and
most critically, won't make any progress after the reset. That will
lead to second hang, with the same pattern, ad infinitum.

Previously the false blaming of engine was not critical as score was
only used as a trigger for full reset and debug aid in error states.
But now, the score is essential of finding the culprit request.

To fix this, keep the hangcheck scores across resets. We already
have a decay mechanism in place if progress is being made. This
ensures that even if we blame the wrong engine once, we don't
do it twice or consistently, and the real culprit request will be
cleared, real progress will be made and this untangles rest of
the engines and lead to successful recovery.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=98104
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/intel_engine_cs.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Chris Wilson Oct. 6, 2016, 7:10 a.m. UTC | #1

On Thu, Oct 06, 2016 at 10:00:09AM +0300, Mika Kuoppala wrote:
> Hangcheck score has been zeroed on engine init, which happens
> after reset recovery. This has worked well as we always reset
> all engines on hang, and also discarded all work submitted
> to engines.
> 
> With commit 821ed7df6e2a ("drm/i915: Update reset path to fix
> incomplete requests") driver gained capability to only discard
> the request or requests that were directly involved with the hang,
> and those who were deemed innocent, were replayed intact.
> 
> Our hangcheck works by periodically sampling the engine state and
> then doing checks in multiple stages to see if engine is making
> progress. The engine capabilities differ. With render engine, we
> have a more ways to measure the progress and thus more checks and
> stages. With other engines, we only sample the seqno and head movement.
> 
> Now consider that blitter engine is waiting on render and render engine
> has a batch which has stuck. Due to simpler checks, the blitter engine
> hangcheck score accumulates faster and reaches reset threshold quicker.
> We also blame the blitter for the hang as it had the highest score
> when recovery started.

This is the bug. It shouldn't accumulate any score in this case as the
engine is not active.

This patch is not the right approach for the issue as described here.
Because as soon as the blitter engine is active again, there is a very
real danger of it being declared guilty and reset.

The patch has merit, but not for this issue...
-Chris

diff --git a/drivers/gpu/drm/i915/intel_engine_cs.c b/drivers/gpu/drm/i915/intel_engine_cs.c
index d00ec805f93d..4bb869eb11bc 100644
--- a/drivers/gpu/drm/i915/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/intel_engine_cs.c
@@ -209,7 +209,6 @@  void intel_engine_init_seqno(struct intel_engine_cs *engine, u32 seqno)
 
 void intel_engine_init_hangcheck(struct intel_engine_cs *engine)
 {
-	memset(&engine->hangcheck, 0, sizeof(engine->hangcheck));
 	clear_bit(engine->id, &engine->i915->gpu_error.missed_irq_rings);
 	if (intel_engine_has_waiter(engine))
 		i915_queue_hangcheck(engine->i915);

drm/i915: Save hangcheck score across resets

Commit Message

Comments

Patch