[v5,2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

From: Michel Thierry <michel.thierry@intel.com>

From: Michel Thierry <michel.thierry@intel.com>

*** General ***

Watchdog timeout (or "media engine reset") is a feature that allows
userland applications to enable hang detection on individual batch buffers.
The detection mechanism itself is mostly bound to the hardware and the only
thing that the driver needs to do to support this form of hang detection
is to implement the interrupt handling support as well as watchdog command
emission before and after the emitted batch buffer start instruction in the
ring buffer.

The principle of the hang detection mechanism is as follows:

1. Once the decision has been made to enable watchdog timeout for a
particular batch buffer and the driver is in the process of emitting the
batch buffer start instruction into the ring buffer it also emits a
watchdog timer start instruction before and a watchdog timer cancellation
instruction after the batch buffer start instruction in the ring buffer.

2. Once the GPU execution reaches the watchdog timer start instruction
the hardware watchdog counter is started by the hardware. The counter
keeps counting until either reaching a previously configured threshold
value or the timer cancellation instruction is executed.

2a. If the counter reaches the threshold value the hardware fires a
watchdog interrupt that is picked up by the watchdog interrupt handler.
This means that a hang has been detected and the driver needs to deal with
it the same way it would deal with a engine hang detected by the periodic
hang checker. The only difference between the two is that we already blamed
the active request (to ensure an engine reset).

2b. If the batch buffer completes and the execution reaches the watchdog
cancellation instruction before the watchdog counter reaches its
threshold value the watchdog is cancelled and nothing more comes of it.
No hang is detected.

Note about future interaction with preemption: Preemption could happen
in a command sequence prior to watchdog counter getting disabled,
resulting in watchdog being triggered following preemption (e.g. when
watchdog had been enabled in the low priority batch). The driver will
need to explicitly disable the watchdog counter as part of the
preemption sequence.

*** This patch introduces: ***

1. IRQ handler code for watchdog timeout allowing direct hang recovery
based on hardware-driven hang detection, which then integrates directly
with the hang recovery path. This is independent of having per-engine reset
or just full gpu reset.

2. Watchdog specific register information.

Currently the render engine and all available media engines support
watchdog timeout (VECS is only supported in GEN9). The specifications elude
to the BCS engine being supported but that is currently not supported by
this commit.

Note that the value to stop the counter is different between render and
non-render engines in GEN8; GEN9 onwards it's the same.

v2: Move irq handler to tasklet, arm watchdog for a 2nd time to check
against false-positives.

v3: Don't use high priority tasklet, use engine_last_submit while
checking for false-positives. From GEN9 onwards, the stop counter bit is
the same for all engines.

v4: Remove unnecessary brackets, use current_seqno to mark the request
as guilty in the hangcheck/capture code.

v5: Rebased after RESET_ENGINEs flag.

v6: Don't capture error state in case of watchdog timeout. The capture
process is time consuming and this will align to what happens when we
use GuC to handle the watchdog timeout. (Chris)

v7: Rebase.

v8: Rebase, use HZ to reschedule.

v9: Rebase, get forcewake domains in function (no longer in execlists
struct).

v10: Rebase.

v11: Rebase,
     remove extra braces (Tvrtko),
     implement watchdog_to_clock_counts helper (Tvrtko),
     Move tasklet_kill(watchdog_tasklet) inside intel_engines (Tvrtko),
     Use a global heartbeat seqno instead of engine seqno (Chris)
     Make all engines checks all class based checks (Tvrtko)

v12: Rebase,
     Reset immediately upon entering the IRQ (Chris)
     Make reset_engine_to_str a helper (Tvrtko)
     Rename watchdog_irq_handler as watchdog_tasklet (Tvrtko)
     Let the compiler itself do the inline (Tvrtko)

v13: Rebase
v14: Rebase, skip checking for the guilty seqno in the tasklet (Tvrtko)

Cc: Antonio Argenziano <antonio.argenziano@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Signed-off-by: Michel Thierry <michel.thierry@intel.com>
Signed-off-by: Carlos Santa <carlos.santa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.h     |  4 ++
 drivers/gpu/drm/i915/i915_irq.c           | 14 ++++--
 drivers/gpu/drm/i915/i915_reg.h           |  6 +++
 drivers/gpu/drm/i915/i915_reset.c         | 20 +++++++++
 drivers/gpu/drm/i915/i915_reset.h         |  6 +++
 drivers/gpu/drm/i915/intel_engine_cs.c    |  1 +
 drivers/gpu/drm/i915/intel_engine_types.h |  5 +++
 drivers/gpu/drm/i915/intel_hangcheck.c    | 11 +----
 drivers/gpu/drm/i915/intel_lrc.c          | 52 +++++++++++++++++++++++
 9 files changed, 107 insertions(+), 12 deletions(-)

Message ID	20190322234118.65980-3-carlos.santa@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 60DA113B5 for <patchwork-intel-gfx@patchwork.kernel.org>; Fri, 22 Mar 2019 23:42:20 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3A4512AA2B for <patchwork-intel-gfx@patchwork.kernel.org>; Fri, 22 Mar 2019 23:42:20 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1B06E2AA08; Fri, 22 Mar 2019 23:42:20 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED autolearn=ham version=3.3.1 Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.wl.linuxfoundation.org (Postfix) with ESMTPS id 1BAE82AA08 for <patchwork-intel-gfx@patchwork.kernel.org>; Fri, 22 Mar 2019 23:42:18 +0000 (UTC) Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 94C886E3E4; Fri, 22 Mar 2019 23:42:16 +0000 (UTC) X-Original-To: intel-gfx@lists.freedesktop.org Delivered-To: intel-gfx@lists.freedesktop.org Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by gabe.freedesktop.org (Postfix) with ESMTPS id 8BBBB6E3E2 for <intel-gfx@lists.freedesktop.org>; Fri, 22 Mar 2019 23:42:15 +0000 (UTC) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 22 Mar 2019 16:42:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,256,1549958400"; d="scan'208";a="309635778" Received: from miryad.jf.intel.com ([10.54.74.35]) by orsmga005.jf.intel.com with ESMTP; 22 Mar 2019 16:42:13 -0700 From: Carlos Santa <carlos.santa@intel.com> To: intel-gfx@lists.freedesktop.org Date: Fri, 22 Mar 2019 16:41:15 -0700 Message-Id: <20190322234118.65980-3-carlos.santa@intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190322234118.65980-1-carlos.santa@intel.com> References: <20190322234118.65980-1-carlos.santa@intel.com> Subject: [Intel-gfx] [PATCH v5 2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Intel graphics driver community testing & development <intel-gfx.lists.freedesktop.org> List-Unsubscribe: <https://lists.freedesktop.org/mailman/options/intel-gfx>, <mailto:intel-gfx-request@lists.freedesktop.org?subject=unsubscribe> List-Archive: <https://lists.freedesktop.org/archives/intel-gfx> List-Post: <mailto:intel-gfx@lists.freedesktop.org> List-Help: <mailto:intel-gfx-request@lists.freedesktop.org?subject=help> List-Subscribe: <https://lists.freedesktop.org/mailman/listinfo/intel-gfx>, <mailto:intel-gfx-request@lists.freedesktop.org?subject=subscribe> Cc: Michel Thierry <michel.thierry@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org> X-Virus-Scanned: ClamAV using ClamSMTP
Series	GEN8+ GPU Watchdog Reset Support \| expand [v5,0/5] GEN8+ GPU Watchdog Reset Support [v5,1/5] drm/i915: Add engine reset count in get-reset-stats ioctl [v5,2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+ [v5,3/5] drm/i915: Watchdog timeout: Ringbuffer command emission for gen8+ [v5,4/5] drm/i915: Watchdog timeout: DRM kernel interface to set the timeout [v5,5/5] drm/i915: Watchdog timeout: Include threshold value in error state

[v5,2/5] drm/i915: Watchdog timeout: IRQ handler for gen8+

Commit Message

Comments

Patch