[26/37] drm/i915/dg1: Handle GRF/IC ECC error irq

Message ID	20200521003803.18936-27-lucas.demarchi@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=g3s1=7D=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8A1542075F IronPort-SDR: DcGRUwIhFmYCJRRgFW/6pvlL9ZH7RE4evpCot/uRragEC89ap520H4rrW2tX5NgpN2kkP9oTWt S/msF0xpWeog== IronPort-SDR: PQbm9RzLI4qUmQVKbkEZyyjU1qAj7tZdN6F0qXAKnj3SinqDV+sGfHXxl8uS2e0B22j/m6U13k egiS6xA3Iqaw== From: Lucas De Marchi <lucas.demarchi@intel.com> To: intel-gfx@lists.freedesktop.org Date: Wed, 20 May 2020 17:37:52 -0700 Message-Id: <20200521003803.18936-27-lucas.demarchi@intel.com> In-Reply-To: <20200521003803.18936-1-lucas.demarchi@intel.com> References: <20200521003803.18936-1-lucas.demarchi@intel.com> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH 26/37] drm/i915/dg1: Handle GRF/IC ECC error irq Precedence: list Cc: fernando.pacheco@intel.com, Matthew Auld <matthew.auld@intel.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	Introduce DG1 \| expand [00/37] Introduce DG1 [01/37] drm/i915/rkl: Add DPLL4 support [02/37] drm/i915/rkl: Add DDC pin mapping [03/37] drm/i915/rkl: Setup ports/phys [04/37] drm/i915/rkl: provide port/phy mapping for vbt [05/37] drm/i915/rkl: Handle HTI [06/37] drm/i915/rkl: Handle comp master/slave relationships for PHYs [07/37] drm/i915/rkl: Add initial workarounds [08/37] drm/i915: make intel_{uncore, de}_rmw() more useful [09/37] drm/i915: Add has_master_unit_irq flag [10/37] drm/i915: add pcie snoop flag [11/37] drm/i915/dg1: add initial DG-1 definitions [12/37] drm/i915/dg1: Add DG1 PCI IDs [13/37] drm/i915/dg1: Add fake PCH [14/37] drm/i915/dg1: Initialize RAWCLK properly [15/37] drm/i915/dg1: Define MOCS table for DG1 [16/37] drm/i915/dg1: Add DG1 power wells [17/37] drm/i915/dg1: Increase mmio size to 4MB [18/37] drm/i915/dg1: add support for the master unit interrupt [19/37] drm/i915/dg1: Wait for pcode/uncore handshake at startup [20/37] drm/i915/dg1: Add DPLL macros for DG1 [21/37] drm/i915/dg1: Add and setup DPLLs for DG1 [22/37] drm/i915/dg1: Enable DPLL for DG1 [23/37] drm/i915/dg1: add hpd interrupt handling [24/37] drm/i915/dg1: invert HPD pins [25/37] drm/i915/dg1: gmbus pin mapping [26/37] drm/i915/dg1: Handle GRF/IC ECC error irq [27/37] drm/i915/dg1: Log counter on SLM ECC error [28/37] drm/i915/dg1: Enable first 2 ports for DG1 [29/37] drm/i915/dg1: Don't program PHY_MISC for PHY-C and PHY-D [30/37] drm/i915/dg1: Update comp master/slave relationships for PHYs [31/37] drm/i915/dg1: Update voltage swing tables for DP [32/37] drm/i915/dg1: provide port/phy mapping for vbt [33/37] drm/i915/dg1: map/unmap pll clocks [34/37] drm/i915/dg1: enable PORT C/D aka D/E [35/37] drm/i915/dg1: Load DMC [36/37] drm/i915/dg1: Add initial DG1 workarounds [37/37] drm/i915/dg1: Remove SHPD_FILTER_CNT register programming

Message ID

20200521003803.18936-27-lucas.demarchi@intel.com (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8A1542075F
IronPort-SDR: 
 DcGRUwIhFmYCJRRgFW/6pvlL9ZH7RE4evpCot/uRragEC89ap520H4rrW2tX5NgpN2kkP9oTWt
 S/msF0xpWeog==
IronPort-SDR: 
 PQbm9RzLI4qUmQVKbkEZyyjU1qAj7tZdN6F0qXAKnj3SinqDV+sGfHXxl8uS2e0B22j/m6U13k
 egiS6xA3Iqaw==
From: Lucas De Marchi <lucas.demarchi@intel.com>
To: intel-gfx@lists.freedesktop.org
Date: Wed, 20 May 2020 17:37:52 -0700
Message-Id: <20200521003803.18936-27-lucas.demarchi@intel.com>
In-Reply-To: <20200521003803.18936-1-lucas.demarchi@intel.com>
References: <20200521003803.18936-1-lucas.demarchi@intel.com>
MIME-Version: 1.0
Subject: [Intel-gfx] [PATCH 26/37] drm/i915/dg1: Handle GRF/IC ECC error irq
Precedence: list
Cc: fernando.pacheco@intel.com, Matthew Auld <matthew.auld@intel.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

Introduce DG1 | expand

Commit Message

Lucas De Marchi May 21, 2020, 12:37 a.m. UTC

From: Fernando Pacheco <fernando.pacheco@intel.com>

The error detection and correction capability
for GRF and instruction cache (IC) will utilize
the new interrupt and error handling infrastructure
for dgfx products. The GFX device can generate
a number of classes of error under the new
infrastructure: correctable, non-fatal, and
fatal errors.

The non-fatal and fatal error classes distinguish
between levels of severity for uncorrectable errors.
All ECC uncorrectable errors will be reported as
fatal to produce the desired system response. Fatal
errors are expected to route as PCIe error messages
which should result in OS issuing a GFX device FLR.
But the option exists to route fatal errors as
interrupts.

Driver will only handle logging of errors. Anything
more will be handled at system level.

For errors that will route as interrupts, three
bits in the Master Interrupt Register will be used
to convey the class of error.

For each class of error:
1. Determine source of error (IP block) by reading
   the Device Error Source Register (RW1C) that
   corresponds to the class of error being serviced.
2. If the generating IP block is GT, read and log the
   GT Error Register (RW1C) that corresponds to the
   class of error being serviced. Non-GT errors will
   be logged in aggregate for now.

Bspec: 50875

Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Fernando Pacheco <fernando.pacheco@intel.com>
Cc: Radhakrishna Sripada <radhakrishna.sripada@intel.com>
Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c | 121 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/i915/i915_reg.h |  28 ++++++++
 2 files changed, 149 insertions(+)

Comments

Chris Wilson May 21, 2020, 8:19 a.m. UTC | #1

Quoting Lucas De Marchi (2020-05-21 01:37:52)
> From: Fernando Pacheco <fernando.pacheco@intel.com>
> 
> The error detection and correction capability
> for GRF and instruction cache (IC) will utilize
> the new interrupt and error handling infrastructure
> for dgfx products. The GFX device can generate
> a number of classes of error under the new
> infrastructure: correctable, non-fatal, and
> fatal errors.
> 
> The non-fatal and fatal error classes distinguish
> between levels of severity for uncorrectable errors.
> All ECC uncorrectable errors will be reported as
> fatal to produce the desired system response. Fatal
> errors are expected to route as PCIe error messages
> which should result in OS issuing a GFX device FLR.
> But the option exists to route fatal errors as
> interrupts.
> 
> Driver will only handle logging of errors. Anything
> more will be handled at system level.
> 
> For errors that will route as interrupts, three
> bits in the Master Interrupt Register will be used
> to convey the class of error.
> 
> For each class of error:
> 1. Determine source of error (IP block) by reading
>    the Device Error Source Register (RW1C) that
>    corresponds to the class of error being serviced.
> 2. If the generating IP block is GT, read and log the
>    GT Error Register (RW1C) that corresponds to the
>    class of error being serviced. Non-GT errors will
>    be logged in aggregate for now.
> 
> Bspec: 50875
> 
> Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Fernando Pacheco <fernando.pacheco@intel.com>
> Cc: Radhakrishna Sripada <radhakrishna.sripada@intel.com>
> Signed-off-by: Fernando Pacheco <fernando.pacheco@intel.com>
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> ---
>  drivers/gpu/drm/i915/i915_irq.c | 121 ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/i915/i915_reg.h |  28 ++++++++
>  2 files changed, 149 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
> index ebc80e8b1599..17e679b910da 100644
> --- a/drivers/gpu/drm/i915/i915_irq.c
> +++ b/drivers/gpu/drm/i915/i915_irq.c
> @@ -2515,6 +2515,124 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg)
>         return IRQ_HANDLED;
>  }
>  
> +static const char *
> +hardware_error_type_to_str(const enum hardware_error hw_err)
> +{
> +       switch (hw_err) {
> +       case HARDWARE_ERROR_CORRECTABLE:
> +               return "CORRECTABLE";
> +       case HARDWARE_ERROR_NONFATAL:
> +               return "NONFATAL";
> +       case HARDWARE_ERROR_FATAL:
> +               return "FATAL";
> +       default:
> +               return "UNKNOWN";
> +       }
> +}
> +
> +static void
> +gen12_gt_hw_error_handler(struct drm_i915_private * const i915,
> +                         const enum hardware_error hw_err)
> +{
> +       void __iomem * const regs = i915->uncore.regs;
> +       const char *hw_err_str = hardware_error_type_to_str(hw_err);
> +       u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
> +       u32 errstat;
> +
> +       lockdep_assert_held(&i915->irq_lock);

Wrong place and wrong locks.
-Chris

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index ebc80e8b1599..17e679b910da 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -2515,6 +2515,124 @@  static irqreturn_t gen8_irq_handler(int irq, void *arg)
 	return IRQ_HANDLED;
 }
 
+static const char *
+hardware_error_type_to_str(const enum hardware_error hw_err)
+{
+	switch (hw_err) {
+	case HARDWARE_ERROR_CORRECTABLE:
+		return "CORRECTABLE";
+	case HARDWARE_ERROR_NONFATAL:
+		return "NONFATAL";
+	case HARDWARE_ERROR_FATAL:
+		return "FATAL";
+	default:
+		return "UNKNOWN";
+	}
+}
+
+static void
+gen12_gt_hw_error_handler(struct drm_i915_private * const i915,
+			  const enum hardware_error hw_err)
+{
+	void __iomem * const regs = i915->uncore.regs;
+	const char *hw_err_str = hardware_error_type_to_str(hw_err);
+	u32 other_errors = ~(EU_GRF_ERROR | EU_IC_ERROR);
+	u32 errstat;
+
+	lockdep_assert_held(&i915->irq_lock);
+
+	errstat = raw_reg_read(regs, ERR_STAT_GT_REG(hw_err));
+
+	if (unlikely(!errstat)) {
+		DRM_ERROR("ERR_STAT_GT_REG_%s blank!\n", hw_err_str);
+		return;
+	}
+
+	/*
+	 * TODO: The GT Non Fatal Error Status Register
+	 * only has reserved bitfields defined.
+	 * Remove once there is something to service.
+	 */
+	if (hw_err == HARDWARE_ERROR_NONFATAL) {
+		DRM_ERROR("detected Non-Fatal hardware error\n");
+		raw_reg_write(regs, ERR_STAT_GT_REG(hw_err), errstat);
+		return;
+	}
+
+	if (errstat & EU_GRF_ERROR)
+		DRM_ERROR("detected EU GRF %s hardware error\n", hw_err_str);
+
+	if (errstat & EU_IC_ERROR)
+		DRM_ERROR("detected EU IC %s hardware error\n", hw_err_str);
+
+	/*
+	 * TODO: The remaining GT errors don't have a
+	 * need for targeted logging at the moment. We
+	 * still want to log detection of these errors, but
+	 * let's aggregate them until someone has a need for them.
+	 */
+	if (errstat & other_errors)
+		DRM_ERROR("detected hardware error(s) in ERR_STAT_GT_REG_%s: 0x%08x\n",
+			  hw_err_str, errstat & other_errors);
+
+	raw_reg_write(regs, ERR_STAT_GT_REG(hw_err), errstat);
+}
+
+static void
+gen12_hw_error_source_handler(struct drm_i915_private * const i915,
+			      const enum hardware_error hw_err)
+{
+	void __iomem * const regs = i915->uncore.regs;
+	const char *hw_err_str = hardware_error_type_to_str(hw_err);
+	u32 errsrc;
+
+	spin_lock(&i915->irq_lock);
+	errsrc = raw_reg_read(regs, DEV_ERR_STAT_REG(hw_err));
+
+	if (unlikely(!errsrc)) {
+		DRM_ERROR("DEV_ERR_STAT_REG_%s blank!\n", hw_err_str);
+		goto out_unlock;
+	}
+
+	if (errsrc & DEV_ERR_STAT_GT_ERROR)
+		gen12_gt_hw_error_handler(i915, hw_err);
+
+	if (errsrc & ~DEV_ERR_STAT_GT_ERROR)
+		DRM_ERROR("non-GT hardware error(s) in DEV_ERR_STAT_REG_%s: 0x%08x\n",
+			  hw_err_str, errsrc & ~DEV_ERR_STAT_GT_ERROR);
+
+	raw_reg_write(regs, DEV_ERR_STAT_REG(hw_err), errsrc);
+
+out_unlock:
+	spin_unlock(&i915->irq_lock);
+}
+
+/*
+ * GEN12+ adds three Error bits to the Master Interrupt
+ * Register to support dgfx card error handling.
+ * These three bits are used to convey the class of error:
+ * FATAL, NONFATAL, or CORRECTABLE.
+ *
+ * To process an interrupt:
+ *	1. Determine source of error (IP block) by reading
+ *	   the Device Error Source Register (RW1C) that
+ *	   corresponds to the class of error being serviced.
+ *	2. For GT as the generating IP block, read and log
+ *	   the GT Error Register (RW1C) that corresponds to
+ *	   the class of error being serviced.
+ */
+static void
+gen12_hw_error_irq_handler(struct drm_i915_private * const i915,
+			   const u32 master_ctl)
+{
+	enum hardware_error hw_err;
+
+	for (hw_err = 0; hw_err < HARDWARE_ERROR_MAX; hw_err++) {
+		if (master_ctl & GEN12_ERROR_IRQ(hw_err))
+			gen12_hw_error_source_handler(i915, hw_err);
+	}
+}
+
 static u32
 gen11_gu_misc_irq_ack(struct intel_gt *gt, const u32 master_ctl)
 {
@@ -2597,6 +2715,9 @@  __gen11_irq_handler(struct drm_i915_private * const i915,
 	/* Find, queue (onto bottom-halves), then clear each source */
 	gen11_gt_irq_handler(gt, master_ctl);
 
+	if (IS_DG1(i915))
+		gen12_hw_error_irq_handler(i915, master_ctl);
+
 	/* IRQs are synced during runtime_suspend, we don't require a wakeref */
 	if (master_ctl & GEN11_DISPLAY_IRQ)
 		gen11_display_irq_handler(i915);
diff --git a/drivers/gpu/drm/i915/i915_reg.h b/drivers/gpu/drm/i915/i915_reg.h
index e0bd9e02c3d1..40cb361b4254 100644
--- a/drivers/gpu/drm/i915/i915_reg.h
+++ b/drivers/gpu/drm/i915/i915_reg.h
@@ -7647,6 +7647,10 @@  enum {
 #define  GEN11_MASTER_IRQ		(1 << 31)
 #define  GEN11_PCU_IRQ			(1 << 30)
 #define  GEN11_GU_MISC_IRQ		(1 << 29)
+#define  GEN12_FATAL_ERROR_IRQ		(1 << 28)
+#define  GEN12_NON_FATAL_ERROR_IRQ	(1 << 27)
+#define  GEN12_CORRECTABLE_ERROR_IRQ	(1 << 26)
+#define  GEN12_ERROR_IRQ(x)		(1 << (26 + (x)))
 #define  GEN11_DISPLAY_IRQ		(1 << 16)
 #define  GEN11_GT_DW_IRQ(x)		(1 << (x))
 #define  GEN11_GT_DW1_IRQ		(1 << 1)
@@ -7738,6 +7742,30 @@  enum {
 
 #define GEN11_IIR_REG_SELECTOR(x)	_MMIO(0x190070 + ((x) * 4))
 
+enum hardware_error {
+	HARDWARE_ERROR_CORRECTABLE = 0,
+	HARDWARE_ERROR_NONFATAL = 1,
+	HARDWARE_ERROR_FATAL = 2,
+	HARDWARE_ERROR_MAX,
+};
+
+#define _DEV_ERR_STAT_FATAL		0x100174
+#define _DEV_ERR_STAT_NONFATAL		0x100178
+#define _DEV_ERR_STAT_CORRECTABLE	0x10017c
+#define DEV_ERR_STAT_REG(x)		_MMIO(_PICK_EVEN((x), \
+						_DEV_ERR_STAT_CORRECTABLE, \
+						_DEV_ERR_STAT_NONFATAL))
+#define  DEV_ERR_STAT_GT_ERROR		(1 << 0)
+
+#define _ERR_STAT_GT_COR		0x100160
+#define _ERR_STAT_GT_NONFATAL		0x100164
+#define _ERR_STAT_GT_FATAL		0x100168
+#define ERR_STAT_GT_REG(x)		_MMIO(_PICK_EVEN((x), \
+						_ERR_STAT_GT_COR, \
+						_ERR_STAT_GT_NONFATAL))
+#define  EU_GRF_ERROR			(1 << 15)
+#define  EU_IC_ERROR			(1 << 14)
+
 #define GEN11_RENDER_COPY_INTR_ENABLE	_MMIO(0x190030)
 #define GEN11_VCS_VECS_INTR_ENABLE	_MMIO(0x190034)
 #define GEN11_GUC_SG_INTR_ENABLE	_MMIO(0x190038)

[26/37] drm/i915/dg1: Handle GRF/IC ECC error irq

Commit Message

Comments

Patch