[RFC,7/8] drm/i915: Skip CSB processing on invalid CSB tail

Message ID	20180321172625.6415-8-jeff.mcgee@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: jeff.mcgee@intel.com To: intel-gfx@lists.freedesktop.org Date: Wed, 21 Mar 2018 10:26:24 -0700 Message-Id: <20180321172625.6415-8-jeff.mcgee@intel.com> In-Reply-To: <20180321172625.6415-1-jeff.mcgee@intel.com> References: <20180321172625.6415-1-jeff.mcgee@intel.com> Subject: [Intel-gfx] [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB tail Precedence: list Cc: ben@bwidawsk.net, kalyan.kondapally@intel.com MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

20180321172625.6415-8-jeff.mcgee@intel.com (mailing list archive)

State

New, archived

Headers

From: jeff.mcgee@intel.com
To: intel-gfx@lists.freedesktop.org
Date: Wed, 21 Mar 2018 10:26:24 -0700
Message-Id: <20180321172625.6415-8-jeff.mcgee@intel.com>
In-Reply-To: <20180321172625.6415-1-jeff.mcgee@intel.com>
References: <20180321172625.6415-1-jeff.mcgee@intel.com>
Subject: [Intel-gfx] [RFC 7/8] drm/i915: Skip CSB processing on invalid CSB
	tail
Precedence: list
Cc: ben@bwidawsk.net, kalyan.kondapally@intel.com
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

jeff.mcgee@intel.com March 21, 2018, 5:26 p.m. UTC

From: Jeff McGee <jeff.mcgee@intel.com>

Engine reset is fast. A context switch interrupt may be generated just
prior to the reset such that the top half handler is racing with reset
post-processing. The handler may set the irq_posted bit again after
the reset code has cleared it to start fresh. Then the re-enabled
tasklet will read the CSB head and tail from MMIO, which will be at
the hardware reset values of 0 and 7 respectively, given that no
actual CSB event has occurred since the reset. Mayhem then ensues as
the tasklet starts processing invalid CSB entries.

We can handle this corner case without adding any new synchronization
between the irq top half and the reset work item. The tasklet can
just skip CSB processing if the tail is not sane.

Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
---
 drivers/gpu/drm/i915/intel_lrc.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

jeff.mcgee@intel.com March 21, 2018, 5:31 p.m. UTC | #1

On Wed, Mar 21, 2018 at 10:26:24AM -0700, jeff.mcgee@intel.com wrote:
> From: Jeff McGee <jeff.mcgee@intel.com>
> 
> Engine reset is fast. A context switch interrupt may be generated just
> prior to the reset such that the top half handler is racing with reset
> post-processing. The handler may set the irq_posted bit again after
> the reset code has cleared it to start fresh. Then the re-enabled
> tasklet will read the CSB head and tail from MMIO, which will be at
> the hardware reset values of 0 and 7 respectively, given that no
> actual CSB event has occurred since the reset. Mayhem then ensues as
> the tasklet starts processing invalid CSB entries.
> 
> We can handle this corner case without adding any new synchronization
> between the irq top half and the reset work item. The tasklet can
> just skip CSB processing if the tail is not sane.
> 
> Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> ---
If I drop this patch and substitute https://patchwork.freedesktop.org/patch/211831/
I will see irq_posted get set after reset which causes the first tasklet
run to re-process a previous CSB event and hit GEM_BUG_ON that nothing
was active.
-Jeff

Chris Wilson March 21, 2018, 6:12 p.m. UTC | #2

Quoting Jeff McGee (2018-03-21 17:31:45)
> On Wed, Mar 21, 2018 at 10:26:24AM -0700, jeff.mcgee@intel.com wrote:
> > From: Jeff McGee <jeff.mcgee@intel.com>
> > 
> > Engine reset is fast. A context switch interrupt may be generated just
> > prior to the reset such that the top half handler is racing with reset
> > post-processing. The handler may set the irq_posted bit again after
> > the reset code has cleared it to start fresh. Then the re-enabled
> > tasklet will read the CSB head and tail from MMIO, which will be at
> > the hardware reset values of 0 and 7 respectively, given that no
> > actual CSB event has occurred since the reset. Mayhem then ensues as
> > the tasklet starts processing invalid CSB entries.
> > 
> > We can handle this corner case without adding any new synchronization
> > between the irq top half and the reset work item. The tasklet can
> > just skip CSB processing if the tail is not sane.
> > 
> > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> > ---
> If I drop this patch and substitute https://patchwork.freedesktop.org/patch/211831/
> I will see irq_posted get set after reset which causes the first tasklet
> run to re-process a previous CSB event and hit GEM_BUG_ON that nothing
> was active.

However, for reset+sync to be followed by an interrupt is surprising.
What more do we need to do after the reset to flush the last interrupt?

What I've been trying to get right is disabling the RING_IMR across the
reset so that we do not get any new interrupts generated for this engine.
(So couple that also with a sync_irq.) We've been kicking around that
plan for yonks, since the beginning of doing per-engine reset.
-Chris

Chris Wilson March 21, 2018, 7:06 p.m. UTC | #3

Quoting Chris Wilson (2018-03-21 18:12:51)
> Quoting Jeff McGee (2018-03-21 17:31:45)
> > On Wed, Mar 21, 2018 at 10:26:24AM -0700, jeff.mcgee@intel.com wrote:
> > > From: Jeff McGee <jeff.mcgee@intel.com>
> > > 
> > > Engine reset is fast. A context switch interrupt may be generated just
> > > prior to the reset such that the top half handler is racing with reset
> > > post-processing. The handler may set the irq_posted bit again after
> > > the reset code has cleared it to start fresh. Then the re-enabled
> > > tasklet will read the CSB head and tail from MMIO, which will be at
> > > the hardware reset values of 0 and 7 respectively, given that no
> > > actual CSB event has occurred since the reset. Mayhem then ensues as
> > > the tasklet starts processing invalid CSB entries.
> > > 
> > > We can handle this corner case without adding any new synchronization
> > > between the irq top half and the reset work item. The tasklet can
> > > just skip CSB processing if the tail is not sane.
> > > 
> > > Signed-off-by: Jeff McGee <jeff.mcgee@intel.com>
> > > ---
> > If I drop this patch and substitute https://patchwork.freedesktop.org/patch/211831/
> > I will see irq_posted get set after reset which causes the first tasklet
> > run to re-process a previous CSB event and hit GEM_BUG_ON that nothing
> > was active.
> 
> However, for reset+sync to be followed by an interrupt is surprising.
> What more do we need to do after the reset to flush the last interrupt?

Actually, it may not be a late interrupt, just a late cacheline flush
from one processor to another. __set_bit bites.
-Chris

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index cec4e1653daf..d420c2ecb50a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -865,6 +865,14 @@  static void process_csb(struct intel_engine_cs *engine)
 		head = readl(i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 		tail = GEN8_CSB_WRITE_PTR(head);
 		head = GEN8_CSB_READ_PTR(head);
+
+		/* The MMIO read CSB tail may be at the reset value of
+		 * 0x7 if there hasn't been a valid CSB event since
+		 * the engine reset.
+		 */
+		if (tail >= GEN8_CSB_ENTRIES)
+			goto out;
+
 		execlists->csb_head = head;
 	} else {
 		const int write_idx =
@@ -873,6 +881,7 @@  static void process_csb(struct intel_engine_cs *engine)
 
 		head = execlists->csb_head;
 		tail = READ_ONCE(buf[write_idx]);
+		GEM_BUG_ON(tail >= GEN8_CSB_ENTRIES);
 	}
 	GEM_TRACE("%s cs-irq head=%d [%d%s], tail=%d [%d%s]\n",
 		  engine->name,
@@ -981,7 +990,7 @@  static void process_csb(struct intel_engine_cs *engine)
 		writel(_MASKED_FIELD(GEN8_CSB_READ_PTR_MASK, head << 8),
 		       i915->regs + i915_mmio_reg_offset(RING_CONTEXT_STATUS_PTR(engine)));
 	}
-
+out:
 	if (unlikely(fw))
 		intel_uncore_forcewake_put(i915, execlists->fw_domains);
 }

[RFC,7/8] drm/i915: Skip CSB processing on invalid CSB tail

Commit Message

Comments

Patch