tests/gem_eio: Resilience against "hanging too fast"

Message ID	1448468364-20610-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Daniel Vetter <daniel.vetter@ffwll.ch> To: Intel Graphics Development <intel-gfx@lists.freedesktop.org> Date: Wed, 25 Nov 2015 17:19:24 +0100 Message-Id: <1448468364-20610-1-git-send-email-daniel.vetter@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@intel.com>, Daniel Vetter <daniel.vetter@ffwll.ch> Subject: [Intel-gfx] [PATCH] tests/gem_eio: Resilience against "hanging too fast" Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Message ID

1448468364-20610-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)

State

New, archived

Headers

From: Daniel Vetter <daniel.vetter@ffwll.ch>
To: Intel Graphics Development <intel-gfx@lists.freedesktop.org>
Date: Wed, 25 Nov 2015 17:19:24 +0100
Message-Id: <1448468364-20610-1-git-send-email-daniel.vetter@ffwll.ch>
Cc: Daniel Vetter <daniel.vetter@intel.com>,
	Daniel Vetter <daniel.vetter@ffwll.ch>
Subject: [Intel-gfx] [PATCH] tests/gem_eio: Resilience against "hanging too
	fast"
Precedence: list
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Commit Message

Daniel Vetter Nov. 25, 2015, 4:19 p.m. UTC

Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
hang we need to be careful to not run into the "hanging too fast
check":

- don't restore the ban period, but instead keep it at 0.
- make sure we idle the gpu fully before hanging it again (wait
  subtest missted that).

With this gem_eio works now reliable even when I don't run the
subtests individually.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 tests/gem_eio.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Chris Wilson Nov. 25, 2015, 4:28 p.m. UTC | #1

On Wed, Nov 25, 2015 at 05:19:24PM +0100, Daniel Vetter wrote:
> Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> hang we need to be careful to not run into the "hanging too fast
> check":

That's not how it works. It restores the GPU by triggering a manual
reset.
-Chris

Daniel Vetter Nov. 26, 2015, 8:34 a.m. UTC | #2

On Wed, Nov 25, 2015 at 04:28:02PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 05:19:24PM +0100, Daniel Vetter wrote:
> > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > hang we need to be careful to not run into the "hanging too fast
> > check":
> 
> That's not how it works. It restores the GPU by triggering a manual
> reset.

Maybe in your tree, but here it goes through the full hang recovery. And
the gem_eio default context _does_ get banned without this. Maybe it's a
bug in the ban logic, but it didn't look like that. After all
debugfs/i915_wedged is a hack for developers - by that time userspace has
already either died or fallen back to sw rendering anyway. So I figured no
point in fixing this in the kernel with special cases.
-Daniel

Chris Wilson Nov. 26, 2015, 10:05 a.m. UTC | #3

On Thu, Nov 26, 2015 at 09:34:01AM +0100, Daniel Vetter wrote:
> On Wed, Nov 25, 2015 at 04:28:02PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 05:19:24PM +0100, Daniel Vetter wrote:
> > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu
> > > hang we need to be careful to not run into the "hanging too fast
> > > check":
> > 
> > That's not how it works. It restores the GPU by triggering a manual
> > reset.
> 
> Maybe in your tree, but here it goes through the full hang recovery.

Right.

> And
> the gem_eio default context _does_ get banned without this.

Wrong.

> Maybe it's a
> bug in the ban logic, but it didn't look like that.

Indeed, that is not where the *kernel* bug is.
-Chris

diff --git a/tests/gem_eio.c b/tests/gem_eio.c
index ad67332eae59..f6e41db63a7d 100644
--- a/tests/gem_eio.c
+++ b/tests/gem_eio.c
@@ -84,13 +84,17 @@  static void trigger_reset(int fd)
 
 static void wedge_gpu(int fd)
 {
+	igt_hang_ring_t hang;
+
 	/* First idle the GPU then disable GPU resets before injecting a hang */
 	gem_quiescent_gpu(fd);
 
 	igt_require(i915_reset_control(false));
 
 	igt_debug("Wedging GPU by injecting hang\n");
-	igt_post_hang_ring(fd, igt_hang_ring(fd, I915_EXEC_DEFAULT));
+	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
+	hang.ban = 0;
+	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));
 }
@@ -161,10 +165,14 @@  static void test_wait(int fd)
 {
 	igt_hang_ring_t hang;
 
+	/* First idle the GPU then disable GPU resets before injecting a hang */
+	gem_quiescent_gpu(fd);
+
 	igt_require(i915_reset_control(false));
 
 	hang = igt_hang_ring(fd, I915_EXEC_DEFAULT);
 	igt_assert_eq(__gem_wait(fd, hang.handle, -1), -EIO);
+	hang.ban = 0;
 	igt_post_hang_ring(fd, hang);
 
 	igt_assert(i915_reset_control(true));

tests/gem_eio: Resilience against "hanging too fast"

Commit Message

Comments

Patch