diff mbox

[i-g-t] lib: don't hang on blt on snb

Message ID 1501862842-14601-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)
State New, archived
Headers show

Commit Message

Daniel Vetter Aug. 4, 2017, 4:07 p.m. UTC
We now have full (or a lot at least) igt running in beta CI, and snb
blt hangs are really unhappy:

- drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
  reliably result in insta-machine death when we try to reset the gpu,
  both on the CI snb and the one I have here.

- Other testcases also randomly (and sometimes rather rarely) die on
  snb.

We can't use the endless batch because that results in a reset failure
and wedged gpu, so also not really better.

Until this works reliably it's best to take the tests out of igt,
since machine death has massive impact in creating noise due to the
per-build sharding changing the victimized tests all the time.

Most tests use igt_hang, but gem_exec_capture needed to be switched to
the igt_require_hang_ring helper.

Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 lib/igt_gt.c             | 4 ++++
 tests/gem_exec_capture.c | 3 ++-
 2 files changed, 6 insertions(+), 1 deletion(-)

Comments

Chris Wilson Aug. 4, 2017, 5:05 p.m. UTC | #1
Quoting Daniel Vetter (2017-08-04 17:07:22)
> We now have full (or a lot at least) igt running in beta CI, and snb
> blt hangs are really unhappy:
> 
> - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
>   reliably result in insta-machine death when we try to reset the gpu,
>   both on the CI snb and the one I have here.
> 
> - Other testcases also randomly (and sometimes rather rarely) die on
>   snb.
> 
> We can't use the endless batch because that results in a reset failure
> and wedged gpu, so also not really better.

It shouldn't be the recursion, but the invalid instruction we use to try
and trigger the hang quicker (otherwise hangcheck may see the advancing
ACTHD and give us longer to escape the loop).

In gem_exec_capture we shouldn't even need that invalid instruction, we
just need the busy batch as we pull the trigger ourselves, and if that
fails to reset a simple recursive batch we have some issues to resolve.
-Chris
Daniel Vetter Aug. 7, 2017, 4:26 p.m. UTC | #2
On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
> Quoting Daniel Vetter (2017-08-04 17:07:22)
> > We now have full (or a lot at least) igt running in beta CI, and snb
> > blt hangs are really unhappy:
> > 
> > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
> >   reliably result in insta-machine death when we try to reset the gpu,
> >   both on the CI snb and the one I have here.
> > 
> > - Other testcases also randomly (and sometimes rather rarely) die on
> >   snb.
> > 
> > We can't use the endless batch because that results in a reset failure
> > and wedged gpu, so also not really better.
> 
> It shouldn't be the recursion, but the invalid instruction we use to try
> and trigger the hang quicker (otherwise hangcheck may see the advancing
> ACTHD and give us longer to escape the loop).
> 
> In gem_exec_capture we shouldn't even need that invalid instruction, we
> just need the busy batch as we pull the trigger ourselves, and if that
> fails to reset a simple recursive batch we have some issues to resolve.

Endless loop for haning results in a reset failure on blt as described in
the commit message. We end up with a permanent and unrecoverable -EIO,
which is as deadly to CI as outright killing the machine.
-Daniel
Chris Wilson Aug. 7, 2017, 4:34 p.m. UTC | #3
Quoting Daniel Vetter (2017-08-07 17:26:56)
> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
> > Quoting Daniel Vetter (2017-08-04 17:07:22)
> > > We now have full (or a lot at least) igt running in beta CI, and snb
> > > blt hangs are really unhappy:
> > > 
> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
> > >   reliably result in insta-machine death when we try to reset the gpu,
> > >   both on the CI snb and the one I have here.
> > > 
> > > - Other testcases also randomly (and sometimes rather rarely) die on
> > >   snb.
> > > 
> > > We can't use the endless batch because that results in a reset failure
> > > and wedged gpu, so also not really better.
> > 
> > It shouldn't be the recursion, but the invalid instruction we use to try
> > and trigger the hang quicker (otherwise hangcheck may see the advancing
> > ACTHD and give us longer to escape the loop).
> > 
> > In gem_exec_capture we shouldn't even need that invalid instruction, we
> > just need the busy batch as we pull the trigger ourselves, and if that
> > fails to reset a simple recursive batch we have some issues to resolve.
> 
> Endless loop for haning results in a reset failure on blt as described in
> the commit message. We end up with a permanent and unrecoverable -EIO,
> which is as deadly to CI as outright killing the machine.

No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
instruction as reported, after fixes:

Subtest error-state-basic: SUCCESS (0.001s)
Subtest error-state-capture-render: SUCCESS (7.740s)
Subtest error-state-capture-bsd: SUCCESS (6.024s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd1: SKIP (0.000s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd2: SKIP (0.000s)
Subtest error-state-capture-blt: SUCCESS (13.965s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-vebox: SKIP (0.000s)

Subtest capture-render: SUCCESS (0.003s)
Test requirement not met in function __real_main175, file gem_exec_capture.c:202:
Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
Subtest capture-bsd: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd1: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd2: SKIP (0.000s)
Subtest capture-blt: SUCCESS (0.002s)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-vebox: SKIP (0.000s)

-Chris
Daniel Vetter Aug. 8, 2017, 9:01 a.m. UTC | #4
On Mon, Aug 7, 2017 at 6:34 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-08-07 17:26:56)
>> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
>> > Quoting Daniel Vetter (2017-08-04 17:07:22)
>> > > We now have full (or a lot at least) igt running in beta CI, and snb
>> > > blt hangs are really unhappy:
>> > >
>> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
>> > >   reliably result in insta-machine death when we try to reset the gpu,
>> > >   both on the CI snb and the one I have here.
>> > >
>> > > - Other testcases also randomly (and sometimes rather rarely) die on
>> > >   snb.
>> > >
>> > > We can't use the endless batch because that results in a reset failure
>> > > and wedged gpu, so also not really better.
>> >
>> > It shouldn't be the recursion, but the invalid instruction we use to try
>> > and trigger the hang quicker (otherwise hangcheck may see the advancing
>> > ACTHD and give us longer to escape the loop).
>> >
>> > In gem_exec_capture we shouldn't even need that invalid instruction, we
>> > just need the busy batch as we pull the trigger ourselves, and if that
>> > fails to reset a simple recursive batch we have some issues to resolve.
>>
>> Endless loop for haning results in a reset failure on blt as described in
>> the commit message. We end up with a permanent and unrecoverable -EIO,
>> which is as deadly to CI as outright killing the machine.
>
> No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
> instruction as reported, after fixes:

Well my gt2 disagreed, but I guess we can push your patches to igt and
then ask CI whether we need more.
-Daniel

>
> Subtest error-state-basic: SUCCESS (0.001s)
> Subtest error-state-capture-render: SUCCESS (7.740s)
> Subtest error-state-capture-bsd: SUCCESS (6.024s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd1: SKIP (0.000s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd2: SKIP (0.000s)
> Subtest error-state-capture-blt: SUCCESS (13.965s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-vebox: SKIP (0.000s)
>
> Subtest capture-render: SUCCESS (0.003s)
> Test requirement not met in function __real_main175, file gem_exec_capture.c:202:
> Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
> Subtest capture-bsd: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd1: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd2: SKIP (0.000s)
> Subtest capture-blt: SUCCESS (0.002s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-vebox: SKIP (0.000s)
>
> -Chris
Chris Wilson Aug. 8, 2017, 9:25 a.m. UTC | #5
Quoting Daniel Vetter (2017-08-08 10:01:59)
> On Mon, Aug 7, 2017 at 6:34 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> > Quoting Daniel Vetter (2017-08-07 17:26:56)
> >> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
> >> > Quoting Daniel Vetter (2017-08-04 17:07:22)
> >> > > We now have full (or a lot at least) igt running in beta CI, and snb
> >> > > blt hangs are really unhappy:
> >> > >
> >> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
> >> > >   reliably result in insta-machine death when we try to reset the gpu,
> >> > >   both on the CI snb and the one I have here.
> >> > >
> >> > > - Other testcases also randomly (and sometimes rather rarely) die on
> >> > >   snb.
> >> > >
> >> > > We can't use the endless batch because that results in a reset failure
> >> > > and wedged gpu, so also not really better.
> >> >
> >> > It shouldn't be the recursion, but the invalid instruction we use to try
> >> > and trigger the hang quicker (otherwise hangcheck may see the advancing
> >> > ACTHD and give us longer to escape the loop).
> >> >
> >> > In gem_exec_capture we shouldn't even need that invalid instruction, we
> >> > just need the busy batch as we pull the trigger ourselves, and if that
> >> > fails to reset a simple recursive batch we have some issues to resolve.
> >>
> >> Endless loop for haning results in a reset failure on blt as described in
> >> the commit message. We end up with a permanent and unrecoverable -EIO,
> >> which is as deadly to CI as outright killing the machine.
> >
> > No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
> > instruction as reported, after fixes:
> 
> Well my gt2 disagreed, but I guess we can push your patches to igt and
> then ask CI whether we need more.

Fine, dug out the snb-gt2,

[ickle@huronriver tests]$ sudo ./drv_hangman 
IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
Subtest error-state-sysfs-entry: SUCCESS (0.000s)
Subtest error-state-basic: SUCCESS (0.004s)
Subtest error-state-capture-render: SUCCESS (13.711s)
Subtest error-state-capture-bsd: SUCCESS (8.006s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd1: SKIP (0.000s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-bsd2: SKIP (0.000s)
Subtest error-state-capture-blt: SUCCESS (6.049s)
Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
Test requirement: gem_has_ring(device, ring_id)
Subtest error-state-capture-vebox: SKIP (0.000s)
Test requirement not met in function hangcheck_unterminated, file drv_hangman.c:213:
Test requirement: gem_uses_full_ppgtt(device)
Subtest hangcheck-unterminated: SKIP (0.000s)
[ickle@huronriver tests]$ sudo ./gem_exec_capture 
IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
Subtest capture-render: SUCCESS (0.009s)
Test requirement not met in function __real_main175, file gem_exec_capture.c:202:
Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
Subtest capture-bsd: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd1: SKIP (0.000s)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-bsd2: SKIP (0.000s)
Subtest capture-blt: SUCCESS (0.005s)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
Test requirement: gem_has_ring(fd, ring)
Subtest capture-vebox: SKIP (0.000s)

Seems solid to me.
-Chris
Daniel Vetter Aug. 8, 2017, 9:30 a.m. UTC | #6
On Tue, Aug 8, 2017 at 11:25 AM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
> Quoting Daniel Vetter (2017-08-08 10:01:59)
>> On Mon, Aug 7, 2017 at 6:34 PM, Chris Wilson <chris@chris-wilson.co.uk> wrote:
>> > Quoting Daniel Vetter (2017-08-07 17:26:56)
>> >> On Fri, Aug 04, 2017 at 06:05:10PM +0100, Chris Wilson wrote:
>> >> > Quoting Daniel Vetter (2017-08-04 17:07:22)
>> >> > > We now have full (or a lot at least) igt running in beta CI, and snb
>> >> > > blt hangs are really unhappy:
>> >> > >
>> >> > > - drv_hangman@error-state-capture-blt and gem_exec_capture@capture-blt
>> >> > >   reliably result in insta-machine death when we try to reset the gpu,
>> >> > >   both on the CI snb and the one I have here.
>> >> > >
>> >> > > - Other testcases also randomly (and sometimes rather rarely) die on
>> >> > >   snb.
>> >> > >
>> >> > > We can't use the endless batch because that results in a reset failure
>> >> > > and wedged gpu, so also not really better.
>> >> >
>> >> > It shouldn't be the recursion, but the invalid instruction we use to try
>> >> > and trigger the hang quicker (otherwise hangcheck may see the advancing
>> >> > ACTHD and give us longer to escape the loop).
>> >> >
>> >> > In gem_exec_capture we shouldn't even need that invalid instruction, we
>> >> > just need the busy batch as we pull the trigger ourselves, and if that
>> >> > fails to reset a simple recursive batch we have some issues to resolve.
>> >>
>> >> Endless loop for haning results in a reset failure on blt as described in
>> >> the commit message. We end up with a permanent and unrecoverable -EIO,
>> >> which is as deadly to CI as outright killing the machine.
>> >
>> > No, it doesn't. snb-gt1 exhibiting the machine death on invalid blt
>> > instruction as reported, after fixes:
>>
>> Well my gt2 disagreed, but I guess we can push your patches to igt and
>> then ask CI whether we need more.
>
> Fine, dug out the snb-gt2,
>
> [ickle@huronriver tests]$ sudo ./drv_hangman
> IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
> Subtest error-state-sysfs-entry: SUCCESS (0.000s)
> Subtest error-state-basic: SUCCESS (0.004s)
> Subtest error-state-capture-render: SUCCESS (13.711s)
> Subtest error-state-capture-bsd: SUCCESS (8.006s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd1: SKIP (0.000s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-bsd2: SKIP (0.000s)
> Subtest error-state-capture-blt: SUCCESS (6.049s)
> Test requirement not met in function test_error_state_capture, file drv_hangman.c:187:
> Test requirement: gem_has_ring(device, ring_id)
> Subtest error-state-capture-vebox: SKIP (0.000s)
> Test requirement not met in function hangcheck_unterminated, file drv_hangman.c:213:
> Test requirement: gem_uses_full_ppgtt(device)
> Subtest hangcheck-unterminated: SKIP (0.000s)
> [ickle@huronriver tests]$ sudo ./gem_exec_capture
> IGT-Version: 1.19-gcfd42d1 (i686) (Linux: 4.12.0+ i686)
> Subtest capture-render: SUCCESS (0.009s)
> Test requirement not met in function __real_main175, file gem_exec_capture.c:202:
> Test requirement: gem_can_store_dword(fd, e->exec_id | e->flags)
> Subtest capture-bsd: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd1: SKIP (0.000s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-bsd2: SKIP (0.000s)
> Subtest capture-blt: SUCCESS (0.005s)
> Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1642:
> Test requirement: gem_has_ring(fd, ring)
> Subtest capture-vebox: SKIP (0.000s)
>
> Seems solid to me.

Ok I'll retest once your patches have landed, could very well be that
I screwed up something with my looping batch.
-Daniel
diff mbox

Patch

diff --git a/lib/igt_gt.c b/lib/igt_gt.c
index 6f7daa5ef982..99d709fe4086 100644
--- a/lib/igt_gt.c
+++ b/lib/igt_gt.c
@@ -118,6 +118,10 @@  void igt_require_hang_ring(int fd, int ring)
 	if (!igt_check_boolean_env_var("IGT_HANG", true))
 		igt_skip("hang injection disabled by user");
 
+	igt_require_f(ring != I915_EXEC_BLT ||
+		      intel_gen(intel_get_drm_devid(fd)) != 6,
+		      "blt hang can causes insta-death on snb.\n");
+
 	gem_require_ring(fd, ring);
 	gem_context_require_bannable(fd);
 	if (!igt_check_boolean_env_var("IGT_HANG_WITHOUT_RESET", false))
diff --git a/tests/gem_exec_capture.c b/tests/gem_exec_capture.c
index f8f43d2903a9..fb4a5e85f6b2 100644
--- a/tests/gem_exec_capture.c
+++ b/tests/gem_exec_capture.c
@@ -69,6 +69,8 @@  static void capture(int fd, int dir, unsigned ring)
 	uint32_t *batch;
 	int i;
 
+	igt_require_hang_ring(fd, ring);
+
 	memset(obj, 0, sizeof(obj));
 	obj[SCRATCH].handle = gem_create(fd, 4096);
 	obj[CAPTURE].handle = gem_create(fd, 4096);
@@ -166,7 +168,6 @@  igt_main
 			continue;
 
 		igt_subtest_f("capture-%s", e->name) {
-			gem_require_ring(fd, e->exec_id | e->flags);
 			capture(fd, dir, e->exec_id | e->flags);
 		}
 	}