Message ID | 1448537675-481-2-git-send-email-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote: > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu > hang we need to be careful to not run into the "hanging too fast > check": > > - don't restore the ban period, but instead keep it at 0. > - make sure we idle the gpu fully before hanging it again (wait > subtest missted that). > > With this gem_eio works now reliable even when I don't run the > subtests individually. > > Of course it's a bit fishy that the default ctx gets blamed for > essentially doing nothing, but until that's figured out in upstream > it's better to make the test work for now. This used to be reliable. And just disabling all banning in the kernel forever more is silly. During igt_post_hang_ring: 1. we wait upon the hanging batch - this returns when hangcheck fires 2. reset the ban period to normal - this takes mutex_lock_interruptible and so must wait for the reset handler to run before it can make the change, - ergo the hanging batch never triggers a ban for itself. - (a subsequent nonsimulated GPU hang may trigger the ban though) Nak. -Chris
On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote: > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote: > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu > > hang we need to be careful to not run into the "hanging too fast > > check": > > > > - don't restore the ban period, but instead keep it at 0. > > - make sure we idle the gpu fully before hanging it again (wait > > subtest missted that). > > > > With this gem_eio works now reliable even when I don't run the > > subtests individually. > > > > Of course it's a bit fishy that the default ctx gets blamed for > > essentially doing nothing, but until that's figured out in upstream > > it's better to make the test work for now. > > This used to be reliable. And just disabling all banning in the kernel > forever more is silly. > > During igt_post_hang_ring: > 1. we wait upon the hanging batch > - this returns when hangcheck fires > 2. reset the ban period to normal > - this takes mutex_lock_interruptible and so must wait for the reset > handler to run before it can make the change, > - ergo the hanging batch never triggers a ban for itself. > - (a subsequent nonsimulated GPU hang may trigger the ban though) This isn't where it dies. It dies when we do the echo 1 > i915_wedged. I suspect quiescent_gpu or whatever is getting in the way, but I really only wanted to get things to run first. And since i915_wedged is a developer feature, and it does work perfectly if you don't intend to reuse contexts I didn't see any point in first trying to fix it up. So I still maintain that this is a good enough approach, at least if there's no obvious fix in-flight already. -Daniel
On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote: > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote: > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote: > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu > > > hang we need to be careful to not run into the "hanging too fast > > > check": > > > > > > - don't restore the ban period, but instead keep it at 0. > > > - make sure we idle the gpu fully before hanging it again (wait > > > subtest missted that). > > > > > > With this gem_eio works now reliable even when I don't run the > > > subtests individually. > > > > > > Of course it's a bit fishy that the default ctx gets blamed for > > > essentially doing nothing, but until that's figured out in upstream > > > it's better to make the test work for now. > > > > This used to be reliable. And just disabling all banning in the kernel > > forever more is silly. > > > > During igt_post_hang_ring: > > 1. we wait upon the hanging batch > > - this returns when hangcheck fires > > 2. reset the ban period to normal > > - this takes mutex_lock_interruptible and so must wait for the reset > > handler to run before it can make the change, > > - ergo the hanging batch never triggers a ban for itself. > > - (a subsequent nonsimulated GPU hang may trigger the ban though) > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged. That is not where it dies. > I suspect quiescent_gpu or whatever is getting in the way, but I really only > wanted to get things to run first. And since i915_wedged is a developer > feature, and it does work perfectly if you don't intend to reuse contexts > I didn't see any point in first trying to fix it up. > > So I still maintain that this is a good enough approach, at least if > there's no obvious fix in-flight already. No way. This is a kernel regression since 4.0, having just tested with v4.0 on snb/ivb/hsw. -Chris
On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote: > On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote: > > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote: > > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote: > > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu > > > > hang we need to be careful to not run into the "hanging too fast > > > > check": > > > > > > > > - don't restore the ban period, but instead keep it at 0. > > > > - make sure we idle the gpu fully before hanging it again (wait > > > > subtest missted that). > > > > > > > > With this gem_eio works now reliable even when I don't run the > > > > subtests individually. > > > > > > > > Of course it's a bit fishy that the default ctx gets blamed for > > > > essentially doing nothing, but until that's figured out in upstream > > > > it's better to make the test work for now. > > > > > > This used to be reliable. And just disabling all banning in the kernel > > > forever more is silly. > > > > > > During igt_post_hang_ring: > > > 1. we wait upon the hanging batch > > > - this returns when hangcheck fires > > > 2. reset the ban period to normal > > > - this takes mutex_lock_interruptible and so must wait for the reset > > > handler to run before it can make the change, > > > - ergo the hanging batch never triggers a ban for itself. > > > - (a subsequent nonsimulated GPU hang may trigger the ban though) > > > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged. > > That is not where it dies. Well at least it happens after we start the hang recover from i915_wedged. > > I suspect quiescent_gpu or whatever is getting in the way, but I really only > > wanted to get things to run first. And since i915_wedged is a developer > > feature, and it does work perfectly if you don't intend to reuse contexts > > I didn't see any point in first trying to fix it up. > > > > So I still maintain that this is a good enough approach, at least if > > there's no obvious fix in-flight already. > > No way. This is a kernel regression since 4.0, having just tested with > v4.0 on snb/ivb/hsw. Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN anyway when we are terminally wedged, and that seems to have been the case ever since we started with reset_counter this has been broken forever. I guess I missed something. -Daniel
On Thu, Nov 26, 2015 at 04:51:13PM +0100, Daniel Vetter wrote: > On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote: > > On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote: > > > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote: > > > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote: > > > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu > > > > > hang we need to be careful to not run into the "hanging too fast > > > > > check": > > > > > > > > > > - don't restore the ban period, but instead keep it at 0. > > > > > - make sure we idle the gpu fully before hanging it again (wait > > > > > subtest missted that). > > > > > > > > > > With this gem_eio works now reliable even when I don't run the > > > > > subtests individually. > > > > > > > > > > Of course it's a bit fishy that the default ctx gets blamed for > > > > > essentially doing nothing, but until that's figured out in upstream > > > > > it's better to make the test work for now. > > > > > > > > This used to be reliable. And just disabling all banning in the kernel > > > > forever more is silly. > > > > > > > > During igt_post_hang_ring: > > > > 1. we wait upon the hanging batch > > > > - this returns when hangcheck fires > > > > 2. reset the ban period to normal > > > > - this takes mutex_lock_interruptible and so must wait for the reset > > > > handler to run before it can make the change, > > > > - ergo the hanging batch never triggers a ban for itself. > > > > - (a subsequent nonsimulated GPU hang may trigger the ban though) > > > > > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged. > > > > That is not where it dies. > > Well at least it happens after we start the hang recover from i915_wedged. > > > > I suspect quiescent_gpu or whatever is getting in the way, but I really only > > > wanted to get things to run first. And since i915_wedged is a developer > > > feature, and it does work perfectly if you don't intend to reuse contexts > > > I didn't see any point in first trying to fix it up. > > > > > > So I still maintain that this is a good enough approach, at least if > > > there's no obvious fix in-flight already. > > > > No way. This is a kernel regression since 4.0, having just tested with > > v4.0 on snb/ivb/hsw. > > Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN > anyway when we are terminally wedged, and that seems to have been the case > ever since we started with reset_counter this has been broken forever. I > guess I missed something. The bug I see is SNB specific, and introduced between v4.0 and v4.1. -Chris
On Thu, Nov 26, 2015 at 09:10:57PM +0000, Chris Wilson wrote: > On Thu, Nov 26, 2015 at 04:51:13PM +0100, Daniel Vetter wrote: > > On Thu, Nov 26, 2015 at 03:34:05PM +0000, Chris Wilson wrote: > > > On Thu, Nov 26, 2015 at 03:46:06PM +0100, Daniel Vetter wrote: > > > > On Thu, Nov 26, 2015 at 12:59:37PM +0000, Chris Wilson wrote: > > > > > On Thu, Nov 26, 2015 at 12:34:35PM +0100, Daniel Vetter wrote: > > > > > > Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu > > > > > > hang we need to be careful to not run into the "hanging too fast > > > > > > check": > > > > > > > > > > > > - don't restore the ban period, but instead keep it at 0. > > > > > > - make sure we idle the gpu fully before hanging it again (wait > > > > > > subtest missted that). > > > > > > > > > > > > With this gem_eio works now reliable even when I don't run the > > > > > > subtests individually. > > > > > > > > > > > > Of course it's a bit fishy that the default ctx gets blamed for > > > > > > essentially doing nothing, but until that's figured out in upstream > > > > > > it's better to make the test work for now. > > > > > > > > > > This used to be reliable. And just disabling all banning in the kernel > > > > > forever more is silly. > > > > > > > > > > During igt_post_hang_ring: > > > > > 1. we wait upon the hanging batch > > > > > - this returns when hangcheck fires > > > > > 2. reset the ban period to normal > > > > > - this takes mutex_lock_interruptible and so must wait for the reset > > > > > handler to run before it can make the change, > > > > > - ergo the hanging batch never triggers a ban for itself. > > > > > - (a subsequent nonsimulated GPU hang may trigger the ban though) > > > > > > > > This isn't where it dies. It dies when we do the echo 1 > i915_wedged. > > > > > > That is not where it dies. > > > > Well at least it happens after we start the hang recover from i915_wedged. > > > > > > I suspect quiescent_gpu or whatever is getting in the way, but I really only > > > > wanted to get things to run first. And since i915_wedged is a developer > > > > feature, and it does work perfectly if you don't intend to reuse contexts > > > > I didn't see any point in first trying to fix it up. > > > > > > > > So I still maintain that this is a good enough approach, at least if > > > > there's no obvious fix in-flight already. > > > > > > No way. This is a kernel regression since 4.0, having just tested with > > > v4.0 on snb/ivb/hsw. > > > > Ok, I didn't realize that. I figured since i915_wedged will return -EAGAIN > > anyway when we are terminally wedged, and that seems to have been the case > > ever since we started with reset_counter this has been broken forever. I > > guess I missed something. > > The bug I see is SNB specific, and introduced between v4.0 and v4.1. The irony, I hacked on gem_eio on snb too ;-) -Daniel
diff --git a/tests/gem_eio.c b/tests/gem_eio.c index 8345d1a7a429..6c07d9175b95 100644 --- a/tests/gem_eio.c +++ b/tests/gem_eio.c @@ -84,13 +84,17 @@ static void trigger_reset(int fd) static void wedge_gpu(int fd) { + igt_hang_ring_t hang; + /* First idle the GPU then disable GPU resets before injecting a hang */ gem_quiescent_gpu(fd); igt_require(i915_reset_control(false)); igt_debug("Wedging GPU by injecting hang\n"); - igt_post_hang_ring(fd, igt_hang_ring(fd, I915_EXEC_DEFAULT)); + hang = igt_hang_ring(fd, I915_EXEC_DEFAULT); + hang.ban = 0; + igt_post_hang_ring(fd, hang); igt_assert(i915_reset_control(true)); } @@ -161,10 +165,14 @@ static void test_wait(int fd) { igt_hang_ring_t hang; + /* First idle the GPU then disable GPU resets before injecting a hang */ + gem_quiescent_gpu(fd); + igt_require(i915_reset_control(false)); hang = igt_hang_ring(fd, I915_EXEC_DEFAULT); igt_assert_eq(__gem_wait(fd, hang.handle, -1), 0); + hang.ban = 0; igt_post_hang_ring(fd, hang); igt_assert(i915_reset_control(true));
Since $debugfs/i915_wedged restores a wedged gpu by using a normal gpu hang we need to be careful to not run into the "hanging too fast check": - don't restore the ban period, but instead keep it at 0. - make sure we idle the gpu fully before hanging it again (wait subtest missted that). With this gem_eio works now reliable even when I don't run the subtests individually. Of course it's a bit fishy that the default ctx gets blamed for essentially doing nothing, but until that's figured out in upstream it's better to make the test work for now. Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- tests/gem_eio.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)