Message ID | 1448467099-1876-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote: > This testcase tries to validate -EIO behaviour by disabling gpu reset > support in the kernel. Except that the wait subtest forgot to do that, > and therefore gets a return value of 0 instead of the expected -EIO. > Wrong. It was intentionally not using reset=false. -Chris
On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote: > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote: > > This testcase tries to validate -EIO behaviour by disabling gpu reset > > support in the kernel. Except that the wait subtest forgot to do that, > > and therefore gets a return value of 0 instead of the expected -EIO. > > > > Wrong. It was intentionally not using reset=false. To be more precise, the reason here is that we are not wedging the GPU but the expectation is that a wait upon a request that hangs reports the hang. Since the wait on GPU activity is explicit in the ioctl, the presumption is that the user actually cares about that activity and so should be given greater information about how it completes (timeout, GPU hung, or success). -Chris
On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote: > On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote: > > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote: > > > This testcase tries to validate -EIO behaviour by disabling gpu reset > > > support in the kernel. Except that the wait subtest forgot to do that, > > > and therefore gets a return value of 0 instead of the expected -EIO. > > > > > > > Wrong. It was intentionally not using reset=false. > > To be more precise, the reason here is that we are not wedging the GPU > but the expectation is that a wait upon a request that hangs reports the > hang. Since the wait on GPU activity is explicit in the ioctl, the > presumption is that the user actually cares about that activity and so > should be given greater information about how it completes (timeout, GPU > hung, or success). The only place we reprt hangs is in the reset_stats_ioctl. And fundamentally wait_ioctl can't do (right now) what you want, since if the reset recovery happens before userspace calls wait_ioctl then it will happily return 0 for success. So if you want this then we need: - a reason for userspace to want this - reorg all our reset handling and move (or well copy) the reset stats to every object I don't see the point in that. -Daniel
On Thu, Nov 26, 2015 at 09:36:14AM +0100, Daniel Vetter wrote: > On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote: > > On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote: > > > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote: > > > > This testcase tries to validate -EIO behaviour by disabling gpu reset > > > > support in the kernel. Except that the wait subtest forgot to do that, > > > > and therefore gets a return value of 0 instead of the expected -EIO. > > > > > > > > > > Wrong. It was intentionally not using reset=false. > > > > To be more precise, the reason here is that we are not wedging the GPU > > but the expectation is that a wait upon a request that hangs reports the > > hang. Since the wait on GPU activity is explicit in the ioctl, the > > presumption is that the user actually cares about that activity and so > > should be given greater information about how it completes (timeout, GPU > > hung, or success). > > The only place we reprt hangs is in the reset_stats_ioctl. And > fundamentally wait_ioctl can't do (right now) what you want, since if the > reset recovery happens before userspace calls wait_ioctl then it will > happily return 0 for success. > > So if you want this then we need: > - a reason for userspace to want this > - reorg all our reset handling and move (or well copy) the reset stats to > every object I buy that if it resets before the wait_ioctl then it is invisible, which makes the information incomplete. I still think that as a caller of wait_ioctl, knowing that my wait was broken by a reset is valuable information. The drivers use some form of waiting to throttle themselves, therefore it is a convenient way of having the information that a reset has occurred and take action. This usecase also wouldn't miss a reset happening. You may disagree about wait_ioctl, but a wait_fence must definitely report EIO. (And this is as close to wait_fence as we currently have in the code and in use today.) -Chris
On Thu, Nov 26, 2015 at 10:03:01AM +0000, Chris Wilson wrote: > On Thu, Nov 26, 2015 at 09:36:14AM +0100, Daniel Vetter wrote: > > On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote: > > > On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote: > > > > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote: > > > > > This testcase tries to validate -EIO behaviour by disabling gpu reset > > > > > support in the kernel. Except that the wait subtest forgot to do that, > > > > > and therefore gets a return value of 0 instead of the expected -EIO. > > > > > > > > > > > > > Wrong. It was intentionally not using reset=false. > > > > > > To be more precise, the reason here is that we are not wedging the GPU > > > but the expectation is that a wait upon a request that hangs reports the > > > hang. Since the wait on GPU activity is explicit in the ioctl, the > > > presumption is that the user actually cares about that activity and so > > > should be given greater information about how it completes (timeout, GPU > > > hung, or success). > > > > The only place we reprt hangs is in the reset_stats_ioctl. And > > fundamentally wait_ioctl can't do (right now) what you want, since if the > > reset recovery happens before userspace calls wait_ioctl then it will > > happily return 0 for success. > > > > So if you want this then we need: > > - a reason for userspace to want this > > - reorg all our reset handling and move (or well copy) the reset stats to > > every object > > I buy that if it resets before the wait_ioctl then it is invisible, > which makes the information incomplete. I still think that as a caller > of wait_ioctl, knowing that my wait was broken by a reset is valuable > information. The drivers use some form of waiting to throttle > themselves, therefore it is a convenient way of having the information > that a reset has occurred and take action. This usecase also wouldn't > miss a reset happening. Iirc we even discussed this when doing arb robustness, but the trouble is that arb robusteness really wants its reset stats per context. Until that changes (with fences or not) I think we should just curb the EIO from wait_ioctl like in all the other places. > You may disagree about wait_ioctl, but a wait_fence must definitely > report EIO. (And this is as close to wait_fence as we currently have in > the code and in use today.) Hm, the plan for fence waiting is to use the android syncpt stuff. I didn't check whether that bothers with reporting failures, but should be simple to add. And definitely makes sense. -Daniel
diff --git a/tests/gem_eio.c b/tests/gem_eio.c index a24c8f1c53b5..ad67332eae59 100644 --- a/tests/gem_eio.c +++ b/tests/gem_eio.c @@ -161,10 +161,14 @@ static void test_wait(int fd) { igt_hang_ring_t hang; + igt_require(i915_reset_control(false)); + hang = igt_hang_ring(fd, I915_EXEC_DEFAULT); igt_assert_eq(__gem_wait(fd, hang.handle, -1), -EIO); igt_post_hang_ring(fd, hang); + igt_assert(i915_reset_control(true)); + trigger_reset(fd); }
This testcase tries to validate -EIO behaviour by disabling gpu reset support in the kernel. Except that the wait subtest forgot to do that, and therefore gets a return value of 0 instead of the expected -EIO. With this fix gem_eio passes on a kernel with my fixes completely. Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> --- tests/gem_eio.c | 4 ++++ 1 file changed, 4 insertions(+)