tests/gem_eio: Disable reset for wait subtests

Message ID	1448467099-1876-1-git-send-email-daniel.vetter@ffwll.ch (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Daniel Vetter <daniel.vetter@ffwll.ch> To: Intel Graphics Development <intel-gfx@lists.freedesktop.org> Date: Wed, 25 Nov 2015 16:58:19 +0100 Message-Id: <1448467099-1876-1-git-send-email-daniel.vetter@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@intel.com>, Daniel Vetter <daniel.vetter@ffwll.ch> Subject: [Intel-gfx] [PATCH] tests/gem_eio: Disable reset for wait subtests Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Daniel Vetter Nov. 25, 2015, 3:58 p.m. UTC

This testcase tries to validate -EIO behaviour by disabling gpu reset
support in the kernel. Except that the wait subtest forgot to do that,
and therefore gets a return value of 0 instead of the expected -EIO.

With this fix gem_eio passes on a kernel with my fixes completely.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 tests/gem_eio.c | 4 ++++
 1 file changed, 4 insertions(+)

Chris Wilson Nov. 25, 2015, 4:29 p.m. UTC | #1

On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> This testcase tries to validate -EIO behaviour by disabling gpu reset
> support in the kernel. Except that the wait subtest forgot to do that,
> and therefore gets a return value of 0 instead of the expected -EIO.
> 

Wrong. It was intentionally not using reset=false.
-Chris

Chris Wilson Nov. 25, 2015, 4:34 p.m. UTC | #2

On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > support in the kernel. Except that the wait subtest forgot to do that,
> > and therefore gets a return value of 0 instead of the expected -EIO.
> > 
> 
> Wrong. It was intentionally not using reset=false.

To be more precise, the reason here is that we are not wedging the GPU
but the expectation is that a wait upon a request that hangs reports the
hang. Since the wait on GPU activity is explicit in the ioctl, the
presumption is that the user actually cares about that activity and so
should be given greater information about how it completes (timeout, GPU
hung, or success).
-Chris

Daniel Vetter Nov. 26, 2015, 8:36 a.m. UTC | #3

On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote:
> On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > > support in the kernel. Except that the wait subtest forgot to do that,
> > > and therefore gets a return value of 0 instead of the expected -EIO.
> > > 
> > 
> > Wrong. It was intentionally not using reset=false.
> 
> To be more precise, the reason here is that we are not wedging the GPU
> but the expectation is that a wait upon a request that hangs reports the
> hang. Since the wait on GPU activity is explicit in the ioctl, the
> presumption is that the user actually cares about that activity and so
> should be given greater information about how it completes (timeout, GPU
> hung, or success).

The only place we reprt hangs is in the reset_stats_ioctl. And
fundamentally wait_ioctl can't do (right now) what you want, since if the
reset recovery happens before userspace calls wait_ioctl then it will
happily return 0 for success.

So if you want this then we need:
- a reason for userspace to want this
- reorg all our reset handling and move (or well copy) the reset stats to
  every object

I don't see the point in that.
-Daniel

Chris Wilson Nov. 26, 2015, 10:03 a.m. UTC | #4

On Thu, Nov 26, 2015 at 09:36:14AM +0100, Daniel Vetter wrote:
> On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote:
> > On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> > > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > > > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > > > support in the kernel. Except that the wait subtest forgot to do that,
> > > > and therefore gets a return value of 0 instead of the expected -EIO.
> > > > 
> > > 
> > > Wrong. It was intentionally not using reset=false.
> > 
> > To be more precise, the reason here is that we are not wedging the GPU
> > but the expectation is that a wait upon a request that hangs reports the
> > hang. Since the wait on GPU activity is explicit in the ioctl, the
> > presumption is that the user actually cares about that activity and so
> > should be given greater information about how it completes (timeout, GPU
> > hung, or success).
> 
> The only place we reprt hangs is in the reset_stats_ioctl. And
> fundamentally wait_ioctl can't do (right now) what you want, since if the
> reset recovery happens before userspace calls wait_ioctl then it will
> happily return 0 for success.
> 
> So if you want this then we need:
> - a reason for userspace to want this
> - reorg all our reset handling and move (or well copy) the reset stats to
>   every object

I buy that if it resets before the wait_ioctl then it is invisible,
which makes the information incomplete. I still think that as a caller
of wait_ioctl, knowing that my wait was broken by a reset is valuable
information. The drivers use some form of waiting to throttle
themselves, therefore it is a convenient way of having the information
that a reset has occurred and take action. This usecase also wouldn't
miss a reset happening.

You may disagree about wait_ioctl, but a wait_fence must definitely
report EIO. (And this is as close to wait_fence as we currently have in
the code and in use today.)
-Chris

Daniel Vetter Nov. 26, 2015, 10:41 a.m. UTC | #5

On Thu, Nov 26, 2015 at 10:03:01AM +0000, Chris Wilson wrote:
> On Thu, Nov 26, 2015 at 09:36:14AM +0100, Daniel Vetter wrote:
> > On Wed, Nov 25, 2015 at 04:34:13PM +0000, Chris Wilson wrote:
> > > On Wed, Nov 25, 2015 at 04:29:01PM +0000, Chris Wilson wrote:
> > > > On Wed, Nov 25, 2015 at 04:58:19PM +0100, Daniel Vetter wrote:
> > > > > This testcase tries to validate -EIO behaviour by disabling gpu reset
> > > > > support in the kernel. Except that the wait subtest forgot to do that,
> > > > > and therefore gets a return value of 0 instead of the expected -EIO.
> > > > > 
> > > > 
> > > > Wrong. It was intentionally not using reset=false.
> > > 
> > > To be more precise, the reason here is that we are not wedging the GPU
> > > but the expectation is that a wait upon a request that hangs reports the
> > > hang. Since the wait on GPU activity is explicit in the ioctl, the
> > > presumption is that the user actually cares about that activity and so
> > > should be given greater information about how it completes (timeout, GPU
> > > hung, or success).
> > 
> > The only place we reprt hangs is in the reset_stats_ioctl. And
> > fundamentally wait_ioctl can't do (right now) what you want, since if the
> > reset recovery happens before userspace calls wait_ioctl then it will
> > happily return 0 for success.
> > 
> > So if you want this then we need:
> > - a reason for userspace to want this
> > - reorg all our reset handling and move (or well copy) the reset stats to
> >   every object
> 
> I buy that if it resets before the wait_ioctl then it is invisible,
> which makes the information incomplete. I still think that as a caller
> of wait_ioctl, knowing that my wait was broken by a reset is valuable
> information. The drivers use some form of waiting to throttle
> themselves, therefore it is a convenient way of having the information
> that a reset has occurred and take action. This usecase also wouldn't
> miss a reset happening.

Iirc we even discussed this when doing arb robustness, but the trouble is
that arb robusteness really wants its reset stats per context. Until that
changes (with fences or not) I think we should just curb the EIO from
wait_ioctl like in all the other places.

> You may disagree about wait_ioctl, but a wait_fence must definitely
> report EIO. (And this is as close to wait_fence as we currently have in
> the code and in use today.)

Hm, the plan for fence waiting is to use the android syncpt stuff. I
didn't check whether that bothers with reporting failures, but should be
simple to add. And definitely makes sense.
-Daniel

tests/gem_eio: Disable reset for wait subtests

Commit Message

Comments

Patch