diff mbox

[i-g-t] lib/igt_gt.c : allow changes to stop_rings mode bits

Message ID 1436533566-11048-1-git-send-email-tim.gore@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

tim.gore@intel.com July 10, 2015, 1:06 p.m. UTC
From: Tim Gore <tim.gore@intel.com>

In function igt_set_stop_rings, the test
  igt_assert_f(flags == 0 || current == 0, ..

will fail if we are trying to force a hang but the
STOP_RINGS_ALLOW_BAN or STOP_RINGS_ALLOW_ERROR bit is set.
With the introduction of per ring resets in the driver
(in android) these bits do not get cleared to zero when
an individual ring is reset. This causes subsequent
attempt to cause a ring hang via this function to fail,
leading to several igt tests failing (ie gem_reset_stats
subtest ban-xxx etc).

So, modify this test to look only at the bits that are
used to hang the gpu rings.

Signed-off-by: Tim Gore <tim.gore@intel.com>
---
 lib/igt_gt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Daniel Vetter July 13, 2015, 9:30 a.m. UTC | #1
On Fri, Jul 10, 2015 at 02:06:06PM +0100, tim.gore@intel.com wrote:
> From: Tim Gore <tim.gore@intel.com>
> 
> In function igt_set_stop_rings, the test
>   igt_assert_f(flags == 0 || current == 0, ..
> 
> will fail if we are trying to force a hang but the
> STOP_RINGS_ALLOW_BAN or STOP_RINGS_ALLOW_ERROR bit is set.
> With the introduction of per ring resets in the driver
> (in android) these bits do not get cleared to zero when
> an individual ring is reset. This causes subsequent
> attempt to cause a ring hang via this function to fail,
> leading to several igt tests failing (ie gem_reset_stats
> subtest ban-xxx etc).

Fix tdr to reset these instead?
-Daniel

> So, modify this test to look only at the bits that are
> used to hang the gpu rings.
> 
> Signed-off-by: Tim Gore <tim.gore@intel.com>
> ---
>  lib/igt_gt.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/igt_gt.c b/lib/igt_gt.c
> index 8e5e076..12c56fa 100644
> --- a/lib/igt_gt.c
> +++ b/lib/igt_gt.c
> @@ -345,8 +345,8 @@ void igt_set_stop_rings(enum stop_ring_flags flags)
>  			      STOP_RING_ALLOW_ERRORS)) == 0);
>  
>  	current = igt_get_stop_rings();
> -	igt_assert_f(flags == 0 || current == 0,
> -		     "previous i915_ring_stop is still 0x%x\n", current);
> +	igt_assert_f( (flags & STOP_RING_ALL) == 0 || (current & STOP_RING_ALL) == 0,
> +			     "previous i915_ring_stop is still 0x%x\n", current);
>  
>  	stop_rings_write(flags);
>  	current = igt_get_stop_rings();
> -- 
> 1.9.1
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/intel-gfx
tim.gore@intel.com July 13, 2015, 9:43 a.m. UTC | #2
Tim Gore 
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ

> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Monday, July 13, 2015 10:30 AM
> To: Gore, Tim
> Cc: intel-gfx@lists.freedesktop.org; Wood, Thomas
> Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes to stop_rings
> mode bits
> 
> On Fri, Jul 10, 2015 at 02:06:06PM +0100, tim.gore@intel.com wrote:
> > From: Tim Gore <tim.gore@intel.com>
> >
> > In function igt_set_stop_rings, the test
> >   igt_assert_f(flags == 0 || current == 0, ..
> >
> > will fail if we are trying to force a hang but the
> > STOP_RINGS_ALLOW_BAN or STOP_RINGS_ALLOW_ERROR bit is set.
> > With the introduction of per ring resets in the driver (in android)
> > these bits do not get cleared to zero when an individual ring is
> > reset. This causes subsequent attempt to cause a ring hang via this
> > function to fail, leading to several igt tests failing (ie
> > gem_reset_stats subtest ban-xxx etc).
> 
> Fix tdr to reset these instead?
> -Daniel
> 
I could change tdr, but why. When the TDR handles a ring hang and resets the ring,
why would it modify the flag that defines if the driver should ban a frequently hanging
context? If we get rid of the stop_rings interface, as Chris Wilson suggested, we would
still need to keep the STOP_RING_ALLOW_BAN/ALLOW_ERRORS bits in debugfs,
but you would not expect to have to re-write these bits each time there is a ring
reset. 

  Tim

> > So, modify this test to look only at the bits that are used to hang
> > the gpu rings.
> >
> > Signed-off-by: Tim Gore <tim.gore@intel.com>
> > ---
> >  lib/igt_gt.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/igt_gt.c b/lib/igt_gt.c index 8e5e076..12c56fa 100644
> > --- a/lib/igt_gt.c
> > +++ b/lib/igt_gt.c
> > @@ -345,8 +345,8 @@ void igt_set_stop_rings(enum stop_ring_flags flags)
> >  			      STOP_RING_ALLOW_ERRORS)) == 0);
> >
> >  	current = igt_get_stop_rings();
> > -	igt_assert_f(flags == 0 || current == 0,
> > -		     "previous i915_ring_stop is still 0x%x\n", current);
> > +	igt_assert_f( (flags & STOP_RING_ALL) == 0 || (current &
> STOP_RING_ALL) == 0,
> > +			     "previous i915_ring_stop is still 0x%x\n", current);
> >
> >  	stop_rings_write(flags);
> >  	current = igt_get_stop_rings();
> > --
> > 1.9.1
> >
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Daniel Vetter July 13, 2015, 2:59 p.m. UTC | #3
On Mon, Jul 13, 2015 at 09:43:11AM +0000, Gore, Tim wrote:
> 
> 
> Tim Gore 
> Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ
> 
> > -----Original Message-----
> > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> > Vetter
> > Sent: Monday, July 13, 2015 10:30 AM
> > To: Gore, Tim
> > Cc: intel-gfx@lists.freedesktop.org; Wood, Thomas
> > Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes to stop_rings
> > mode bits
> > 
> > On Fri, Jul 10, 2015 at 02:06:06PM +0100, tim.gore@intel.com wrote:
> > > From: Tim Gore <tim.gore@intel.com>
> > >
> > > In function igt_set_stop_rings, the test
> > >   igt_assert_f(flags == 0 || current == 0, ..
> > >
> > > will fail if we are trying to force a hang but the
> > > STOP_RINGS_ALLOW_BAN or STOP_RINGS_ALLOW_ERROR bit is set.
> > > With the introduction of per ring resets in the driver (in android)
> > > these bits do not get cleared to zero when an individual ring is
> > > reset. This causes subsequent attempt to cause a ring hang via this
> > > function to fail, leading to several igt tests failing (ie
> > > gem_reset_stats subtest ban-xxx etc).
> > 
> > Fix tdr to reset these instead?
> > -Daniel
> > 
> I could change tdr, but why. When the TDR handles a ring hang and resets the ring,
> why would it modify the flag that defines if the driver should ban a frequently hanging
> context? If we get rid of the stop_rings interface, as Chris Wilson suggested, we would
> still need to keep the STOP_RING_ALLOW_BAN/ALLOW_ERRORS bits in debugfs,
> but you would not expect to have to re-write these bits each time there is a ring
> reset. 

The fix current hang recover code to no reset this, add some grace period,
then push this patch to igt. We don't have full-blown abi guarantees for
debugfs/igt stuff, but I want at least a few months (really last released
kernel&igt) of backwards/forward compatibility. And inconsistent behaviour
isn't great imo.
-Daniel
tim.gore@intel.com July 13, 2015, 4:07 p.m. UTC | #4
Tim Gore 
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ


> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Monday, July 13, 2015 3:59 PM
> To: Gore, Tim
> Cc: Daniel Vetter; intel-gfx@lists.freedesktop.org; Wood, Thomas
> Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes to stop_rings
> mode bits
> 
> On Mon, Jul 13, 2015 at 09:43:11AM +0000, Gore, Tim wrote:
> >
> >
> > Tim Gore
> > Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon
> > SN3 1RJ
> >
> > > -----Original Message-----
> > > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> > > Daniel Vetter
> > > Sent: Monday, July 13, 2015 10:30 AM
> > > To: Gore, Tim
> > > Cc: intel-gfx@lists.freedesktop.org; Wood, Thomas
> > > Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes
> > > to stop_rings mode bits
> > >
> > > On Fri, Jul 10, 2015 at 02:06:06PM +0100, tim.gore@intel.com wrote:
> > > > From: Tim Gore <tim.gore@intel.com>
> > > >
> > > > In function igt_set_stop_rings, the test
> > > >   igt_assert_f(flags == 0 || current == 0, ..
> > > >
> > > > will fail if we are trying to force a hang but the
> > > > STOP_RINGS_ALLOW_BAN or STOP_RINGS_ALLOW_ERROR bit is set.
> > > > With the introduction of per ring resets in the driver (in
> > > > android) these bits do not get cleared to zero when an individual
> > > > ring is reset. This causes subsequent attempt to cause a ring hang
> > > > via this function to fail, leading to several igt tests failing
> > > > (ie gem_reset_stats subtest ban-xxx etc).
> > >
> > > Fix tdr to reset these instead?
> > > -Daniel
> > >
> > I could change tdr, but why. When the TDR handles a ring hang and
> > resets the ring, why would it modify the flag that defines if the
> > driver should ban a frequently hanging context? If we get rid of the
> > stop_rings interface, as Chris Wilson suggested, we would still need
> > to keep the STOP_RING_ALLOW_BAN/ALLOW_ERRORS bits in debugfs, but
> you
> > would not expect to have to re-write these bits each time there is a ring
> reset.
> 
> The fix current hang recover code to no reset this, add some grace period,
> then push this patch to igt. We don't have full-blown abi guarantees for
> debugfs/igt stuff, but I want at least a few months (really last released
> kernel&igt) of backwards/forward compatibility. And inconsistent behaviour
> isn't great imo.
> -Daniel

Sorry Daniel, I didn't really follow that.
I didn't want a gpu reset to clear the ALLOW_BAN bit, since this will defeat the
point of this bit, except perhaps in test situations where you can keep setting it
each time you deliberately cause a hang. It seems like the ALLOW_BAN bit has
uses in real world situations, although I don't know it anyone uses it.

Would you be more comfortable with
Igt_assert_f ( ( flags ==0 ) || 
     (( current & STOP_RING_ALL) ==0)  && ((current ^ flags) & ~ STOP_RING_ALL == 0 ) )

So the either the new flags must be 0 (currently allowed) or the existing flags must
indicate that  all hangs are cleared (0 except possibly the mode bits) AND the mode
bits you are writing are the same as the current values. ??

  Tim
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
Daniel Vetter July 14, 2015, 8:08 a.m. UTC | #5
On Mon, Jul 13, 2015 at 04:07:14PM +0000, Gore, Tim wrote:
> 
> 
> Tim Gore 
> Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ
> 
> 
> > -----Original Message-----
> > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> > Vetter
> > Sent: Monday, July 13, 2015 3:59 PM
> > To: Gore, Tim
> > Cc: Daniel Vetter; intel-gfx@lists.freedesktop.org; Wood, Thomas
> > Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes to stop_rings
> > mode bits
> > 
> > On Mon, Jul 13, 2015 at 09:43:11AM +0000, Gore, Tim wrote:
> > >
> > >
> > > Tim Gore
> > > Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon
> > > SN3 1RJ
> > >
> > > > -----Original Message-----
> > > > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> > > > Daniel Vetter
> > > > Sent: Monday, July 13, 2015 10:30 AM
> > > > To: Gore, Tim
> > > > Cc: intel-gfx@lists.freedesktop.org; Wood, Thomas
> > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes
> > > > to stop_rings mode bits
> > > >
> > > > On Fri, Jul 10, 2015 at 02:06:06PM +0100, tim.gore@intel.com wrote:
> > > > > From: Tim Gore <tim.gore@intel.com>
> > > > >
> > > > > In function igt_set_stop_rings, the test
> > > > >   igt_assert_f(flags == 0 || current == 0, ..
> > > > >
> > > > > will fail if we are trying to force a hang but the
> > > > > STOP_RINGS_ALLOW_BAN or STOP_RINGS_ALLOW_ERROR bit is set.
> > > > > With the introduction of per ring resets in the driver (in
> > > > > android) these bits do not get cleared to zero when an individual
> > > > > ring is reset. This causes subsequent attempt to cause a ring hang
> > > > > via this function to fail, leading to several igt tests failing
> > > > > (ie gem_reset_stats subtest ban-xxx etc).
> > > >
> > > > Fix tdr to reset these instead?
> > > > -Daniel
> > > >
> > > I could change tdr, but why. When the TDR handles a ring hang and
> > > resets the ring, why would it modify the flag that defines if the
> > > driver should ban a frequently hanging context? If we get rid of the
> > > stop_rings interface, as Chris Wilson suggested, we would still need
> > > to keep the STOP_RING_ALLOW_BAN/ALLOW_ERRORS bits in debugfs, but
> > you
> > > would not expect to have to re-write these bits each time there is a ring
> > reset.
> > 
> > The fix current hang recover code to no reset this, add some grace period,
> > then push this patch to igt. We don't have full-blown abi guarantees for
> > debugfs/igt stuff, but I want at least a few months (really last released
> > kernel&igt) of backwards/forward compatibility. And inconsistent behaviour
> > isn't great imo.
> > -Daniel
> 
> Sorry Daniel, I didn't really follow that.
> I didn't want a gpu reset to clear the ALLOW_BAN bit, since this will defeat the
> point of this bit, except perhaps in test situations where you can keep setting it
> each time you deliberately cause a hang. It seems like the ALLOW_BAN bit has
> uses in real world situations, although I don't know it anyone uses it.

ALLOW_BAN is only for testing. It's meant to re-enable auto-banning
because we disable that by default for automated tests. This is all meant
to be used together with the stop_rings stuff only.

> Would you be more comfortable with
> Igt_assert_f ( ( flags ==0 ) || 
>      (( current & STOP_RING_ALL) ==0)  && ((current ^ flags) & ~ STOP_RING_ALL == 0 ) )
> 
> So the either the new flags must be 0 (currently allowed) or the existing flags must
> indicate that  all hangs are cleared (0 except possibly the mode bits) AND the mode
> bits you are writing are the same as the current values. ??

Iirc gem_reset_stats uses stop_rings only to supress gpu reset warnings in
dmesg (since they're expected) and not for the actual stop_rings logic. It
sets stop_rings after submitting the hanging batch, but before that one is
detected as hung. We could just add another bit to stop_rings for that
case.

What I still don't understand is why tdr can't just keep on properly
resetting stop_rings. It'll break tons of existing tests, and I don't
understand the upside. stop_rings has become quasi-abi (that's why the
ALLOW bits have such funky semantics, it's for backwards compat), if you
need to change it you need to extend it, not break it.
-Daniel
tim.gore@intel.com July 14, 2015, 8:55 a.m. UTC | #6
Tim Gore 
Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ

> -----Original Message-----
> From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of Daniel
> Vetter
> Sent: Tuesday, July 14, 2015 9:09 AM
> To: Gore, Tim
> Cc: Daniel Vetter; intel-gfx@lists.freedesktop.org; Wood, Thomas
> Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes to stop_rings
> mode bits
> 
> On Mon, Jul 13, 2015 at 04:07:14PM +0000, Gore, Tim wrote:
> >
> >
> > Tim Gore
> > Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon
> > SN3 1RJ
> >
> >
> > > -----Original Message-----
> > > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> > > Daniel Vetter
> > > Sent: Monday, July 13, 2015 3:59 PM
> > > To: Gore, Tim
> > > Cc: Daniel Vetter; intel-gfx@lists.freedesktop.org; Wood, Thomas
> > > Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow changes
> > > to stop_rings mode bits
> > >
> > > On Mon, Jul 13, 2015 at 09:43:11AM +0000, Gore, Tim wrote:
> > > >
> > > >
> > > > Tim Gore
> > > > Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way,
> > > > Swindon
> > > > SN3 1RJ
> > > >
> > > > > -----Original Message-----
> > > > > From: Daniel Vetter [mailto:daniel.vetter@ffwll.ch] On Behalf Of
> > > > > Daniel Vetter
> > > > > Sent: Monday, July 13, 2015 10:30 AM
> > > > > To: Gore, Tim
> > > > > Cc: intel-gfx@lists.freedesktop.org; Wood, Thomas
> > > > > Subject: Re: [Intel-gfx] [PATCH i-g-t] lib/igt_gt.c : allow
> > > > > changes to stop_rings mode bits
> > > > >
> > > > > On Fri, Jul 10, 2015 at 02:06:06PM +0100, tim.gore@intel.com wrote:
> > > > > > From: Tim Gore <tim.gore@intel.com>
> > > > > >
> > > > > > In function igt_set_stop_rings, the test
> > > > > >   igt_assert_f(flags == 0 || current == 0, ..
> > > > > >
> > > > > > will fail if we are trying to force a hang but the
> > > > > > STOP_RINGS_ALLOW_BAN or STOP_RINGS_ALLOW_ERROR bit is
> set.
> > > > > > With the introduction of per ring resets in the driver (in
> > > > > > android) these bits do not get cleared to zero when an
> > > > > > individual ring is reset. This causes subsequent attempt to
> > > > > > cause a ring hang via this function to fail, leading to
> > > > > > several igt tests failing (ie gem_reset_stats subtest ban-xxx etc).
> > > > >
> > > > > Fix tdr to reset these instead?
> > > > > -Daniel
> > > > >
> > > > I could change tdr, but why. When the TDR handles a ring hang and
> > > > resets the ring, why would it modify the flag that defines if the
> > > > driver should ban a frequently hanging context? If we get rid of
> > > > the stop_rings interface, as Chris Wilson suggested, we would
> > > > still need to keep the STOP_RING_ALLOW_BAN/ALLOW_ERRORS bits in
> > > > debugfs, but
> > > you
> > > > would not expect to have to re-write these bits each time there is
> > > > a ring
> > > reset.
> > >
> > > The fix current hang recover code to no reset this, add some grace
> > > period, then push this patch to igt. We don't have full-blown abi
> > > guarantees for debugfs/igt stuff, but I want at least a few months
> > > (really last released
> > > kernel&igt) of backwards/forward compatibility. And inconsistent
> > > behaviour isn't great imo.
> > > -Daniel
> >
> > Sorry Daniel, I didn't really follow that.
> > I didn't want a gpu reset to clear the ALLOW_BAN bit, since this will
> > defeat the point of this bit, except perhaps in test situations where
> > you can keep setting it each time you deliberately cause a hang. It
> > seems like the ALLOW_BAN bit has uses in real world situations, although I
> don't know it anyone uses it.
> 
> ALLOW_BAN is only for testing. It's meant to re-enable auto-banning
> because we disable that by default for automated tests. This is all meant to
> be used together with the stop_rings stuff only.
> 
> > Would you be more comfortable with
> > Igt_assert_f ( ( flags ==0 ) ||
> >      (( current & STOP_RING_ALL) ==0)  && ((current ^ flags) & ~
> > STOP_RING_ALL == 0 ) )
> >
> > So the either the new flags must be 0 (currently allowed) or the
> > existing flags must indicate that  all hangs are cleared (0 except
> > possibly the mode bits) AND the mode bits you are writing are the same as
> the current values. ??
> 
> Iirc gem_reset_stats uses stop_rings only to supress gpu reset warnings in
> dmesg (since they're expected) and not for the actual stop_rings logic. It sets
> stop_rings after submitting the hanging batch, but before that one is
> detected as hung. We could just add another bit to stop_rings for that case.
> 
> What I still don't understand is why tdr can't just keep on properly resetting
> stop_rings. It'll break tons of existing tests, and I don't understand the
> upside. stop_rings has become quasi-abi (that's why the ALLOW bits have
> such funky semantics, it's for backwards compat), if you need to change it
> you need to extend it, not break it.
> -Daniel
> --

OK, I hadn't spotted that it was a debug only flag. In that case it is just up to the
tests to keep setting it. It only seems to be used in gem_reset_stats anyway. So
I'll change TDR to clear ALLOW_BAN/WARN when stop_bits gets cleared to 0.

  Tim

> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
diff mbox

Patch

diff --git a/lib/igt_gt.c b/lib/igt_gt.c
index 8e5e076..12c56fa 100644
--- a/lib/igt_gt.c
+++ b/lib/igt_gt.c
@@ -345,8 +345,8 @@  void igt_set_stop_rings(enum stop_ring_flags flags)
 			      STOP_RING_ALLOW_ERRORS)) == 0);
 
 	current = igt_get_stop_rings();
-	igt_assert_f(flags == 0 || current == 0,
-		     "previous i915_ring_stop is still 0x%x\n", current);
+	igt_assert_f( (flags & STOP_RING_ALL) == 0 || (current & STOP_RING_ALL) == 0,
+			     "previous i915_ring_stop is still 0x%x\n", current);
 
 	stop_rings_write(flags);
 	current = igt_get_stop_rings();