[igt] igt/drv_hangman: Use manual error-state generation

Message ID	20161020090739.6009-1-chris@chris-wilson.co.uk (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Chris Wilson <chris@chris-wilson.co.uk> To: intel-gfx@lists.freedesktop.org Date: Thu, 20 Oct 2016 10:07:39 +0100 Message-Id: <20161020090739.6009-1-chris@chris-wilson.co.uk> Subject: [Intel-gfx] [PATCH igt] igt/drv_hangman: Use manual error-state generation Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Chris Wilson Oct. 20, 2016, 9:07 a.m. UTC

For the basic error state, we only desire that an error state be created
following a hang. For that purpose, we do not need a real hang (slow
6-12s) but can inject one instead (fast <1s).

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 tests/drv_hangman.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

Daniel Vetter Oct. 20, 2016, 9:29 a.m. UTC | #1

On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote:
> For the basic error state, we only desire that an error state be created
> following a hang. For that purpose, we do not need a real hang (slow
> 6-12s) but can inject one instead (fast <1s).
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Should we instead speed up hangcheck? I think there's lots of value in
making sure not just error dumping, but also hang detection works somewhat
in BAT. Since if it doesn't any attempt at a full run will lead to pretty
serious disasters. And I have this dream that BAT is the gating thing
deciding whether a patch series deserves a complete pre-merge run ;-)

But since this is a controlled enviromnent we could make hangcheck
super-fast at timing out with some debugfs knob. Would probably also help
a lot with speeding up the gazillion of testcases in gem_reset_stats.
-Daniel

> ---
>  tests/drv_hangman.c | 14 +++++++++++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/tests/drv_hangman.c b/tests/drv_hangman.c
> index 953a4c6..bfe5eaf 100644
> --- a/tests/drv_hangman.c
> +++ b/tests/drv_hangman.c
> @@ -32,6 +32,8 @@
>  #include <sys/stat.h>
>  #include <fcntl.h>
>  
> +#include "igt_debugfs.h"
> +
>  #ifndef I915_PARAM_CMD_PARSER_VERSION
>  #define I915_PARAM_CMD_PARSER_VERSION       28
>  #endif
> @@ -166,15 +168,21 @@ static void test_error_state_basic(void)
>  {
>  	int fd;
>  
> -	fd = drm_open_driver(DRIVER_INTEL);
> -
>  	clear_error_state();
>  	assert_error_state_clear();
>  
> -	submit_hang(fd, I915_EXEC_RENDER);
> +	/* Manually trigger a hang by request a reset */
> +	fd = igt_debugfs_open("i915_wedged", O_WRONLY);
> +	igt_ignore_warn(write(fd, "1\n", 2));
> +	close(fd);
> +
> +	/* Wait for the error capture and gpu reset to complete */
> +	fd = drm_open_driver(DRIVER_INTEL);
> +	gem_quiescent_gpu(fd);
>  	close(fd);
>  
>  	assert_error_state_collected();
> +
>  	clear_error_state();
>  	assert_error_state_clear();
>  }
> -- 
> 2.9.3
> 
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Chris Wilson Oct. 20, 2016, 9:46 a.m. UTC | #2

On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote:
> On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote:
> > For the basic error state, we only desire that an error state be created
> > following a hang. For that purpose, we do not need a real hang (slow
> > 6-12s) but can inject one instead (fast <1s).
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> Should we instead speed up hangcheck? I think there's lots of value in
> making sure not just error dumping, but also hang detection works somewhat
> in BAT. Since if it doesn't any attempt at a full run will lead to pretty
> serious disasters. And I have this dream that BAT is the gating thing
> deciding whether a patch series deserves a complete pre-merge run ;-)

We have full-hang detection in BAT elsewhere as well. This particular
test was only asking the question "do we generate an error state", hence
why I felt it was safe to just do that and skip a simulated hang.
 
> But since this is a controlled enviromnent we could make hangcheck
> super-fast at timing out with some debugfs knob. Would probably also help
> a lot with speeding up the gazillion of testcases in gem_reset_stats.

I have considered i915.hangcheck_interval_ms many a time. It is not just
the interval but the hangcheck score threshold to consider. If we can
trust our activity detection, we would be safe with a hangcheck every
jiffie (at some overhead mind you), but we would declare a dos too soon.
-Chris

Chris Wilson Oct. 20, 2016, 10:05 a.m. UTC | #3

On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote:
> On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote:
> > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote:
> > > For the basic error state, we only desire that an error state be created
> > > following a hang. For that purpose, we do not need a real hang (slow
> > > 6-12s) but can inject one instead (fast <1s).
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > Should we instead speed up hangcheck? I think there's lots of value in
> > making sure not just error dumping, but also hang detection works somewhat
> > in BAT. Since if it doesn't any attempt at a full run will lead to pretty
> > serious disasters. And I have this dream that BAT is the gating thing
> > deciding whether a patch series deserves a complete pre-merge run ;-)
> 
> We have full-hang detection in BAT elsewhere as well. This particular
> test was only asking the question "do we generate an error state", hence
> why I felt it was safe to just do that and skip a simulated hang.
>  
> > But since this is a controlled enviromnent we could make hangcheck
> > super-fast at timing out with some debugfs knob. Would probably also help
> > a lot with speeding up the gazillion of testcases in gem_reset_stats.
> 
> I have considered i915.hangcheck_interval_ms many a time. It is not just
> the interval but the hangcheck score threshold to consider. If we can
> trust our activity detection, we would be safe with a hangcheck every
> jiffie (at some overhead mind you), but we would declare a dos too soon.

Thinking of which, Mika did have some patches to move towards a time
accrued metric...
-Chris

Daniel Vetter Oct. 20, 2016, 1:14 p.m. UTC | #4

On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote:
> On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote:
> > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote:
> > > For the basic error state, we only desire that an error state be created
> > > following a hang. For that purpose, we do not need a real hang (slow
> > > 6-12s) but can inject one instead (fast <1s).
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > 
> > Should we instead speed up hangcheck? I think there's lots of value in
> > making sure not just error dumping, but also hang detection works somewhat
> > in BAT. Since if it doesn't any attempt at a full run will lead to pretty
> > serious disasters. And I have this dream that BAT is the gating thing
> > deciding whether a patch series deserves a complete pre-merge run ;-)
> 
> We have full-hang detection in BAT elsewhere as well. This particular
> test was only asking the question "do we generate an error state", hence
> why I felt it was safe to just do that and skip a simulated hang.

Hm, is it worth it then in BAT? Or does the other test not check whether
the error capture part was mildly successful? Might be worth it to just
combine them (in BAT) for even more time saved. Either way ack on this.

> > But since this is a controlled enviromnent we could make hangcheck
> > super-fast at timing out with some debugfs knob. Would probably also help
> > a lot with speeding up the gazillion of testcases in gem_reset_stats.
> 
> I have considered i915.hangcheck_interval_ms many a time. It is not just
> the interval but the hangcheck score threshold to consider. If we can
> trust our activity detection, we would be safe with a hangcheck every
> jiffie (at some overhead mind you), but we would declare a dos too soon.

Yeah, we'd still need to tune hangcheck in normal time. But for regression
testing I think a linear speed-up (i.e. just scaling the hangcheck
timeout, but leaving all the cadence and accumulation logic intact). And
ofc also scaling the dos loads with the same linear factor. Maybe a
--normal-time option in gem_reset_stats or similar could then be used for
manual testing of changes, or re-tuning.
-Daniel

Chris Wilson Oct. 20, 2016, 1:22 p.m. UTC | #5

On Thu, Oct 20, 2016 at 03:14:30PM +0200, Daniel Vetter wrote:
> On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote:
> > On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote:
> > > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote:
> > > > For the basic error state, we only desire that an error state be created
> > > > following a hang. For that purpose, we do not need a real hang (slow
> > > > 6-12s) but can inject one instead (fast <1s).
> > > > 
> > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > 
> > > Should we instead speed up hangcheck? I think there's lots of value in
> > > making sure not just error dumping, but also hang detection works somewhat
> > > in BAT. Since if it doesn't any attempt at a full run will lead to pretty
> > > serious disasters. And I have this dream that BAT is the gating thing
> > > deciding whether a patch series deserves a complete pre-merge run ;-)
> > 
> > We have full-hang detection in BAT elsewhere as well. This particular
> > test was only asking the question "do we generate an error state", hence
> > why I felt it was safe to just do that and skip a simulated hang.
> 
> Hm, is it worth it then in BAT? Or does the other test not check whether
> the error capture part was mildly successful? Might be worth it to just
> combine them (in BAT) for even more time saved. Either way ack on this.

No, the other tests are to check we survive a hang, not that we generate
post-mortem error state. This test takes approximately 0.2s on a slow
device (at mild debug levels), and I think is concise enough to keep
separate.
-Chris

Daniel Vetter Oct. 24, 2016, 8:24 a.m. UTC | #6

On Thu, Oct 20, 2016 at 02:22:56PM +0100, Chris Wilson wrote:
> On Thu, Oct 20, 2016 at 03:14:30PM +0200, Daniel Vetter wrote:
> > On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote:
> > > On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote:
> > > > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote:
> > > > > For the basic error state, we only desire that an error state be created
> > > > > following a hang. For that purpose, we do not need a real hang (slow
> > > > > 6-12s) but can inject one instead (fast <1s).
> > > > > 
> > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > > 
> > > > Should we instead speed up hangcheck? I think there's lots of value in
> > > > making sure not just error dumping, but also hang detection works somewhat
> > > > in BAT. Since if it doesn't any attempt at a full run will lead to pretty
> > > > serious disasters. And I have this dream that BAT is the gating thing
> > > > deciding whether a patch series deserves a complete pre-merge run ;-)
> > > 
> > > We have full-hang detection in BAT elsewhere as well. This particular
> > > test was only asking the question "do we generate an error state", hence
> > > why I felt it was safe to just do that and skip a simulated hang.
> > 
> > Hm, is it worth it then in BAT? Or does the other test not check whether
> > the error capture part was mildly successful? Might be worth it to just
> > combine them (in BAT) for even more time saved. Either way ack on this.
> 
> No, the other tests are to check we survive a hang, not that we generate
> post-mortem error state. This test takes approximately 0.2s on a slow
> device (at mild debug levels), and I think is concise enough to keep
> separate.

Ok, makes sense.
-Daniel

[igt] igt/drv_hangman: Use manual error-state generation

Commit Message

Comments

Patch