Message ID | 20161020090739.6009-1-chris@chris-wilson.co.uk (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote: > For the basic error state, we only desire that an error state be created > following a hang. For that purpose, we do not need a real hang (slow > 6-12s) but can inject one instead (fast <1s). > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Should we instead speed up hangcheck? I think there's lots of value in making sure not just error dumping, but also hang detection works somewhat in BAT. Since if it doesn't any attempt at a full run will lead to pretty serious disasters. And I have this dream that BAT is the gating thing deciding whether a patch series deserves a complete pre-merge run ;-) But since this is a controlled enviromnent we could make hangcheck super-fast at timing out with some debugfs knob. Would probably also help a lot with speeding up the gazillion of testcases in gem_reset_stats. -Daniel > --- > tests/drv_hangman.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/tests/drv_hangman.c b/tests/drv_hangman.c > index 953a4c6..bfe5eaf 100644 > --- a/tests/drv_hangman.c > +++ b/tests/drv_hangman.c > @@ -32,6 +32,8 @@ > #include <sys/stat.h> > #include <fcntl.h> > > +#include "igt_debugfs.h" > + > #ifndef I915_PARAM_CMD_PARSER_VERSION > #define I915_PARAM_CMD_PARSER_VERSION 28 > #endif > @@ -166,15 +168,21 @@ static void test_error_state_basic(void) > { > int fd; > > - fd = drm_open_driver(DRIVER_INTEL); > - > clear_error_state(); > assert_error_state_clear(); > > - submit_hang(fd, I915_EXEC_RENDER); > + /* Manually trigger a hang by request a reset */ > + fd = igt_debugfs_open("i915_wedged", O_WRONLY); > + igt_ignore_warn(write(fd, "1\n", 2)); > + close(fd); > + > + /* Wait for the error capture and gpu reset to complete */ > + fd = drm_open_driver(DRIVER_INTEL); > + gem_quiescent_gpu(fd); > close(fd); > > assert_error_state_collected(); > + > clear_error_state(); > assert_error_state_clear(); > } > -- > 2.9.3 > > _______________________________________________ > Intel-gfx mailing list > Intel-gfx@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote: > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote: > > For the basic error state, we only desire that an error state be created > > following a hang. For that purpose, we do not need a real hang (slow > > 6-12s) but can inject one instead (fast <1s). > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Should we instead speed up hangcheck? I think there's lots of value in > making sure not just error dumping, but also hang detection works somewhat > in BAT. Since if it doesn't any attempt at a full run will lead to pretty > serious disasters. And I have this dream that BAT is the gating thing > deciding whether a patch series deserves a complete pre-merge run ;-) We have full-hang detection in BAT elsewhere as well. This particular test was only asking the question "do we generate an error state", hence why I felt it was safe to just do that and skip a simulated hang. > But since this is a controlled enviromnent we could make hangcheck > super-fast at timing out with some debugfs knob. Would probably also help > a lot with speeding up the gazillion of testcases in gem_reset_stats. I have considered i915.hangcheck_interval_ms many a time. It is not just the interval but the hangcheck score threshold to consider. If we can trust our activity detection, we would be safe with a hangcheck every jiffie (at some overhead mind you), but we would declare a dos too soon. -Chris
On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote: > On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote: > > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote: > > > For the basic error state, we only desire that an error state be created > > > following a hang. For that purpose, we do not need a real hang (slow > > > 6-12s) but can inject one instead (fast <1s). > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > Should we instead speed up hangcheck? I think there's lots of value in > > making sure not just error dumping, but also hang detection works somewhat > > in BAT. Since if it doesn't any attempt at a full run will lead to pretty > > serious disasters. And I have this dream that BAT is the gating thing > > deciding whether a patch series deserves a complete pre-merge run ;-) > > We have full-hang detection in BAT elsewhere as well. This particular > test was only asking the question "do we generate an error state", hence > why I felt it was safe to just do that and skip a simulated hang. > > > But since this is a controlled enviromnent we could make hangcheck > > super-fast at timing out with some debugfs knob. Would probably also help > > a lot with speeding up the gazillion of testcases in gem_reset_stats. > > I have considered i915.hangcheck_interval_ms many a time. It is not just > the interval but the hangcheck score threshold to consider. If we can > trust our activity detection, we would be safe with a hangcheck every > jiffie (at some overhead mind you), but we would declare a dos too soon. Thinking of which, Mika did have some patches to move towards a time accrued metric... -Chris
On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote: > On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote: > > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote: > > > For the basic error state, we only desire that an error state be created > > > following a hang. For that purpose, we do not need a real hang (slow > > > 6-12s) but can inject one instead (fast <1s). > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > Should we instead speed up hangcheck? I think there's lots of value in > > making sure not just error dumping, but also hang detection works somewhat > > in BAT. Since if it doesn't any attempt at a full run will lead to pretty > > serious disasters. And I have this dream that BAT is the gating thing > > deciding whether a patch series deserves a complete pre-merge run ;-) > > We have full-hang detection in BAT elsewhere as well. This particular > test was only asking the question "do we generate an error state", hence > why I felt it was safe to just do that and skip a simulated hang. Hm, is it worth it then in BAT? Or does the other test not check whether the error capture part was mildly successful? Might be worth it to just combine them (in BAT) for even more time saved. Either way ack on this. > > But since this is a controlled enviromnent we could make hangcheck > > super-fast at timing out with some debugfs knob. Would probably also help > > a lot with speeding up the gazillion of testcases in gem_reset_stats. > > I have considered i915.hangcheck_interval_ms many a time. It is not just > the interval but the hangcheck score threshold to consider. If we can > trust our activity detection, we would be safe with a hangcheck every > jiffie (at some overhead mind you), but we would declare a dos too soon. Yeah, we'd still need to tune hangcheck in normal time. But for regression testing I think a linear speed-up (i.e. just scaling the hangcheck timeout, but leaving all the cadence and accumulation logic intact). And ofc also scaling the dos loads with the same linear factor. Maybe a --normal-time option in gem_reset_stats or similar could then be used for manual testing of changes, or re-tuning. -Daniel
On Thu, Oct 20, 2016 at 03:14:30PM +0200, Daniel Vetter wrote: > On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote: > > On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote: > > > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote: > > > > For the basic error state, we only desire that an error state be created > > > > following a hang. For that purpose, we do not need a real hang (slow > > > > 6-12s) but can inject one instead (fast <1s). > > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > > > Should we instead speed up hangcheck? I think there's lots of value in > > > making sure not just error dumping, but also hang detection works somewhat > > > in BAT. Since if it doesn't any attempt at a full run will lead to pretty > > > serious disasters. And I have this dream that BAT is the gating thing > > > deciding whether a patch series deserves a complete pre-merge run ;-) > > > > We have full-hang detection in BAT elsewhere as well. This particular > > test was only asking the question "do we generate an error state", hence > > why I felt it was safe to just do that and skip a simulated hang. > > Hm, is it worth it then in BAT? Or does the other test not check whether > the error capture part was mildly successful? Might be worth it to just > combine them (in BAT) for even more time saved. Either way ack on this. No, the other tests are to check we survive a hang, not that we generate post-mortem error state. This test takes approximately 0.2s on a slow device (at mild debug levels), and I think is concise enough to keep separate. -Chris
On Thu, Oct 20, 2016 at 02:22:56PM +0100, Chris Wilson wrote: > On Thu, Oct 20, 2016 at 03:14:30PM +0200, Daniel Vetter wrote: > > On Thu, Oct 20, 2016 at 10:46:01AM +0100, Chris Wilson wrote: > > > On Thu, Oct 20, 2016 at 11:29:05AM +0200, Daniel Vetter wrote: > > > > On Thu, Oct 20, 2016 at 10:07:39AM +0100, Chris Wilson wrote: > > > > > For the basic error state, we only desire that an error state be created > > > > > following a hang. For that purpose, we do not need a real hang (slow > > > > > 6-12s) but can inject one instead (fast <1s). > > > > > > > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > > > > > Should we instead speed up hangcheck? I think there's lots of value in > > > > making sure not just error dumping, but also hang detection works somewhat > > > > in BAT. Since if it doesn't any attempt at a full run will lead to pretty > > > > serious disasters. And I have this dream that BAT is the gating thing > > > > deciding whether a patch series deserves a complete pre-merge run ;-) > > > > > > We have full-hang detection in BAT elsewhere as well. This particular > > > test was only asking the question "do we generate an error state", hence > > > why I felt it was safe to just do that and skip a simulated hang. > > > > Hm, is it worth it then in BAT? Or does the other test not check whether > > the error capture part was mildly successful? Might be worth it to just > > combine them (in BAT) for even more time saved. Either way ack on this. > > No, the other tests are to check we survive a hang, not that we generate > post-mortem error state. This test takes approximately 0.2s on a slow > device (at mild debug levels), and I think is concise enough to keep > separate. Ok, makes sense. -Daniel
diff --git a/tests/drv_hangman.c b/tests/drv_hangman.c index 953a4c6..bfe5eaf 100644 --- a/tests/drv_hangman.c +++ b/tests/drv_hangman.c @@ -32,6 +32,8 @@ #include <sys/stat.h> #include <fcntl.h> +#include "igt_debugfs.h" + #ifndef I915_PARAM_CMD_PARSER_VERSION #define I915_PARAM_CMD_PARSER_VERSION 28 #endif @@ -166,15 +168,21 @@ static void test_error_state_basic(void) { int fd; - fd = drm_open_driver(DRIVER_INTEL); - clear_error_state(); assert_error_state_clear(); - submit_hang(fd, I915_EXEC_RENDER); + /* Manually trigger a hang by request a reset */ + fd = igt_debugfs_open("i915_wedged", O_WRONLY); + igt_ignore_warn(write(fd, "1\n", 2)); + close(fd); + + /* Wait for the error capture and gpu reset to complete */ + fd = drm_open_driver(DRIVER_INTEL); + gem_quiescent_gpu(fd); close(fd); assert_error_state_collected(); + clear_error_state(); assert_error_state_clear(); }
For the basic error state, we only desire that an error state be created following a hang. For that purpose, we do not need a real hang (slow 6-12s) but can inject one instead (fast <1s). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> --- tests/drv_hangman.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)