diff mbox

drm/i915: Add option to list load failure checkpoints

Message ID 20180131182347.19860-1-michal.wajdeczko@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Wajdeczko Jan. 31, 2018, 6:23 p.m. UTC
Our inject_load_failure functionality allows to insert one
failure during driver load, but it is hard to guess which
number should passed as modparam to select specific checkpoint.

Use negative number as option to list all available failure
checkpoints without triggering any failure.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/i915_drv.c    | 18 +++++++++++-------
 drivers/gpu/drm/i915/i915_params.c |  6 ++++--
 drivers/gpu/drm/i915/i915_params.h |  2 +-
 3 files changed, 16 insertions(+), 10 deletions(-)

Comments

Chris Wilson Jan. 31, 2018, 8:48 p.m. UTC | #1
Quoting Michal Wajdeczko (2018-01-31 18:23:47)
> Our inject_load_failure functionality allows to insert one
> failure during driver load, but it is hard to guess which
> number should passed as modparam to select specific checkpoint.
> 
> Use negative number as option to list all available failure
> checkpoints without triggering any failure.

Hmm, it was only intended for use with the coupled igt test. Mind
expanding upon the use case you have? Could you not use that iterative
search for finding the injection value you want for repeated runs? For
the bisect case, do you not want to keep it iterating over all in case
the value changes? How stable do you want the modparam?
-Chris
Michal Wajdeczko Feb. 1, 2018, 2:47 p.m. UTC | #2
On Wed, 31 Jan 2018 21:48:42 +0100, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Michal Wajdeczko (2018-01-31 18:23:47)
>> Our inject_load_failure functionality allows to insert one
>> failure during driver load, but it is hard to guess which
>> number should passed as modparam to select specific checkpoint.
>>
>> Use negative number as option to list all available failure
>> checkpoints without triggering any failure.
>
> Hmm, it was only intended for use with the coupled igt test. Mind
> expanding upon the use case you have? Could you not use that iterative
> search for finding the injection value you want for repeated runs? For
> the bisect case, do you not want to keep it iterating over all in case
> the value changes? How stable do you want the modparam?

Iterative approach is good for validation team to verify that all
existing failure points are correctly handled (ie. we don't cause
crash/panic), but it is less useful when you are interested in adding
new checkpoints just for your code, as it requires both time and luck
as you may be hit by earlier checkpoint that no longer works ;)

Btw, IMHO this modparam should only be exposed in DEBUG config (as it
introduces some code and text), and maybe we should also consider
extending it to support more than one failure (as then it will be
easy to check several code paths (like failure in fallback)

This patch was trying to keep definition of modparam unchanged (extra
-1 value should not be noticed by any existing use case) but since
it is purely debug feature I'm not against making bigger changes here.

Michal
Chris Wilson Feb. 1, 2018, 3:28 p.m. UTC | #3
Quoting Michal Wajdeczko (2018-02-01 14:47:05)
> On Wed, 31 Jan 2018 21:48:42 +0100, Chris Wilson  
> <chris@chris-wilson.co.uk> wrote:
> 
> > Quoting Michal Wajdeczko (2018-01-31 18:23:47)
> >> Our inject_load_failure functionality allows to insert one
> >> failure during driver load, but it is hard to guess which
> >> number should passed as modparam to select specific checkpoint.
> >>
> >> Use negative number as option to list all available failure
> >> checkpoints without triggering any failure.
> >
> > Hmm, it was only intended for use with the coupled igt test. Mind
> > expanding upon the use case you have? Could you not use that iterative
> > search for finding the injection value you want for repeated runs? For
> > the bisect case, do you not want to keep it iterating over all in case
> > the value changes? How stable do you want the modparam?
> 
> Iterative approach is good for validation team to verify that all
> existing failure points are correctly handled (ie. we don't cause
> crash/panic), but it is less useful when you are interested in adding
> new checkpoints just for your code, as it requires both time and luck
> as you may be hit by earlier checkpoint that no longer works ;)
> 
> Btw, IMHO this modparam should only be exposed in DEBUG config (as it
> introduces some code and text), and maybe we should also consider
> extending it to support more than one failure (as then it will be
> easy to check several code paths (like failure in fallback)

Sure, let's get it under the IS_ENABLED(CONFIG_DRM_I915_DEBUG).
 
> This patch was trying to keep definition of modparam unchanged (extra
> -1 value should not be noticed by any existing use case) but since
> it is purely debug feature I'm not against making bigger changes here.

From my pov, it's very useful to define how you expect it to work, and
how to make it more useful than drv_module_reload and to keep it working
as you intend. A pure fault-counter doesn't seem very robust or
intuitive for me when you are trying to develop a new fault point. Otoh,
I don't mind if the developer has to fix all the earlier fails when
testing his (those fails are all our responsibility) -- or at least file
bugs so that awareness is raised.
-Chris
Michal Wajdeczko Feb. 1, 2018, 3:51 p.m. UTC | #4
On Thu, 01 Feb 2018 16:28:31 +0100, Chris Wilson  
<chris@chris-wilson.co.uk> wrote:

> Quoting Michal Wajdeczko (2018-02-01 14:47:05)
>> On Wed, 31 Jan 2018 21:48:42 +0100, Chris Wilson
>> <chris@chris-wilson.co.uk> wrote:
>>
>> > Quoting Michal Wajdeczko (2018-01-31 18:23:47)
>> >> Our inject_load_failure functionality allows to insert one
>> >> failure during driver load, but it is hard to guess which
>> >> number should passed as modparam to select specific checkpoint.
>> >>
>> >> Use negative number as option to list all available failure
>> >> checkpoints without triggering any failure.
>> >
>> > Hmm, it was only intended for use with the coupled igt test. Mind
>> > expanding upon the use case you have? Could you not use that iterative
>> > search for finding the injection value you want for repeated runs? For
>> > the bisect case, do you not want to keep it iterating over all in case
>> > the value changes? How stable do you want the modparam?
>>
>> Iterative approach is good for validation team to verify that all
>> existing failure points are correctly handled (ie. we don't cause
>> crash/panic), but it is less useful when you are interested in adding
>> new checkpoints just for your code, as it requires both time and luck
>> as you may be hit by earlier checkpoint that no longer works ;)
>>
>> Btw, IMHO this modparam should only be exposed in DEBUG config (as it
>> introduces some code and text), and maybe we should also consider
>> extending it to support more than one failure (as then it will be
>> easy to check several code paths (like failure in fallback)
>
> Sure, let's get it under the IS_ENABLED(CONFIG_DRM_I915_DEBUG).

I'll send separate patch for this.

>
>> This patch was trying to keep definition of modparam unchanged (extra
>> -1 value should not be noticed by any existing use case) but since
>> it is purely debug feature I'm not against making bigger changes here.
>
> From my pov, it's very useful to define how you expect it to work, and

I was assuming these steps:

1. add new failure checkpoint(s) to the code
2. boot driver with inject modparam=-1 to list active checkpoints
3. note the number of checkpoint that we want to trigger
4. boot driver with inject modparam=n to trigger selected checkpoint
5. repeat steps 2-4 in case of rebase/code refactor/new checkpoints

> how to make it more useful than drv_module_reload and to keep it working
> as you intend. A pure fault-counter doesn't seem very robust or
> intuitive for me when you are trying to develop a new fault point. Otoh,

Currently we identify fault checkpoint by pair [function:line] so if we  
want
to invest more in this mechanism we can change modparam to char* and then
check against function and line instead of the counter. Then no iteration
(or listing checkpoints) would be needed.

> I don't mind if the developer has to fix all the earlier fails when
> testing his (those fails are all our responsibility) -- or at least file
> bugs so that awareness is raised.

Agree.

Michal
Chris Wilson Feb. 1, 2018, 4:30 p.m. UTC | #5
Quoting Michal Wajdeczko (2018-02-01 15:51:58)
> I was assuming these steps:
> 
> 1. add new failure checkpoint(s) to the code
> 2. boot driver with inject modparam=-1 to list active checkpoints
> 3. note the number of checkpoint that we want to trigger
> 4. boot driver with inject modparam=n to trigger selected checkpoint
> 5. repeat steps 2-4 in case of rebase/code refactor/new checkpoints
> 
> > how to make it more useful than drv_module_reload and to keep it working
> > as you intend. A pure fault-counter doesn't seem very robust or
> > intuitive for me when you are trying to develop a new fault point. Otoh,
> 
> Currently we identify fault checkpoint by pair [function:line] so if we  
> want
> to invest more in this mechanism we can change modparam to char* and then
> check against function and line instead of the counter. Then no iteration
> (or listing checkpoints) would be needed.

So I've no strong objection to the overall approach. (If time were
infinite, I would push the fault-counters into a .data section so we
could probe them from the i915.ko without loading.) I think though to
make it really useful would be to add a tag to each fault-point (NULL
for boring existing ones unless you're feeling inventive). Then the
developer need only say ./tests/drv_reload_fault my-tag and that will
then be stable across different kernels.

The minimum bar for adding this dev feature would be the corresponding
tool for igt, with a small integrated test. (Hmm, if we could push the
tags to .data, we could even tell how many fault-injection sites were
hit.)

Long term desire would be kprobe/ebpf hookup, or some [soon-to-be]
existing infrastructure for fault-injection along those lines.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 1ec12ad..0992301 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -55,16 +55,21 @@ 
 
 static struct drm_driver driver;
 
-static unsigned int i915_load_fail_count;
+static int i915_load_fail_count = 0;
 
 bool __i915_inject_load_failure(const char *func, int line)
 {
-	if (i915_load_fail_count >= i915_modparams.inject_load_failure)
+	++i915_load_fail_count;
+
+	if (unlikely(i915_modparams.inject_load_failure < 0)) {
+		DRM_NOTE("[i915] load checkpoint %d [%s:%d]\n",
+			 i915_load_fail_count, func, line);
 		return false;
+	}
 
-	if (++i915_load_fail_count == i915_modparams.inject_load_failure) {
-		DRM_INFO("Injecting failure at checkpoint %u [%s:%d]\n",
-			 i915_modparams.inject_load_failure, func, line);
+	if (i915_modparams.inject_load_failure == i915_load_fail_count) {
+		DRM_INFO("Injecting failure at checkpoint %d [%s:%d]\n",
+			 i915_load_fail_count, func, line);
 		return true;
 	}
 
@@ -107,8 +112,7 @@  bool __i915_inject_load_failure(const char *func, int line)
 
 static bool i915_error_injected(struct drm_i915_private *dev_priv)
 {
-	return i915_modparams.inject_load_failure &&
-	       i915_load_fail_count == i915_modparams.inject_load_failure;
+	return i915_modparams.inject_load_failure >= i915_load_fail_count;
 }
 
 #define i915_load_error(dev_priv, fmt, ...)				     \
diff --git a/drivers/gpu/drm/i915/i915_params.c b/drivers/gpu/drm/i915/i915_params.c
index 0b553a8..e5ab69c 100644
--- a/drivers/gpu/drm/i915/i915_params.c
+++ b/drivers/gpu/drm/i915/i915_params.c
@@ -167,8 +167,10 @@  struct i915_params i915_modparams __read_mostly = {
 i915_param_named_unsafe(enable_dp_mst, bool, 0600,
 	"Enable multi-stream transport (MST) for new DisplayPort sinks. (default: true)");
 
-i915_param_named_unsafe(inject_load_failure, uint, 0400,
-	"Force an error after a number of failure check points (0:disabled (default), N:force failure at the Nth failure check point)");
+i915_param_named_unsafe(inject_load_failure, int, 0400,
+	"Force an error after a number of failure check points. "
+	"(-1=list check points, 0=disabled [default], "
+	"N=force failure at the Nth failure check point)");
 
 i915_param_named(enable_dpcd_backlight, bool, 0600,
 	"Enable support for DPCD backlight control (default:false)");
diff --git a/drivers/gpu/drm/i915/i915_params.h b/drivers/gpu/drm/i915/i915_params.h
index 430f5f9..6ed799f 100644
--- a/drivers/gpu/drm/i915/i915_params.h
+++ b/drivers/gpu/drm/i915/i915_params.h
@@ -54,7 +54,7 @@ 
 	param(int, mmio_debug, 0) \
 	param(int, edp_vswing, 0) \
 	param(int, reset, 2) \
-	param(unsigned int, inject_load_failure, 0) \
+	param(int, inject_load_failure, 0) \
 	/* leave bools at the end to not create holes */ \
 	param(bool, alpha_support, IS_ENABLED(CONFIG_DRM_I915_ALPHA_SUPPORT)) \
 	param(bool, enable_cmd_parser, true) \