drm/i915: Reboot CI if we get wedged during driver init
diff mbox series

Message ID 20200701150721.423630-1-michal@hardline.pl
State New
Headers show
Series
  • drm/i915: Reboot CI if we get wedged during driver init
Related show

Commit Message

Michał Winiarski July 1, 2020, 3:07 p.m. UTC
From: Michał Winiarski <michal.winiarski@intel.com>

Getting wedged device on driver init is pretty much unrecoverable.
Since we're running verious scenarios that may potentially hit this in
CI (module reload / selftests / hotunplug), and if it happens, it means
that we can't trust any subsequent CI results, we should just apply the
taint to let the CI know that it should reboot (CI checks taint between
test runs).

Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Chris Wilson July 1, 2020, 3:16 p.m. UTC | #1
Quoting Michał Winiarski (2020-07-01 16:07:21)
> From: Michał Winiarski <michal.winiarski@intel.com>
> 
> Getting wedged device on driver init is pretty much unrecoverable.
> Since we're running verious scenarios that may potentially hit this in
> CI (module reload / selftests / hotunplug), and if it happens, it means
> that we can't trust any subsequent CI results, we should just apply the
> taint to let the CI know that it should reboot (CI checks taint between
> test runs).

Ok, we treat WEDGED_ON_INIT as non-recoverable [as opposed to the less
wedged WEDGED].
 
> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris
Chris Wilson July 1, 2020, 3:17 p.m. UTC | #2
Quoting Michał Winiarski (2020-07-01 16:07:21)
> From: Michał Winiarski <michal.winiarski@intel.com>
> 
> Getting wedged device on driver init is pretty much unrecoverable.
> Since we're running verious scenarios that may potentially hit this in
> CI (module reload / selftests / hotunplug), and if it happens, it means
> that we can't trust any subsequent CI results, we should just apply the
> taint to let the CI know that it should reboot (CI checks taint between
> test runs).
> 
> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Petri Latvala <petri.latvala@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 0156f1f5c736..d27e8bb7d550 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1360,6 +1360,8 @@ void intel_gt_set_wedged_on_init(struct intel_gt *gt)
>                      I915_WEDGED_ON_INIT);
>         intel_gt_set_wedged(gt);
>         set_bit(I915_WEDGED_ON_INIT, &gt->reset.flags);
> +

Ah, we don't say around here that this WEDGED_ON_INIT is non-recoverable,
could you please add a comment to that effect?

> +       add_taint_for_CI(TAINT_WARN);
>  }
>  
>  void intel_gt_init_reset(struct intel_gt *gt)
> -- 
> 2.27.0
>
Michal Wajdeczko July 1, 2020, 5:03 p.m. UTC | #3
On 01.07.2020 17:17, Chris Wilson wrote:
> Quoting Michał Winiarski (2020-07-01 16:07:21)
>> From: Michał Winiarski <michal.winiarski@intel.com>
>>
>> Getting wedged device on driver init is pretty much unrecoverable.
>> Since we're running verious scenarios that may potentially hit this in

typo

>> CI (module reload / selftests / hotunplug), and if it happens, it means
>> that we can't trust any subsequent CI results, we should just apply the
>> taint to let the CI know that it should reboot (CI checks taint between
>> test runs).
>>
>> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Petri Latvala <petri.latvala@intel.com>
>> ---
>>  drivers/gpu/drm/i915/gt/intel_reset.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index 0156f1f5c736..d27e8bb7d550 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -1360,6 +1360,8 @@ void intel_gt_set_wedged_on_init(struct intel_gt *gt)
>>                      I915_WEDGED_ON_INIT);
>>         intel_gt_set_wedged(gt);
>>         set_bit(I915_WEDGED_ON_INIT, &gt->reset.flags);
>> +
> 
> Ah, we don't say around here that this WEDGED_ON_INIT is non-recoverable,
> could you please add a comment to that effect?
> 

Such comment is already in WEDGED_ON_INIT description, but repeating it
will definitely help

>> +       add_taint_for_CI(TAINT_WARN);

btw, today we are tainting kernel for CI silently and from different
places, so maybe it is worth to add there some debug log with
__builtin_return_address() for better diagnose why we stopped CI?

with typo/comment fixed,
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>

>>  }
>>  
>>  void intel_gt_init_reset(struct intel_gt *gt)
>> -- 
>> 2.27.0
>>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

Patch
diff mbox series

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 0156f1f5c736..d27e8bb7d550 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1360,6 +1360,8 @@  void intel_gt_set_wedged_on_init(struct intel_gt *gt)
 		     I915_WEDGED_ON_INIT);
 	intel_gt_set_wedged(gt);
 	set_bit(I915_WEDGED_ON_INIT, &gt->reset.flags);
+
+	add_taint_for_CI(TAINT_WARN);
 }
 
 void intel_gt_init_reset(struct intel_gt *gt)