drm/i915: Reboot CI if we get wedged during driver init

Message ID	20200701150721.423630-1-michal@hardline.pl (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=V8Ks=AM=lists.freedesktop.org=intel-gfx-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6ACBA20760 From: =?utf-8?q?Micha=C5=82_Winiarski?= <michal@hardline.pl> To: intel-gfx@lists.freedesktop.org Date: Wed, 1 Jul 2020 17:07:21 +0200 Message-Id: <20200701150721.423630-1-michal@hardline.pl> MIME-Version: 1.0 Subject: [Intel-gfx] [PATCH] drm/i915: Reboot CI if we get wedged during driver init Precedence: list Cc: =?utf-8?q?Micha=C5=82_Winiarski?= <michal.winiarski@intel.com>, Chris Wilson <chris@chris-wilson.co.uk> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: Reboot CI if we get wedged during driver init \| expand drm/i915: Reboot CI if we get wedged during driver init

Message ID

20200701150721.423630-1-michal@hardline.pl (mailing list archive)

State

New, archived

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6ACBA20760
From: =?utf-8?q?Micha=C5=82_Winiarski?= <michal@hardline.pl>
To: intel-gfx@lists.freedesktop.org
Date: Wed,  1 Jul 2020 17:07:21 +0200
Message-Id: <20200701150721.423630-1-michal@hardline.pl>
MIME-Version: 1.0
Subject: [Intel-gfx] [PATCH] drm/i915: Reboot CI if we get wedged during
 driver init
Precedence: list
Cc: =?utf-8?q?Micha=C5=82_Winiarski?= <michal.winiarski@intel.com>,
 Chris Wilson <chris@chris-wilson.co.uk>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: intel-gfx-bounces@lists.freedesktop.org
Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Series

drm/i915: Reboot CI if we get wedged during driver init | expand

Commit Message

Michał Winiarski July 1, 2020, 3:07 p.m. UTC

From: Michał Winiarski <michal.winiarski@intel.com>

Getting wedged device on driver init is pretty much unrecoverable.
Since we're running verious scenarios that may potentially hit this in
CI (module reload / selftests / hotunplug), and if it happens, it means
that we can't trust any subsequent CI results, we should just apply the
taint to let the CI know that it should reboot (CI checks taint between
test runs).

Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Petri Latvala <petri.latvala@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Chris Wilson July 1, 2020, 3:16 p.m. UTC | #1

Quoting Michał Winiarski (2020-07-01 16:07:21)
> From: Michał Winiarski <michal.winiarski@intel.com>
> 
> Getting wedged device on driver init is pretty much unrecoverable.
> Since we're running verious scenarios that may potentially hit this in
> CI (module reload / selftests / hotunplug), and if it happens, it means
> that we can't trust any subsequent CI results, we should just apply the
> taint to let the CI know that it should reboot (CI checks taint between
> test runs).

Ok, we treat WEDGED_ON_INIT as non-recoverable [as opposed to the less
wedged WEDGED].
 
> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Petri Latvala <petri.latvala@intel.com>
Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
-Chris

Chris Wilson July 1, 2020, 3:17 p.m. UTC | #2

Quoting Michał Winiarski (2020-07-01 16:07:21)
> From: Michał Winiarski <michal.winiarski@intel.com>
> 
> Getting wedged device on driver init is pretty much unrecoverable.
> Since we're running verious scenarios that may potentially hit this in
> CI (module reload / selftests / hotunplug), and if it happens, it means
> that we can't trust any subsequent CI results, we should just apply the
> taint to let the CI know that it should reboot (CI checks taint between
> test runs).
> 
> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Petri Latvala <petri.latvala@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_reset.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 0156f1f5c736..d27e8bb7d550 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1360,6 +1360,8 @@ void intel_gt_set_wedged_on_init(struct intel_gt *gt)
>                      I915_WEDGED_ON_INIT);
>         intel_gt_set_wedged(gt);
>         set_bit(I915_WEDGED_ON_INIT, &gt->reset.flags);
> +

Ah, we don't say around here that this WEDGED_ON_INIT is non-recoverable,
could you please add a comment to that effect?

> +       add_taint_for_CI(TAINT_WARN);
>  }
>  
>  void intel_gt_init_reset(struct intel_gt *gt)
> -- 
> 2.27.0
>

Michal Wajdeczko July 1, 2020, 5:03 p.m. UTC | #3

On 01.07.2020 17:17, Chris Wilson wrote:
> Quoting Michał Winiarski (2020-07-01 16:07:21)
>> From: Michał Winiarski <michal.winiarski@intel.com>
>>
>> Getting wedged device on driver init is pretty much unrecoverable.
>> Since we're running verious scenarios that may potentially hit this in

typo

>> CI (module reload / selftests / hotunplug), and if it happens, it means
>> that we can't trust any subsequent CI results, we should just apply the
>> taint to let the CI know that it should reboot (CI checks taint between
>> test runs).
>>
>> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Petri Latvala <petri.latvala@intel.com>
>> ---
>>  drivers/gpu/drm/i915/gt/intel_reset.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index 0156f1f5c736..d27e8bb7d550 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -1360,6 +1360,8 @@ void intel_gt_set_wedged_on_init(struct intel_gt *gt)
>>                      I915_WEDGED_ON_INIT);
>>         intel_gt_set_wedged(gt);
>>         set_bit(I915_WEDGED_ON_INIT, &gt->reset.flags);
>> +
> 
> Ah, we don't say around here that this WEDGED_ON_INIT is non-recoverable,
> could you please add a comment to that effect?
> 

Such comment is already in WEDGED_ON_INIT description, but repeating it
will definitely help

>> +       add_taint_for_CI(TAINT_WARN);

btw, today we are tainting kernel for CI silently and from different
places, so maybe it is worth to add there some debug log with
__builtin_return_address() for better diagnose why we stopped CI?

with typo/comment fixed,
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>

>>  }
>>  
>>  void intel_gt_init_reset(struct intel_gt *gt)
>> -- 
>> 2.27.0
>>
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx
>

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 0156f1f5c736..d27e8bb7d550 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1360,6 +1360,8 @@  void intel_gt_set_wedged_on_init(struct intel_gt *gt)
 		     I915_WEDGED_ON_INIT);
 	intel_gt_set_wedged(gt);
 	set_bit(I915_WEDGED_ON_INIT, &gt->reset.flags);
+
+	add_taint_for_CI(TAINT_WARN);
 }
 
 void intel_gt_init_reset(struct intel_gt *gt)

drm/i915: Reboot CI if we get wedged during driver init

Commit Message

Comments

Patch