diff mbox series

[1/2] drm/i915/selftests: Fix engine reset count storage for multi-tile

Message ID 20231201122109.729006-1-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show
Series [1/2] drm/i915/selftests: Fix engine reset count storage for multi-tile | expand

Commit Message

Tvrtko Ursulin Dec. 1, 2023, 12:21 p.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Engine->id namespace is per-tile so struct igt_live_test->reset_engine[]
needs to be two-dimensional so engine reset counts from all tiles can be
stored with no aliasing. With aliasing, if we had a real multi-tile
platform, the reset counts would be incorrect for same engine instance on
different tiles.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Fixes: 0c29efa23f5c ("drm/i915/selftests: Consider multi-gt instead of to_gt()")
Reported-by: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/selftests/igt_live_test.c | 9 +++++----
 drivers/gpu/drm/i915/selftests/igt_live_test.h | 3 ++-
 2 files changed, 7 insertions(+), 5 deletions(-)

Comments

Andi Shyti Dec. 7, 2023, 11:26 a.m. UTC | #1
Hi Tvrtko,

> Engine->id namespace is per-tile so struct igt_live_test->reset_engine[]
> needs to be two-dimensional so engine reset counts from all tiles can be
> stored with no aliasing. With aliasing, if we had a real multi-tile
> platform, the reset counts would be incorrect for same engine instance on
> different tiles.
> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Fixes: 0c29efa23f5c ("drm/i915/selftests: Consider multi-gt instead of to_gt()")
> Reported-by: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com>
> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

sorry for being late here... the patch makes sense to me and the
CI failures don't look related.

Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Thanks,
Andi
Tvrtko Ursulin Dec. 7, 2023, 11:43 a.m. UTC | #2
On 07/12/2023 11:26, Andi Shyti wrote:
> Hi Tvrtko,
> 
>> Engine->id namespace is per-tile so struct igt_live_test->reset_engine[]
>> needs to be two-dimensional so engine reset counts from all tiles can be
>> stored with no aliasing. With aliasing, if we had a real multi-tile
>> platform, the reset counts would be incorrect for same engine instance on
>> different tiles.
>>
>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Fixes: 0c29efa23f5c ("drm/i915/selftests: Consider multi-gt instead of to_gt()")
>> Reported-by: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com>
>> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
>> Cc: Andi Shyti <andi.shyti@linux.intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> 
> sorry for being late here... the patch makes sense to me and the
> CI failures don't look related.
> 
> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>

Thanks pushed!

There is more work to be done with the fact i915_reset_engine_count has 
it's own aliasing when used like this, but I opted to leave that for 
some other time.

Regards,

Tvrtko
Andi Shyti Dec. 7, 2023, 11:46 a.m. UTC | #3
On Thu, Dec 07, 2023 at 11:43:28AM +0000, Tvrtko Ursulin wrote:
> 
> On 07/12/2023 11:26, Andi Shyti wrote:
> > Hi Tvrtko,
> > 
> > > Engine->id namespace is per-tile so struct igt_live_test->reset_engine[]
> > > needs to be two-dimensional so engine reset counts from all tiles can be
> > > stored with no aliasing. With aliasing, if we had a real multi-tile
> > > platform, the reset counts would be incorrect for same engine instance on
> > > different tiles.
> > > 
> > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > > Fixes: 0c29efa23f5c ("drm/i915/selftests: Consider multi-gt instead of to_gt()")
> > > Reported-by: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com>
> > > Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > Cc: Andi Shyti <andi.shyti@linux.intel.com>
> > > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > 
> > sorry for being late here... the patch makes sense to me and the
> > CI failures don't look related.
> > 
> > Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
> 
> Thanks pushed!
> 
> There is more work to be done with the fact i915_reset_engine_count has it's
> own aliasing when used like this, but I opted to leave that for some other
> time.

feel free to share if you have some preparatory work done already
and I can try to help out. Otherwise I can take a look at it, as
well.

Andi
Tvrtko Ursulin Dec. 7, 2023, 1:45 p.m. UTC | #4
On 07/12/2023 11:46, Andi Shyti wrote:
> On Thu, Dec 07, 2023 at 11:43:28AM +0000, Tvrtko Ursulin wrote:
>>
>> On 07/12/2023 11:26, Andi Shyti wrote:
>>> Hi Tvrtko,
>>>
>>>> Engine->id namespace is per-tile so struct igt_live_test->reset_engine[]
>>>> needs to be two-dimensional so engine reset counts from all tiles can be
>>>> stored with no aliasing. With aliasing, if we had a real multi-tile
>>>> platform, the reset counts would be incorrect for same engine instance on
>>>> different tiles.
>>>>
>>>> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Fixes: 0c29efa23f5c ("drm/i915/selftests: Consider multi-gt instead of to_gt()")
>>>> Reported-by: Alan Previn Teres Alexis <alan.previn.teres.alexis@intel.com>
>>>> Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
>>>> Cc: Andi Shyti <andi.shyti@linux.intel.com>
>>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>>
>>> sorry for being late here... the patch makes sense to me and the
>>> CI failures don't look related.
>>>
>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>
>> Thanks pushed!
>>
>> There is more work to be done with the fact i915_reset_engine_count has it's
>> own aliasing when used like this, but I opted to leave that for some other
>> time.
> 
> feel free to share if you have some preparatory work done already
> and I can try to help out. Otherwise I can take a look at it, as
> well.

I don't have any patches I was just noticed when doing this that even 
though i915_reset_engine_count takes the engine as parameter, the 
i915->gpu_error is a single gt construct and as such I think using 
i915_reset_engine_count from per gt selftests is a mismatch.

I thought options were to add engine reset counts in the engine itself 
and use that from selftests. Leaving i915_reset_engine_count to be used 
from error capture paths. And it probably needs to be renamed 
accordingly so it is not misleading.

But then there may be issues around virtual engines though which this 
helper conveniently and quietly side stepped.

At that point I stopped thinking about it, given how real multi-tile for 
i915 is not happening, I didn't see it worth the effort. Still the sour 
taste of a mess remains so if you can think of an elegant and relatively 
cheap solution I think it would be good to tidy.

Regards,

Tvrtko
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/selftests/igt_live_test.c b/drivers/gpu/drm/i915/selftests/igt_live_test.c
index 4ddc6d902752..7d41874a49c5 100644
--- a/drivers/gpu/drm/i915/selftests/igt_live_test.c
+++ b/drivers/gpu/drm/i915/selftests/igt_live_test.c
@@ -37,8 +37,9 @@  int igt_live_test_begin(struct igt_live_test *t,
 		}
 
 		for_each_engine(engine, gt, id)
-			t->reset_engine[id] =
-			i915_reset_engine_count(&i915->gpu_error, engine);
+			t->reset_engine[i][id] =
+				i915_reset_engine_count(&i915->gpu_error,
+							engine);
 	}
 
 	t->reset_global = i915_reset_count(&i915->gpu_error);
@@ -66,14 +67,14 @@  int igt_live_test_end(struct igt_live_test *t)
 
 	for_each_gt(gt, i915, i) {
 		for_each_engine(engine, gt, id) {
-			if (t->reset_engine[id] ==
+			if (t->reset_engine[i][id] ==
 			    i915_reset_engine_count(&i915->gpu_error, engine))
 				continue;
 
 			gt_err(gt, "%s(%s): engine '%s' was reset %d times!\n",
 			       t->func, t->name, engine->name,
 			       i915_reset_engine_count(&i915->gpu_error, engine) -
-			       t->reset_engine[id]);
+			       t->reset_engine[i][id]);
 			return -EIO;
 		}
 	}
diff --git a/drivers/gpu/drm/i915/selftests/igt_live_test.h b/drivers/gpu/drm/i915/selftests/igt_live_test.h
index 36ed42736c52..83e3ad430922 100644
--- a/drivers/gpu/drm/i915/selftests/igt_live_test.h
+++ b/drivers/gpu/drm/i915/selftests/igt_live_test.h
@@ -7,6 +7,7 @@ 
 #ifndef IGT_LIVE_TEST_H
 #define IGT_LIVE_TEST_H
 
+#include "gt/intel_gt_defines.h" /* for I915_MAX_GT */
 #include "gt/intel_engine.h" /* for I915_NUM_ENGINES */
 
 struct drm_i915_private;
@@ -17,7 +18,7 @@  struct igt_live_test {
 	const char *name;
 
 	unsigned int reset_global;
-	unsigned int reset_engine[I915_NUM_ENGINES];
+	unsigned int reset_engine[I915_MAX_GT][I915_NUM_ENGINES];
 };
 
 /*