mbox series

[i-g-t,0/2] tests/i915/perf: Add stress / race exercises

Message ID 20230131091731.5892-1-janusz.krzysztofik@linux.intel.com (mailing list archive)
Headers show
Series tests/i915/perf: Add stress / race exercises | expand

Message

Janusz Krzysztofik Jan. 31, 2023, 9:17 a.m. UTC
Users reported oopses on list corruptions when using i915 perf with a
number of concurrently running graphics applications.  That indicates we
are currently missing some important tests for such scenarios.  Cover
that gap.

Chris Wilson (1):
  i915/perf: Stress opening of new perf streams against existing
    contexts

Janusz Krzysztofik (1):
  tests/i915/perf: Exercise barrier race

 tests/i915/perf.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 111 insertions(+)

Comments

Dixit, Ashutosh Jan. 31, 2023, 4:19 p.m. UTC | #1
On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote:
>

Hi Janusz,

> Users reported oopses on list corruptions when using i915 perf with a
> number of concurrently running graphics applications.  That indicates we
> are currently missing some important tests for such scenarios.  Cover
> that gap.

Do these oops etc. have anything to do with perf itself or rather with
persistence or non-persistence not properly supported with GuC? We should
have seen such failures with persistence tests (with GuC) itself so I am
wondering if there's any point of dragging perf into these already muddy
waters? Such failures should be isolated first with other tests without
mixing perf into this IMO.

Thanks.
--
Ashutosh
Dixit, Ashutosh Jan. 31, 2023, 4:55 p.m. UTC | #2
On Tue, 31 Jan 2023 08:19:48 -0800, Dixit, Ashutosh wrote:
>
> On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote:
> >
>
> Hi Janusz,
>
> > Users reported oopses on list corruptions when using i915 perf with a
> > number of concurrently running graphics applications.  That indicates we
> > are currently missing some important tests for such scenarios.  Cover
> > that gap.
>
> Do these oops etc. have anything to do with perf itself or rather with
> persistence or non-persistence not properly supported with GuC? We should
> have seen such failures with persistence tests (with GuC) itself so I am
> wondering if there's any point of dragging perf into these already muddy
> waters? Such failures should be isolated first with other tests without
> mixing perf into this IMO.

Basically failures in these tests indicate defects in which subsystem? If
the failures do not indicate defects in perf then these tests should not be
added as perf tests. E.g. if failures indicate defects in GuC subsystem
then they should be added as GuC tests.

Otherwise it gets hard to dispostion bugs which are filed due to these
failures. The bugs come to the wrong team and then have to be sent to the
correct team etc.

Thanks.
--
Ashutosh
Janusz Krzysztofik Jan. 31, 2023, 5:36 p.m. UTC | #3
On Tuesday, 31 January 2023 17:19:48 CET Dixit, Ashutosh wrote:
> On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote:
> >
> 
> Hi Janusz,
> 
> > Users reported oopses on list corruptions when using i915 perf with a
> > number of concurrently running graphics applications.  That indicates we
> > are currently missing some important tests for such scenarios.  Cover
> > that gap.
> 
> Do these oops etc. have anything to do with perf itself or rather with
> persistence or non-persistence not properly supported with GuC? 

My root cause analysis has revealed that these list corruptions are actually 
caused by a bug in barrier processing, then no, they are not persistence nor 
GuC related.  For details, please see my preliminary (still a bit buggy, but 
otherwise valid) fix, so far sent only to trybot:
https://patchwork.freedesktop.org/series/113268/

> We should
> have seen such failures with persistence tests (with GuC) itself so I am
> wondering if there's any point of dragging perf into these already muddy
> waters? Such failures should be isolated first with other tests without
> mixing perf into this IMO.

I see your point, but unfortunately things are not that easy.  My 
investigation has lead me to a conclusion that the bug within the barrier 
processing code is now addressed, to some extent, and probably not 
intentionally, by a kind of workaround that makes it really hard to reproduce 
without any interaction from an external user that tries to replace a barrier 
with its own request.  And I can see a very limited number of such users, one 
of them being perf.

The first patch was developed by Chris still before I found the the root cause 
of the issue.  Since the bug seemed strictly perf related at that point in 
time, that's probably why Chris decided to add the new subtest to perf.  As 
such, that subtest is more general than just focused on triggering the list 
corruption bug, and it pretty belongs to perf, I believe.

Since Chris' subtest didn't help in triggering the list corruption, I've 
developed a new subtest that can do it.  Since it is almost identical to the 
one Chris added, I decided to reuse his code, then add my new subtest to perf 
as well.  But maybe you are right that my subtest better fits to another test. 
not perf.  I'll think this over.

I hope this clarifies things for you.

Thanks,
Janusz

> 
> Thanks.
> --
> Ashutosh
>
Janusz Krzysztofik Jan. 31, 2023, 5:56 p.m. UTC | #4
On Tuesday, 31 January 2023 17:55:54 CET Dixit, Ashutosh wrote:
> On Tue, 31 Jan 2023 08:19:48 -0800, Dixit, Ashutosh wrote:
> >
> > On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote:
> > >
> >
> > Hi Janusz,
> >
> > > Users reported oopses on list corruptions when using i915 perf with a
> > > number of concurrently running graphics applications.  That indicates we
> > > are currently missing some important tests for such scenarios.  Cover
> > > that gap.
> >
> > Do these oops etc. have anything to do with perf itself or rather with
> > persistence or non-persistence not properly supported with GuC? We should
> > have seen such failures with persistence tests (with GuC) itself so I am
> > wondering if there's any point of dragging perf into these already muddy
> > waters? Such failures should be isolated first with other tests without
> > mixing perf into this IMO.
> 
> Basically failures in these tests indicate defects in which subsystem? If
> the failures do not indicate defects in perf then these tests should not be
> added as perf tests. E.g. if failures indicate defects in GuC subsystem
> then they should be added as GuC tests.

But how can a tests know in advance what bugs, in which particular subsystems, 
it is ever going to hit?  If it could, we wouldn't need any root cause 
analysis, only tests telling us which bug from a predicted set was hit.
For me your vision seems to assume an environment without cross-subsystem 
dependencies, where a test is only capable of triggering bugs in a particular 
subsystem and never in others.  That's not possible in reality, I believe, we 
need root cause analysis to tell.

> 
> Otherwise it gets hard to dispostion bugs which are filed due to these
> failures. The bugs come to the wrong team and then have to be sent to the
> correct team etc.

In my opinion, all parties, whether validation, or bug filling, or 
development, must do their job with care.  Assigning bugs to teams by test 
name, not by a signature of the issue found in test output or system logs, 
doesn't seem to be the best practice to me.

Thanks,
Janusz

> 
> Thanks.
> --
> Ashutosh
>
Dixit, Ashutosh Jan. 31, 2023, 6:36 p.m. UTC | #5
On Tue, 31 Jan 2023 09:36:30 -0800, Janusz Krzysztofik wrote:
>
> Since Chris' subtest didn't help in triggering the list corruption, I've
> developed a new subtest that can do it.  Since it is almost identical to the
> one Chris added, I decided to reuse his code, then add my new subtest to perf
> as well.  But maybe you are right that my subtest better fits to another test.
> not perf.  I'll think this over.
>
> I hope this clarifies things for you.

Thanks Janusz. Of course the test is valid but because it is not triggering
bugs in perf perhaps it is better added to a different IGT file. Maybe
gem_ctx_persistence? Maybe perf is also ok. Anyway something to think
about.

Thanks.
--
Ashutosh
Janusz Krzysztofik Feb. 1, 2023, 3:51 p.m. UTC | #6
Hi Ashutosh,

On Tuesday, 31 January 2023 19:36:50 CET Dixit, Ashutosh wrote:
> On Tue, 31 Jan 2023 09:36:30 -0800, Janusz Krzysztofik wrote:
> >
> > Since Chris' subtest didn't help in triggering the list corruption, I've
> > developed a new subtest that can do it.  Since it is almost identical to the
> > one Chris added, I decided to reuse his code, then add my new subtest to perf
> > as well.  But maybe you are right that my subtest better fits to another test.
> > not perf.  I'll think this over.
> >
> > I hope this clarifies things for you.
> 
> Thanks Janusz. Of course the test is valid but because it is not triggering
> bugs in perf perhaps it is better added to a different IGT file. Maybe
> gem_ctx_persistence? Maybe perf is also ok. Anyway something to think
> about.

I raised this point on today's IGT workroup meeting, asking for opinion.  
Since the new subtests depend on perf, we agreed that what needs to be done 
first is to provide an IGT library with perf helpers.  As soon as it is 
available, other tests can use it if needed.  Since I'm not familiar with 
perf, and I have other tasks in my queue, I've no time to spend on learning 
perf and working on that library.  Then, for now I'll keep the new subtests 
inside the perf test.  We can move them elsewhere at a later time, when the 
library is ready.

Thanks,
Janusz

> 
> Thanks.
> --
> Ashutosh
>
Umesh Nerlige Ramappa Feb. 3, 2023, 7:21 p.m. UTC | #7
On Tue, Jan 31, 2023 at 10:17:29AM +0100, Janusz Krzysztofik wrote:
>Users reported oopses on list corruptions when using i915 perf with a
>number of concurrently running graphics applications.  That indicates we
>are currently missing some important tests for such scenarios.  Cover
>that gap.

Any bug ids that were filed for the issue? Also what were the parameters 
passed to perf OA when the issues were reported? This could help 
actually root cause the issue and have a test closer to the actual use 
case.

Regards,
Umesh
>
>Chris Wilson (1):
>  i915/perf: Stress opening of new perf streams against existing
>    contexts
>
>Janusz Krzysztofik (1):
>  tests/i915/perf: Exercise barrier race
>
> tests/i915/perf.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 111 insertions(+)
>
>-- 
>2.25.1
>
Janusz Krzysztofik Feb. 3, 2023, 8:59 p.m. UTC | #8
Hi Umesh,

Thanks for taking a look.

On Friday, 3 February 2023 20:21:38 CET Umesh Nerlige Ramappa wrote:
> On Tue, Jan 31, 2023 at 10:17:29AM +0100, Janusz Krzysztofik wrote:
> >Users reported oopses on list corruptions when using i915 perf with a
> >number of concurrently running graphics applications.  That indicates we
> >are currently missing some important tests for such scenarios.  Cover
> >that gap.
> 
> Any bug ids that were filed for the issue? 

Yes, https://gitlab.freedesktop.org/drm/intel/issues/6333.

> Also what were the parameters 
> passed to perf OA when the issues were reported? This could help 
> actually root cause the issue and have a test closer to the actual use 
> case.

Root cause has been already identified, and it lays not in perf, only in 
barriers handling code, see https://patchwork.freedesktop.org/series/113636/ 
(submitted to trybot so far, soon to the public list).

Thanks,
Janusz

> 
> Regards,
> Umesh
> >
> >Chris Wilson (1):
> >  i915/perf: Stress opening of new perf streams against existing
> >    contexts
> >
> >Janusz Krzysztofik (1):
> >  tests/i915/perf: Exercise barrier race
> >
> > tests/i915/perf.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 111 insertions(+)
> >
>