Message ID | 20230131091731.5892-1-janusz.krzysztofik@linux.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | tests/i915/perf: Add stress / race exercises | expand |
On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote: > Hi Janusz, > Users reported oopses on list corruptions when using i915 perf with a > number of concurrently running graphics applications. That indicates we > are currently missing some important tests for such scenarios. Cover > that gap. Do these oops etc. have anything to do with perf itself or rather with persistence or non-persistence not properly supported with GuC? We should have seen such failures with persistence tests (with GuC) itself so I am wondering if there's any point of dragging perf into these already muddy waters? Such failures should be isolated first with other tests without mixing perf into this IMO. Thanks. -- Ashutosh
On Tue, 31 Jan 2023 08:19:48 -0800, Dixit, Ashutosh wrote: > > On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote: > > > > Hi Janusz, > > > Users reported oopses on list corruptions when using i915 perf with a > > number of concurrently running graphics applications. That indicates we > > are currently missing some important tests for such scenarios. Cover > > that gap. > > Do these oops etc. have anything to do with perf itself or rather with > persistence or non-persistence not properly supported with GuC? We should > have seen such failures with persistence tests (with GuC) itself so I am > wondering if there's any point of dragging perf into these already muddy > waters? Such failures should be isolated first with other tests without > mixing perf into this IMO. Basically failures in these tests indicate defects in which subsystem? If the failures do not indicate defects in perf then these tests should not be added as perf tests. E.g. if failures indicate defects in GuC subsystem then they should be added as GuC tests. Otherwise it gets hard to dispostion bugs which are filed due to these failures. The bugs come to the wrong team and then have to be sent to the correct team etc. Thanks. -- Ashutosh
On Tuesday, 31 January 2023 17:19:48 CET Dixit, Ashutosh wrote: > On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote: > > > > Hi Janusz, > > > Users reported oopses on list corruptions when using i915 perf with a > > number of concurrently running graphics applications. That indicates we > > are currently missing some important tests for such scenarios. Cover > > that gap. > > Do these oops etc. have anything to do with perf itself or rather with > persistence or non-persistence not properly supported with GuC? My root cause analysis has revealed that these list corruptions are actually caused by a bug in barrier processing, then no, they are not persistence nor GuC related. For details, please see my preliminary (still a bit buggy, but otherwise valid) fix, so far sent only to trybot: https://patchwork.freedesktop.org/series/113268/ > We should > have seen such failures with persistence tests (with GuC) itself so I am > wondering if there's any point of dragging perf into these already muddy > waters? Such failures should be isolated first with other tests without > mixing perf into this IMO. I see your point, but unfortunately things are not that easy. My investigation has lead me to a conclusion that the bug within the barrier processing code is now addressed, to some extent, and probably not intentionally, by a kind of workaround that makes it really hard to reproduce without any interaction from an external user that tries to replace a barrier with its own request. And I can see a very limited number of such users, one of them being perf. The first patch was developed by Chris still before I found the the root cause of the issue. Since the bug seemed strictly perf related at that point in time, that's probably why Chris decided to add the new subtest to perf. As such, that subtest is more general than just focused on triggering the list corruption bug, and it pretty belongs to perf, I believe. Since Chris' subtest didn't help in triggering the list corruption, I've developed a new subtest that can do it. Since it is almost identical to the one Chris added, I decided to reuse his code, then add my new subtest to perf as well. But maybe you are right that my subtest better fits to another test. not perf. I'll think this over. I hope this clarifies things for you. Thanks, Janusz > > Thanks. > -- > Ashutosh >
On Tuesday, 31 January 2023 17:55:54 CET Dixit, Ashutosh wrote: > On Tue, 31 Jan 2023 08:19:48 -0800, Dixit, Ashutosh wrote: > > > > On Tue, 31 Jan 2023 01:17:29 -0800, Janusz Krzysztofik wrote: > > > > > > > Hi Janusz, > > > > > Users reported oopses on list corruptions when using i915 perf with a > > > number of concurrently running graphics applications. That indicates we > > > are currently missing some important tests for such scenarios. Cover > > > that gap. > > > > Do these oops etc. have anything to do with perf itself or rather with > > persistence or non-persistence not properly supported with GuC? We should > > have seen such failures with persistence tests (with GuC) itself so I am > > wondering if there's any point of dragging perf into these already muddy > > waters? Such failures should be isolated first with other tests without > > mixing perf into this IMO. > > Basically failures in these tests indicate defects in which subsystem? If > the failures do not indicate defects in perf then these tests should not be > added as perf tests. E.g. if failures indicate defects in GuC subsystem > then they should be added as GuC tests. But how can a tests know in advance what bugs, in which particular subsystems, it is ever going to hit? If it could, we wouldn't need any root cause analysis, only tests telling us which bug from a predicted set was hit. For me your vision seems to assume an environment without cross-subsystem dependencies, where a test is only capable of triggering bugs in a particular subsystem and never in others. That's not possible in reality, I believe, we need root cause analysis to tell. > > Otherwise it gets hard to dispostion bugs which are filed due to these > failures. The bugs come to the wrong team and then have to be sent to the > correct team etc. In my opinion, all parties, whether validation, or bug filling, or development, must do their job with care. Assigning bugs to teams by test name, not by a signature of the issue found in test output or system logs, doesn't seem to be the best practice to me. Thanks, Janusz > > Thanks. > -- > Ashutosh >
On Tue, 31 Jan 2023 09:36:30 -0800, Janusz Krzysztofik wrote: > > Since Chris' subtest didn't help in triggering the list corruption, I've > developed a new subtest that can do it. Since it is almost identical to the > one Chris added, I decided to reuse his code, then add my new subtest to perf > as well. But maybe you are right that my subtest better fits to another test. > not perf. I'll think this over. > > I hope this clarifies things for you. Thanks Janusz. Of course the test is valid but because it is not triggering bugs in perf perhaps it is better added to a different IGT file. Maybe gem_ctx_persistence? Maybe perf is also ok. Anyway something to think about. Thanks. -- Ashutosh
Hi Ashutosh, On Tuesday, 31 January 2023 19:36:50 CET Dixit, Ashutosh wrote: > On Tue, 31 Jan 2023 09:36:30 -0800, Janusz Krzysztofik wrote: > > > > Since Chris' subtest didn't help in triggering the list corruption, I've > > developed a new subtest that can do it. Since it is almost identical to the > > one Chris added, I decided to reuse his code, then add my new subtest to perf > > as well. But maybe you are right that my subtest better fits to another test. > > not perf. I'll think this over. > > > > I hope this clarifies things for you. > > Thanks Janusz. Of course the test is valid but because it is not triggering > bugs in perf perhaps it is better added to a different IGT file. Maybe > gem_ctx_persistence? Maybe perf is also ok. Anyway something to think > about. I raised this point on today's IGT workroup meeting, asking for opinion. Since the new subtests depend on perf, we agreed that what needs to be done first is to provide an IGT library with perf helpers. As soon as it is available, other tests can use it if needed. Since I'm not familiar with perf, and I have other tasks in my queue, I've no time to spend on learning perf and working on that library. Then, for now I'll keep the new subtests inside the perf test. We can move them elsewhere at a later time, when the library is ready. Thanks, Janusz > > Thanks. > -- > Ashutosh >
On Tue, Jan 31, 2023 at 10:17:29AM +0100, Janusz Krzysztofik wrote: >Users reported oopses on list corruptions when using i915 perf with a >number of concurrently running graphics applications. That indicates we >are currently missing some important tests for such scenarios. Cover >that gap. Any bug ids that were filed for the issue? Also what were the parameters passed to perf OA when the issues were reported? This could help actually root cause the issue and have a test closer to the actual use case. Regards, Umesh > >Chris Wilson (1): > i915/perf: Stress opening of new perf streams against existing > contexts > >Janusz Krzysztofik (1): > tests/i915/perf: Exercise barrier race > > tests/i915/perf.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 111 insertions(+) > >-- >2.25.1 >
Hi Umesh, Thanks for taking a look. On Friday, 3 February 2023 20:21:38 CET Umesh Nerlige Ramappa wrote: > On Tue, Jan 31, 2023 at 10:17:29AM +0100, Janusz Krzysztofik wrote: > >Users reported oopses on list corruptions when using i915 perf with a > >number of concurrently running graphics applications. That indicates we > >are currently missing some important tests for such scenarios. Cover > >that gap. > > Any bug ids that were filed for the issue? Yes, https://gitlab.freedesktop.org/drm/intel/issues/6333. > Also what were the parameters > passed to perf OA when the issues were reported? This could help > actually root cause the issue and have a test closer to the actual use > case. Root cause has been already identified, and it lays not in perf, only in barriers handling code, see https://patchwork.freedesktop.org/series/113636/ (submitted to trybot so far, soon to the public list). Thanks, Janusz > > Regards, > Umesh > > > >Chris Wilson (1): > > i915/perf: Stress opening of new perf streams against existing > > contexts > > > >Janusz Krzysztofik (1): > > tests/i915/perf: Exercise barrier race > > > > tests/i915/perf.c | 111 ++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 111 insertions(+) > > >