diff mbox series

[01/12] drm/i915: Remove bogus GEM_BUG_ON in unpark

Message ID 20220712233136.1044951-2-John.C.Harrison@Intel.com (mailing list archive)
State New, archived
Headers show
Series Random assortment of (mostly) GuC related patches | expand

Commit Message

John Harrison July 12, 2022, 11:31 p.m. UTC
From: Matthew Brost <matthew.brost@intel.com>

Remove bogus GEM_BUG_ON which compared kernel context timeline seqno to
seqno in memory on engine PM unpark. If a GT reset occurred these values
might not match as a kernel context could be skipped. This bug was
hidden by always switching to a kernel context on park (execlists
requirement).

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Tvrtko Ursulin July 18, 2022, 12:15 p.m. UTC | #1
On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
> From: Matthew Brost <matthew.brost@intel.com>
> 
> Remove bogus GEM_BUG_ON which compared kernel context timeline seqno to
> seqno in memory on engine PM unpark. If a GT reset occurred these values
> might not match as a kernel context could be skipped. This bug was
> hidden by always switching to a kernel context on park (execlists
> requirement).

Reset of the kernel context? Under which circumstances does that happen?

It is unclear if the claim is this to be a general problem or the assert 
is only invalid with the GuC. Lack of a CI reported issue suggests it is 
not a generic problem?

Regards,

Tvrtko

> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>   1 file changed, 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref *wf)
>   			 ce->timeline->seqno,
>   			 READ_ONCE(*ce->timeline->hwsp_seqno),
>   			 ce->ring->emit);
> -		GEM_BUG_ON(ce->timeline->seqno !=
> -			   READ_ONCE(*ce->timeline->hwsp_seqno));
>   	}
>   
>   	if (engine->unpark)
John Harrison July 19, 2022, 12:05 a.m. UTC | #2
On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>
> On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
>> From: Matthew Brost <matthew.brost@intel.com>
>>
>> Remove bogus GEM_BUG_ON which compared kernel context timeline seqno to
>> seqno in memory on engine PM unpark. If a GT reset occurred these values
>> might not match as a kernel context could be skipped. This bug was
>> hidden by always switching to a kernel context on park (execlists
>> requirement).
>
> Reset of the kernel context? Under which circumstances does that happen?
As per description, the issue is with full GT reset.

>
> It is unclear if the claim is this to be a general problem or the 
> assert is only invalid with the GuC. Lack of a CI reported issue 
> suggests it is not a generic problem?
Currently it is not an issue because we always switch to the kernel 
context because that's how execlists works and the entire driver is 
fundamentally based on execlist operation. When we stop using the kernel 
context as a (non-functional) barrier when using GuC submission, then 
you would see an issue without this fix.

John.


>
> Regards,
>
> Tvrtko
>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>   1 file changed, 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref *wf)
>>                ce->timeline->seqno,
>>                READ_ONCE(*ce->timeline->hwsp_seqno),
>>                ce->ring->emit);
>> -        GEM_BUG_ON(ce->timeline->seqno !=
>> -               READ_ONCE(*ce->timeline->hwsp_seqno));
>>       }
>>         if (engine->unpark)
Tvrtko Ursulin July 19, 2022, 9:42 a.m. UTC | #3
On 19/07/2022 01:05, John Harrison wrote:
> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>
>> On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
>>> From: Matthew Brost <matthew.brost@intel.com>
>>>
>>> Remove bogus GEM_BUG_ON which compared kernel context timeline seqno to
>>> seqno in memory on engine PM unpark. If a GT reset occurred these values
>>> might not match as a kernel context could be skipped. This bug was
>>> hidden by always switching to a kernel context on park (execlists
>>> requirement).
>>
>> Reset of the kernel context? Under which circumstances does that happen?
> As per description, the issue is with full GT reset.
> 
>>
>> It is unclear if the claim is this to be a general problem or the 
>> assert is only invalid with the GuC. Lack of a CI reported issue 
>> suggests it is not a generic problem?
> Currently it is not an issue because we always switch to the kernel 
> context because that's how execlists works and the entire driver is 
> fundamentally based on execlist operation. When we stop using the kernel 
> context as a (non-functional) barrier when using GuC submission, then 
> you would see an issue without this fix.

Issue is with GuC, GuC and full reset, or with full reset regardless of 
the backend?

If issue is only with GuC patch should have drm/i915/guc prefix as 
minimum. But if it actually only becomes a problem when GuC backend 
stops parking with the kernel context when I think the whole unpark code 
should be refactored in a cleaner way than just removing the one assert. 
Otherwise what is the point of leaving everything else in there?

Or if the issue is backend agnostic, *if* full reset happens to hit 
during parking, then it is different. Wouldn't that be a race with 
parking and reset which probably shouldn't happen to start with.

Regards,

Tvrtko

> 
> John.
> 
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>   1 file changed, 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref *wf)
>>>                ce->timeline->seqno,
>>>                READ_ONCE(*ce->timeline->hwsp_seqno),
>>>                ce->ring->emit);
>>> -        GEM_BUG_ON(ce->timeline->seqno !=
>>> -               READ_ONCE(*ce->timeline->hwsp_seqno));
>>>       }
>>>         if (engine->unpark)
>
John Harrison July 21, 2022, 12:54 a.m. UTC | #4
On 7/19/2022 02:42, Tvrtko Ursulin wrote:
>
> On 19/07/2022 01:05, John Harrison wrote:
>> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>>
>>> On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
>>>> From: Matthew Brost <matthew.brost@intel.com>
>>>>
>>>> Remove bogus GEM_BUG_ON which compared kernel context timeline 
>>>> seqno to
>>>> seqno in memory on engine PM unpark. If a GT reset occurred these 
>>>> values
>>>> might not match as a kernel context could be skipped. This bug was
>>>> hidden by always switching to a kernel context on park (execlists
>>>> requirement).
>>>
>>> Reset of the kernel context? Under which circumstances does that 
>>> happen?
>> As per description, the issue is with full GT reset.
>>
>>>
>>> It is unclear if the claim is this to be a general problem or the 
>>> assert is only invalid with the GuC. Lack of a CI reported issue 
>>> suggests it is not a generic problem?
>> Currently it is not an issue because we always switch to the kernel 
>> context because that's how execlists works and the entire driver is 
>> fundamentally based on execlist operation. When we stop using the 
>> kernel context as a (non-functional) barrier when using GuC 
>> submission, then you would see an issue without this fix.
>
> Issue is with GuC, GuC and full reset, or with full reset regardless 
> of the backend?
The issue is with code making invalid assumptions. The assumption is 
currently not failing because the execlist backend requires the use of a 
barrier context for a bunch of operations. The GuC backend does not 
require this. In fact, the barrier context does not function as a 
barrier when the scheduler is external to i915. Hence the desire to 
remove the use of the barrier context from generic i915 operation and 
make it only used when in execlist mode. At that point, the invalid 
assumption will no longer work and the BUG will fire.

>
> If issue is only with GuC patch should have drm/i915/guc prefix as 
> minimum. But if it actually only becomes a problem when GuC backend 
> stops parking with the kernel context when I think the whole unpark 
> code should be refactored in a cleaner way than just removing the one 
> assert. Otherwise what is the point of leaving everything else in there?
>
> Or if the issue is backend agnostic, *if* full reset happens to hit 
> during parking, then it is different. Wouldn't that be a race with 
> parking and reset which probably shouldn't happen to start with.
>
The issue is neither with GuC nor with resets, GT or otherwise. The 
issue is with generic i915 code making assumptions about backend 
implementations that are only correct for the execlist implementation.

John.


> Regards,
>
> Tvrtko
>
>>
>> John.
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>>   1 file changed, 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
>>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref *wf)
>>>>                ce->timeline->seqno,
>>>>                READ_ONCE(*ce->timeline->hwsp_seqno),
>>>>                ce->ring->emit);
>>>> -        GEM_BUG_ON(ce->timeline->seqno !=
>>>> -               READ_ONCE(*ce->timeline->hwsp_seqno));
>>>>       }
>>>>         if (engine->unpark)
>>
Tvrtko Ursulin July 21, 2022, 9:24 a.m. UTC | #5
On 21/07/2022 01:54, John Harrison wrote:
> On 7/19/2022 02:42, Tvrtko Ursulin wrote:
>> On 19/07/2022 01:05, John Harrison wrote:
>>> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>>>
>>>> On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
>>>>> From: Matthew Brost <matthew.brost@intel.com>
>>>>>
>>>>> Remove bogus GEM_BUG_ON which compared kernel context timeline 
>>>>> seqno to
>>>>> seqno in memory on engine PM unpark. If a GT reset occurred these 
>>>>> values
>>>>> might not match as a kernel context could be skipped. This bug was
>>>>> hidden by always switching to a kernel context on park (execlists
>>>>> requirement).
>>>>
>>>> Reset of the kernel context? Under which circumstances does that 
>>>> happen?
>>> As per description, the issue is with full GT reset.
>>>
>>>>
>>>> It is unclear if the claim is this to be a general problem or the 
>>>> assert is only invalid with the GuC. Lack of a CI reported issue 
>>>> suggests it is not a generic problem?
>>> Currently it is not an issue because we always switch to the kernel 
>>> context because that's how execlists works and the entire driver is 
>>> fundamentally based on execlist operation. When we stop using the 
>>> kernel context as a (non-functional) barrier when using GuC 
>>> submission, then you would see an issue without this fix.

Let me pick this point to try again - I am simply asking for a clear 
description of steps which lead to the problem, instead of, what I find 
are, generic and hard to penetrate statements of invalid assumptions etc.

I picked this spot because of this: "When we stop using the kernel 
context as a (non-functional) barrier when using GuC submission, then 
you would see an issue without this fix."

I point to 363324292710 ("drm/i915/guc: Don't call 
switch_to_kernel_context with GuC submission"). Hence it is not when but 
it already happened. Which in my mind then does not compute - I can't 
grok the explanation which appears to fall over on the first claim.

Or perhaps the bug on is already firing today on every GuC enabled 
machine in the CI? In which case there is a Fixes: link to be added?

I have asked about, if we have 363324292710, and if this patch is 
removing the seqno bug on, why it is not removing something more in 
__engine_unpark, gated on "is guc"? Like ss there a point to sanitizing 
the context which wasn't lost, because it wasn't used to park the engine 
with?

Or if the problem can't be hit with execlists (in case reset claim from 
the commit message misleading), why shouldn't the bug on be changed to 
contain the !guc condition instead of being remove?

I am simply asking for a clear explanation of the conditions and steps 
which lead to the bug on incorrectly firing. It doesn't have to be long 
text or anything like that, just clear so we can close this and move on.

Regards,

Tvrtko

>>
>> Issue is with GuC, GuC and full reset, or with full reset regardless 
>> of the backend?
> The issue is with code making invalid assumptions. The assumption is 
> currently not failing because the execlist backend requires the use of a 
> barrier context for a bunch of operations. The GuC backend does not 
> require this. In fact, the barrier context does not function as a 
> barrier when the scheduler is external to i915. Hence the desire to 
> remove the use of the barrier context from generic i915 operation and 
> make it only used when in execlist mode. At that point, the invalid 
> assumption will no longer work and the BUG will fire.
> 
>>
>> If issue is only with GuC patch should have drm/i915/guc prefix as 
>> minimum. But if it actually only becomes a problem when GuC backend 
>> stops parking with the kernel context when I think the whole unpark 
>> code should be refactored in a cleaner way than just removing the one 
>> assert. Otherwise what is the point of leaving everything else in there?
>>
>> Or if the issue is backend agnostic, *if* full reset happens to hit 
>> during parking, then it is different. Wouldn't that be a race with 
>> parking and reset which probably shouldn't happen to start with.
>>
> The issue is neither with GuC nor with resets, GT or otherwise. The 
> issue is with generic i915 code making assumptions about backend 
> implementations that are only correct for the execlist implementation.
> 
> John.
> 
> 
>> Regards,
>>
>> Tvrtko
>>
>>>
>>> John.
>>>
>>>
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>>>   1 file changed, 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref *wf)
>>>>>                ce->timeline->seqno,
>>>>>                READ_ONCE(*ce->timeline->hwsp_seqno),
>>>>>                ce->ring->emit);
>>>>> -        GEM_BUG_ON(ce->timeline->seqno !=
>>>>> -               READ_ONCE(*ce->timeline->hwsp_seqno));
>>>>>       }
>>>>>         if (engine->unpark)
>>>
>
John Harrison July 22, 2022, 7:09 p.m. UTC | #6
On 7/21/2022 02:24, Tvrtko Ursulin wrote:
> On 21/07/2022 01:54, John Harrison wrote:
>> On 7/19/2022 02:42, Tvrtko Ursulin wrote:
>>> On 19/07/2022 01:05, John Harrison wrote:
>>>> On 7/18/2022 05:15, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 13/07/2022 00:31, John.C.Harrison@Intel.com wrote:
>>>>>> From: Matthew Brost <matthew.brost@intel.com>
>>>>>>
>>>>>> Remove bogus GEM_BUG_ON which compared kernel context timeline 
>>>>>> seqno to
>>>>>> seqno in memory on engine PM unpark. If a GT reset occurred these 
>>>>>> values
>>>>>> might not match as a kernel context could be skipped. This bug was
>>>>>> hidden by always switching to a kernel context on park (execlists
>>>>>> requirement).
>>>>>
>>>>> Reset of the kernel context? Under which circumstances does that 
>>>>> happen?
>>>> As per description, the issue is with full GT reset.
>>>>
>>>>>
>>>>> It is unclear if the claim is this to be a general problem or the 
>>>>> assert is only invalid with the GuC. Lack of a CI reported issue 
>>>>> suggests it is not a generic problem?
>>>> Currently it is not an issue because we always switch to the kernel 
>>>> context because that's how execlists works and the entire driver is 
>>>> fundamentally based on execlist operation. When we stop using the 
>>>> kernel context as a (non-functional) barrier when using GuC 
>>>> submission, then you would see an issue without this fix.
>
> Let me pick this point to try again - I am simply asking for a clear 
> description of steps which lead to the problem, instead of, what I 
> find are, generic and hard to penetrate statements of invalid 
> assumptions etc.
>
> I picked this spot because of this: "When we stop using the kernel 
> context as a (non-functional) barrier when using GuC submission, then 
> you would see an issue without this fix."
>
> I point to 363324292710 ("drm/i915/guc: Don't call 
> switch_to_kernel_context with GuC submission"). Hence it is not when 
> but it already happened. Which in my mind then does not compute - I 
> can't grok the explanation which appears to fall over on the first claim.
>
> Or perhaps the bug on is already firing today on every GuC enabled 
> machine in the CI? In which case there is a Fixes: link to be added?
>
> I have asked about, if we have 363324292710, and if this patch is 
> removing the seqno bug on, why it is not removing something more in 
> __engine_unpark, gated on "is guc"? Like ss there a point to 
> sanitizing the context which wasn't lost, because it wasn't used to 
> park the engine with?
>
> Or if the problem can't be hit with execlists (in case reset claim 
> from the commit message misleading), why shouldn't the bug on be 
> changed to contain the !guc condition instead of being remove?
>
> I am simply asking for a clear explanation of the conditions and steps 
> which lead to the bug on incorrectly firing. It doesn't have to be 
> long text or anything like that, just clear so we can close this and 
> move on.
>
> Regards,
>
> Tvrtko
@Matthew Brost, it's your patch, do you recall the details of when it 
was going bang? I vaguely recall something about it being hit in local 
testing pre-merge rather than by CI post-merge.

John.

>
>>>
>>> Issue is with GuC, GuC and full reset, or with full reset regardless 
>>> of the backend?
>> The issue is with code making invalid assumptions. The assumption is 
>> currently not failing because the execlist backend requires the use 
>> of a barrier context for a bunch of operations. The GuC backend does 
>> not require this. In fact, the barrier context does not function as a 
>> barrier when the scheduler is external to i915. Hence the desire to 
>> remove the use of the barrier context from generic i915 operation and 
>> make it only used when in execlist mode. At that point, the invalid 
>> assumption will no longer work and the BUG will fire.
>>
>>>
>>> If issue is only with GuC patch should have drm/i915/guc prefix as 
>>> minimum. But if it actually only becomes a problem when GuC backend 
>>> stops parking with the kernel context when I think the whole unpark 
>>> code should be refactored in a cleaner way than just removing the 
>>> one assert. Otherwise what is the point of leaving everything else 
>>> in there?
>>>
>>> Or if the issue is backend agnostic, *if* full reset happens to hit 
>>> during parking, then it is different. Wouldn't that be a race with 
>>> parking and reset which probably shouldn't happen to start with.
>>>
>> The issue is neither with GuC nor with resets, GT or otherwise. The 
>> issue is with generic i915 code making assumptions about backend 
>> implementations that are only correct for the execlist implementation.
>>
>> John.
>>
>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>> John.
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/intel_engine_pm.c | 2 --
>>>>>>   1 file changed, 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c 
>>>>>> b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> index b0a4a2dbe3ee9..fb3e1599d04ec 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
>>>>>> @@ -68,8 +68,6 @@ static int __engine_unpark(struct intel_wakeref 
>>>>>> *wf)
>>>>>>                ce->timeline->seqno,
>>>>>> READ_ONCE(*ce->timeline->hwsp_seqno),
>>>>>>                ce->ring->emit);
>>>>>> -        GEM_BUG_ON(ce->timeline->seqno !=
>>>>>> - READ_ONCE(*ce->timeline->hwsp_seqno));
>>>>>>       }
>>>>>>         if (engine->unpark)
>>>>
>>
diff mbox series

Patch

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_pm.c b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
index b0a4a2dbe3ee9..fb3e1599d04ec 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_pm.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_pm.c
@@ -68,8 +68,6 @@  static int __engine_unpark(struct intel_wakeref *wf)
 			 ce->timeline->seqno,
 			 READ_ONCE(*ce->timeline->hwsp_seqno),
 			 ce->ring->emit);
-		GEM_BUG_ON(ce->timeline->seqno !=
-			   READ_ONCE(*ce->timeline->hwsp_seqno));
 	}
 
 	if (engine->unpark)