diff mbox

[v2,1/3] drm/i915: Drop racy markup of missed-irqs from idle-worker

Message ID 1469084259-23110-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson July 21, 2016, 6:57 a.m. UTC
During the idle-worker we disable the hangcheck and so kick any waiters
that should have been completed (since the GPU is now idle). Unlike the
hangcheck, we do not take any care to avoid the race between the irq
handler and ourselves, and so it is possible for us to declare a missed
interrupt even as the bottom-half is being scheduled to run. Let's
ignore this race to stop a potential false-positive error.

References: https://bugs.freedesktop.org/show_bug.cgi?id=96974
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Tvrtko Ursulin July 21, 2016, 9:58 a.m. UTC | #1
On 21/07/16 07:57, Chris Wilson wrote:
> During the idle-worker we disable the hangcheck and so kick any waiters
> that should have been completed (since the GPU is now idle). Unlike the
> hangcheck, we do not take any care to avoid the race between the irq
> handler and ourselves, and so it is possible for us to declare a missed
> interrupt even as the bottom-half is being scheduled to run. Let's
> ignore this race to stop a potential false-positive error.

If the bottom half is scheduled to run then then..

> References: https://bugs.freedesktop.org/show_bug.cgi?id=96974
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 40047eb48826..9e826585edb2 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2706,10 +2706,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
>   	rearm_hangcheck = false;
>
>   	stuck_engines = intel_kick_waiters(dev_priv);

... this will not return a stucked engine since the there is a bh task 
assigned all until the bh exits.

So I don't get it. :)

Regards,

Tvrtko

> -	if (unlikely(stuck_engines)) {
> -		DRM_DEBUG_DRIVER("kicked stuck waiters...missed irq\n");
> -		dev_priv->gpu_error.missed_irq_rings |= stuck_engines;
> -	}
> +	if (unlikely(stuck_engines))
> +		DRM_DEBUG_DRIVER("kicked stuck waiters (%x)...missed irq?\n",
> +				 stuck_engines);
>
>   	if (INTEL_GEN(dev_priv) >= 6)
>   		gen6_rps_idle(dev_priv);
>
Chris Wilson July 21, 2016, 10:10 a.m. UTC | #2
On Thu, Jul 21, 2016 at 10:58:05AM +0100, Tvrtko Ursulin wrote:
> 
> On 21/07/16 07:57, Chris Wilson wrote:
> >During the idle-worker we disable the hangcheck and so kick any waiters
> >that should have been completed (since the GPU is now idle). Unlike the
> >hangcheck, we do not take any care to avoid the race between the irq
> >handler and ourselves, and so it is possible for us to declare a missed
> >interrupt even as the bottom-half is being scheduled to run. Let's
> >ignore this race to stop a potential false-positive error.
> 
> If the bottom half is scheduled to run then then..
> 
> >References: https://bugs.freedesktop.org/show_bug.cgi?id=96974
> >Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >---
> >  drivers/gpu/drm/i915/i915_gem.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> >diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >index 40047eb48826..9e826585edb2 100644
> >--- a/drivers/gpu/drm/i915/i915_gem.c
> >+++ b/drivers/gpu/drm/i915/i915_gem.c
> >@@ -2706,10 +2706,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
> >  	rearm_hangcheck = false;
> >
> >  	stuck_engines = intel_kick_waiters(dev_priv);
> 
> ... this will not return a stucked engine since the there is a bh
> task assigned all until the bh exits.

It reports if it wakes up a waiter on any engine. If the bh is already
running, we cannot know if it has missed the seqno update. If it isn't
running yet, we cannot know if it is about to be run.
-Chris
Tvrtko Ursulin July 21, 2016, 10:28 a.m. UTC | #3
On 21/07/16 11:10, Chris Wilson wrote:
> On Thu, Jul 21, 2016 at 10:58:05AM +0100, Tvrtko Ursulin wrote:
>>
>> On 21/07/16 07:57, Chris Wilson wrote:
>>> During the idle-worker we disable the hangcheck and so kick any waiters
>>> that should have been completed (since the GPU is now idle). Unlike the
>>> hangcheck, we do not take any care to avoid the race between the irq
>>> handler and ourselves, and so it is possible for us to declare a missed
>>> interrupt even as the bottom-half is being scheduled to run. Let's
>>> ignore this race to stop a potential false-positive error.
>>
>> If the bottom half is scheduled to run then then..
>>
>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=96974
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gem.c | 7 +++----
>>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 40047eb48826..9e826585edb2 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -2706,10 +2706,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
>>>   	rearm_hangcheck = false;
>>>
>>>   	stuck_engines = intel_kick_waiters(dev_priv);
>>
>> ... this will not return a stucked engine since the there is a bh
>> task assigned all until the bh exits.
>
> It reports if it wakes up a waiter on any engine. If the bh is already
> running, we cannot know if it has missed the seqno update. If it isn't
> running yet, we cannot know if it is about to be run.

Oh I read the logic as completely opposite than what it is.

Since the idle worker runs 100ms after last retirement, that would mean 
a really slow waiter or what?

Regards,

Tvrtko
Chris Wilson July 21, 2016, 11:04 a.m. UTC | #4
On Thu, Jul 21, 2016 at 11:28:02AM +0100, Tvrtko Ursulin wrote:
> 
> On 21/07/16 11:10, Chris Wilson wrote:
> >On Thu, Jul 21, 2016 at 10:58:05AM +0100, Tvrtko Ursulin wrote:
> >>
> >>On 21/07/16 07:57, Chris Wilson wrote:
> >>>During the idle-worker we disable the hangcheck and so kick any waiters
> >>>that should have been completed (since the GPU is now idle). Unlike the
> >>>hangcheck, we do not take any care to avoid the race between the irq
> >>>handler and ourselves, and so it is possible for us to declare a missed
> >>>interrupt even as the bottom-half is being scheduled to run. Let's
> >>>ignore this race to stop a potential false-positive error.
> >>
> >>If the bottom half is scheduled to run then then..
> >>
> >>>References: https://bugs.freedesktop.org/show_bug.cgi?id=96974
> >>>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>>Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>>Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> >>>---
> >>>  drivers/gpu/drm/i915/i915_gem.c | 7 +++----
> >>>  1 file changed, 3 insertions(+), 4 deletions(-)
> >>>
> >>>diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>>index 40047eb48826..9e826585edb2 100644
> >>>--- a/drivers/gpu/drm/i915/i915_gem.c
> >>>+++ b/drivers/gpu/drm/i915/i915_gem.c
> >>>@@ -2706,10 +2706,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
> >>>  	rearm_hangcheck = false;
> >>>
> >>>  	stuck_engines = intel_kick_waiters(dev_priv);
> >>
> >>... this will not return a stucked engine since the there is a bh
> >>task assigned all until the bh exits.
> >
> >It reports if it wakes up a waiter on any engine. If the bh is already
> >running, we cannot know if it has missed the seqno update. If it isn't
> >running yet, we cannot know if it is about to be run.
> 
> Oh I read the logic as completely opposite than what it is.
> 
> Since the idle worker runs 100ms after last retirement, that would
> mean a really slow waiter or what?

It is dubious. But the idle worker runs 100ms after the first time we
detect all engines are idle and may be running as we detect all engines
are idle again. The only thing for sure is that in some cases that bdw-u
is reaching the idle-worker with an unwoken engine (and that there is
a race here in declaring it as a missed interrupt). I wasn't that
concerned about the race because of that 100ms delay where eveything
should have been idle, but on reflection that 100ms is not guarranteed.
-Chris
Tvrtko Ursulin July 22, 2016, 10:10 a.m. UTC | #5
On 21/07/16 12:04, Chris Wilson wrote:
> On Thu, Jul 21, 2016 at 11:28:02AM +0100, Tvrtko Ursulin wrote:
>> On 21/07/16 11:10, Chris Wilson wrote:
>>> On Thu, Jul 21, 2016 at 10:58:05AM +0100, Tvrtko Ursulin wrote:
>>>>
>>>> On 21/07/16 07:57, Chris Wilson wrote:
>>>>> During the idle-worker we disable the hangcheck and so kick any waiters
>>>>> that should have been completed (since the GPU is now idle). Unlike the
>>>>> hangcheck, we do not take any care to avoid the race between the irq
>>>>> handler and ourselves, and so it is possible for us to declare a missed
>>>>> interrupt even as the bottom-half is being scheduled to run. Let's
>>>>> ignore this race to stop a potential false-positive error.
>>>>
>>>> If the bottom half is scheduled to run then then..
>>>>
>>>>> References: https://bugs.freedesktop.org/show_bug.cgi?id=96974
>>>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/i915_gem.c | 7 +++----
>>>>>   1 file changed, 3 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>>>> index 40047eb48826..9e826585edb2 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>>>> @@ -2706,10 +2706,9 @@ i915_gem_idle_work_handler(struct work_struct *work)
>>>>>   	rearm_hangcheck = false;
>>>>>
>>>>>   	stuck_engines = intel_kick_waiters(dev_priv);
>>>>
>>>> ... this will not return a stucked engine since the there is a bh
>>>> task assigned all until the bh exits.
>>>
>>> It reports if it wakes up a waiter on any engine. If the bh is already
>>> running, we cannot know if it has missed the seqno update. If it isn't
>>> running yet, we cannot know if it is about to be run.
>>
>> Oh I read the logic as completely opposite than what it is.
>>
>> Since the idle worker runs 100ms after last retirement, that would
>> mean a really slow waiter or what?
>
> It is dubious. But the idle worker runs 100ms after the first time we
> detect all engines are idle and may be running as we detect all engines
> are idle again. The only thing for sure is that in some cases that bdw-u
> is reaching the idle-worker with an unwoken engine (and that there is
> a race here in declaring it as a missed interrupt). I wasn't that
> concerned about the race because of that 100ms delay where eveything
> should have been idle, but on reflection that 100ms is not guarranteed.

Would canceling the idle worker be to expensive?

Either way, looks OK to me.

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
Chris Wilson July 22, 2016, 10:22 a.m. UTC | #6
On Fri, Jul 22, 2016 at 11:10:28AM +0100, Tvrtko Ursulin wrote:
> 
> Would canceling the idle worker be to expensive?

It wasn't as much as that, I was trying to keep runtime suspend simple.
In that the GT takes the wakelock to prevent suspend as required and
not have the knowledge about all the users of the device inside runtime
management callbacks. (It means the users then have to be concious that
if they don't hold an explicit wakelock, they should check rpm first.)
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 40047eb48826..9e826585edb2 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2706,10 +2706,9 @@  i915_gem_idle_work_handler(struct work_struct *work)
 	rearm_hangcheck = false;
 
 	stuck_engines = intel_kick_waiters(dev_priv);
-	if (unlikely(stuck_engines)) {
-		DRM_DEBUG_DRIVER("kicked stuck waiters...missed irq\n");
-		dev_priv->gpu_error.missed_irq_rings |= stuck_engines;
-	}
+	if (unlikely(stuck_engines))
+		DRM_DEBUG_DRIVER("kicked stuck waiters (%x)...missed irq?\n",
+				 stuck_engines);
 
 	if (INTEL_GEN(dev_priv) >= 6)
 		gen6_rps_idle(dev_priv);