diff mbox

[1/5] drm/i915: Remove stale asserts from i915_gem_find_active_request()

Message ID 20180529132922.6831-1-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson May 29, 2018, 1:29 p.m. UTC
Since we use i915_gem_find_active_request() from inside
intel_engine_dump() and may call that at any time, we do not guarantee
that the engine is paused nor that the signal kthreads and irq handler
are suspended, so we cannot assert that the breadcrumb doesn't advance
and that the irq hasn't happened on another CPU signaling the request we
believe to be idle.

Fixes: f636edb214a5 ("drm/i915: Make i915_engine_info pretty printer to standalone")
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

Comments

Chris Wilson May 29, 2018, 5:57 p.m. UTC | #1
Quoting Patchwork (2018-05-29 18:53:25)
> == Series Details ==
> 
> Series: series starting with [1/5] drm/i915: Remove stale asserts from i915_gem_find_active_request()
> URL   : https://patchwork.freedesktop.org/series/43892/
> State : failure
> 
> == Summary ==
> 
> = CI Bug Log - changes from CI_DRM_4253_full -> Patchwork_9139_full =
> 
> == Summary - FAILURE ==
> 
>   Serious unknown changes coming with Patchwork_9139_full absolutely need to be
>   verified manually.
>   
>   If you think the reported changes have nothing to do with the changes
>   introduced in Patchwork_9139_full, please notify your bug team to allow them
>   to document this new failure mode, which will reduce false positives in CI.
> 
>   External URL: https://patchwork.freedesktop.org/api/1.0/series/43892/revisions/1/mbox/
> 
> == Possible new issues ==
> 
>   Here are the unknown changes that may have been introduced in Patchwork_9139_full:
> 
>   === IGT changes ===
> 
>     ==== Possible regressions ====
> 
>     igt@perf_pmu@busy-hang-rcs0:
>       shard-kbl:          PASS -> FAIL +1
>       shard-glk:          PASS -> FAIL +1
> 
>     igt@perf_pmu@busy-hang-vecs0:
>       shard-apl:          PASS -> FAIL +2

These are an issue in the igt not coordinating it's wait-until-idle with
the kernel. Simple patch on list.
-Chris
Tvrtko Ursulin May 30, 2018, 10:43 a.m. UTC | #2
On 29/05/2018 14:29, Chris Wilson wrote:
> Since we use i915_gem_find_active_request() from inside
> intel_engine_dump() and may call that at any time, we do not guarantee
> that the engine is paused nor that the signal kthreads and irq handler
> are suspended, so we cannot assert that the breadcrumb doesn't advance
> and that the irq hasn't happened on another CPU signaling the request we
> believe to be idle.
> 
> Fixes: f636edb214a5 ("drm/i915: Make i915_engine_info pretty printer to standalone")
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gem.c | 17 ++++++++---------
>   1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 05f44ca35a06..530d6d0109b4 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2972,23 +2972,22 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
>   	struct i915_request *request, *active = NULL;
>   	unsigned long flags;
>   
> -	/* We are called by the error capture and reset at a random
> -	 * point in time. In particular, note that neither is crucially
> -	 * ordered with an interrupt. After a hang, the GPU is dead and we
> -	 * assume that no more writes can happen (we waited long enough for
> -	 * all writes that were in transaction to be flushed) - adding an
> +	/*
> +	 * We are called by the error capture, reset and to dump engine
> +	 * state at random points in time. In particular, note that neither is
> +	 * crucially ordered with an interrupt. After a hang, the GPU is dead
> +	 * and we assume that no more writes can happen (we waited long enough
> +	 * for all writes that were in transaction to be flushed) - adding an
>   	 * extra delay for a recent interrupt is pointless. Hence, we do
>   	 * not need an engine->irq_seqno_barrier() before the seqno reads.
> +	 * At all other times, we must assume the GPU is still running, but
> +	 * we only care about the snapshot of this moment.
>   	 */
>   	spin_lock_irqsave(&engine->timeline.lock, flags);
>   	list_for_each_entry(request, &engine->timeline.requests, link) {
>   		if (__i915_request_completed(request, request->global_seqno))
>   			continue;
>   
> -		GEM_BUG_ON(request->engine != engine);

Removal of this one belongs to virtual engine perhaps?

Regards,

Tvrtko

> -		GEM_BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
> -				    &request->fence.flags));
> -
>   		active = request;
>   		break;
>   	}
>
Chris Wilson May 30, 2018, 10:55 a.m. UTC | #3
Quoting Tvrtko Ursulin (2018-05-30 11:43:16)
> 
> On 29/05/2018 14:29, Chris Wilson wrote:
> > Since we use i915_gem_find_active_request() from inside
> > intel_engine_dump() and may call that at any time, we do not guarantee
> > that the engine is paused nor that the signal kthreads and irq handler
> > are suspended, so we cannot assert that the breadcrumb doesn't advance
> > and that the irq hasn't happened on another CPU signaling the request we
> > believe to be idle.
> > 
> > Fixes: f636edb214a5 ("drm/i915: Make i915_engine_info pretty printer to standalone")
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.c | 17 ++++++++---------
> >   1 file changed, 8 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 05f44ca35a06..530d6d0109b4 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -2972,23 +2972,22 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
> >       struct i915_request *request, *active = NULL;
> >       unsigned long flags;
> >   
> > -     /* We are called by the error capture and reset at a random
> > -      * point in time. In particular, note that neither is crucially
> > -      * ordered with an interrupt. After a hang, the GPU is dead and we
> > -      * assume that no more writes can happen (we waited long enough for
> > -      * all writes that were in transaction to be flushed) - adding an
> > +     /*
> > +      * We are called by the error capture, reset and to dump engine
> > +      * state at random points in time. In particular, note that neither is
> > +      * crucially ordered with an interrupt. After a hang, the GPU is dead
> > +      * and we assume that no more writes can happen (we waited long enough
> > +      * for all writes that were in transaction to be flushed) - adding an
> >        * extra delay for a recent interrupt is pointless. Hence, we do
> >        * not need an engine->irq_seqno_barrier() before the seqno reads.
> > +      * At all other times, we must assume the GPU is still running, but
> > +      * we only care about the snapshot of this moment.
> >        */
> >       spin_lock_irqsave(&engine->timeline.lock, flags);
> >       list_for_each_entry(request, &engine->timeline.requests, link) {
> >               if (__i915_request_completed(request, request->global_seqno))
> >                       continue;
> >   
> > -             GEM_BUG_ON(request->engine != engine);
> 
> Removal of this one belongs to virtual engine perhaps?

Nah, even with virtual engine; the engine timeline list is still true. I
was just thinking it was too random to check here (e.g. in the middle of
error capture) and that our more recent addition of checking during
retirement was a little more rigorous (and timely).
-Chris
Tvrtko Ursulin May 30, 2018, 11:01 a.m. UTC | #4
On 30/05/2018 11:55, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2018-05-30 11:43:16)
>>
>> On 29/05/2018 14:29, Chris Wilson wrote:
>>> Since we use i915_gem_find_active_request() from inside
>>> intel_engine_dump() and may call that at any time, we do not guarantee
>>> that the engine is paused nor that the signal kthreads and irq handler
>>> are suspended, so we cannot assert that the breadcrumb doesn't advance
>>> and that the irq hasn't happened on another CPU signaling the request we
>>> believe to be idle.
>>>
>>> Fixes: f636edb214a5 ("drm/i915: Make i915_engine_info pretty printer to standalone")
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/i915_gem.c | 17 ++++++++---------
>>>    1 file changed, 8 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
>>> index 05f44ca35a06..530d6d0109b4 100644
>>> --- a/drivers/gpu/drm/i915/i915_gem.c
>>> +++ b/drivers/gpu/drm/i915/i915_gem.c
>>> @@ -2972,23 +2972,22 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
>>>        struct i915_request *request, *active = NULL;
>>>        unsigned long flags;
>>>    
>>> -     /* We are called by the error capture and reset at a random
>>> -      * point in time. In particular, note that neither is crucially
>>> -      * ordered with an interrupt. After a hang, the GPU is dead and we
>>> -      * assume that no more writes can happen (we waited long enough for
>>> -      * all writes that were in transaction to be flushed) - adding an
>>> +     /*
>>> +      * We are called by the error capture, reset and to dump engine
>>> +      * state at random points in time. In particular, note that neither is
>>> +      * crucially ordered with an interrupt. After a hang, the GPU is dead
>>> +      * and we assume that no more writes can happen (we waited long enough
>>> +      * for all writes that were in transaction to be flushed) - adding an
>>>         * extra delay for a recent interrupt is pointless. Hence, we do
>>>         * not need an engine->irq_seqno_barrier() before the seqno reads.
>>> +      * At all other times, we must assume the GPU is still running, but
>>> +      * we only care about the snapshot of this moment.
>>>         */
>>>        spin_lock_irqsave(&engine->timeline.lock, flags);
>>>        list_for_each_entry(request, &engine->timeline.requests, link) {
>>>                if (__i915_request_completed(request, request->global_seqno))
>>>                        continue;
>>>    
>>> -             GEM_BUG_ON(request->engine != engine);
>>
>> Removal of this one belongs to virtual engine perhaps?
> 
> Nah, even with virtual engine; the engine timeline list is still true. I
> was just thinking it was too random to check here (e.g. in the middle of
> error capture) and that our more recent addition of checking during
> retirement was a little more rigorous (and timely).

It is a random check indeed. Commit message append: "While at it remove 
a random assert on..."

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
Chris Wilson May 30, 2018, 11:14 a.m. UTC | #5
Quoting Tvrtko Ursulin (2018-05-30 12:01:47)
> 
> On 30/05/2018 11:55, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-30 11:43:16)
> >>
> >> On 29/05/2018 14:29, Chris Wilson wrote:
> >>> Since we use i915_gem_find_active_request() from inside
> >>> intel_engine_dump() and may call that at any time, we do not guarantee
> >>> that the engine is paused nor that the signal kthreads and irq handler
> >>> are suspended, so we cannot assert that the breadcrumb doesn't advance
> >>> and that the irq hasn't happened on another CPU signaling the request we
> >>> believe to be idle.
> >>>
> >>> Fixes: f636edb214a5 ("drm/i915: Make i915_engine_info pretty printer to standalone")
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Mika Kuoppala <mika.kuoppala@intel.com>
> >>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> >>> ---
> >>>    drivers/gpu/drm/i915/i915_gem.c | 17 ++++++++---------
> >>>    1 file changed, 8 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> >>> index 05f44ca35a06..530d6d0109b4 100644
> >>> --- a/drivers/gpu/drm/i915/i915_gem.c
> >>> +++ b/drivers/gpu/drm/i915/i915_gem.c
> >>> @@ -2972,23 +2972,22 @@ i915_gem_find_active_request(struct intel_engine_cs *engine)
> >>>        struct i915_request *request, *active = NULL;
> >>>        unsigned long flags;
> >>>    
> >>> -     /* We are called by the error capture and reset at a random
> >>> -      * point in time. In particular, note that neither is crucially
> >>> -      * ordered with an interrupt. After a hang, the GPU is dead and we
> >>> -      * assume that no more writes can happen (we waited long enough for
> >>> -      * all writes that were in transaction to be flushed) - adding an
> >>> +     /*
> >>> +      * We are called by the error capture, reset and to dump engine
> >>> +      * state at random points in time. In particular, note that neither is
> >>> +      * crucially ordered with an interrupt. After a hang, the GPU is dead
> >>> +      * and we assume that no more writes can happen (we waited long enough
> >>> +      * for all writes that were in transaction to be flushed) - adding an
> >>>         * extra delay for a recent interrupt is pointless. Hence, we do
> >>>         * not need an engine->irq_seqno_barrier() before the seqno reads.
> >>> +      * At all other times, we must assume the GPU is still running, but
> >>> +      * we only care about the snapshot of this moment.
> >>>         */
> >>>        spin_lock_irqsave(&engine->timeline.lock, flags);
> >>>        list_for_each_entry(request, &engine->timeline.requests, link) {
> >>>                if (__i915_request_completed(request, request->global_seqno))
> >>>                        continue;
> >>>    
> >>> -             GEM_BUG_ON(request->engine != engine);
> >>
> >> Removal of this one belongs to virtual engine perhaps?
> > 
> > Nah, even with virtual engine; the engine timeline list is still true. I
> > was just thinking it was too random to check here (e.g. in the middle of
> > error capture) and that our more recent addition of checking during
> > retirement was a little more rigorous (and timely).
> 
> It is a random check indeed. Commit message append: "While at it remove 
> a random assert on..."
> 
> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Pushed this patch by itself, just so I can have a new CI baseline :)
Thanks for the review,
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 05f44ca35a06..530d6d0109b4 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2972,23 +2972,22 @@  i915_gem_find_active_request(struct intel_engine_cs *engine)
 	struct i915_request *request, *active = NULL;
 	unsigned long flags;
 
-	/* We are called by the error capture and reset at a random
-	 * point in time. In particular, note that neither is crucially
-	 * ordered with an interrupt. After a hang, the GPU is dead and we
-	 * assume that no more writes can happen (we waited long enough for
-	 * all writes that were in transaction to be flushed) - adding an
+	/*
+	 * We are called by the error capture, reset and to dump engine
+	 * state at random points in time. In particular, note that neither is
+	 * crucially ordered with an interrupt. After a hang, the GPU is dead
+	 * and we assume that no more writes can happen (we waited long enough
+	 * for all writes that were in transaction to be flushed) - adding an
 	 * extra delay for a recent interrupt is pointless. Hence, we do
 	 * not need an engine->irq_seqno_barrier() before the seqno reads.
+	 * At all other times, we must assume the GPU is still running, but
+	 * we only care about the snapshot of this moment.
 	 */
 	spin_lock_irqsave(&engine->timeline.lock, flags);
 	list_for_each_entry(request, &engine->timeline.requests, link) {
 		if (__i915_request_completed(request, request->global_seqno))
 			continue;
 
-		GEM_BUG_ON(request->engine != engine);
-		GEM_BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT,
-				    &request->fence.flags));
-
 		active = request;
 		break;
 	}