drm/i915/gt: Be defensive in the face of false CS events
diff mbox series

Message ID 20200710121609.6775-1-chris@chris-wilson.co.uk
State New
Headers show
Series
  • drm/i915/gt: Be defensive in the face of false CS events
Related show

Commit Message

Chris Wilson July 10, 2020, 12:16 p.m. UTC
If the HW throws a curve ball and reports either en event before it is
possible, or just a completely impossible event, we have to grin and
bear it. The first few events, we will likely not notice as we would be
expecting some event, but as soon as we stop expecting an event and yet
they still keep coming, then we enter into undefined state territory.
In which case, bail out, stop processing the events, and reset the
engine and our set of queued requests to recover.

The sporadic hangs and warnings will continue to plague CI, but at least
system stability should not be compromised.

Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_lrc.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Tvrtko Ursulin July 10, 2020, 12:30 p.m. UTC | #1
On 10/07/2020 13:16, Chris Wilson wrote:
> If the HW throws a curve ball and reports either en event before it is
> possible, or just a completely impossible event, we have to grin and
> bear it. The first few events, we will likely not notice as we would be
> expecting some event, but as soon as we stop expecting an event and yet
> they still keep coming, then we enter into undefined state territory.
> In which case, bail out, stop processing the events, and reset the
> engine and our set of queued requests to recover.
> 
> The sporadic hangs and warnings will continue to plague CI, but at least
> system stability should not be compromised.
> 
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_lrc.c | 8 ++++++--
>   1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> index fbcfeaed6441..c86324d2d2bb 100644
> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> @@ -2567,6 +2567,7 @@ static void process_csb(struct intel_engine_cs *engine)
>   	tail = READ_ONCE(*execlists->csb_write);
>   	if (unlikely(head == tail))
>   		return;
> +	execlists->csb_head = tail;

This deserves a comment...

>   
>   	/*
>   	 * Hopefully paired with a wmb() in HW!
> @@ -2613,6 +2614,9 @@ static void process_csb(struct intel_engine_cs *engine)
>   		if (promote) {
>   			struct i915_request * const *old = execlists->active;
>   
> +			if (GEM_WARN_ON(!*execlists->pending))
> +				break;
> +

... but why not continue? You think nothing good can come out of trying 
further and break simply expedites the hang? We have to be confident we 
can cope with any random i915 state caused by skipping maybe valid entries.

Conclusion will define what kind of comment to put above. "Assume we 
always consume all CSB entries, or things are really bad and we mark all 
as invalid upon finding first bad entry"?

Regards,

Tvrtko

>   			ring_set_paused(engine, 0);
>   
>   			/* Point active to the new ELSP; prevent overwriting */
> @@ -2635,7 +2639,8 @@ static void process_csb(struct intel_engine_cs *engine)
>   
>   			WRITE_ONCE(execlists->pending[0], NULL);
>   		} else {
> -			GEM_BUG_ON(!*execlists->active);
> +			if (GEM_WARN_ON(!*execlists->active))
> +				break;
>   
>   			/* port0 completed, advanced to port1 */
>   			trace_ports(execlists, "completed", execlists->active);
> @@ -2686,7 +2691,6 @@ static void process_csb(struct intel_engine_cs *engine)
>   		}
>   	} while (head != tail);
>   
> -	execlists->csb_head = head;
>   	set_timeslice(engine);
>   
>   	/*
>
Chris Wilson July 10, 2020, 12:35 p.m. UTC | #2
Quoting Tvrtko Ursulin (2020-07-10 13:30:09)
> 
> On 10/07/2020 13:16, Chris Wilson wrote:
> > If the HW throws a curve ball and reports either en event before it is
> > possible, or just a completely impossible event, we have to grin and
> > bear it. The first few events, we will likely not notice as we would be
> > expecting some event, but as soon as we stop expecting an event and yet
> > they still keep coming, then we enter into undefined state territory.
> > In which case, bail out, stop processing the events, and reset the
> > engine and our set of queued requests to recover.
> > 
> > The sporadic hangs and warnings will continue to plague CI, but at least
> > system stability should not be compromised.
> > 
> > Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_lrc.c | 8 ++++++--
> >   1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index fbcfeaed6441..c86324d2d2bb 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -2567,6 +2567,7 @@ static void process_csb(struct intel_engine_cs *engine)
> >       tail = READ_ONCE(*execlists->csb_write);
> >       if (unlikely(head == tail))
> >               return;
> > +     execlists->csb_head = tail;
> 
> This deserves a comment...
> 
> >   
> >       /*
> >        * Hopefully paired with a wmb() in HW!
> > @@ -2613,6 +2614,9 @@ static void process_csb(struct intel_engine_cs *engine)
> >               if (promote) {
> >                       struct i915_request * const *old = execlists->active;
> >   
> > +                     if (GEM_WARN_ON(!*execlists->pending))
> > +                             break;
> > +
> 
> ... but why not continue? You think nothing good can come out of trying 
> further and break simply expedites the hang? We have to be confident we 
> can cope with any random i915 state caused by skipping maybe valid entries.

We are already past the point of no return as the events coming from HW
do not correspond to our events; continuing on cannot recover, we will
already have made mistakes.
 
> Conclusion will define what kind of comment to put above. "Assume we 
> always consume all CSB entries, or things are really bad and we mark all 
> as invalid upon finding first bad entry"?

It's dead, Jim.

We escape out, reset the engine/GPU, consign the port tracking to the bin,
and reload with the next set of requests.
-Chris
Tvrtko Ursulin July 10, 2020, 12:49 p.m. UTC | #3
On 10/07/2020 13:35, Chris Wilson wrote:
> Quoting Tvrtko Ursulin (2020-07-10 13:30:09)
>>
>> On 10/07/2020 13:16, Chris Wilson wrote:
>>> If the HW throws a curve ball and reports either en event before it is
>>> possible, or just a completely impossible event, we have to grin and
>>> bear it. The first few events, we will likely not notice as we would be
>>> expecting some event, but as soon as we stop expecting an event and yet
>>> they still keep coming, then we enter into undefined state territory.
>>> In which case, bail out, stop processing the events, and reset the
>>> engine and our set of queued requests to recover.
>>>
>>> The sporadic hangs and warnings will continue to plague CI, but at least
>>> system stability should not be compromised.
>>>
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
>>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>    drivers/gpu/drm/i915/gt/intel_lrc.c | 8 ++++++--
>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> index fbcfeaed6441..c86324d2d2bb 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>> @@ -2567,6 +2567,7 @@ static void process_csb(struct intel_engine_cs *engine)
>>>        tail = READ_ONCE(*execlists->csb_write);
>>>        if (unlikely(head == tail))
>>>                return;
>>> +     execlists->csb_head = tail;
>>
>> This deserves a comment...
>>
>>>    
>>>        /*
>>>         * Hopefully paired with a wmb() in HW!
>>> @@ -2613,6 +2614,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>                if (promote) {
>>>                        struct i915_request * const *old = execlists->active;
>>>    
>>> +                     if (GEM_WARN_ON(!*execlists->pending))
>>> +                             break;
>>> +
>>
>> ... but why not continue? You think nothing good can come out of trying
>> further and break simply expedites the hang? We have to be confident we
>> can cope with any random i915 state caused by skipping maybe valid entries.
> 
> We are already past the point of no return as the events coming from HW
> do not correspond to our events; continuing on cannot recover, we will
> already have made mistakes.

Yeah, I am just worried if between first error and reset, the fact we 
skipped possible valid entries, could cause hitting some other bug on or 
null ptr deref. I don't have anything concrete.. so maybe just FUD.

>> Conclusion will define what kind of comment to put above. "Assume we
>> always consume all CSB entries, or things are really bad and we mark all
>> as invalid upon finding first bad entry"?
> 
> It's dead, Jim.
> 
> We escape out, reset the engine/GPU, consign the port tracking to the bin,
> and reload with the next set of requests.

With a comment at the "execlists->csb_head = tail;" site explaining the 
plan for handling seriously unexpected HW events:

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko
Ruhl, Michael J July 10, 2020, 5:23 p.m. UTC | #4
>-----Original Message-----
>From: Intel-gfx <intel-gfx-bounces@lists.freedesktop.org> On Behalf Of Chris
>Wilson
>Sent: Friday, July 10, 2020 8:16 AM
>To: intel-gfx@lists.freedesktop.org
>Cc: Chris Wilson <chris@chris-wilson.co.uk>
>Subject: [Intel-gfx] [PATCH] drm/i915/gt: Be defensive in the face of false CS
>events
>
>If the HW throws a curve ball and reports either en event before it is
                                                                                         ^^
s/en/an/

?

m

>possible, or just a completely impossible event, we have to grin and
>bear it. The first few events, we will likely not notice as we would be
>expecting some event, but as soon as we stop expecting an event and yet
>they still keep coming, then we enter into undefined state territory.
>In which case, bail out, stop processing the events, and reset the
>engine and our set of queued requests to recover.
>
>The sporadic hangs and warnings will continue to plague CI, but at least
>system stability should not be compromised.
>
>Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/2045
>Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>---
> drivers/gpu/drm/i915/gt/intel_lrc.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c
>b/drivers/gpu/drm/i915/gt/intel_lrc.c
>index fbcfeaed6441..c86324d2d2bb 100644
>--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>@@ -2567,6 +2567,7 @@ static void process_csb(struct intel_engine_cs
>*engine)
> 	tail = READ_ONCE(*execlists->csb_write);
> 	if (unlikely(head == tail))
> 		return;
>+	execlists->csb_head = tail;
>
> 	/*
> 	 * Hopefully paired with a wmb() in HW!
>@@ -2613,6 +2614,9 @@ static void process_csb(struct intel_engine_cs
>*engine)
> 		if (promote) {
> 			struct i915_request * const *old = execlists->active;
>
>+			if (GEM_WARN_ON(!*execlists->pending))
>+				break;
>+
> 			ring_set_paused(engine, 0);
>
> 			/* Point active to the new ELSP; prevent overwriting
>*/
>@@ -2635,7 +2639,8 @@ static void process_csb(struct intel_engine_cs
>*engine)
>
> 			WRITE_ONCE(execlists->pending[0], NULL);
> 		} else {
>-			GEM_BUG_ON(!*execlists->active);
>+			if (GEM_WARN_ON(!*execlists->active))
>+				break;
>
> 			/* port0 completed, advanced to port1 */
> 			trace_ports(execlists, "completed", execlists->active);
>@@ -2686,7 +2691,6 @@ static void process_csb(struct intel_engine_cs
>*engine)
> 		}
> 	} while (head != tail);
>
>-	execlists->csb_head = head;
> 	set_timeslice(engine);
>
> 	/*
>--
>2.20.1
>
>_______________________________________________
>Intel-gfx mailing list
>Intel-gfx@lists.freedesktop.org
>https://lists.freedesktop.org/mailman/listinfo/intel-gfx

Patch
diff mbox series

diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
index fbcfeaed6441..c86324d2d2bb 100644
--- a/drivers/gpu/drm/i915/gt/intel_lrc.c
+++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
@@ -2567,6 +2567,7 @@  static void process_csb(struct intel_engine_cs *engine)
 	tail = READ_ONCE(*execlists->csb_write);
 	if (unlikely(head == tail))
 		return;
+	execlists->csb_head = tail;
 
 	/*
 	 * Hopefully paired with a wmb() in HW!
@@ -2613,6 +2614,9 @@  static void process_csb(struct intel_engine_cs *engine)
 		if (promote) {
 			struct i915_request * const *old = execlists->active;
 
+			if (GEM_WARN_ON(!*execlists->pending))
+				break;
+
 			ring_set_paused(engine, 0);
 
 			/* Point active to the new ELSP; prevent overwriting */
@@ -2635,7 +2639,8 @@  static void process_csb(struct intel_engine_cs *engine)
 
 			WRITE_ONCE(execlists->pending[0], NULL);
 		} else {
-			GEM_BUG_ON(!*execlists->active);
+			if (GEM_WARN_ON(!*execlists->active))
+				break;
 
 			/* port0 completed, advanced to port1 */
 			trace_ports(execlists, "completed", execlists->active);
@@ -2686,7 +2691,6 @@  static void process_csb(struct intel_engine_cs *engine)
 		}
 	} while (head != tail);
 
-	execlists->csb_head = head;
 	set_timeslice(engine);
 
 	/*