diff mbox

[5/5] drm/i915: Fix error capture on BYT/BDW

Message ID 1390616265-4329-5-git-send-email-benjamin.widawsky@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ben Widawsky Jan. 25, 2014, 2:17 a.m. UTC
The previous check during error capture of whether or not the current VM
should be scanned used, gen < 7. That was more or less trying to
determine if there was a full PPGTT. At the time, this was sort of what
I meant to do because I was more interested in working backwards from
hardware state. However, this is incorrect because it will not include
platforms that are greater than gen7, and not having PPGTT.  Example
would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
greater than gen7 with the PPGTT module parameter invoked.

I am /assuming/ BYT was broken, I have not actually checked.

While here, clean up the file a bit to avoid duplicate reads (now that
the PPGTT info is in the error state).

I think Mika/Chris may have been looking at this too.

Broken by:
commit 685987c6915222730f45141a89f1cd87fb092e9a
Author: Ben Widawsky <benjamin.widawsky@intel.com>
Date:   Fri Dec 6 14:10:54 2013 -0800

    drm/i915: Identify active VM for batchbuffer capture

Reported-by: Kenneth Graunke <kenneth.w.graunke@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Signed-off-by: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

Comments

Chris Wilson Jan. 26, 2014, 11:47 a.m. UTC | #1
On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> The previous check during error capture of whether or not the current VM
> should be scanned used, gen < 7. That was more or less trying to
> determine if there was a full PPGTT. At the time, this was sort of what
> I meant to do because I was more interested in working backwards from
> hardware state. However, this is incorrect because it will not include
> platforms that are greater than gen7, and not having PPGTT.  Example
> would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> greater than gen7 with the PPGTT module parameter invoked.
> 
> I am /assuming/ BYT was broken, I have not actually checked.
> 
> While here, clean up the file a bit to avoid duplicate reads (now that
> the PPGTT info is in the error state).
> 
> I think Mika/Chris may have been looking at this too.

Sure, we are looking (for identifying the guilty request/batch) by using
the older, simpler mechanism of finding the first incomplete request. I
think that search is now definite since we preallocate the request and no
longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
between seqno/batch/request).

That should also apply here and be much simpler.
-Chris
Ben Widawsky Jan. 26, 2014, 7:05 p.m. UTC | #2
On Sun, Jan 26, 2014 at 11:47:40AM +0000, Chris Wilson wrote:
> On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> > The previous check during error capture of whether or not the current VM
> > should be scanned used, gen < 7. That was more or less trying to
> > determine if there was a full PPGTT. At the time, this was sort of what
> > I meant to do because I was more interested in working backwards from
> > hardware state. However, this is incorrect because it will not include
> > platforms that are greater than gen7, and not having PPGTT.  Example
> > would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> > greater than gen7 with the PPGTT module parameter invoked.
> > 
> > I am /assuming/ BYT was broken, I have not actually checked.
> > 
> > While here, clean up the file a bit to avoid duplicate reads (now that
> > the PPGTT info is in the error state).
> > 
> > I think Mika/Chris may have been looking at this too.
> 
> Sure, we are looking (for identifying the guilty request/batch) by using
> the older, simpler mechanism of finding the first incomplete request. I
> think that search is now definite since we preallocate the request and no
> longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
> between seqno/batch/request).
> 
> That should also apply here and be much simpler.
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

How does that solve hangs which aren't caused by requests?
Chris Wilson Jan. 26, 2014, 7:55 p.m. UTC | #3
On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
> On Sun, Jan 26, 2014 at 11:47:40AM +0000, Chris Wilson wrote:
> > On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> > > The previous check during error capture of whether or not the current VM
> > > should be scanned used, gen < 7. That was more or less trying to
> > > determine if there was a full PPGTT. At the time, this was sort of what
> > > I meant to do because I was more interested in working backwards from
> > > hardware state. However, this is incorrect because it will not include
> > > platforms that are greater than gen7, and not having PPGTT.  Example
> > > would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> > > greater than gen7 with the PPGTT module parameter invoked.
> > > 
> > > I am /assuming/ BYT was broken, I have not actually checked.
> > > 
> > > While here, clean up the file a bit to avoid duplicate reads (now that
> > > the PPGTT info is in the error state).
> > > 
> > > I think Mika/Chris may have been looking at this too.
> > 
> > Sure, we are looking (for identifying the guilty request/batch) by using
> > the older, simpler mechanism of finding the first incomplete request. I
> > think that search is now definite since we preallocate the request and no
> > longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
> > between seqno/batch/request).
> > 
> > That should also apply here and be much simpler.
> 
> How does that solve hangs which aren't caused by requests?

Was that an intentional rhetorical question?

The code you touch here only deals with requests - finding the current
batchbuffer if any.
-Chris
Ben Widawsky Jan. 26, 2014, 9:47 p.m. UTC | #4
On Sun, Jan 26, 2014 at 07:55:59PM +0000, Chris Wilson wrote:
> On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
> > On Sun, Jan 26, 2014 at 11:47:40AM +0000, Chris Wilson wrote:
> > > On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> > > > The previous check during error capture of whether or not the current VM
> > > > should be scanned used, gen < 7. That was more or less trying to
> > > > determine if there was a full PPGTT. At the time, this was sort of what
> > > > I meant to do because I was more interested in working backwards from
> > > > hardware state. However, this is incorrect because it will not include
> > > > platforms that are greater than gen7, and not having PPGTT.  Example
> > > > would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> > > > greater than gen7 with the PPGTT module parameter invoked.
> > > > 
> > > > I am /assuming/ BYT was broken, I have not actually checked.
> > > > 
> > > > While here, clean up the file a bit to avoid duplicate reads (now that
> > > > the PPGTT info is in the error state).
> > > > 
> > > > I think Mika/Chris may have been looking at this too.
> > > 
> > > Sure, we are looking (for identifying the guilty request/batch) by using
> > > the older, simpler mechanism of finding the first incomplete request. I
> > > think that search is now definite since we preallocate the request and no
> > > longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
> > > between seqno/batch/request).
> > > 
> > > That should also apply here and be much simpler.
> > 
> > How does that solve hangs which aren't caused by requests?
> 
> Was that an intentional rhetorical question?
> 
> The code you touch here only deals with requests - finding the current
> batchbuffer if any.
> -Chris
> 

It wasn't rhetorical. I temporarily ignored that all batches are tied to
a request.

So what's the plan now? Just looking at the callers, we seem to have a
couple of callers that can't easily identify the bad request.
Chris Wilson Jan. 27, 2014, 1:45 p.m. UTC | #5
On Sun, Jan 26, 2014 at 01:47:29PM -0800, Ben Widawsky wrote:
> On Sun, Jan 26, 2014 at 07:55:59PM +0000, Chris Wilson wrote:
> > On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
> > > On Sun, Jan 26, 2014 at 11:47:40AM +0000, Chris Wilson wrote:
> > > > On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> > > > > The previous check during error capture of whether or not the current VM
> > > > > should be scanned used, gen < 7. That was more or less trying to
> > > > > determine if there was a full PPGTT. At the time, this was sort of what
> > > > > I meant to do because I was more interested in working backwards from
> > > > > hardware state. However, this is incorrect because it will not include
> > > > > platforms that are greater than gen7, and not having PPGTT.  Example
> > > > > would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> > > > > greater than gen7 with the PPGTT module parameter invoked.
> > > > > 
> > > > > I am /assuming/ BYT was broken, I have not actually checked.
> > > > > 
> > > > > While here, clean up the file a bit to avoid duplicate reads (now that
> > > > > the PPGTT info is in the error state).
> > > > > 
> > > > > I think Mika/Chris may have been looking at this too.
> > > > 
> > > > Sure, we are looking (for identifying the guilty request/batch) by using
> > > > the older, simpler mechanism of finding the first incomplete request. I
> > > > think that search is now definite since we preallocate the request and no
> > > > longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
> > > > between seqno/batch/request).
> > > > 
> > > > That should also apply here and be much simpler.
> > > 
> > > How does that solve hangs which aren't caused by requests?
> > 
> > Was that an intentional rhetorical question?
> > 
> > The code you touch here only deals with requests - finding the current
> > batchbuffer if any.
> > -Chris
> > 
> 
> It wasn't rhetorical. I temporarily ignored that all batches are tied to
> a request.
> 
> So what's the plan now? Just looking at the callers, we seem to have a
> couple of callers that can't easily identify the bad request.

I was thinking along the lines of:

@@ -737,31 +709,16 @@ i915_error_first_batchbuffer(struct drm_i915_private *dev_priv,
        }
 
        seqno = ring->get_seqno(ring, false);
-       list_for_each_entry(vm, &dev_priv->vm_list, global_link) {
-               if (!is_active_vm(vm, ring))
+       list_for_each_entry(request, &ring->request_list, list) {
+               if (i915_seqno_passed(seqno, request->seqno))
                        continue;
 
-               found_active = true;
-
-               list_for_each_entry(vma, &vm->active_list, mm_list) {
-                       obj = vma->obj;
-                       if (obj->ring != ring)
-                               continue;
-
-                       if (i915_seqno_passed(seqno, obj->last_read_seqno))
-                               continue;
-
-                       if ((obj->base.read_domains & I915_GEM_DOMAIN_COMMAND) == 0)
-                               continue;
-
-                       /* We need to copy these to an anonymous buffer as the simplest
-                        * method to avoid being overwritten by userspace.
-                        */
-                       return i915_error_object_create(dev_priv, obj, vm);
-               }
+               /* We need to copy these to an anonymous buffer as the simplest
+                * method to avoid being overwritten by userspace.
+                */
+               return i915_error_object_create(dev_priv, request->batch_obj, request->ctx->vm);
        }
 
-       WARN_ON(!found_active);
Ben Widawsky Jan. 27, 2014, 6:24 p.m. UTC | #6
On Mon, Jan 27, 2014 at 01:45:22PM +0000, Chris Wilson wrote:
> On Sun, Jan 26, 2014 at 01:47:29PM -0800, Ben Widawsky wrote:
> > On Sun, Jan 26, 2014 at 07:55:59PM +0000, Chris Wilson wrote:
> > > On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
> > > > On Sun, Jan 26, 2014 at 11:47:40AM +0000, Chris Wilson wrote:
> > > > > On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> > > > > > The previous check during error capture of whether or not the current VM
> > > > > > should be scanned used, gen < 7. That was more or less trying to
> > > > > > determine if there was a full PPGTT. At the time, this was sort of what
> > > > > > I meant to do because I was more interested in working backwards from
> > > > > > hardware state. However, this is incorrect because it will not include
> > > > > > platforms that are greater than gen7, and not having PPGTT.  Example
> > > > > > would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> > > > > > greater than gen7 with the PPGTT module parameter invoked.
> > > > > > 
> > > > > > I am /assuming/ BYT was broken, I have not actually checked.
> > > > > > 
> > > > > > While here, clean up the file a bit to avoid duplicate reads (now that
> > > > > > the PPGTT info is in the error state).
> > > > > > 
> > > > > > I think Mika/Chris may have been looking at this too.
> > > > > 
> > > > > Sure, we are looking (for identifying the guilty request/batch) by using
> > > > > the older, simpler mechanism of finding the first incomplete request. I
> > > > > think that search is now definite since we preallocate the request and no
> > > > > longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
> > > > > between seqno/batch/request).
> > > > > 
> > > > > That should also apply here and be much simpler.
> > > > 
> > > > How does that solve hangs which aren't caused by requests?
> > > 
> > > Was that an intentional rhetorical question?
> > > 
> > > The code you touch here only deals with requests - finding the current
> > > batchbuffer if any.
> > > -Chris
> > > 
> > 
> > It wasn't rhetorical. I temporarily ignored that all batches are tied to
> > a request.
> > 
> > So what's the plan now? Just looking at the callers, we seem to have a
> > couple of callers that can't easily identify the bad request.
> 
> I was thinking along the lines of:
> 
> @@ -737,31 +709,16 @@ i915_error_first_batchbuffer(struct drm_i915_private *dev_priv,
>         }
>  
>         seqno = ring->get_seqno(ring, false);
> -       list_for_each_entry(vm, &dev_priv->vm_list, global_link) {
> -               if (!is_active_vm(vm, ring))
> +       list_for_each_entry(request, &ring->request_list, list) {
> +               if (i915_seqno_passed(seqno, request->seqno))
>                         continue;
>  
> -               found_active = true;
> -
> -               list_for_each_entry(vma, &vm->active_list, mm_list) {
> -                       obj = vma->obj;
> -                       if (obj->ring != ring)
> -                               continue;
> -
> -                       if (i915_seqno_passed(seqno, obj->last_read_seqno))
> -                               continue;
> -
> -                       if ((obj->base.read_domains & I915_GEM_DOMAIN_COMMAND) == 0)
> -                               continue;
> -
> -                       /* We need to copy these to an anonymous buffer as the simplest
> -                        * method to avoid being overwritten by userspace.
> -                        */
> -                       return i915_error_object_create(dev_priv, obj, vm);
> -               }
> +               /* We need to copy these to an anonymous buffer as the simplest
> +                * method to avoid being overwritten by userspace.
> +                */
> +               return i915_error_object_create(dev_priv, request->batch_obj, request->ctx->vm);
>         }
>  
> -       WARN_ON(!found_active);
> 

So per ring batchbuffers is okay with you (it's fine by me)?
Ben Widawsky Jan. 27, 2014, 8:31 p.m. UTC | #7
On Mon, Jan 27, 2014 at 01:45:22PM +0000, Chris Wilson wrote:
> On Sun, Jan 26, 2014 at 01:47:29PM -0800, Ben Widawsky wrote:
> > On Sun, Jan 26, 2014 at 07:55:59PM +0000, Chris Wilson wrote:
> > > On Sun, Jan 26, 2014 at 11:05:40AM -0800, Ben Widawsky wrote:
> > > > On Sun, Jan 26, 2014 at 11:47:40AM +0000, Chris Wilson wrote:
> > > > > On Fri, Jan 24, 2014 at 06:17:45PM -0800, Ben Widawsky wrote:
> > > > > > The previous check during error capture of whether or not the current VM
> > > > > > should be scanned used, gen < 7. That was more or less trying to
> > > > > > determine if there was a full PPGTT. At the time, this was sort of what
> > > > > > I meant to do because I was more interested in working backwards from
> > > > > > hardware state. However, this is incorrect because it will not include
> > > > > > platforms that are greater than gen7, and not having PPGTT.  Example
> > > > > > would be BYT which is gen7 but doesn't have PPGTT, BDW, or any platform
> > > > > > greater than gen7 with the PPGTT module parameter invoked.
> > > > > > 
> > > > > > I am /assuming/ BYT was broken, I have not actually checked.
> > > > > > 
> > > > > > While here, clean up the file a bit to avoid duplicate reads (now that
> > > > > > the PPGTT info is in the error state).
> > > > > > 
> > > > > > I think Mika/Chris may have been looking at this too.
> > > > > 
> > > > > Sure, we are looking (for identifying the guilty request/batch) by using
> > > > > the older, simpler mechanism of finding the first incomplete request. I
> > > > > think that search is now definite since we preallocate the request and no
> > > > > longer do request collascing if ENOMEM (i.e. there is a 1:1 relationship
> > > > > between seqno/batch/request).
> > > > > 
> > > > > That should also apply here and be much simpler.
> > > > 
> > > > How does that solve hangs which aren't caused by requests?
> > > 
> > > Was that an intentional rhetorical question?
> > > 
> > > The code you touch here only deals with requests - finding the current
> > > batchbuffer if any.
> > > -Chris
> > > 
> > 
> > It wasn't rhetorical. I temporarily ignored that all batches are tied to
> > a request.
> > 
> > So what's the plan now? Just looking at the callers, we seem to have a
> > couple of callers that can't easily identify the bad request.
> 
> I was thinking along the lines of:
> 
> @@ -737,31 +709,16 @@ i915_error_first_batchbuffer(struct drm_i915_private *dev_priv,
>         }
>  
>         seqno = ring->get_seqno(ring, false);
> -       list_for_each_entry(vm, &dev_priv->vm_list, global_link) {
> -               if (!is_active_vm(vm, ring))
> +       list_for_each_entry(request, &ring->request_list, list) {
> +               if (i915_seqno_passed(seqno, request->seqno))
>                         continue;
>  
> -               found_active = true;
> -
> -               list_for_each_entry(vma, &vm->active_list, mm_list) {
> -                       obj = vma->obj;
> -                       if (obj->ring != ring)
> -                               continue;
> -
> -                       if (i915_seqno_passed(seqno, obj->last_read_seqno))
> -                               continue;
> -
> -                       if ((obj->base.read_domains & I915_GEM_DOMAIN_COMMAND) == 0)
> -                               continue;
> -
> -                       /* We need to copy these to an anonymous buffer as the simplest
> -                        * method to avoid being overwritten by userspace.
> -                        */
> -                       return i915_error_object_create(dev_priv, obj, vm);
> -               }
> +               /* We need to copy these to an anonymous buffer as the simplest
> +                * method to avoid being overwritten by userspace.
> +                */
> +               return i915_error_object_create(dev_priv, request->batch_obj, request->ctx->vm);
>         }
>  
> -       WARN_ON(!found_active);
> 

The other issue is the existing method doesn't rely as much on proper
request handling, ie. this could be more resilient to driver bugs. I
kind of want to keep both...
Chris Wilson Jan. 27, 2014, 9:31 p.m. UTC | #8
On Mon, Jan 27, 2014 at 12:31:08PM -0800, Ben Widawsky wrote:
> The other issue is the existing method doesn't rely as much on proper
> request handling, ie. this could be more resilient to driver bugs. I
> kind of want to keep both...

Actually I think it is. Part of the process of reading an error dump is
tying together the registers with what is captured. If they are
inconsistent, we know that the driver/capture is buggy. What happens in
the real world is that the GPU executes something completely different
than the batch buffer anyway...
-Chris
Ben Widawsky Jan. 27, 2014, 9:54 p.m. UTC | #9
On Mon, Jan 27, 2014 at 09:31:04PM +0000, Chris Wilson wrote:
> On Mon, Jan 27, 2014 at 12:31:08PM -0800, Ben Widawsky wrote:
> > The other issue is the existing method doesn't rely as much on proper
> > request handling, ie. this could be more resilient to driver bugs. I
> > kind of want to keep both...
> 
> Actually I think it is. Part of the process of reading an error dump is
> tying together the registers with what is captured. If they are
> inconsistent, we know that the driver/capture is buggy. What happens in
> the real world is that the GPU executes something completely different
> than the batch buffer anyway...
> -Chris
> 
> -- 
> Chris Wilson, Intel Open Source Technology Centre

Recapping IRC conversation - Chris is sending a patch to fix this
problem with his solution.
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 76bb010..6a859f1 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -684,32 +684,32 @@  static void i915_gem_record_fences(struct drm_device *dev,
 
 /* This assumes all batchbuffers are executed from the PPGTT. It might have to
  * change in the future. */
-static bool is_active_vm(struct i915_address_space *vm,
+static bool is_active_vm(struct drm_i915_error_state *error,
+			 struct i915_address_space *vm,
 			 struct intel_ring_buffer *ring)
 {
 	struct drm_device *dev = vm->dev;
-	struct drm_i915_private *dev_priv = dev->dev_private;
 	struct i915_hw_ppgtt *ppgtt;
 
-	if (INTEL_INFO(dev)->gen < 7)
+	if (!HAS_PPGTT(dev))
 		return i915_is_ggtt(vm);
 
-	/* FIXME: This ignores that the global gtt vm is also on this list. */
+	if (i915_is_ggtt(vm) || !USES_PPGTT(dev))
+		return error->head[ring->id] &&
+			(error->acthd[ring->id] ==
+			(error->head[ring->id] & HEAD_ADDR));
+
 	ppgtt = container_of(vm, struct i915_hw_ppgtt, base);
 
-	if (INTEL_INFO(dev)->gen >= 8) {
-		u64 pdp0 = (u64)I915_READ(GEN8_RING_PDP_UDW(ring, 0)) << 32;
-		pdp0 |=  I915_READ(GEN8_RING_PDP_LDW(ring, 0));
-		return pdp0 == ppgtt->pd_dma_addr[0];
-	} else {
-		u32 pp_db;
-		pp_db = I915_READ(RING_PP_DIR_BASE(ring));
-		return (pp_db >> 10) == ppgtt->pd_offset;
-	}
+	if (INTEL_INFO(dev)->gen >= 8)
+		return error->vm_info[ring->id].pdp[0] == ppgtt->pd_dma_addr[0];
+	else
+		return error->vm_info[ring->id].pp_dir_base == ppgtt->pd_offset;
 }
 
 static struct drm_i915_error_object *
 i915_error_first_batchbuffer(struct drm_i915_private *dev_priv,
+			     struct drm_i915_error_state *error,
 			     struct intel_ring_buffer *ring)
 {
 	struct i915_address_space *vm;
@@ -735,7 +735,7 @@  i915_error_first_batchbuffer(struct drm_i915_private *dev_priv,
 
 	seqno = ring->get_seqno(ring, false);
 	list_for_each_entry(vm, &dev_priv->vm_list, global_link) {
-		if (!is_active_vm(vm, ring))
+		if (!is_active_vm(error, vm, ring))
 			continue;
 
 		found_active = true;
@@ -877,7 +877,7 @@  static void i915_gem_record_rings(struct drm_device *dev,
 		i915_record_ring_state(dev, error, ring);
 
 		error->ring[i].batchbuffer =
-			i915_error_first_batchbuffer(dev_priv, ring);
+			i915_error_first_batchbuffer(dev_priv, error, ring);
 
 		error->ring[i].ringbuffer =
 			i915_error_ggtt_object_create(dev_priv, ring->obj);