diff mbox

drm/i915/ppgtt: Limit guilty hunt inside of relevant vm

Message ID 1389953004-8830-1-git-send-email-mika.kuoppala@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mika Kuoppala Jan. 17, 2014, 10:03 a.m. UTC
With full ppgtt, ACTHD is only relevant inside one context
(address space). Trying to find guilty batch only relying
on ACTHD, the result is false positives as ACTHD points
inside batches on different address spaces.

Filter out nonrelated contexts by checking on which vm
the ring was running on when the hang happened. Only after
finding the relevant vm, use acthd to find the guilty
batch inside it.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73652
Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
---
 drivers/gpu/drm/i915/i915_gem.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

Comments

Chris Wilson Jan. 17, 2014, 10:26 a.m. UTC | #1
On Fri, Jan 17, 2014 at 12:03:24PM +0200, Mika Kuoppala wrote:
> With full ppgtt, ACTHD is only relevant inside one context
> (address space). Trying to find guilty batch only relying
> on ACTHD, the result is false positives as ACTHD points
> inside batches on different address spaces.
> 
> Filter out nonrelated contexts by checking on which vm
> the ring was running on when the hang happened. Only after
> finding the relevant vm, use acthd to find the guilty
> batch inside it.

Alternatively (or in addtion to) you could walk the request
list backwards and stop searching for guilty requests after
the first hit.
-Chris
Mika Kuoppala Jan. 17, 2014, 2:29 p.m. UTC | #2
Chris Wilson <chris@chris-wilson.co.uk> writes:

> On Fri, Jan 17, 2014 at 12:03:24PM +0200, Mika Kuoppala wrote:
>> With full ppgtt, ACTHD is only relevant inside one context
>> (address space). Trying to find guilty batch only relying
>> on ACTHD, the result is false positives as ACTHD points
>> inside batches on different address spaces.
>> 
>> Filter out nonrelated contexts by checking on which vm
>> the ring was running on when the hang happened. Only after
>> finding the relevant vm, use acthd to find the guilty
>> batch inside it.
>
> Alternatively (or in addtion to) you could walk the request
> list backwards and stop searching for guilty requests after
> the first hit.

I took this idea and posted a patchset as a separate thread.

The approach you suggested feels more 'right' as it is lot
less complex and we don't need acthd nor knowledge about address
spaces to find the guilty.

Only drawback I can now think of is that if gpu hangs just
after writing the seqno to hardware status page, we end up
blaming the wrong request. But if this is a problem we could
double check with acthd that they point to the same req.

-Mika
Chris Wilson Jan. 17, 2014, 2:37 p.m. UTC | #3
On Fri, Jan 17, 2014 at 04:29:31PM +0200, Mika Kuoppala wrote:
> Chris Wilson <chris@chris-wilson.co.uk> writes:
> 
> > On Fri, Jan 17, 2014 at 12:03:24PM +0200, Mika Kuoppala wrote:
> >> With full ppgtt, ACTHD is only relevant inside one context
> >> (address space). Trying to find guilty batch only relying
> >> on ACTHD, the result is false positives as ACTHD points
> >> inside batches on different address spaces.
> >> 
> >> Filter out nonrelated contexts by checking on which vm
> >> the ring was running on when the hang happened. Only after
> >> finding the relevant vm, use acthd to find the guilty
> >> batch inside it.
> >
> > Alternatively (or in addtion to) you could walk the request
> > list backwards and stop searching for guilty requests after
> > the first hit.
> 
> I took this idea and posted a patchset as a separate thread.
> 
> The approach you suggested feels more 'right' as it is lot
> less complex and we don't need acthd nor knowledge about address
> spaces to find the guilty.
> 
> Only drawback I can now think of is that if gpu hangs just
> after writing the seqno to hardware status page, we end up
> blaming the wrong request. But if this is a problem we could
> double check with acthd that they point to the same req.

In that eventuality, neither batch is guilty. Instead it is the driver
that is at fault for not working around the broken hardware.

Not sure how we would debug that other than through random trial and
error.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5fcdb14..a7cc060 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2282,9 +2282,34 @@  request_to_vm(struct drm_i915_gem_request *request)
 	return vm;
 }
 
+static bool
+request_vm_active(struct drm_i915_gem_request *request)
+{
+	struct intel_ring_buffer *ring = request->ring;
+	struct drm_device *dev = ring->dev;
+	struct drm_i915_private *dev_priv = dev->dev_private;
+	struct i915_hw_ppgtt *ppgtt;
+	u32 pd_off;
+
+	if (USES_FULL_PPGTT(dev) == false)
+		return true;
+
+	if (WARN_ON(!request->ctx))
+		return false;
+
+	ppgtt = ctx_to_ppgtt(request->ctx);
+
+	pd_off = (I915_READ(RING_PP_DIR_BASE(ring)) >> 16) * 64;
+
+	return pd_off == ppgtt->pd_offset;
+}
+
 static bool i915_request_guilty(struct drm_i915_gem_request *request,
 				const u32 acthd, bool *inside)
 {
+	if (!request_vm_active(request))
+		return false;
+
 	/* There is a possibility that unmasked head address
 	 * pointing inside the ring, matches the batch_obj address range.
 	 * However this is extremely unlikely.