diff mbox

[2/2] drm/i915: Keep the per-object list of VMAs under control

Message ID 1454324408-29327-2-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tvrtko Ursulin Feb. 1, 2016, 11 a.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Where objects are shared across contexts and heavy rendering
is in progress, execlist retired request queue will grow
unbound until the GPU is idle enough for the retire worker
to run and call intel_execlists_retire_requests.

With some workloads, like for example gem_close_race, that
never happens causing the shared object VMA list to grow to
epic proportions, and in turn causes retirement call sites to
spend linearly more and more time walking the obj->vma_list.

End result is the above mentioned test case taking ten minutes
to complete and using up more than a GiB of RAM just for the VMA
objects.

If we instead trigger the execlist house keeping a bit more
often, obj->vma_list will be kept in check by the virtue of
context cleanup running and zapping the inactive VMAs.

This makes the test case an order of magnitude faster and brings
memory use back to normal.

This also makes the code more self-contained since the
intel_execlists_retire_requests call-site is now in a more
appropriate place and implementation leakage is somewhat
reduced.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Testcase: igt/gem_close_race/gem_close_race
---
 drivers/gpu/drm/i915/i915_gem.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Chris Wilson Feb. 1, 2016, 11:12 a.m. UTC | #1
On Mon, Feb 01, 2016 at 11:00:08AM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> Where objects are shared across contexts and heavy rendering
> is in progress, execlist retired request queue will grow
> unbound until the GPU is idle enough for the retire worker
> to run and call intel_execlists_retire_requests.
> 
> With some workloads, like for example gem_close_race, that
> never happens causing the shared object VMA list to grow to
> epic proportions, and in turn causes retirement call sites to
> spend linearly more and more time walking the obj->vma_list.
> 
> End result is the above mentioned test case taking ten minutes
> to complete and using up more than a GiB of RAM just for the VMA
> objects.
> 
> If we instead trigger the execlist house keeping a bit more
> often, obj->vma_list will be kept in check by the virtue of
> context cleanup running and zapping the inactive VMAs.
> 
> This makes the test case an order of magnitude faster and brings
> memory use back to normal.
> 
> This also makes the code more self-contained since the
> intel_execlists_retire_requests call-site is now in a more
> appropriate place and implementation leakage is somewhat
> reduced.

However, this then causes a perf regression since we unpin the contexts
too frequently and do not have any mitigation in place yet.
-Chris
Tvrtko Ursulin Feb. 1, 2016, 1:29 p.m. UTC | #2
On 01/02/16 11:12, Chris Wilson wrote:
> On Mon, Feb 01, 2016 at 11:00:08AM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> Where objects are shared across contexts and heavy rendering
>> is in progress, execlist retired request queue will grow
>> unbound until the GPU is idle enough for the retire worker
>> to run and call intel_execlists_retire_requests.
>>
>> With some workloads, like for example gem_close_race, that
>> never happens causing the shared object VMA list to grow to
>> epic proportions, and in turn causes retirement call sites to
>> spend linearly more and more time walking the obj->vma_list.
>>
>> End result is the above mentioned test case taking ten minutes
>> to complete and using up more than a GiB of RAM just for the VMA
>> objects.
>>
>> If we instead trigger the execlist house keeping a bit more
>> often, obj->vma_list will be kept in check by the virtue of
>> context cleanup running and zapping the inactive VMAs.
>>
>> This makes the test case an order of magnitude faster and brings
>> memory use back to normal.
>>
>> This also makes the code more self-contained since the
>> intel_execlists_retire_requests call-site is now in a more
>> appropriate place and implementation leakage is somewhat
>> reduced.
>
> However, this then causes a perf regression since we unpin the contexts
> too frequently and do not have any mitigation in place yet.

I suppose it is possible. What takes most time - page table clears on 
VMA unbinds? It is just that this looks so bad at the moment. :( Luckily 
it is just the IGT..

Regards,

Tvrtko
Chris Wilson Feb. 1, 2016, 1:41 p.m. UTC | #3
On Mon, Feb 01, 2016 at 01:29:16PM +0000, Tvrtko Ursulin wrote:
> 
> On 01/02/16 11:12, Chris Wilson wrote:
> >On Mon, Feb 01, 2016 at 11:00:08AM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>Where objects are shared across contexts and heavy rendering
> >>is in progress, execlist retired request queue will grow
> >>unbound until the GPU is idle enough for the retire worker
> >>to run and call intel_execlists_retire_requests.
> >>
> >>With some workloads, like for example gem_close_race, that
> >>never happens causing the shared object VMA list to grow to
> >>epic proportions, and in turn causes retirement call sites to
> >>spend linearly more and more time walking the obj->vma_list.
> >>
> >>End result is the above mentioned test case taking ten minutes
> >>to complete and using up more than a GiB of RAM just for the VMA
> >>objects.
> >>
> >>If we instead trigger the execlist house keeping a bit more
> >>often, obj->vma_list will be kept in check by the virtue of
> >>context cleanup running and zapping the inactive VMAs.
> >>
> >>This makes the test case an order of magnitude faster and brings
> >>memory use back to normal.
> >>
> >>This also makes the code more self-contained since the
> >>intel_execlists_retire_requests call-site is now in a more
> >>appropriate place and implementation leakage is somewhat
> >>reduced.
> >
> >However, this then causes a perf regression since we unpin the contexts
> >too frequently and do not have any mitigation in place yet.
> 
> I suppose it is possible. What takes most time - page table clears
> on VMA unbinds? It is just that this looks so bad at the moment. :(
> Luckily it is just the IGT..

On Braswell, in particular it is most noticeable, it is the ioremaps.
Note that we don't unbind the VMA on unpin, just make them available for
reallocation. The basic mitigation strategy that's been sent in a couple
of different forms is to to defer the remapping from the unpin to the
vma unbind (and along the vmap paths from the unpin to the put_pages).
Then the context unpin becomes just a matter of dropping a few
individual pin-counts and ref-counts on the various objects used by the
context.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index faa9def96917..c558887b2084 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2955,6 +2955,9 @@  i915_gem_retire_requests_ring(struct intel_engine_cs *ring)
 		i915_gem_request_assign(&ring->trace_irq_req, NULL);
 	}
 
+	if (i915.enable_execlists)
+		intel_execlists_retire_requests(ring);
+
 	WARN_ON(i915_verify_lists(ring->dev));
 }
 
@@ -2973,8 +2976,6 @@  i915_gem_retire_requests(struct drm_device *dev)
 			spin_lock_irq(&ring->execlist_lock);
 			idle &= list_empty(&ring->execlist_queue);
 			spin_unlock_irq(&ring->execlist_lock);
-
-			intel_execlists_retire_requests(ring);
 		}
 	}