diff mbox

[RFC] drm/i915: Prefault the entire object on first page fault

Message ID 1391520619-13295-1-git-send-email-chris@chris-wilson.co.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Chris Wilson Feb. 4, 2014, 1:30 p.m. UTC
Inserting additional PTEs has no side-effect for us as the pfn are fixed
for the entire time the object is resident in the global GTT. The
downside is that we pay the entire cost of faulting the object upon the
first hit, for which we in return receive the benefit of removing the
per-page faulting overhead.

On an Ivybridge i7-3720qm with 1600MHz DDR3, with 32 fences,
Upload rate for 2 linear surfaces:	8127MiB/s -> 8134MiB/s
Upload rate for 2 tiled surfaces:	8607MiB/s -> 8625MiB/s
Upload rate for 4 linear surfaces:	8127MiB/s -> 8127MiB/s
Upload rate for 4 tiled surfaces:	8611MiB/s -> 8602MiB/s
Upload rate for 8 linear surfaces:	8114MiB/s -> 8124MiB/s
Upload rate for 8 tiled surfaces:	8601MiB/s -> 8603MiB/s
Upload rate for 16 linear surfaces:	8110MiB/s -> 8123MiB/s
Upload rate for 16 tiled surfaces:	8595MiB/s -> 8606MiB/s
Upload rate for 32 linear surfaces:	8104MiB/s -> 8121MiB/s
Upload rate for 32 tiled surfaces:	8589MiB/s -> 8605MiB/s
Upload rate for 64 linear surfaces:	8107MiB/s -> 8121MiB/s
Upload rate for 64 tiled surfaces:	2013MiB/s -> 3017MiB/s

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: "Goel, Akash" <akash.goel@intel.com>
---

It survived light testing without noticable performance degradation. Can
anyone think of how this will impact us negatively?
-Chris

---
 drivers/gpu/drm/i915/i915_gem.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

Comments

Daniel Vetter Feb. 4, 2014, 2:12 p.m. UTC | #1
On Tue, Feb 04, 2014 at 01:30:19PM +0000, Chris Wilson wrote:
> Inserting additional PTEs has no side-effect for us as the pfn are fixed
> for the entire time the object is resident in the global GTT. The
> downside is that we pay the entire cost of faulting the object upon the
> first hit, for which we in return receive the benefit of removing the
> per-page faulting overhead.
> 
> On an Ivybridge i7-3720qm with 1600MHz DDR3, with 32 fences,
> Upload rate for 2 linear surfaces:	8127MiB/s -> 8134MiB/s
> Upload rate for 2 tiled surfaces:	8607MiB/s -> 8625MiB/s
> Upload rate for 4 linear surfaces:	8127MiB/s -> 8127MiB/s
> Upload rate for 4 tiled surfaces:	8611MiB/s -> 8602MiB/s
> Upload rate for 8 linear surfaces:	8114MiB/s -> 8124MiB/s
> Upload rate for 8 tiled surfaces:	8601MiB/s -> 8603MiB/s
> Upload rate for 16 linear surfaces:	8110MiB/s -> 8123MiB/s
> Upload rate for 16 tiled surfaces:	8595MiB/s -> 8606MiB/s
> Upload rate for 32 linear surfaces:	8104MiB/s -> 8121MiB/s
> Upload rate for 32 tiled surfaces:	8589MiB/s -> 8605MiB/s
> Upload rate for 64 linear surfaces:	8107MiB/s -> 8121MiB/s
> Upload rate for 64 tiled surfaces:	2013MiB/s -> 3017MiB/s
> 
> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: "Goel, Akash" <akash.goel@intel.com>
> ---
> 
> It survived light testing without noticable performance degradation. Can
> anyone think of how this will impact us negatively?

piglit does an awful lot of single-pixel readbacks iirc, that's about the
only thing I could think of. Maybe we should wait until we have the
vm_insert_pfn_frm_io_mapping to not adversely affect this. Or if the
overhead is negligible we could move ahead right away. Nothing else really
crosses my mind which would qualify as real-world usage.
-Daniel
Daniel Vetter Feb. 4, 2014, 2:15 p.m. UTC | #2
On Tue, Feb 04, 2014 at 03:12:49PM +0100, Daniel Vetter wrote:
> On Tue, Feb 04, 2014 at 01:30:19PM +0000, Chris Wilson wrote:
> > Inserting additional PTEs has no side-effect for us as the pfn are fixed
> > for the entire time the object is resident in the global GTT. The
> > downside is that we pay the entire cost of faulting the object upon the
> > first hit, for which we in return receive the benefit of removing the
> > per-page faulting overhead.
> > 
> > On an Ivybridge i7-3720qm with 1600MHz DDR3, with 32 fences,
> > Upload rate for 2 linear surfaces:	8127MiB/s -> 8134MiB/s
> > Upload rate for 2 tiled surfaces:	8607MiB/s -> 8625MiB/s
> > Upload rate for 4 linear surfaces:	8127MiB/s -> 8127MiB/s
> > Upload rate for 4 tiled surfaces:	8611MiB/s -> 8602MiB/s
> > Upload rate for 8 linear surfaces:	8114MiB/s -> 8124MiB/s
> > Upload rate for 8 tiled surfaces:	8601MiB/s -> 8603MiB/s
> > Upload rate for 16 linear surfaces:	8110MiB/s -> 8123MiB/s
> > Upload rate for 16 tiled surfaces:	8595MiB/s -> 8606MiB/s
> > Upload rate for 32 linear surfaces:	8104MiB/s -> 8121MiB/s
> > Upload rate for 32 tiled surfaces:	8589MiB/s -> 8605MiB/s
> > Upload rate for 64 linear surfaces:	8107MiB/s -> 8121MiB/s
> > Upload rate for 64 tiled surfaces:	2013MiB/s -> 3017MiB/s
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: "Goel, Akash" <akash.goel@intel.com>
> > ---
> > 
> > It survived light testing without noticable performance degradation. Can
> > anyone think of how this will impact us negatively?
> 
> piglit does an awful lot of single-pixel readbacks iirc, that's about the
> only thing I could think of. Maybe we should wait until we have the
> vm_insert_pfn_frm_io_mapping to not adversely affect this. Or if the
> overhead is negligible we could move ahead right away. Nothing else really
> crosses my mind which would qualify as real-world usage.

On that topic: What's the improvement of the optimized insert_pfn_pgprot
with the prefault patch applied when doing just single dword writes? I.e.
just to measure the the insert_pfn performance so that we have some
impressive microbenchmark numbers justifying things? I'm thinking of a 2nd
mode in your test to measure pagefaults/s.
-Daniel
Chris Wilson Feb. 4, 2014, 2:33 p.m. UTC | #3
On Tue, Feb 04, 2014 at 03:15:26PM +0100, Daniel Vetter wrote:
> On Tue, Feb 04, 2014 at 03:12:49PM +0100, Daniel Vetter wrote:
> > On Tue, Feb 04, 2014 at 01:30:19PM +0000, Chris Wilson wrote:
> > > Inserting additional PTEs has no side-effect for us as the pfn are fixed
> > > for the entire time the object is resident in the global GTT. The
> > > downside is that we pay the entire cost of faulting the object upon the
> > > first hit, for which we in return receive the benefit of removing the
> > > per-page faulting overhead.
> > > 
> > > On an Ivybridge i7-3720qm with 1600MHz DDR3, with 32 fences,
> > > Upload rate for 2 linear surfaces:	8127MiB/s -> 8134MiB/s
> > > Upload rate for 2 tiled surfaces:	8607MiB/s -> 8625MiB/s
> > > Upload rate for 4 linear surfaces:	8127MiB/s -> 8127MiB/s
> > > Upload rate for 4 tiled surfaces:	8611MiB/s -> 8602MiB/s
> > > Upload rate for 8 linear surfaces:	8114MiB/s -> 8124MiB/s
> > > Upload rate for 8 tiled surfaces:	8601MiB/s -> 8603MiB/s
> > > Upload rate for 16 linear surfaces:	8110MiB/s -> 8123MiB/s
> > > Upload rate for 16 tiled surfaces:	8595MiB/s -> 8606MiB/s
> > > Upload rate for 32 linear surfaces:	8104MiB/s -> 8121MiB/s
> > > Upload rate for 32 tiled surfaces:	8589MiB/s -> 8605MiB/s
> > > Upload rate for 64 linear surfaces:	8107MiB/s -> 8121MiB/s
> > > Upload rate for 64 tiled surfaces:	2013MiB/s -> 3017MiB/s
> > > 
> > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: "Goel, Akash" <akash.goel@intel.com>
> > > ---
> > > 
> > > It survived light testing without noticable performance degradation. Can
> > > anyone think of how this will impact us negatively?
> > 
> > piglit does an awful lot of single-pixel readbacks iirc, that's about the
> > only thing I could think of. Maybe we should wait until we have the
> > vm_insert_pfn_frm_io_mapping to not adversely affect this. Or if the
> > overhead is negligible we could move ahead right away. Nothing else really
> > crosses my mind which would qualify as real-world usage.
> 
> On that topic: What's the improvement of the optimized insert_pfn_pgprot
> with the prefault patch applied when doing just single dword writes? I.e.
> just to measure the the insert_pfn performance so that we have some
> impressive microbenchmark numbers justifying things? I'm thinking of a 2nd
> mode in your test to measure pagefaults/s.

Not pagefault/s yet, but varying object/write sizes is interesting.

IGT-Version: 1.5-g906b862 (x86_64) (Linux: 3.13.0+ x86_64)
4/4096: Upload rate for 2 linear surfaces:	651.042MiB/s
4/4096: Upload rate for 2 tiled surfaces:	1302.083MiB/s
4/4096: Upload rate for 4 linear surfaces:	1116.071MiB/s
4/4096: Upload rate for 4 tiled surfaces:	1736.111MiB/s
4/4096: Upload rate for 8 linear surfaces:	892.857MiB/s
4/4096: Upload rate for 8 tiled surfaces:	1420.455MiB/s
4/4096: Upload rate for 16 linear surfaces:	 57.710MiB/s
4/4096: Upload rate for 16 tiled surfaces:	 58.685MiB/s
4/4096: Upload rate for 32 linear surfaces:	 59.018MiB/s
4/4096: Upload rate for 32 tiled surfaces:	 59.780MiB/s
4/4096: Upload rate for 64 linear surfaces:	 59.060MiB/s
4/4096: Upload rate for 64 tiled surfaces:	  2.021MiB/s
Test assertion failure function performance, file
gem_fence_upload.c:108:
Last errno: 0, Success
Failed assertion: linear[1] > 0.75 * linear[0]
Subtest 4KiB (single dword): FAIL
4096/4096: Upload rate for 2 linear surfaces:	9259.259MiB/s
4096/4096: Upload rate for 2 tiled surfaces:	9153.318MiB/s
4096/4096: Upload rate for 4 linear surfaces:	9237.875MiB/s
4096/4096: Upload rate for 4 tiled surfaces:	9190.121MiB/s
4096/4096: Upload rate for 8 linear surfaces:	9235.209MiB/s
4096/4096: Upload rate for 8 tiled surfaces:	9280.742MiB/s
4096/4096: Upload rate for 16 linear surfaces:	9300.974MiB/s
4096/4096: Upload rate for 16 tiled surfaces:	9284.782MiB/s
4096/4096: Upload rate for 32 linear surfaces:	9311.122MiB/s
4096/4096: Upload rate for 32 tiled surfaces:	9311.122MiB/s
4096/4096: Upload rate for 64 linear surfaces:	9291.184MiB/s
4096/4096: Upload rate for 64 tiled surfaces:	1685.708MiB/s
Test assertion failure function performance, file
gem_fence_upload.c:109:
Last errno: 0, Success
Failed assertion: tiled[1] > 0.75 * tiled[0]
Subtest 4KiB: FAIL
4/1048576: Upload rate for 2 linear surfaces:	 21.945MiB/s
4/1048576: Upload rate for 2 tiled surfaces:	411.184MiB/s
4/1048576: Upload rate for 4 linear surfaces:	 24.529MiB/s
4/1048576: Upload rate for 4 tiled surfaces:	434.028MiB/s
4/1048576: Upload rate for 8 linear surfaces:	 21.448MiB/s
4/1048576: Upload rate for 8 tiled surfaces:	195.313MiB/s
4/1048576: Upload rate for 16 linear surfaces:	 16.644MiB/s
4/1048576: Upload rate for 16 tiled surfaces:	 53.373MiB/s
4/1048576: Upload rate for 32 linear surfaces:	 16.563MiB/s
4/1048576: Upload rate for 32 tiled surfaces:	 55.285MiB/s
4/1048576: Upload rate for 64 linear surfaces:	 15.486MiB/s
4/1048576: Upload rate for 64 tiled surfaces:	  0.107MiB/s
Test assertion failure function performance, file
gem_fence_upload.c:108:
Last errno: 0, Success
Failed assertion: linear[1] > 0.75 * linear[0]
Subtest 1MiB (single dword): FAIL
1048576/1048576: Upload rate for 2 linear surfaces:	8136.153MiB/s
1048576/1048576: Upload rate for 2 tiled surfaces:	8633.445MiB/s
1048576/1048576: Upload rate for 4 linear surfaces:	8128.936MiB/s
1048576/1048576: Upload rate for 4 tiled surfaces:	8614.996MiB/s
1048576/1048576: Upload rate for 8 linear surfaces:	8126.130MiB/s
1048576/1048576: Upload rate for 8 tiled surfaces:	8615.187MiB/s
1048576/1048576: Upload rate for 16 linear surfaces:	8127.811MiB/s
1048576/1048576: Upload rate for 16 tiled surfaces:	8617.108MiB/s
1048576/1048576: Upload rate for 32 linear surfaces:	8125.888MiB/s
1048576/1048576: Upload rate for 32 tiled surfaces:	8612.528MiB/s
1048576/1048576: Upload rate for 64 linear surfaces:	8128.412MiB/s
1048576/1048576: Upload rate for 64 tiled surfaces:	4522.448MiB/s
Test assertion failure function performance, file
gem_fence_upload.c:109:
Last errno: 0, Success
Failed assertion: tiled[1] > 0.75 * tiled[0]
Subtest 1MiB: FAIL

There's still the obvious cliff >32 fences, but also the interesting
transition at 8 objects, and the odd effect of tiled vs linear.
-Chris
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 14722f48fcd8..e0b3cda19363 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1514,14 +1514,26 @@  int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 	if (ret)
 		goto unpin;
 
-	obj->fault_mappable = true;
-
+	/* Finally, remap it using the new GTT offset */
 	pfn = dev_priv->gtt.mappable_base + i915_gem_obj_ggtt_offset(obj);
 	pfn >>= PAGE_SHIFT;
-	pfn += page_offset;
 
-	/* Finally, remap it using the new GTT offset */
-	ret = vm_insert_pfn(vma, (unsigned long)vmf->virtual_address, pfn);
+	if (!obj->fault_mappable) {
+		int i;
+
+		for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
+			ret = vm_insert_pfn(vma,
+					    (unsigned long)vma->vm_start + i * PAGE_SIZE,
+					    pfn + i);
+			if (ret)
+				break;
+		}
+
+		obj->fault_mappable = true;
+	} else
+		ret = vm_insert_pfn(vma,
+				    (unsigned long)vmf->virtual_address,
+				    pfn + page_offset);
 unpin:
 	i915_gem_object_ggtt_unpin(obj);
 unlock: