diff mbox

drm/i915: Reduce locking in command submission

Message ID 1421320890-29713-1-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tvrtko Ursulin Jan. 15, 2015, 11:21 a.m. UTC
From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

This eliminates six needless spin lock/unlock pairs when writing out ELSP. Apart
from tidier code main benefit is between 0.51% and 0.73% speedup on some OGL
tests under CHV (bench_OglBatch4 bench_OglDeferred respectively).

Kindly benchmarked by Ben Widawsky.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Dave Gordon <david.s.gordon@intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Ben Widawsky <ben@bwidawsk.net>
---
 drivers/gpu/drm/i915/i915_drv.h     | 15 +++++++++++++++
 drivers/gpu/drm/i915/intel_lrc.c    | 13 ++++++-------
 drivers/gpu/drm/i915/intel_uncore.c | 14 --------------
 3 files changed, 21 insertions(+), 21 deletions(-)

Comments

Ben Widawsky Jan. 15, 2015, 4:54 p.m. UTC | #1
On Thu, Jan 15, 2015 at 11:21:30AM +0000, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> This eliminates six needless spin lock/unlock pairs when writing out ELSP. Apart
> from tidier code main benefit is between 0.51% and 0.73% speedup on some OGL
> tests under CHV (bench_OglBatch4 bench_OglDeferred respectively).

With 95% confidence t-test on n=5

> 
> Kindly benchmarked by Ben Widawsky.

FWIW, as I mentioned on IRC, I think the reduction of the unnecessary forcewake
(someone should fix the shadow register list) is probably more beneficial than
removing the spin on an uncontested lock. I was tempted to try that myself, but
I didn't have time or much interest since your patch accomplishes the same
thing.

The sucky thing, which I actually care about since I've been doing a lot of
profiling, is the raw MMIO doesn't show up with our i915 trace functions. It's
obtainable still, but then I get a mess of other stuff I don't want.

> 
> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: Dave Gordon <david.s.gordon@intel.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Ben Widawsky <ben@bwidawsk.net>
[snip]
Tvrtko Ursulin Jan. 15, 2015, 5:05 p.m. UTC | #2
On 01/15/2015 04:54 PM, Ben Widawsky wrote:
> On Thu, Jan 15, 2015 at 11:21:30AM +0000, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> This eliminates six needless spin lock/unlock pairs when writing out ELSP. Apart
>> from tidier code main benefit is between 0.51% and 0.73% speedup on some OGL
>> tests under CHV (bench_OglBatch4 bench_OglDeferred respectively).
>
> With 95% confidence t-test on n=5
>
>>
>> Kindly benchmarked by Ben Widawsky.
>
> FWIW, as I mentioned on IRC, I think the reduction of the unnecessary forcewake
> (someone should fix the shadow register list) is probably more beneficial than
> removing the spin on an uncontested lock. I was tempted to try that myself, but
> I didn't have time or much interest since your patch accomplishes the same
> thing.

I missed that IRC discussion, but I don't think it was doing forcewakes 
since the outer block in execlists_elsp_write bumps the counters which 
made I915_WRITE & co skip them.

Regards,

Tvrtko
Ben Widawsky Jan. 15, 2015, 11:42 p.m. UTC | #3
On Thu, Jan 15, 2015 at 05:05:30PM +0000, Tvrtko Ursulin wrote:
> 
> On 01/15/2015 04:54 PM, Ben Widawsky wrote:
> >On Thu, Jan 15, 2015 at 11:21:30AM +0000, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>This eliminates six needless spin lock/unlock pairs when writing out ELSP. Apart
> >>from tidier code main benefit is between 0.51% and 0.73% speedup on some OGL
> >>tests under CHV (bench_OglBatch4 bench_OglDeferred respectively).
> >
> >With 95% confidence t-test on n=5
> >
> >>
> >>Kindly benchmarked by Ben Widawsky.
> >
> >FWIW, as I mentioned on IRC, I think the reduction of the unnecessary forcewake
> >(someone should fix the shadow register list) is probably more beneficial than
> >removing the spin on an uncontested lock. I was tempted to try that myself, but
> >I didn't have time or much interest since your patch accomplishes the same
> >thing.
> 
> I missed that IRC discussion, but I don't think it was doing forcewakes
> since the outer block in execlists_elsp_write bumps the counters which made
> I915_WRITE & co skip them.
> 
> Regards,
> 
> Tvrtko

I didn't check the locking but it looks like it could actually get decremented
once the spinlock is released. Probably never happens, but I think it's
possible.

I completely missed that block somehow. I think my eyes skipped over it because
how could getting forcewake take like 10+ lines :D
Shuang He Jan. 16, 2015, 12:19 a.m. UTC | #4
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 5585
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                                  353/353              353/353
ILK                 -1              200/200              199/200
SNB                                  400/422              400/422
IVB                                  487/487              487/487
BYT                                  296/296              296/296
HSW              +22-1              486/508              507/508
BDW                 -1              402/402              401/402
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
*ILK  igt_gem_concurrent_blit_gtt-bcs-overwrite-source      PASS(2, M37)      NO_RESULT(1, M37)
 HSW  igt_kms_cursor_crc_cursor-size-change      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_kms_fence_pin_leak      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_kms_flip_event_leak      NSPT(2, M40)PASS(3, M20)      PASS(1, M20)
 HSW  igt_kms_flip_flip-vs-dpms-off-vs-modeset      DMESG_WARN(2, M20M40)PASS(1, M40)      DMESG_WARN(1, M20)
 HSW  igt_kms_mmio_vs_cs_flip_setcrtc_vs_cs_flip      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_kms_mmio_vs_cs_flip_setplane_vs_cs_flip      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_lpsp_non-edp      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_cursor      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_cursor-dpms      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_dpms-mode-unset-non-lpsp      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_dpms-non-lpsp      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_drm-resources-equal      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_fences      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_fences-dpms      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_gem-execbuf      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_gem-mmap-cpu      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_gem-mmap-gtt      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_gem-pread      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_i2c      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_modeset-non-lpsp      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_modeset-non-lpsp-stress-no-wait      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_pci-d3-state      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
 HSW  igt_pm_rpm_rte      NSPT(1, M40)PASS(4, M20M40)      PASS(1, M20)
*BDW  igt_gem_concurrent_blit_gtt-rcs-early-read-interruptible      PASS(7, M30M28)      DMESG_WARN(1, M30)
Note: You need to pay more attention to line start with '*'
diff mbox

Patch

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 66f0c60..33d577a 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -3197,6 +3197,21 @@  int vlv_freq_opcode(struct drm_i915_private *dev_priv, int val);
 #define POSTING_READ(reg)	(void)I915_READ_NOTRACE(reg)
 #define POSTING_READ16(reg)	(void)I915_READ16_NOTRACE(reg)
 
+/* Raw MMIO access with no forcewake handling, use with care. */
+#define __raw_i915_read8(dev_priv__, reg__) readb((dev_priv__)->regs + (reg__))
+#define __raw_i915_write8(dev_priv__, reg__, val__) writeb(val__, (dev_priv__)->regs + (reg__))
+
+#define __raw_i915_read16(dev_priv__, reg__) readw((dev_priv__)->regs + (reg__))
+#define __raw_i915_write16(dev_priv__, reg__, val__) writew(val__, (dev_priv__)->regs + (reg__))
+
+#define __raw_i915_read32(dev_priv__, reg__) readl((dev_priv__)->regs + (reg__))
+#define __raw_i915_write32(dev_priv__, reg__, val__) writel(val__, (dev_priv__)->regs + (reg__))
+
+#define __raw_i915_read64(dev_priv__, reg__) readq((dev_priv__)->regs + (reg__))
+#define __raw_i915_write64(dev_priv__, reg__, val__) writeq(val__, (dev_priv__)->regs + (reg__))
+
+#define __raw_posting_read(dev_priv__, reg__) (void)__raw_i915_read32(dev_priv__, reg__)
+
 /* "Broadcast RGB" property */
 #define INTEL_BROADCAST_RGB_AUTO 0
 #define INTEL_BROADCAST_RGB_FULL 1
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index e405b61..e22b866 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -305,6 +305,7 @@  static void execlists_elsp_write(struct intel_engine_cs *ring,
 	 * Instead, we do the runtime_pm_get/put when creating/destroying requests.
 	 */
 	spin_lock_irqsave(&dev_priv->uncore.lock, flags);
+
 	if (IS_CHERRYVIEW(dev) || INTEL_INFO(dev)->gen >= 9) {
 		if (dev_priv->uncore.fw_rendercount++ == 0)
 			dev_priv->uncore.funcs.force_wake_get(dev_priv,
@@ -322,19 +323,17 @@  static void execlists_elsp_write(struct intel_engine_cs *ring,
 			dev_priv->uncore.funcs.force_wake_get(dev_priv,
 							      FORCEWAKE_ALL);
 	}
-	spin_unlock_irqrestore(&dev_priv->uncore.lock, flags);
 
-	I915_WRITE(RING_ELSP(ring), desc[1]);
-	I915_WRITE(RING_ELSP(ring), desc[0]);
-	I915_WRITE(RING_ELSP(ring), desc[3]);
+	__raw_i915_write32(dev_priv, RING_ELSP(ring), desc[1]);
+	__raw_i915_write32(dev_priv, RING_ELSP(ring), desc[0]);
+	__raw_i915_write32(dev_priv, RING_ELSP(ring), desc[3]);
 	/* The context is automatically loaded after the following */
-	I915_WRITE(RING_ELSP(ring), desc[2]);
+	__raw_i915_write32(dev_priv, RING_ELSP(ring), desc[2]);
 
 	/* ELSP is a wo register, so use another nearby reg for posting instead */
-	POSTING_READ(RING_EXECLIST_STATUS(ring));
+	__raw_posting_read(dev_priv, RING_EXECLIST_STATUS(ring));
 
 	/* Release Force Wakeup (see the big comment above). */
-	spin_lock_irqsave(&dev_priv->uncore.lock, flags);
 	if (IS_CHERRYVIEW(dev) || INTEL_INFO(dev)->gen >= 9) {
 		if (--dev_priv->uncore.fw_rendercount == 0)
 			dev_priv->uncore.funcs.force_wake_put(dev_priv,
diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c
index e9561de..9a31932 100644
--- a/drivers/gpu/drm/i915/intel_uncore.c
+++ b/drivers/gpu/drm/i915/intel_uncore.c
@@ -26,20 +26,6 @@ 
 
 #define FORCEWAKE_ACK_TIMEOUT_MS 2
 
-#define __raw_i915_read8(dev_priv__, reg__) readb((dev_priv__)->regs + (reg__))
-#define __raw_i915_write8(dev_priv__, reg__, val__) writeb(val__, (dev_priv__)->regs + (reg__))
-
-#define __raw_i915_read16(dev_priv__, reg__) readw((dev_priv__)->regs + (reg__))
-#define __raw_i915_write16(dev_priv__, reg__, val__) writew(val__, (dev_priv__)->regs + (reg__))
-
-#define __raw_i915_read32(dev_priv__, reg__) readl((dev_priv__)->regs + (reg__))
-#define __raw_i915_write32(dev_priv__, reg__, val__) writel(val__, (dev_priv__)->regs + (reg__))
-
-#define __raw_i915_read64(dev_priv__, reg__) readq((dev_priv__)->regs + (reg__))
-#define __raw_i915_write64(dev_priv__, reg__, val__) writeq(val__, (dev_priv__)->regs + (reg__))
-
-#define __raw_posting_read(dev_priv__, reg__) (void)__raw_i915_read32(dev_priv__, reg__)
-
 static void
 assert_device_not_suspended(struct drm_i915_private *dev_priv)
 {