Message ID | 1421320890-29713-1-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 15, 2015 at 11:21:30AM +0000, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > > This eliminates six needless spin lock/unlock pairs when writing out ELSP. Apart > from tidier code main benefit is between 0.51% and 0.73% speedup on some OGL > tests under CHV (bench_OglBatch4 bench_OglDeferred respectively). With 95% confidence t-test on n=5 > > Kindly benchmarked by Ben Widawsky. FWIW, as I mentioned on IRC, I think the reduction of the unnecessary forcewake (someone should fix the shadow register list) is probably more beneficial than removing the spin on an uncontested lock. I was tempted to try that myself, but I didn't have time or much interest since your patch accomplishes the same thing. The sucky thing, which I actually care about since I've been doing a lot of profiling, is the raw MMIO doesn't show up with our i915 trace functions. It's obtainable still, but then I get a mess of other stuff I don't want. > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > Cc: Dave Gordon <david.s.gordon@intel.com> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Cc: Ben Widawsky <ben@bwidawsk.net> [snip]
On 01/15/2015 04:54 PM, Ben Widawsky wrote: > On Thu, Jan 15, 2015 at 11:21:30AM +0000, Tvrtko Ursulin wrote: >> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> >> >> This eliminates six needless spin lock/unlock pairs when writing out ELSP. Apart >> from tidier code main benefit is between 0.51% and 0.73% speedup on some OGL >> tests under CHV (bench_OglBatch4 bench_OglDeferred respectively). > > With 95% confidence t-test on n=5 > >> >> Kindly benchmarked by Ben Widawsky. > > FWIW, as I mentioned on IRC, I think the reduction of the unnecessary forcewake > (someone should fix the shadow register list) is probably more beneficial than > removing the spin on an uncontested lock. I was tempted to try that myself, but > I didn't have time or much interest since your patch accomplishes the same > thing. I missed that IRC discussion, but I don't think it was doing forcewakes since the outer block in execlists_elsp_write bumps the counters which made I915_WRITE & co skip them. Regards, Tvrtko
On Thu, Jan 15, 2015 at 05:05:30PM +0000, Tvrtko Ursulin wrote: > > On 01/15/2015 04:54 PM, Ben Widawsky wrote: > >On Thu, Jan 15, 2015 at 11:21:30AM +0000, Tvrtko Ursulin wrote: > >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com> > >> > >>This eliminates six needless spin lock/unlock pairs when writing out ELSP. Apart > >>from tidier code main benefit is between 0.51% and 0.73% speedup on some OGL > >>tests under CHV (bench_OglBatch4 bench_OglDeferred respectively). > > > >With 95% confidence t-test on n=5 > > > >> > >>Kindly benchmarked by Ben Widawsky. > > > >FWIW, as I mentioned on IRC, I think the reduction of the unnecessary forcewake > >(someone should fix the shadow register list) is probably more beneficial than > >removing the spin on an uncontested lock. I was tempted to try that myself, but > >I didn't have time or much interest since your patch accomplishes the same > >thing. > > I missed that IRC discussion, but I don't think it was doing forcewakes > since the outer block in execlists_elsp_write bumps the counters which made > I915_WRITE & co skip them. > > Regards, > > Tvrtko I didn't check the locking but it looks like it could actually get decremented once the spinlock is released. Probably never happens, but I think it's possible. I completely missed that block somehow. I think my eyes skipped over it because how could getting forcewake take like 10+ lines :D
Tested-By: PRC QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 5585
-------------------------------------Summary-------------------------------------
Platform Delta drm-intel-nightly Series Applied
PNV 353/353 353/353
ILK -1 200/200 199/200
SNB 400/422 400/422
IVB 487/487 487/487
BYT 296/296 296/296
HSW +22-1 486/508 507/508
BDW -1 402/402 401/402
-------------------------------------Detailed-------------------------------------
Platform Test drm-intel-nightly Series Applied
*ILK igt_gem_concurrent_blit_gtt-bcs-overwrite-source PASS(2, M37) NO_RESULT(1, M37)
HSW igt_kms_cursor_crc_cursor-size-change NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_kms_fence_pin_leak NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_kms_flip_event_leak NSPT(2, M40)PASS(3, M20) PASS(1, M20)
HSW igt_kms_flip_flip-vs-dpms-off-vs-modeset DMESG_WARN(2, M20M40)PASS(1, M40) DMESG_WARN(1, M20)
HSW igt_kms_mmio_vs_cs_flip_setcrtc_vs_cs_flip NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_kms_mmio_vs_cs_flip_setplane_vs_cs_flip NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_lpsp_non-edp NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_cursor NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_cursor-dpms NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_dpms-mode-unset-non-lpsp NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_dpms-non-lpsp NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_drm-resources-equal NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_fences NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_fences-dpms NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_gem-execbuf NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_gem-mmap-cpu NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_gem-mmap-gtt NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_gem-pread NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_i2c NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_modeset-non-lpsp NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_modeset-non-lpsp-stress-no-wait NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_pci-d3-state NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
HSW igt_pm_rpm_rte NSPT(1, M40)PASS(4, M20M40) PASS(1, M20)
*BDW igt_gem_concurrent_blit_gtt-rcs-early-read-interruptible PASS(7, M30M28) DMESG_WARN(1, M30)
Note: You need to pay more attention to line start with '*'
diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 66f0c60..33d577a 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -3197,6 +3197,21 @@ int vlv_freq_opcode(struct drm_i915_private *dev_priv, int val); #define POSTING_READ(reg) (void)I915_READ_NOTRACE(reg) #define POSTING_READ16(reg) (void)I915_READ16_NOTRACE(reg) +/* Raw MMIO access with no forcewake handling, use with care. */ +#define __raw_i915_read8(dev_priv__, reg__) readb((dev_priv__)->regs + (reg__)) +#define __raw_i915_write8(dev_priv__, reg__, val__) writeb(val__, (dev_priv__)->regs + (reg__)) + +#define __raw_i915_read16(dev_priv__, reg__) readw((dev_priv__)->regs + (reg__)) +#define __raw_i915_write16(dev_priv__, reg__, val__) writew(val__, (dev_priv__)->regs + (reg__)) + +#define __raw_i915_read32(dev_priv__, reg__) readl((dev_priv__)->regs + (reg__)) +#define __raw_i915_write32(dev_priv__, reg__, val__) writel(val__, (dev_priv__)->regs + (reg__)) + +#define __raw_i915_read64(dev_priv__, reg__) readq((dev_priv__)->regs + (reg__)) +#define __raw_i915_write64(dev_priv__, reg__, val__) writeq(val__, (dev_priv__)->regs + (reg__)) + +#define __raw_posting_read(dev_priv__, reg__) (void)__raw_i915_read32(dev_priv__, reg__) + /* "Broadcast RGB" property */ #define INTEL_BROADCAST_RGB_AUTO 0 #define INTEL_BROADCAST_RGB_FULL 1 diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c index e405b61..e22b866 100644 --- a/drivers/gpu/drm/i915/intel_lrc.c +++ b/drivers/gpu/drm/i915/intel_lrc.c @@ -305,6 +305,7 @@ static void execlists_elsp_write(struct intel_engine_cs *ring, * Instead, we do the runtime_pm_get/put when creating/destroying requests. */ spin_lock_irqsave(&dev_priv->uncore.lock, flags); + if (IS_CHERRYVIEW(dev) || INTEL_INFO(dev)->gen >= 9) { if (dev_priv->uncore.fw_rendercount++ == 0) dev_priv->uncore.funcs.force_wake_get(dev_priv, @@ -322,19 +323,17 @@ static void execlists_elsp_write(struct intel_engine_cs *ring, dev_priv->uncore.funcs.force_wake_get(dev_priv, FORCEWAKE_ALL); } - spin_unlock_irqrestore(&dev_priv->uncore.lock, flags); - I915_WRITE(RING_ELSP(ring), desc[1]); - I915_WRITE(RING_ELSP(ring), desc[0]); - I915_WRITE(RING_ELSP(ring), desc[3]); + __raw_i915_write32(dev_priv, RING_ELSP(ring), desc[1]); + __raw_i915_write32(dev_priv, RING_ELSP(ring), desc[0]); + __raw_i915_write32(dev_priv, RING_ELSP(ring), desc[3]); /* The context is automatically loaded after the following */ - I915_WRITE(RING_ELSP(ring), desc[2]); + __raw_i915_write32(dev_priv, RING_ELSP(ring), desc[2]); /* ELSP is a wo register, so use another nearby reg for posting instead */ - POSTING_READ(RING_EXECLIST_STATUS(ring)); + __raw_posting_read(dev_priv, RING_EXECLIST_STATUS(ring)); /* Release Force Wakeup (see the big comment above). */ - spin_lock_irqsave(&dev_priv->uncore.lock, flags); if (IS_CHERRYVIEW(dev) || INTEL_INFO(dev)->gen >= 9) { if (--dev_priv->uncore.fw_rendercount == 0) dev_priv->uncore.funcs.force_wake_put(dev_priv, diff --git a/drivers/gpu/drm/i915/intel_uncore.c b/drivers/gpu/drm/i915/intel_uncore.c index e9561de..9a31932 100644 --- a/drivers/gpu/drm/i915/intel_uncore.c +++ b/drivers/gpu/drm/i915/intel_uncore.c @@ -26,20 +26,6 @@ #define FORCEWAKE_ACK_TIMEOUT_MS 2 -#define __raw_i915_read8(dev_priv__, reg__) readb((dev_priv__)->regs + (reg__)) -#define __raw_i915_write8(dev_priv__, reg__, val__) writeb(val__, (dev_priv__)->regs + (reg__)) - -#define __raw_i915_read16(dev_priv__, reg__) readw((dev_priv__)->regs + (reg__)) -#define __raw_i915_write16(dev_priv__, reg__, val__) writew(val__, (dev_priv__)->regs + (reg__)) - -#define __raw_i915_read32(dev_priv__, reg__) readl((dev_priv__)->regs + (reg__)) -#define __raw_i915_write32(dev_priv__, reg__, val__) writel(val__, (dev_priv__)->regs + (reg__)) - -#define __raw_i915_read64(dev_priv__, reg__) readq((dev_priv__)->regs + (reg__)) -#define __raw_i915_write64(dev_priv__, reg__, val__) writeq(val__, (dev_priv__)->regs + (reg__)) - -#define __raw_posting_read(dev_priv__, reg__) (void)__raw_i915_read32(dev_priv__, reg__) - static void assert_device_not_suspended(struct drm_i915_private *dev_priv) {