[1/3] drm/i915: Use readl/writel for ring buffer access

Message ID	1460631571-29230-1-git-send-email-tvrtko.ursulin@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> To: Intel-gfx@lists.freedesktop.org Date: Thu, 14 Apr 2016 11:59:29 +0100 Message-Id: <1460631571-29230-1-git-send-email-tvrtko.ursulin@linux.intel.com> Subject: [Intel-gfx] [PATCH 1/3] drm/i915: Use readl/writel for ring buffer access Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Tvrtko Ursulin April 14, 2016, 10:59 a.m. UTC

From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

We know ringbuffers are memory and not ports so if we use readl
and writel instead of ioread32 and iowrite32 (which dispatch to
the very same functions after checking the address range) we
avoid generating functions calls and branching on every access.

This generates smaller code and potentialy also improves
performance. Brief testing with gem_latency (ten runs of both
-n 0 and -n 100) show potential 3% better throughput and 1%
better latency although more runs would be required to be
absolutely certain.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_irq.c         | 8 ++++----
 drivers/gpu/drm/i915/intel_lrc.c        | 2 +-
 drivers/gpu/drm/i915/intel_lrc.h        | 2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.c | 2 +-
 drivers/gpu/drm/i915/intel_ringbuffer.h | 2 +-
 5 files changed, 8 insertions(+), 8 deletions(-)

Chris Wilson April 14, 2016, 11:16 a.m. UTC | #1

On Thu, Apr 14, 2016 at 11:59:29AM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> 
> We know ringbuffers are memory and not ports so if we use readl
> and writel instead of ioread32 and iowrite32 (which dispatch to
> the very same functions after checking the address range) we
> avoid generating functions calls and branching on every access.

We don't need to use readl/write at all, since they are normal memory
on llc, and on x86 we can pretend that iomaps (!llc/stolen) are as well.

This patch is in the queue along with killing the incorrect spare iomem
annotation.
-Chris

Tvrtko Ursulin April 14, 2016, 11:24 a.m. UTC | #2

On 14/04/16 12:16, Chris Wilson wrote:
> On Thu, Apr 14, 2016 at 11:59:29AM +0100, Tvrtko Ursulin wrote:
>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>
>> We know ringbuffers are memory and not ports so if we use readl
>> and writel instead of ioread32 and iowrite32 (which dispatch to
>> the very same functions after checking the address range) we
>> avoid generating functions calls and branching on every access.
>
> We don't need to use readl/write at all, since they are normal memory
> on llc, and on x86 we can pretend that iomaps (!llc/stolen) are as well.

It is fine to use readl/writel since it translates to a single mov 
instruction anyway on x86.

> This patch is in the queue along with killing the incorrect spare iomem
> annotation.

Ok did not spot them. Don't mind either way, thought this is quick, easy 
and obvious improvement when I spotted the ugly code generated for ring 
buffer writing.

Mind you it is still not completely pretty with this patch since it is 
full of reloads and adds for ringbuf->virtual_start and tail which I 
can't figure how to help GCC optimize. Unless we make being, emit and 
advance functions return the current tail pointer and also accept it. In 
that case it all shrinks by half.

Regards,

Tvrtko

Chris Wilson April 14, 2016, 11:30 a.m. UTC | #3

On Thu, Apr 14, 2016 at 12:24:20PM +0100, Tvrtko Ursulin wrote:
> 
> On 14/04/16 12:16, Chris Wilson wrote:
> >On Thu, Apr 14, 2016 at 11:59:29AM +0100, Tvrtko Ursulin wrote:
> >>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>
> >>We know ringbuffers are memory and not ports so if we use readl
> >>and writel instead of ioread32 and iowrite32 (which dispatch to
> >>the very same functions after checking the address range) we
> >>avoid generating functions calls and branching on every access.
> >
> >We don't need to use readl/write at all, since they are normal memory
> >on llc, and on x86 we can pretend that iomaps (!llc/stolen) are as well.
> 
> It is fine to use readl/writel since it translates to a single mov
> instruction anyway on x86.
> 
> >This patch is in the queue along with killing the incorrect spare iomem
> >annotation.
> 
> Ok did not spot them. Don't mind either way, thought this is quick,
> easy and obvious improvement when I spotted the ugly code generated
> for ring buffer writing.
> 
> Mind you it is still not completely pretty with this patch since it
> is full of reloads and adds for ringbuf->virtual_start and tail
> which I can't figure how to help GCC optimize. Unless we make being,
> emit and advance functions return the current tail pointer and also
> accept it. In that case it all shrinks by half.

We figured out how to help gcc with that in userspace using:

out = ring_begin(num_dwords);
out[0] = cmd;
out[N] = dwN

GCC will then do

mov $imm0, 0x0($eax)
mov $imm1, 0x4($eax)
mov $edx, 0x8($eax)
etc

Forgive the clumsy rebasing:

https://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=tasklet&id=a5c7b28441af0cb0e640f4ba86facba69e8f6c37
 drivers/gpu/drm/i915/i915_gem.c            |   4 -
 drivers/gpu/drm/i915/i915_gem_context.c    |  76 +--
 drivers/gpu/drm/i915/i915_gem_execbuffer.c |  37 +-
 drivers/gpu/drm/i915/i915_gem_gtt.c        |  61 +-
 drivers/gpu/drm/i915/i915_gem_request.c    | 121 +++-
 drivers/gpu/drm/i915/i915_gem_request.h    |   3 +
 drivers/gpu/drm/i915/i915_guc_submission.c |   2 +-
 drivers/gpu/drm/i915/intel_display.c       | 134 ++---
 drivers/gpu/drm/i915/intel_lrc.c           | 239 ++++----
 drivers/gpu/drm/i915/intel_mocs.c          |  50 +-
 drivers/gpu/drm/i915/intel_overlay.c       |  77 +--
 drivers/gpu/drm/i915/intel_ringbuffer.c    | 922 ++++++++++++-----------------
 drivers/gpu/drm/i915/intel_ringbuffer.h    |  28 +-
 13 files changed, 793 insertions(+), 961 deletions(-)
-Chris

Tvrtko Ursulin April 14, 2016, 11:58 a.m. UTC | #4

On 14/04/16 12:30, Chris Wilson wrote:
> On Thu, Apr 14, 2016 at 12:24:20PM +0100, Tvrtko Ursulin wrote:
>>
>> On 14/04/16 12:16, Chris Wilson wrote:
>>> On Thu, Apr 14, 2016 at 11:59:29AM +0100, Tvrtko Ursulin wrote:
>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>
>>>> We know ringbuffers are memory and not ports so if we use readl
>>>> and writel instead of ioread32 and iowrite32 (which dispatch to
>>>> the very same functions after checking the address range) we
>>>> avoid generating functions calls and branching on every access.
>>>
>>> We don't need to use readl/write at all, since they are normal memory
>>> on llc, and on x86 we can pretend that iomaps (!llc/stolen) are as well.
>>
>> It is fine to use readl/writel since it translates to a single mov
>> instruction anyway on x86.
>>
>>> This patch is in the queue along with killing the incorrect spare iomem
>>> annotation.
>>
>> Ok did not spot them. Don't mind either way, thought this is quick,
>> easy and obvious improvement when I spotted the ugly code generated
>> for ring buffer writing.
>>
>> Mind you it is still not completely pretty with this patch since it
>> is full of reloads and adds for ringbuf->virtual_start and tail
>> which I can't figure how to help GCC optimize. Unless we make being,
>> emit and advance functions return the current tail pointer and also
>> accept it. In that case it all shrinks by half.
>
> We figured out how to help gcc with that in userspace using:
>
> out = ring_begin(num_dwords);
> out[0] = cmd;
> out[N] = dwN
>
> GCC will then do
>
> mov $imm0, 0x0($eax)
> mov $imm1, 0x4($eax)
> mov $edx, 0x8($eax)
> etc

Would be nice, hope it happens soon. :)

Regards,

Tvrtko

Dave Gordon April 14, 2016, 3:07 p.m. UTC | #5

On 14/04/16 12:58, Tvrtko Ursulin wrote:
>
> On 14/04/16 12:30, Chris Wilson wrote:
>> On Thu, Apr 14, 2016 at 12:24:20PM +0100, Tvrtko Ursulin wrote:
>>>
>>> On 14/04/16 12:16, Chris Wilson wrote:
>>>> On Thu, Apr 14, 2016 at 11:59:29AM +0100, Tvrtko Ursulin wrote:
>>>>> From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>>>
>>>>> We know ringbuffers are memory and not ports so if we use readl
>>>>> and writel instead of ioread32 and iowrite32 (which dispatch to
>>>>> the very same functions after checking the address range) we
>>>>> avoid generating functions calls and branching on every access.
>>>>
>>>> We don't need to use readl/write at all, since they are normal memory
>>>> on llc, and on x86 we can pretend that iomaps (!llc/stolen) are as
>>>> well.
>>>
>>> It is fine to use readl/writel since it translates to a single mov
>>> instruction anyway on x86.
>>>
>>>> This patch is in the queue along with killing the incorrect spare iomem
>>>> annotation.
>>>
>>> Ok did not spot them. Don't mind either way, thought this is quick,
>>> easy and obvious improvement when I spotted the ugly code generated
>>> for ring buffer writing.
>>>
>>> Mind you it is still not completely pretty with this patch since it
>>> is full of reloads and adds for ringbuf->virtual_start and tail
>>> which I can't figure how to help GCC optimize. Unless we make being,
>>> emit and advance functions return the current tail pointer and also
>>> accept it. In that case it all shrinks by half.
>>
>> We figured out how to help gcc with that in userspace using:
>>
>> out = ring_begin(num_dwords);
>> out[0] = cmd;
>> out[N] = dwN
>>
>> GCC will then do
>>
>> mov $imm0, 0x0($eax)
>> mov $imm1, 0x4($eax)
>> mov $edx, 0x8($eax)
>> etc
>
> Would be nice, hope it happens soon. :)
>
> Regards,
> Tvrtko

Another couple of alternative styles:

	DWORD* ptr = ring_begin(ring, nwords);
	*ptr++ = MI_WHATEVER;
	*ptr++ = param1;
	...
	ring_advance(ring, ptr);
	// this call checks that 'ptr' has not gone
	// beyond the nwords reserved above

Or collapse it all into one call:

	DWORD insns[OP_NWORDS] = {
		MI_WHATEVER,
		param1,
		...
	}
	ring_append(ring, nwords, insns);

which combines the check-and-wrap with a block copy to add all the 
instructions in one go :)

.Dave.

Chris Wilson April 14, 2016, 3:55 p.m. UTC | #6

On Thu, Apr 14, 2016 at 04:07:58PM +0100, Dave Gordon wrote:
> On 14/04/16 12:58, Tvrtko Ursulin wrote:
> >
> >On 14/04/16 12:30, Chris Wilson wrote:
> >>On Thu, Apr 14, 2016 at 12:24:20PM +0100, Tvrtko Ursulin wrote:
> >>>
> >>>On 14/04/16 12:16, Chris Wilson wrote:
> >>>>On Thu, Apr 14, 2016 at 11:59:29AM +0100, Tvrtko Ursulin wrote:
> >>>>>From: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> >>>>>
> >>>>>We know ringbuffers are memory and not ports so if we use readl
> >>>>>and writel instead of ioread32 and iowrite32 (which dispatch to
> >>>>>the very same functions after checking the address range) we
> >>>>>avoid generating functions calls and branching on every access.
> >>>>
> >>>>We don't need to use readl/write at all, since they are normal memory
> >>>>on llc, and on x86 we can pretend that iomaps (!llc/stolen) are as
> >>>>well.
> >>>
> >>>It is fine to use readl/writel since it translates to a single mov
> >>>instruction anyway on x86.
> >>>
> >>>>This patch is in the queue along with killing the incorrect spare iomem
> >>>>annotation.
> >>>
> >>>Ok did not spot them. Don't mind either way, thought this is quick,
> >>>easy and obvious improvement when I spotted the ugly code generated
> >>>for ring buffer writing.
> >>>
> >>>Mind you it is still not completely pretty with this patch since it
> >>>is full of reloads and adds for ringbuf->virtual_start and tail
> >>>which I can't figure how to help GCC optimize. Unless we make being,
> >>>emit and advance functions return the current tail pointer and also
> >>>accept it. In that case it all shrinks by half.
> >>
> >>We figured out how to help gcc with that in userspace using:
> >>
> >>out = ring_begin(num_dwords);
> >>out[0] = cmd;
> >>out[N] = dwN
> >>
> >>GCC will then do
> >>
> >>mov $imm0, 0x0($eax)
> >>mov $imm1, 0x4($eax)
> >>mov $edx, 0x8($eax)
> >>etc
> >
> >Would be nice, hope it happens soon. :)
> >
> >Regards,
> >Tvrtko
> 
> Another couple of alternative styles:
> 
> 	DWORD* ptr = ring_begin(ring, nwords);
> 	*ptr++ = MI_WHATEVER;
> 	*ptr++ = param1;

GCC turns *ptr++ into mov $val, (%eax); add $4, %eax

It's convenient for translating the for loops, but not as compact.

> 	...
> 	ring_advance(ring, ptr);
> 	// this call checks that 'ptr' has not gone
> 	// beyond the nwords reserved above
> 
> Or collapse it all into one call:
> 
> 	DWORD insns[OP_NWORDS] = {
> 		MI_WHATEVER,
> 		param1,
> 		...
> 	}
> 	ring_append(ring, nwords, insns);
> 
> which combines the check-and-wrap with a block copy to add all the
> instructions in one go :)

Considered that, but not seriously looked into what gcc does there -
mainly because that would involve even more complicated changes to the
code.
-Chris

Tvrtko Ursulin April 15, 2016, 8:54 a.m. UTC | #7

On 14/04/16 17:37, Patchwork wrote:
> == Series Details ==
>
> Series: series starting with [1/3] drm/i915: Use readl/writel for ring buffer access
> URL   : https://patchwork.freedesktop.org/series/5714/
> State : failure
>
> == Summary ==
>
> Series 5714v1 Series without cover letter
> http://patchwork.freedesktop.org/api/1.0/series/5714/revisions/1/mbox/
>
> Test drv_hangman:
>          Subgroup error-state-basic:
>                  pass       -> FAIL       (ilk-hp8440p)
> Test gem_busy:
>          Subgroup basic-render:
>                  pass       -> SKIP       (ilk-hp8440p)
> Test gem_exec_basic:
>          Subgroup basic-render:
>                  pass       -> FAIL       (ilk-hp8440p)
>          Subgroup gtt-bsd:
>                  pass       -> FAIL       (ilk-hp8440p)
>          Subgroup readonly-default:
>                  pass       -> FAIL       (ilk-hp8440p)
>          Subgroup readonly-render:
>                  pass       -> FAIL       (ilk-hp8440p)
> Test gem_exec_create:
>          Subgroup basic:
>                  pass       -> FAIL       (ilk-hp8440p)
> Test gem_exec_store:
>          Subgroup basic-all:
>                  pass       -> FAIL       (ilk-hp8440p)
>          Subgroup basic-bsd:
>                  pass       -> FAIL       (ilk-hp8440p)
>          Subgroup basic-default:
>                  pass       -> FAIL       (ilk-hp8440p)
> Test gem_exec_whisper:
>          Subgroup basic:
>                  pass       -> SKIP       (ilk-hp8440p)
> Test gem_linear_blits:
>          Subgroup basic:
>                  pass       -> FAIL       (ilk-hp8440p)
> Test gem_mmap_gtt:
>          Subgroup basic-small-copy-xy:
>                  pass       -> DMESG-WARN (ilk-hp8440p)
> Test gem_ringfill:
>          Subgroup basic-default-hang:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)
> Test gem_sync:
>          Subgroup basic-all:
>                  pass       -> FAIL       (ilk-hp8440p)
>          Subgroup basic-bsd:
>                  pass       -> FAIL       (ilk-hp8440p)
>          Subgroup basic-render:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)
> Test gem_tiled_fence_blits:
>          Subgroup basic:
>                  pass       -> FAIL       (ilk-hp8440p)
> Test kms_addfb_basic:
>          Subgroup addfb25-modifier-no-flag:
>                  pass       -> DMESG-WARN (ilk-hp8440p)
> Test kms_flip:
>          Subgroup basic-flip-vs-dpms:
>                  pass       -> DMESG-FAIL (ilk-hp8440p) UNSTABLE
>          Subgroup basic-flip-vs-wf_vblank:
>                  pass       -> FAIL       (ilk-hp8440p) UNSTABLE
>          Subgroup basic-plain-flip:
>                  pass       -> DMESG-WARN (ilk-hp8440p)
> Test kms_force_connector_basic:
>          Subgroup force-edid:
>                  skip       -> PASS       (hsw-gt2)
>          Subgroup force-load-detect:
>                  skip       -> DMESG-WARN (ilk-hp8440p)
> Test kms_pipe_crc_basic:
>          Subgroup bad-nb-words-1:
>                  pass       -> DMESG-WARN (ilk-hp8440p)
>          Subgroup bad-nb-words-3:
>                  pass       -> DMESG-WARN (ilk-hp8440p)
>          Subgroup bad-pipe:
>                  pass       -> DMESG-WARN (ilk-hp8440p)
>          Subgroup hang-read-crc-pipe-a:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)
>          Subgroup nonblocking-crc-pipe-b:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)
>          Subgroup read-crc-pipe-a-frame-sequence:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)
>          Subgroup read-crc-pipe-b:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)
>          Subgroup read-crc-pipe-b-frame-sequence:
>                  pass       -> DMESG-FAIL (ilk-hp8440p)
>          Subgroup suspend-read-crc-pipe-c:
>                  skip       -> DMESG-FAIL (ilk-hp8440p)
> Test kms_sink_crc_basic:
>                  skip       -> DMESG-FAIL (ilk-hp8440p)
>
> bdw-ultra        total:203  pass:179  dwarn:0   dfail:0   fail:1   skip:23
> bsw-nuc-2        total:202  pass:162  dwarn:0   dfail:0   fail:1   skip:39
> byt-nuc          total:202  pass:164  dwarn:0   dfail:0   fail:0   skip:38
> hsw-brixbox      total:203  pass:178  dwarn:0   dfail:0   fail:1   skip:24
> hsw-gt2          total:203  pass:183  dwarn:0   dfail:0   fail:1   skip:19
> ilk-hp8440p      total:203  pass:103  dwarn:7   dfail:10  fail:15  skip:68
> ivb-t430s        total:203  pass:174  dwarn:0   dfail:0   fail:1   skip:28
> skl-i7k-2        total:203  pass:177  dwarn:0   dfail:0   fail:1   skip:25
> skl-nuci5        total:203  pass:191  dwarn:0   dfail:0   fail:1   skip:11
> snb-dellxps      total:203  pass:164  dwarn:0   dfail:0   fail:1   skip:38
> snb-x220t        total:203  pass:164  dwarn:0   dfail:0   fail:2   skip:37
> BOOT FAILED for bdw-nuci7
>
> Results at /archive/results/CI_IGT_test/Patchwork_1901/
>
> c7583aec08ba04e2336bd9879a10f30d4e0cdc60 drm-intel-nightly: 2016y-04m-14d-14h-53m-34s UTC integration manifest
> b704cc8 drm/i915: Use writel instead of iowrite32 when programming page table entries
> 6a748d4 drm/i915: Use writel instead of iowrite32 when doing GTT relocations
> 8c397f2 drm/i915: Use readl/writel for ring buffer access

Oh this breaks ILK - what it special about it?

Regards,

Tvrtko

[1/3] drm/i915: Use readl/writel for ring buffer access

Commit Message

Comments

Patch