drm/radeon: Inline r100_mm_rreg

Message ID	20140410160817.5275493d.cand@gmx.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <dri-devel-bounces@lists.freedesktop.org> Date: Thu, 10 Apr 2014 16:08:17 +0300 From: Lauri Kasanen <cand@gmx.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH] drm/radeon: Inline r100_mm_rreg Message-Id: <20140410160817.5275493d.cand@gmx.com> Mime-Version: 1.0 Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>

Lauri Kasanen April 10, 2014, 1:08 p.m. UTC

This was originally un-inlined by Andi Kleen in 2011 citing size concerns.
Indeed, inlining it grows radeon.ko by 7%.

However, 2% of cpu is spent in this function. Inlining it gives 1% more fps
in Urban Terror.

Signed-off-by: Lauri Kasanen <cand@gmx.com>
---
 drivers/gpu/drm/radeon/r100.c   | 18 ------------------
 drivers/gpu/drm/radeon/radeon.h | 20 ++++++++++++++++++--
 2 files changed, 18 insertions(+), 20 deletions(-)

Ilia Mirkin April 10, 2014, 4:19 p.m. UTC | #1

On Thu, Apr 10, 2014 at 9:08 AM, Lauri Kasanen <cand@gmx.com> wrote:
> This was originally un-inlined by Andi Kleen in 2011 citing size concerns.
> Indeed, inlining it grows radeon.ko by 7%.
>
> However, 2% of cpu is spent in this function. Inlining it gives 1% more fps
> in Urban Terror.
>
> Signed-off-by: Lauri Kasanen <cand@gmx.com>
> ---
>  drivers/gpu/drm/radeon/r100.c   | 18 ------------------
>  drivers/gpu/drm/radeon/radeon.h | 20 ++++++++++++++++++--
>  2 files changed, 18 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c
> index b6c3264..8169e82 100644
> --- a/drivers/gpu/drm/radeon/r100.c
> +++ b/drivers/gpu/drm/radeon/r100.c
> @@ -4086,24 +4086,6 @@ int r100_init(struct radeon_device *rdev)
>         return 0;
>  }
>
> -uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg,
> -                     bool always_indirect)
> -{
> -       if (reg < rdev->rmmio_size && !always_indirect)
> -               return readl(((void __iomem *)rdev->rmmio) + reg);
> -       else {
> -               unsigned long flags;
> -               uint32_t ret;
> -
> -               spin_lock_irqsave(&rdev->mmio_idx_lock, flags);
> -               writel(reg, ((void __iomem *)rdev->rmmio) + RADEON_MM_INDEX);
> -               ret = readl(((void __iomem *)rdev->rmmio) + RADEON_MM_DATA);
> -               spin_unlock_irqrestore(&rdev->mmio_idx_lock, flags);
> -
> -               return ret;
> -       }
> -}
> -
>  void r100_mm_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v,
>                   bool always_indirect)
>  {
> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
> index 5cf10a7..9231100 100644
> --- a/drivers/gpu/drm/radeon/radeon.h
> +++ b/drivers/gpu/drm/radeon/radeon.h
> @@ -2330,8 +2330,24 @@ int radeon_device_init(struct radeon_device *rdev,
>  void radeon_device_fini(struct radeon_device *rdev);
>  int radeon_gpu_wait_for_idle(struct radeon_device *rdev);
>
> -uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg,
> -                     bool always_indirect);
> +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg,
> +                                   bool always_indirect)
> +{
> +       if (reg < rdev->rmmio_size && !always_indirect)
> +               return readl(((void __iomem *)rdev->rmmio) + reg);

Quick thought from someone entirely unfamiliar with the hardware:
perhaps you can get the performance benefit without the size increase
by moving the else portion into a non-inline function? I'm guessing
that most accesses happen in the "if" branch.

> +       else {
> +               unsigned long flags;
> +               uint32_t ret;
> +
> +               spin_lock_irqsave(&rdev->mmio_idx_lock, flags);
> +               writel(reg, ((void __iomem *)rdev->rmmio) + RADEON_MM_INDEX);
> +               ret = readl(((void __iomem *)rdev->rmmio) + RADEON_MM_DATA);
> +               spin_unlock_irqrestore(&rdev->mmio_idx_lock, flags);
> +
> +               return ret;
> +       }
> +}
> +
>  void r100_mm_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v,
>                   bool always_indirect);
>  u32 r100_io_rreg(struct radeon_device *rdev, u32 reg);
> --
> 1.8.3.1
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

Lauri Kasanen April 10, 2014, 6:46 p.m. UTC | #2

On Thu, 10 Apr 2014 12:19:10 -0400
Ilia Mirkin <imirkin@alum.mit.edu> wrote:

> > +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg,
> > +                                   bool always_indirect)
> > +{
> > +       if (reg < rdev->rmmio_size && !always_indirect)
> > +               return readl(((void __iomem *)rdev->rmmio) + reg);
> 
> Quick thought from someone entirely unfamiliar with the hardware:
> perhaps you can get the performance benefit without the size increase
> by moving the else portion into a non-inline function? I'm guessing
> that most accesses happen in the "if" branch.

The function call overhead is about equal to branching overhead, so
splitting it would only help about half that. It's called from many
places, and a lot of calls per sec.

Of course the future kernel LTO will all make this go away, but that's
probably two years in the future before it's stable.

- Lauri

Ilia Mirkin April 10, 2014, 6:52 p.m. UTC | #3

On Thu, Apr 10, 2014 at 2:46 PM, Lauri Kasanen <cand@gmx.com> wrote:
> On Thu, 10 Apr 2014 12:19:10 -0400
> Ilia Mirkin <imirkin@alum.mit.edu> wrote:
>
>> > +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg,
>> > +                                   bool always_indirect)
>> > +{
>> > +       if (reg < rdev->rmmio_size && !always_indirect)
>> > +               return readl(((void __iomem *)rdev->rmmio) + reg);
>>
>> Quick thought from someone entirely unfamiliar with the hardware:
>> perhaps you can get the performance benefit without the size increase
>> by moving the else portion into a non-inline function? I'm guessing
>> that most accesses happen in the "if" branch.
>
> The function call overhead is about equal to branching overhead, so
> splitting it would only help about half that. It's called from many
> places, and a lot of calls per sec.

Is that really true? I haven't profiled it, but a function call has to
save/restore registers, set up new stack, etc. And the jump is to some
far-away code, so perhaps less likely to be in i-cache? (But maybe
not, not sure on the details of how all that works.) And I'm guessing
most of the size increase is coming from the spinlock/unlock, which,
I'm guessing again, is the much-rarer case.

And the branch would happen either way... so that's a sunk cost.
(Except I bet that the always_indirect param is always constant and
that bit of the if can be resolved at compile time with that part
being inlined.)

Anyways, it was just a thought.

  -ilia

Christian König April 10, 2014, 7:30 p.m. UTC | #4

Am 10.04.2014 20:52, schrieb Ilia Mirkin:
> On Thu, Apr 10, 2014 at 2:46 PM, Lauri Kasanen <cand@gmx.com> wrote:
>> On Thu, 10 Apr 2014 12:19:10 -0400
>> Ilia Mirkin <imirkin@alum.mit.edu> wrote:
>>
>>>> +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg,
>>>> +                                   bool always_indirect)
>>>> +{
>>>> +       if (reg < rdev->rmmio_size && !always_indirect)
>>>> +               return readl(((void __iomem *)rdev->rmmio) + reg);
>>> Quick thought from someone entirely unfamiliar with the hardware:
>>> perhaps you can get the performance benefit without the size increase
>>> by moving the else portion into a non-inline function? I'm guessing
>>> that most accesses happen in the "if" branch.
>> The function call overhead is about equal to branching overhead, so
>> splitting it would only help about half that. It's called from many
>> places, and a lot of calls per sec.

Actually direct register access shouldn't be necessary so often. Apart 
from page flips, write/read pointer updates and irq processing there 
shouldn't be so many of them. Could you clarify a bit more what issue 
you are seeing here?

> Is that really true? I haven't profiled it, but a function call has to
> save/restore registers, set up new stack, etc. And the jump is to some
> far-away code, so perhaps less likely to be in i-cache? (But maybe
> not, not sure on the details of how all that works.) And I'm guessing
> most of the size increase is coming from the spinlock/unlock, which,
> I'm guessing again, is the much-rarer case.
>
> And the branch would happen either way... so that's a sunk cost.
> (Except I bet that the always_indirect param is always constant and
> that bit of the if can be resolved at compile time with that part
> being inlined.)
>
> Anyways, it was just a thought.

And a pretty much correct one. The "else" case shouldn't be necessary on 
modern hardware any more and so nearly never taken.

Christian.

>
>    -ilia
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

Lauri Kasanen April 11, 2014, 7:52 a.m. UTC | #5

On Thu, 10 Apr 2014 21:30:03 +0200
Christian König <deathsimple@vodafone.de> wrote:

> >>> Quick thought from someone entirely unfamiliar with the hardware:
> >>> perhaps you can get the performance benefit without the size increase
> >>> by moving the else portion into a non-inline function? I'm guessing
> >>> that most accesses happen in the "if" branch.
> >> The function call overhead is about equal to branching overhead, so
> >> splitting it would only help about half that. It's called from many
> >> places, and a lot of calls per sec.
> 
> Actually direct register access shouldn't be necessary so often. Apart 
> from page flips, write/read pointer updates and irq processing there 
> shouldn't be so many of them. Could you clarify a bit more what issue 
> you are seeing here?

Too much cpu usage for such a simple function. 2% makes it #2 in top-10
radeon.ko functions, right after evergreen_cs_parse. For reference, #3
(radeon_cs_packet_parse) is only 0.5%, one fourth of this function's
usage.

As proved by the perf increase, it's called often enough that getting
rid of the function call overhead (and compiling the if out
compile-time) helps measurably.

- Lauri

Christian König April 11, 2014, 8:33 a.m. UTC | #6

Am 11.04.2014 09:52, schrieb Lauri Kasanen:
> On Thu, 10 Apr 2014 21:30:03 +0200
> Christian König <deathsimple@vodafone.de> wrote:
>
>>>>> Quick thought from someone entirely unfamiliar with the hardware:
>>>>> perhaps you can get the performance benefit without the size increase
>>>>> by moving the else portion into a non-inline function? I'm guessing
>>>>> that most accesses happen in the "if" branch.
>>>> The function call overhead is about equal to branching overhead, so
>>>> splitting it would only help about half that. It's called from many
>>>> places, and a lot of calls per sec.
>> Actually direct register access shouldn't be necessary so often. Apart
>> from page flips, write/read pointer updates and irq processing there
>> shouldn't be so many of them. Could you clarify a bit more what issue
>> you are seeing here?
> Too much cpu usage for such a simple function. 2% makes it #2 in top-10
> radeon.ko functions, right after evergreen_cs_parse. For reference, #3
> (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's
> usage.

I think you misunderstood me here. I do believe your numbers that it 
makes a noticeable difference.

But I've did a couple of perf tests recently on SI and CIK while hacking 
on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on 
those systems.

So what puzzles me is who the hack is calling r100_mm_rreg so often that 
it makes a noticeable difference on evergreen/NI?

Christian.

>
> As proved by the perf increase, it's called often enough that getting
> rid of the function call overhead (and compiling the if out
> compile-time) helps measurably.
>
> - Lauri

Lauri Kasanen April 11, 2014, 9:54 a.m. UTC | #7

On Fri, 11 Apr 2014 10:33:08 +0200
Christian König <deathsimple@vodafone.de> wrote:

> >> Actually direct register access shouldn't be necessary so often. Apart
> >> from page flips, write/read pointer updates and irq processing there
> >> shouldn't be so many of them. Could you clarify a bit more what issue
> >> you are seeing here?
> > Too much cpu usage for such a simple function. 2% makes it #2 in top-10
> > radeon.ko functions, right after evergreen_cs_parse. For reference, #3
> > (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's
> > usage.
> 
> I think you misunderstood me here. I do believe your numbers that it 
> makes a noticeable difference.
> 
> But I've did a couple of perf tests recently on SI and CIK while hacking 
> on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on 
> those systems.
> 
> So what puzzles me is who the hack is calling r100_mm_rreg so often that 
> it makes a noticeable difference on evergreen/NI?

The biggest caller is cayman_cp_int_cntl_setup. Before inlining it took
0.0013%, after it takes 1%.

This is on a Richland APU, so Aruba/Cayman. Urban Terror is an ioq3
game with a lot of cpu-side vertex submissions.

- Lauri

Christian König April 11, 2014, 12:32 p.m. UTC | #8

Am 11.04.2014 11:54, schrieb Lauri Kasanen:
> On Fri, 11 Apr 2014 10:33:08 +0200
> Christian König <deathsimple@vodafone.de> wrote:
>
>>>> Actually direct register access shouldn't be necessary so often. Apart
>>>> from page flips, write/read pointer updates and irq processing there
>>>> shouldn't be so many of them. Could you clarify a bit more what issue
>>>> you are seeing here?
>>> Too much cpu usage for such a simple function. 2% makes it #2 in top-10
>>> radeon.ko functions, right after evergreen_cs_parse. For reference, #3
>>> (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's
>>> usage.
>> I think you misunderstood me here. I do believe your numbers that it
>> makes a noticeable difference.
>>
>> But I've did a couple of perf tests recently on SI and CIK while hacking
>> on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on
>> those systems.
>>
>> So what puzzles me is who the hack is calling r100_mm_rreg so often that
>> it makes a noticeable difference on evergreen/NI?
> The biggest caller is cayman_cp_int_cntl_setup. Before inlining it took
> 0.0013%, after it takes 1%.

Sounds like somebody is constantly turning interrupts on and off.

> This is on a Richland APU, so Aruba/Cayman. Urban Terror is an ioq3
> game with a lot of cpu-side vertex submissions.

That will probably be the difference, I only tested lightsmark.

Anyway, I would do like Ilia suggested and only put the else branch into 
a separate, not inlined function.

BTW: It's probably a good idea to do the same for the write function as 
well.

Christian.

> - Lauri

Lauri Kasanen April 11, 2014, 4:47 p.m. UTC | #9

On Fri, 11 Apr 2014 14:32:20 +0200
Christian König <deathsimple@vodafone.de> wrote:

> Anyway, I would do like Ilia suggested and only put the else branch into 
> a separate, not inlined function.
> 
> BTW: It's probably a good idea to do the same for the write function as 
> well.

I tested it. The majority of the size increase stayed - the
else/spinlock part as non-inlined functions, radeon.ko was still 5%
larger instead of 7%.

- Lauri

Lauri Kasanen April 11, 2014, 5:38 p.m. UTC | #10

On Fri, 11 Apr 2014 14:32:20 +0200
Christian König <deathsimple@vodafone.de> wrote:

> Am 11.04.2014 11:54, schrieb Lauri Kasanen:
> > On Fri, 11 Apr 2014 10:33:08 +0200
> > Christian König <deathsimple@vodafone.de> wrote:
> >
> >>>> Actually direct register access shouldn't be necessary so often. Apart
> >>>> from page flips, write/read pointer updates and irq processing there
> >>>> shouldn't be so many of them. Could you clarify a bit more what issue
> >>>> you are seeing here?
> >>> Too much cpu usage for such a simple function. 2% makes it #2 in top-10
> >>> radeon.ko functions, right after evergreen_cs_parse. For reference, #3
> >>> (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's
> >>> usage.
> >> I think you misunderstood me here. I do believe your numbers that it
> >> makes a noticeable difference.
> >>
> >> But I've did a couple of perf tests recently on SI and CIK while hacking
> >> on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on
> >> those systems.
> >>
> >> So what puzzles me is who the hack is calling r100_mm_rreg so often that
> >> it makes a noticeable difference on evergreen/NI?
> > The biggest caller is cayman_cp_int_cntl_setup. Before inlining it took
> > 0.0013%, after it takes 1%.
> 
> Sounds like somebody is constantly turning interrupts on and off.

I instrumented radeon_irq_set. There's about 12 calls to that per sec,
most coming from radeon_irq_kms_sw_irq_get and
radeon_irq_kms_sw_irq_put.

So from your earlier mail I gather this amount of IRQ setting is
normal. It's just taking a lot of cpu for some reason.

- Lauri

drm/radeon: Inline r100_mm_rreg

Commit Message

Comments

Patch