Message ID | 20140410160817.5275493d.cand@gmx.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Apr 10, 2014 at 9:08 AM, Lauri Kasanen <cand@gmx.com> wrote: > This was originally un-inlined by Andi Kleen in 2011 citing size concerns. > Indeed, inlining it grows radeon.ko by 7%. > > However, 2% of cpu is spent in this function. Inlining it gives 1% more fps > in Urban Terror. > > Signed-off-by: Lauri Kasanen <cand@gmx.com> > --- > drivers/gpu/drm/radeon/r100.c | 18 ------------------ > drivers/gpu/drm/radeon/radeon.h | 20 ++++++++++++++++++-- > 2 files changed, 18 insertions(+), 20 deletions(-) > > diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c > index b6c3264..8169e82 100644 > --- a/drivers/gpu/drm/radeon/r100.c > +++ b/drivers/gpu/drm/radeon/r100.c > @@ -4086,24 +4086,6 @@ int r100_init(struct radeon_device *rdev) > return 0; > } > > -uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, > - bool always_indirect) > -{ > - if (reg < rdev->rmmio_size && !always_indirect) > - return readl(((void __iomem *)rdev->rmmio) + reg); > - else { > - unsigned long flags; > - uint32_t ret; > - > - spin_lock_irqsave(&rdev->mmio_idx_lock, flags); > - writel(reg, ((void __iomem *)rdev->rmmio) + RADEON_MM_INDEX); > - ret = readl(((void __iomem *)rdev->rmmio) + RADEON_MM_DATA); > - spin_unlock_irqrestore(&rdev->mmio_idx_lock, flags); > - > - return ret; > - } > -} > - > void r100_mm_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v, > bool always_indirect) > { > diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h > index 5cf10a7..9231100 100644 > --- a/drivers/gpu/drm/radeon/radeon.h > +++ b/drivers/gpu/drm/radeon/radeon.h > @@ -2330,8 +2330,24 @@ int radeon_device_init(struct radeon_device *rdev, > void radeon_device_fini(struct radeon_device *rdev); > int radeon_gpu_wait_for_idle(struct radeon_device *rdev); > > -uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, > - bool always_indirect); > +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, > + bool always_indirect) > +{ > + if (reg < rdev->rmmio_size && !always_indirect) > + return readl(((void __iomem *)rdev->rmmio) + reg); Quick thought from someone entirely unfamiliar with the hardware: perhaps you can get the performance benefit without the size increase by moving the else portion into a non-inline function? I'm guessing that most accesses happen in the "if" branch. > + else { > + unsigned long flags; > + uint32_t ret; > + > + spin_lock_irqsave(&rdev->mmio_idx_lock, flags); > + writel(reg, ((void __iomem *)rdev->rmmio) + RADEON_MM_INDEX); > + ret = readl(((void __iomem *)rdev->rmmio) + RADEON_MM_DATA); > + spin_unlock_irqrestore(&rdev->mmio_idx_lock, flags); > + > + return ret; > + } > +} > + > void r100_mm_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v, > bool always_indirect); > u32 r100_io_rreg(struct radeon_device *rdev, u32 reg); > -- > 1.8.3.1 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Thu, 10 Apr 2014 12:19:10 -0400 Ilia Mirkin <imirkin@alum.mit.edu> wrote: > > +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, > > + bool always_indirect) > > +{ > > + if (reg < rdev->rmmio_size && !always_indirect) > > + return readl(((void __iomem *)rdev->rmmio) + reg); > > Quick thought from someone entirely unfamiliar with the hardware: > perhaps you can get the performance benefit without the size increase > by moving the else portion into a non-inline function? I'm guessing > that most accesses happen in the "if" branch. The function call overhead is about equal to branching overhead, so splitting it would only help about half that. It's called from many places, and a lot of calls per sec. Of course the future kernel LTO will all make this go away, but that's probably two years in the future before it's stable. - Lauri
On Thu, Apr 10, 2014 at 2:46 PM, Lauri Kasanen <cand@gmx.com> wrote: > On Thu, 10 Apr 2014 12:19:10 -0400 > Ilia Mirkin <imirkin@alum.mit.edu> wrote: > >> > +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, >> > + bool always_indirect) >> > +{ >> > + if (reg < rdev->rmmio_size && !always_indirect) >> > + return readl(((void __iomem *)rdev->rmmio) + reg); >> >> Quick thought from someone entirely unfamiliar with the hardware: >> perhaps you can get the performance benefit without the size increase >> by moving the else portion into a non-inline function? I'm guessing >> that most accesses happen in the "if" branch. > > The function call overhead is about equal to branching overhead, so > splitting it would only help about half that. It's called from many > places, and a lot of calls per sec. Is that really true? I haven't profiled it, but a function call has to save/restore registers, set up new stack, etc. And the jump is to some far-away code, so perhaps less likely to be in i-cache? (But maybe not, not sure on the details of how all that works.) And I'm guessing most of the size increase is coming from the spinlock/unlock, which, I'm guessing again, is the much-rarer case. And the branch would happen either way... so that's a sunk cost. (Except I bet that the always_indirect param is always constant and that bit of the if can be resolved at compile time with that part being inlined.) Anyways, it was just a thought. -ilia
Am 10.04.2014 20:52, schrieb Ilia Mirkin: > On Thu, Apr 10, 2014 at 2:46 PM, Lauri Kasanen <cand@gmx.com> wrote: >> On Thu, 10 Apr 2014 12:19:10 -0400 >> Ilia Mirkin <imirkin@alum.mit.edu> wrote: >> >>>> +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, >>>> + bool always_indirect) >>>> +{ >>>> + if (reg < rdev->rmmio_size && !always_indirect) >>>> + return readl(((void __iomem *)rdev->rmmio) + reg); >>> Quick thought from someone entirely unfamiliar with the hardware: >>> perhaps you can get the performance benefit without the size increase >>> by moving the else portion into a non-inline function? I'm guessing >>> that most accesses happen in the "if" branch. >> The function call overhead is about equal to branching overhead, so >> splitting it would only help about half that. It's called from many >> places, and a lot of calls per sec. Actually direct register access shouldn't be necessary so often. Apart from page flips, write/read pointer updates and irq processing there shouldn't be so many of them. Could you clarify a bit more what issue you are seeing here? > Is that really true? I haven't profiled it, but a function call has to > save/restore registers, set up new stack, etc. And the jump is to some > far-away code, so perhaps less likely to be in i-cache? (But maybe > not, not sure on the details of how all that works.) And I'm guessing > most of the size increase is coming from the spinlock/unlock, which, > I'm guessing again, is the much-rarer case. > > And the branch would happen either way... so that's a sunk cost. > (Except I bet that the always_indirect param is always constant and > that bit of the if can be resolved at compile time with that part > being inlined.) > > Anyways, it was just a thought. And a pretty much correct one. The "else" case shouldn't be necessary on modern hardware any more and so nearly never taken. Christian. > > -ilia > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Thu, 10 Apr 2014 21:30:03 +0200 Christian König <deathsimple@vodafone.de> wrote: > >>> Quick thought from someone entirely unfamiliar with the hardware: > >>> perhaps you can get the performance benefit without the size increase > >>> by moving the else portion into a non-inline function? I'm guessing > >>> that most accesses happen in the "if" branch. > >> The function call overhead is about equal to branching overhead, so > >> splitting it would only help about half that. It's called from many > >> places, and a lot of calls per sec. > > Actually direct register access shouldn't be necessary so often. Apart > from page flips, write/read pointer updates and irq processing there > shouldn't be so many of them. Could you clarify a bit more what issue > you are seeing here? Too much cpu usage for such a simple function. 2% makes it #2 in top-10 radeon.ko functions, right after evergreen_cs_parse. For reference, #3 (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's usage. As proved by the perf increase, it's called often enough that getting rid of the function call overhead (and compiling the if out compile-time) helps measurably. - Lauri
Am 11.04.2014 09:52, schrieb Lauri Kasanen: > On Thu, 10 Apr 2014 21:30:03 +0200 > Christian König <deathsimple@vodafone.de> wrote: > >>>>> Quick thought from someone entirely unfamiliar with the hardware: >>>>> perhaps you can get the performance benefit without the size increase >>>>> by moving the else portion into a non-inline function? I'm guessing >>>>> that most accesses happen in the "if" branch. >>>> The function call overhead is about equal to branching overhead, so >>>> splitting it would only help about half that. It's called from many >>>> places, and a lot of calls per sec. >> Actually direct register access shouldn't be necessary so often. Apart >> from page flips, write/read pointer updates and irq processing there >> shouldn't be so many of them. Could you clarify a bit more what issue >> you are seeing here? > Too much cpu usage for such a simple function. 2% makes it #2 in top-10 > radeon.ko functions, right after evergreen_cs_parse. For reference, #3 > (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's > usage. I think you misunderstood me here. I do believe your numbers that it makes a noticeable difference. But I've did a couple of perf tests recently on SI and CIK while hacking on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on those systems. So what puzzles me is who the hack is calling r100_mm_rreg so often that it makes a noticeable difference on evergreen/NI? Christian. > > As proved by the perf increase, it's called often enough that getting > rid of the function call overhead (and compiling the if out > compile-time) helps measurably. > > - Lauri
On Fri, 11 Apr 2014 10:33:08 +0200 Christian König <deathsimple@vodafone.de> wrote: > >> Actually direct register access shouldn't be necessary so often. Apart > >> from page flips, write/read pointer updates and irq processing there > >> shouldn't be so many of them. Could you clarify a bit more what issue > >> you are seeing here? > > Too much cpu usage for such a simple function. 2% makes it #2 in top-10 > > radeon.ko functions, right after evergreen_cs_parse. For reference, #3 > > (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's > > usage. > > I think you misunderstood me here. I do believe your numbers that it > makes a noticeable difference. > > But I've did a couple of perf tests recently on SI and CIK while hacking > on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on > those systems. > > So what puzzles me is who the hack is calling r100_mm_rreg so often that > it makes a noticeable difference on evergreen/NI? The biggest caller is cayman_cp_int_cntl_setup. Before inlining it took 0.0013%, after it takes 1%. This is on a Richland APU, so Aruba/Cayman. Urban Terror is an ioq3 game with a lot of cpu-side vertex submissions. - Lauri
Am 11.04.2014 11:54, schrieb Lauri Kasanen: > On Fri, 11 Apr 2014 10:33:08 +0200 > Christian König <deathsimple@vodafone.de> wrote: > >>>> Actually direct register access shouldn't be necessary so often. Apart >>>> from page flips, write/read pointer updates and irq processing there >>>> shouldn't be so many of them. Could you clarify a bit more what issue >>>> you are seeing here? >>> Too much cpu usage for such a simple function. 2% makes it #2 in top-10 >>> radeon.ko functions, right after evergreen_cs_parse. For reference, #3 >>> (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's >>> usage. >> I think you misunderstood me here. I do believe your numbers that it >> makes a noticeable difference. >> >> But I've did a couple of perf tests recently on SI and CIK while hacking >> on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on >> those systems. >> >> So what puzzles me is who the hack is calling r100_mm_rreg so often that >> it makes a noticeable difference on evergreen/NI? > The biggest caller is cayman_cp_int_cntl_setup. Before inlining it took > 0.0013%, after it takes 1%. Sounds like somebody is constantly turning interrupts on and off. > This is on a Richland APU, so Aruba/Cayman. Urban Terror is an ioq3 > game with a lot of cpu-side vertex submissions. That will probably be the difference, I only tested lightsmark. Anyway, I would do like Ilia suggested and only put the else branch into a separate, not inlined function. BTW: It's probably a good idea to do the same for the write function as well. Christian. > - Lauri
On Fri, 11 Apr 2014 14:32:20 +0200 Christian König <deathsimple@vodafone.de> wrote: > Anyway, I would do like Ilia suggested and only put the else branch into > a separate, not inlined function. > > BTW: It's probably a good idea to do the same for the write function as > well. I tested it. The majority of the size increase stayed - the else/spinlock part as non-inlined functions, radeon.ko was still 5% larger instead of 7%. - Lauri
On Fri, 11 Apr 2014 14:32:20 +0200 Christian König <deathsimple@vodafone.de> wrote: > Am 11.04.2014 11:54, schrieb Lauri Kasanen: > > On Fri, 11 Apr 2014 10:33:08 +0200 > > Christian König <deathsimple@vodafone.de> wrote: > > > >>>> Actually direct register access shouldn't be necessary so often. Apart > >>>> from page flips, write/read pointer updates and irq processing there > >>>> shouldn't be so many of them. Could you clarify a bit more what issue > >>>> you are seeing here? > >>> Too much cpu usage for such a simple function. 2% makes it #2 in top-10 > >>> radeon.ko functions, right after evergreen_cs_parse. For reference, #3 > >>> (radeon_cs_packet_parse) is only 0.5%, one fourth of this function's > >>> usage. > >> I think you misunderstood me here. I do believe your numbers that it > >> makes a noticeable difference. > >> > >> But I've did a couple of perf tests recently on SI and CIK while hacking > >> on VM support, and IIRC r100_mm_rreg didn't showed up in the top 10 on > >> those systems. > >> > >> So what puzzles me is who the hack is calling r100_mm_rreg so often that > >> it makes a noticeable difference on evergreen/NI? > > The biggest caller is cayman_cp_int_cntl_setup. Before inlining it took > > 0.0013%, after it takes 1%. > > Sounds like somebody is constantly turning interrupts on and off. I instrumented radeon_irq_set. There's about 12 calls to that per sec, most coming from radeon_irq_kms_sw_irq_get and radeon_irq_kms_sw_irq_put. So from your earlier mail I gather this amount of IRQ setting is normal. It's just taking a lot of cpu for some reason. - Lauri
diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c index b6c3264..8169e82 100644 --- a/drivers/gpu/drm/radeon/r100.c +++ b/drivers/gpu/drm/radeon/r100.c @@ -4086,24 +4086,6 @@ int r100_init(struct radeon_device *rdev) return 0; } -uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, - bool always_indirect) -{ - if (reg < rdev->rmmio_size && !always_indirect) - return readl(((void __iomem *)rdev->rmmio) + reg); - else { - unsigned long flags; - uint32_t ret; - - spin_lock_irqsave(&rdev->mmio_idx_lock, flags); - writel(reg, ((void __iomem *)rdev->rmmio) + RADEON_MM_INDEX); - ret = readl(((void __iomem *)rdev->rmmio) + RADEON_MM_DATA); - spin_unlock_irqrestore(&rdev->mmio_idx_lock, flags); - - return ret; - } -} - void r100_mm_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v, bool always_indirect) { diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 5cf10a7..9231100 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -2330,8 +2330,24 @@ int radeon_device_init(struct radeon_device *rdev, void radeon_device_fini(struct radeon_device *rdev); int radeon_gpu_wait_for_idle(struct radeon_device *rdev); -uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, - bool always_indirect); +static inline uint32_t r100_mm_rreg(struct radeon_device *rdev, uint32_t reg, + bool always_indirect) +{ + if (reg < rdev->rmmio_size && !always_indirect) + return readl(((void __iomem *)rdev->rmmio) + reg); + else { + unsigned long flags; + uint32_t ret; + + spin_lock_irqsave(&rdev->mmio_idx_lock, flags); + writel(reg, ((void __iomem *)rdev->rmmio) + RADEON_MM_INDEX); + ret = readl(((void __iomem *)rdev->rmmio) + RADEON_MM_DATA); + spin_unlock_irqrestore(&rdev->mmio_idx_lock, flags); + + return ret; + } +} + void r100_mm_wreg(struct radeon_device *rdev, uint32_t reg, uint32_t v, bool always_indirect); u32 r100_io_rreg(struct radeon_device *rdev, u32 reg);
This was originally un-inlined by Andi Kleen in 2011 citing size concerns. Indeed, inlining it grows radeon.ko by 7%. However, 2% of cpu is spent in this function. Inlining it gives 1% more fps in Urban Terror. Signed-off-by: Lauri Kasanen <cand@gmx.com> --- drivers/gpu/drm/radeon/r100.c | 18 ------------------ drivers/gpu/drm/radeon/radeon.h | 20 ++++++++++++++++++-- 2 files changed, 18 insertions(+), 20 deletions(-)