diff mbox

[RFC] arm: use built-in byte swap function

Message ID 20130205210436.670c62e26d2121330e87af35@freescale.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kim Phillips Feb. 6, 2013, 3:04 a.m. UTC
On Fri, 1 Feb 2013 07:33:17 +0000
"Woodhouse, David" <david.woodhouse@intel.com> wrote:

> On Fri, 2013-02-01 at 01:17 +0000, Russell King - ARM Linux wrote:
> > 
> > > I've tried both gcc 4.6.3 [1] and 4.6.4 [2].  If you can point me to
> > > a 4.5.x, I'll try that, too, but as it stands now, if one moves the
> > > code added to swab.h below outside of its armv6 protection,
> > > gcc adds calls to __bswapsi2.
> > 
> > Take a look at the message I sent on the 29th towards the beginning of
> > this thread for details of gcc 4.5.4 behaviour.
> 
> I'd like to see a comment (with PR# if appropriate) explaining clearly
> *why* it isn't enabled for <ARMv6 even with a bleeding-edge compiler.

ok I think I've figured it out: the difference in the defconfigs that
fail (at91_dt, at91sam9g45, and lpc32xx) is that they are ARM9's
(armv4/5), have CC_OPTIMIZE_FOR_SIZE set, and have code with
multiple swaps ready for space optimization: gcc -Os emits calls
to __bswapsi2 on those platforms to save space because they don't
have the single rev byte swap instruction.

> Russell's test also seemed to indicate that the 32-bit and 64-bit swap
> support was present and functional in GCC 4.5.4 (as indeed it should
> have been since 4.4), so I'm still not quite sure why you require 4.6
> for that.

initially it was based at looking at gcc commit history for the
'rev' instruction implementation, but now I've got 4.4, 4.5, 4.6 and
4.7 compilers to perform Russell's test:

$ for cc in 4.4 4.5 4.6 4.7; do \
    arm-linux-gnueabi-gcc-$cc --version | grep gcc ; \
    for a in armv3 armv4 armv4t armv5t armv5te armv6k armv6 armv7-a; do \
      echo -n $a:; \
      for f in 16 32 64; do \
         echo 'unsigned foo(unsigned val) { return __builtin_bswap'$f'(val); }' | arm-linux-gnueabi-gcc-$cc -w -x c -S -o - - -march=$a | grep 'bl'; \
      done; \
    done; \
done

whose output is:

arm-linux-gnueabi-gcc-4.4 (Ubuntu/Linaro 4.4.7-1ubuntu2) 4.4.7
armv3:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
armv4:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
armv4t:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
armv5t:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
armv5te:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
armv6k:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
armv6:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
armv7-a:   	bl	__builtin_bswap16
	bl	__bswapsi2
	bl	__bswapdi2
arm-linux-gnueabi-gcc-4.5 (Ubuntu/Linaro 4.5.3-12ubuntu2) 4.5.3
armv3:   	bl	__builtin_bswap16
armv4:   	bl	__builtin_bswap16
armv4t:   	bl	__builtin_bswap16
armv5t:   	bl	__builtin_bswap16
armv5te:   	bl	__builtin_bswap16
armv6k:   	bl	__builtin_bswap16
armv6:   	bl	__builtin_bswap16
armv7-a:   	bl	__builtin_bswap16
arm-linux-gnueabi-gcc-4.6 (Ubuntu/Linaro 4.6.3-8ubuntu1) 4.6.3 20120624 (prerelease)
armv3:   	bl	__builtin_bswap16
armv4:   	bl	__builtin_bswap16
armv4t:   	bl	__builtin_bswap16
armv5t:   	bl	__builtin_bswap16
armv5te:   	bl	__builtin_bswap16
armv6k:   	bl	__builtin_bswap16
armv6:   	bl	__builtin_bswap16
armv7-a:   	bl	__builtin_bswap16
arm-linux-gnueabi-gcc-4.7 (Ubuntu/Linaro 4.7.2-1ubuntu1) 4.7.2
armv3:   	bl	__builtin_bswap16
armv4:   	bl	__builtin_bswap16
armv4t:   	bl	__builtin_bswap16
armv5t:   	bl	__builtin_bswap16
armv5te:   	bl	__builtin_bswap16
armv6k:   	bl	__builtin_bswap16
armv6:   	bl	__builtin_bswap16
armv7-a:   	bl	__builtin_bswap16

So 4.4 should be exempt from using the built-ins because it always
emits __bswapsi2 calls: it doesn't matter whether or not -Os or -O2
are added as options in the test.

gcc 4.5, 4.6, and 4.7 all support 32 & 64-bit versions, so we
should check for gcc >= 4.5 instead of gcc >= 4.6.

I've added a new check for !CC_OPTIMIZE_FOR_SIZE and build-tested all
defconfigs with gcc 4.6.3 - here's v5:

From 11aa942a84fe94d204424a19b6b13fdb2b359ee6 Mon Sep 17 00:00:00 2001
From: Kim Phillips <kim.phillips@freescale.com>
Date: Mon, 28 Jan 2013 19:30:33 -0600
Subject: [PATCH] arm: use built-in byte swap function

Enable the compiler intrinsic for byte swapping on arch ARM.  This
allows the compiler to detect and be able to optimize out byte
swappings.

__builtin_bswap{32,64} support was added in gcc 4.4, but until
gcc 4.5, it emitted calls to libgcc's __bswap[sd]i2 (even
with -O2).

All gcc versions tested (4.[4567]) emit calls to __bswap[sd]i2 when
optimizing for size, so we add the !CC_OPTIMIZE_FOR_SIZE check.

Support for 16-bit built-ins will be in gcc version 4.8.

This has a tiny benefit on vmlinux text size (gcc 4.6.4):

multi_v7_defconfig:
   text    data     bss     dec     hex filename
3135208  188396  203344 3526948  35d124 vmlinux
multi_v7_defconfig with builtin_bswap:
   text    data     bss     dec     hex filename
3135112  188396  203344 3526852  35d0c4 vmlinux

exynos_defconfig:
   text    data     bss     dec     hex filename
4286605  360564  223172 4870341  4a50c5 vmlinux
exynos_defconfig with builtin_bswap:
   text    data     bss     dec     hex filename
4286405  360564  223172 4870141  4a4ffd vmlinux

The savings come mostly from device-tree related code, and some
from drivers.

Signed-off-by: Kim Phillips <kim.phillips@freescale.com>
---
akin to: http://comments.gmane.org/gmane.linux.kernel.cross-arch/16016

based on linux-next-20130128.  Depends on commit "compiler-gcc{3,4}.h:
Use GCC_VERSION macro" by Daniel Santos <daniel.santos@pobox.com>,
currently in the akpm branch.

v5: re-work based on new gcc version test data:
  - moved outside armv6 protection
  - check for gcc 4.6+ demoted to gcc 4.5+ with:
    !defined(CONFIG_CC_OPTIMIZE_FOR_SIZE)

v4:
- undo v2-2's addition of ARCH_DEFINES_BUILTIN_BSWAP per Boris
  and David - object is to find arches that define _HAVE_BSWAP
  and clean it up in the future: patch is much less intrusive. :)

v3:
- moved out of uapi swab.h into arch/arm/include/asm/swab.h
- moved ARCH_DEFINES_BUILTIN_BSWAP help text into commit message
- moved GCC_VERSION >= 40800 ifdef into GCC_VERSION >= 40600 block

v2:
- at91 and lpd270 builds fixed by limiting to ARMv6 and above
  (i.e., ARM cores that have support for the 'rev' instruction).
  Otherwise, the compiler emits calls to libgcc's __bswapsi2 on
  these ARMv4/v5 builds (and arch ARM doesn't link with libgcc).
  All ARM defconfigs now have the same build status as they did
  without this patch (some are broken on linux-next).

- move ARM check from generic compiler.h to arch ARM's swab.h.
  - pretty sure it should be limited to __KERNEL__ builds

- add new ARCH_DEFINES_BUILTIN_BSWAP (see Kconfig help).
  - if set, generic compiler header does not set HAVE_BUILTIN_BSWAPxx
  - not too sure about this having to be a new CONFIG_, but it's hard
    to find a place for it given linux/compiler.h doesn't include any
    arch-specific files.

- move new selects to end of CONFIG_ARM's Kconfig select list,
  as is done in David Woodhouse's original patchseries for ppc/x86.

 arch/arm/include/asm/swab.h |    9 +++++++++
 1 file changed, 9 insertions(+)

Comments

David Woodhouse Feb. 6, 2013, 9:02 a.m. UTC | #1
On Tue, 2013-02-05 at 21:04 -0600, Kim Phillips wrote:
> gcc -Os emits calls to __bswapsi2 on those platforms to save space
> because they don't have the single rev byte swap instruction.

Is that the right thing for GCC to do in that situation?

If so, perhaps we should be *providing* __bswap[sd]i2 functions for it
to use?

If not, perhaps there should be a PR filed?

Or is our use case justifiably different to the general case of '-Os'?
If so, why?
Kim Phillips Feb. 7, 2013, 1:19 a.m. UTC | #2
On Wed, 6 Feb 2013 09:02:04 +0000
"Woodhouse, David" <david.woodhouse@intel.com> wrote:

> On Tue, 2013-02-05 at 21:04 -0600, Kim Phillips wrote:
> > gcc -Os emits calls to __bswapsi2 on those platforms to save space
> > because they don't have the single rev byte swap instruction.
> 
> Is that the right thing for GCC to do in that situation?

if it saves space, why wouldn't it be?

"Many of these functions are only optimized in certain cases; if they
are not optimized in a particular case, a call to the library
function is emitted." [1]

I see "(arm_arch6 || !optimize_size)" in gcc's define_expand
"bswapsi2" source, so GCC considers size optimization as a
legitimate one of those cases.

> If so, perhaps we should be *providing* __bswap[sd]i2 functions for it
> to use?

either that, or link with libgcc - why does arch/arm64 do this and
arch/arm not?  It's not obvious from git log.

> If not, perhaps there should be a PR filed?
> 
> Or is our use case justifiably different to the general case of '-Os'?
> If so, why?

shouldn't be - a patch, such as this, that claims to reduce code
size, and that only turns on the new built-in when
CC_OPTIMIZE_FOR_SIZE is off, is generally not good :)

OTOH, the target here is armv6+ performance - not armv4,5 code
density - the OPTIMIZE_FOR_SIZE protection prevents armv4,5 build
breakage.

Kim

[1] http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#Other-Builtins
Will Newton Feb. 7, 2013, 10:19 a.m. UTC | #3
On Thu, Feb 7, 2013 at 1:19 AM, Kim Phillips <kim.phillips@freescale.com> wrote:
> On Wed, 6 Feb 2013 09:02:04 +0000
> "Woodhouse, David" <david.woodhouse@intel.com> wrote:
>
>> On Tue, 2013-02-05 at 21:04 -0600, Kim Phillips wrote:
>> > gcc -Os emits calls to __bswapsi2 on those platforms to save space
>> > because they don't have the single rev byte swap instruction.
>>
>> Is that the right thing for GCC to do in that situation?
>
> if it saves space, why wouldn't it be?
>
> "Many of these functions are only optimized in certain cases; if they
> are not optimized in a particular case, a call to the library
> function is emitted." [1]
>
> I see "(arm_arch6 || !optimize_size)" in gcc's define_expand
> "bswapsi2" source, so GCC considers size optimization as a
> legitimate one of those cases.
>
>> If so, perhaps we should be *providing* __bswap[sd]i2 functions for it
>> to use?
>
> either that, or link with libgcc - why does arch/arm64 do this and
> arch/arm not?  It's not obvious from git log.

One reason I have found, I don't know if it is the canonical one, is
that linking with libgcc allows people to use all intrinsics e.g. soft
float routines in the kernel without noticing it. If you limit the
intrinsics to the ones linked into the kernel explicitly then this
cannot happen.

I have also seen cases where the libgcc intrinsics are improved over
time, having the code in the kernel allows these improvements to be
rolled into the kernel even if the user has an older toolchain.

A number of ports link against libgcc and a roughly equal number do
not, so it isn't clear that there's any consensus on the issue.
Catalin Marinas Feb. 7, 2013, 10:43 a.m. UTC | #4
On 7 February 2013 10:19, Will Newton <will.newton@gmail.com> wrote:
> On Thu, Feb 7, 2013 at 1:19 AM, Kim Phillips <kim.phillips@freescale.com> wrote:
>> On Wed, 6 Feb 2013 09:02:04 +0000
>> "Woodhouse, David" <david.woodhouse@intel.com> wrote:
>>
>>> On Tue, 2013-02-05 at 21:04 -0600, Kim Phillips wrote:
>>> > gcc -Os emits calls to __bswapsi2 on those platforms to save space
>>> > because they don't have the single rev byte swap instruction.
>>>
>>> Is that the right thing for GCC to do in that situation?
>>
>> if it saves space, why wouldn't it be?
>>
>> "Many of these functions are only optimized in certain cases; if they
>> are not optimized in a particular case, a call to the library
>> function is emitted." [1]
>>
>> I see "(arm_arch6 || !optimize_size)" in gcc's define_expand
>> "bswapsi2" source, so GCC considers size optimization as a
>> legitimate one of those cases.
>>
>>> If so, perhaps we should be *providing* __bswap[sd]i2 functions for it
>>> to use?
>>
>> either that, or link with libgcc - why does arch/arm64 do this and
>> arch/arm not?  It's not obvious from git log.
>
> One reason I have found, I don't know if it is the canonical one, is
> that linking with libgcc allows people to use all intrinsics e.g. soft
> float routines in the kernel without noticing it. If you limit the
> intrinsics to the ones linked into the kernel explicitly then this
> cannot happen.

For arm64 we explicitly pass -mgeneral-regs-only to avoid any floating
point generation. Soft-float is excluded by the ABI automatically. But
we use other compiler intrinsics like __ffs and while they are
currently generated inline, you can't guarantee, hence the linking
with libgcc.

> I have also seen cases where the libgcc intrinsics are improved over
> time, having the code in the kernel allows these improvements to be
> rolled into the kernel even if the user has an older toolchain.

Indeed, the gcc guys do a lot benchmarking/optimisations on a wide
range of processors, so we can take advantage of that in the kernel.
But it's much easier on arm64 since the architecture is stable. On
32-bit arm we have to cope with a range of architecture versions with
variations to the instruction set.
Russell King - ARM Linux Feb. 7, 2013, 6:13 p.m. UTC | #5
On Wed, Feb 06, 2013 at 07:19:05PM -0600, Kim Phillips wrote:
> either that, or link with libgcc - why does arch/arm64 do this and
> arch/arm not?  It's not obvious from git log.

We want to be in control of what code the kernel runs, and that means
not using libgcc.  libgcc can do all sorts of stuff because it makes
assumptions about the environment which it is executing in (glibc).

For example, it assumes there may be a GOT, and that it can issue SWI
calls...

However, the biggest reason not to use libgcc is that we want to control
what gets used in the kernel - for example, no floating point, and no
use of 64 x 64bit division.

So all round, using libgcc in AArch32 would bring us a number of issues
that we don't get with the current approach.

AArch64 uses it because it's maintained entirely separately and the
decision has been made by other people - and some of these concerns do
not exist on a 64-bit architecture.
David Woodhouse Feb. 8, 2013, 5:25 p.m. UTC | #6
On Thu, 2013-02-07 at 18:13 +0000, Russell King - ARM Linux wrote:
> 
> However, the biggest reason not to use libgcc is that we want to control
> what gets used in the kernel - for example, no floating point, and no
> use of 64 x 64bit division.

Which is all very sensible. But there's no particular reason we couldn't
add a __bswap[sd]i2 to the kernel's version of libgcc if we wanted to.

If we're compiling with -Os  on platforms without 'rev' and that's the
thing that makes most sense, then we should make it work.
Nicolas Pitre Feb. 8, 2013, 8:04 p.m. UTC | #7
On Fri, 8 Feb 2013, Woodhouse, David wrote:

> On Thu, 2013-02-07 at 18:13 +0000, Russell King - ARM Linux wrote:
> > 
> > However, the biggest reason not to use libgcc is that we want to control
> > what gets used in the kernel - for example, no floating point, and no
> > use of 64 x 64bit division.
> 
> Which is all very sensible. But there's no particular reason we couldn't
> add a __bswap[sd]i2 to the kernel's version of libgcc if we wanted to.

Absolutely.


Nicolas
David Woodhouse Feb. 8, 2013, 10:40 p.m. UTC | #8
On Fri, 2013-02-08 at 15:04 -0500, Nicolas Pitre wrote:
> On Fri, 8 Feb 2013, Woodhouse, David wrote:
> 
> > On Thu, 2013-02-07 at 18:13 +0000, Russell King - ARM Linux wrote:
> > > 
> > > However, the biggest reason not to use libgcc is that we want to control
> > > what gets used in the kernel - for example, no floating point, and no
> > > use of 64 x 64bit division.
> > 
> > Which is all very sensible. But there's no particular reason we couldn't
> > add a __bswap[sd]i2 to the kernel's version of libgcc if we wanted to.
> 
> Absolutely.

And then ARM can just set ARCH_USE_BUILTIN_BSWAP like other
architectures do, right?
Nicolas Pitre Feb. 8, 2013, 10:47 p.m. UTC | #9
On Fri, 8 Feb 2013, Woodhouse, David wrote:

> On Fri, 2013-02-08 at 15:04 -0500, Nicolas Pitre wrote:
> > On Fri, 8 Feb 2013, Woodhouse, David wrote:
> > 
> > > On Thu, 2013-02-07 at 18:13 +0000, Russell King - ARM Linux wrote:
> > > > 
> > > > However, the biggest reason not to use libgcc is that we want to control
> > > > what gets used in the kernel - for example, no floating point, and no
> > > > use of 64 x 64bit division.
> > > 
> > > Which is all very sensible. But there's no particular reason we couldn't
> > > add a __bswap[sd]i2 to the kernel's version of libgcc if we wanted to.
> > 
> > Absolutely.
> 
> And then ARM can just set ARCH_USE_BUILTIN_BSWAP like other
> architectures do, right?

If that turns out to be beneficial over what we have now, then yes.
I didn't read back the whole thread to form an opinion though.


Nicolas
diff mbox

Patch

diff --git a/arch/arm/include/asm/swab.h b/arch/arm/include/asm/swab.h
index 537fc9b..159ab16 100644
--- a/arch/arm/include/asm/swab.h
+++ b/arch/arm/include/asm/swab.h
@@ -35,4 +35,13 @@  static inline __attribute_const__ __u32 __arch_swab32(__u32 x)
 #define __arch_swab32 __arch_swab32
 
 #endif
+
+#if !defined(CONFIG_CC_OPTIMIZE_FOR_SIZE) && GCC_VERSION >= 40500
+#define __HAVE_BUILTIN_BSWAP32__
+#define __HAVE_BUILTIN_BSWAP64__
+#if GCC_VERSION >= 40800
+#define __HAVE_BUILTIN_BSWAP16__
+#endif
+#endif
+
 #endif