diff mbox series

arm: lib: implement aeabi_uldivmod via div64_u64_rem

Message ID 20220716001616.4052225-1-ndesaulniers@google.com (mailing list archive)
State New, archived
Headers show
Series arm: lib: implement aeabi_uldivmod via div64_u64_rem | expand

Commit Message

Nick Desaulniers July 16, 2022, 12:16 a.m. UTC
Compilers frequently need to defer 64b division to a libcall with this
symbol name. It essentially is div64_u64_rem, just with a different
signature. Kernel developers know to call div64_u64_rem, but compilers
don't.

Link: https://lore.kernel.org/lkml/20220524004156.0000790e@garyguo.net/
Suggested-by: Gary Guo <gary@garyguo.net>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
---
 arch/arm/lib/Makefile         |  1 +
 arch/arm/lib/aeabi_uldivmod.c | 23 +++++++++++++++++++++++
 2 files changed, 24 insertions(+)
 create mode 100644 arch/arm/lib/aeabi_uldivmod.c

Comments

Arnd Bergmann July 16, 2022, 9:46 a.m. UTC | #1
On Sat, Jul 16, 2022 at 2:16 AM Nick Desaulniers
<ndesaulniers@google.com> wrote:
>
> Compilers frequently need to defer 64b division to a libcall with this
> symbol name. It essentially is div64_u64_rem, just with a different
> signature. Kernel developers know to call div64_u64_rem, but compilers
> don't.
>
> Link: https://lore.kernel.org/lkml/20220524004156.0000790e@garyguo.net/
> Suggested-by: Gary Guo <gary@garyguo.net>
> Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

This has historically been strongly NAK'd, and I don't think that position
has changed in the meantime. A variable-argument 64-bit division is
really expensive, especially on 32-bit machines that lack a native
32-bit division instruction, and we don't want developers to accidentally
insert one in their driver code.

Explicitly calling one of the division helpers in linux/math64.h is the
established way for driver writers to declare that a particular division
cannot be turned into a cheaper operation and is never run in a
performance critical code path. The compiler of course cannot know
about either of those.

        Arnd
kernel test robot July 18, 2022, 4:50 p.m. UTC | #2
Hi Nick,

I love your patch! Perhaps something to improve:

[auto build test WARNING on soc/for-next]
[also build test WARNING on arm/for-next linus/master v5.19-rc7 next-20220718]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nick-Desaulniers/arm-lib-implement-aeabi_uldivmod-via-div64_u64_rem/20220716-231205
base:   https://git.kernel.org/pub/scm/linux/kernel/git/soc/soc.git for-next
config: arm-ixp4xx_defconfig (https://download.01.org/0day-ci/archive/20220719/202207190057.1gn0emAr-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project d74b88c69dc2644bd0dc5d64e2d7413a0d4040e5)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/f0441d7a00899e433705accb0da58c94f4ff8808
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Nick-Desaulniers/arm-lib-implement-aeabi_uldivmod-via-div64_u64_rem/20220716-231205
        git checkout f0441d7a00899e433705accb0da58c94f4ff8808
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> arch/arm/lib/aeabi_uldivmod.c:17:15: warning: no previous prototype for function '__aeabi_uldivmod' [-Wmissing-prototypes]
   struct result __aeabi_uldivmod(u64 numerator, u64 denominator)
                 ^
   arch/arm/lib/aeabi_uldivmod.c:17:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   struct result __aeabi_uldivmod(u64 numerator, u64 denominator)
   ^
   static 
   1 warning generated.


vim +/__aeabi_uldivmod +17 arch/arm/lib/aeabi_uldivmod.c

    16	
  > 17	struct result __aeabi_uldivmod(u64 numerator, u64 denominator)
Nick Desaulniers Oct. 10, 2022, 9:23 p.m. UTC | #3
On Sat, Jul 16, 2022 at 2:47 AM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Sat, Jul 16, 2022 at 2:16 AM Nick Desaulniers
> <ndesaulniers@google.com> wrote:
> >
> > Compilers frequently need to defer 64b division to a libcall with this
> > symbol name. It essentially is div64_u64_rem, just with a different
> > signature. Kernel developers know to call div64_u64_rem, but compilers
> > don't.
> >
> > Link: https://lore.kernel.org/lkml/20220524004156.0000790e@garyguo.net/
> > Suggested-by: Gary Guo <gary@garyguo.net>
> > Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>

So the existing division by constant issues went away, and Craig was
able to improve division by double-word constants in LLVM
1. https://reviews.llvm.org/D130862
2. https://reviews.llvm.org/D135541
But we still have one instance left that's not div/rem by constant via
CONFIG_FPE_NWFPE=y that's now blocking Android's compiler upgrade.
https://github.com/ClangBuiltLinux/linux/issues/1666

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/nwfpe/softfloat.c#n2312
Any creative ideas on how to avoid this? Perhaps putting the `aSig -=
bSig;` in inline asm? Inserting a `barrier()` or empty asm statement
into the loops also seems to work.

Otherwise I'd define __aeabi_uldivmod as static in
arch/arm/nwfpe/softfloat.c (with the body from this patch) only for
clang.

I see this function seems to be based on Berkeley Softfloat
http://www.jhauser.us/arithmetic/SoftFloat.html
v2.  v3 looks like a total rewrite.  Looking at v3e, it looks like
float64_rem() is now called f64_rem() and defined in f64_rem.c.  It
doesn't look like there's anything from v3 that we could backport to
the kernel's v2 to avoid this.

Otherwise perhaps we just disable OABI_COMPAT for clang. Quite a few
defconfigs explicitly enable FPE_NWFPE though.  Are there really a lot
of OABI binaries still in use?

There's also the hidden llvm flag:
`-mllvm -replexitval=never` that seems to work here, though FWICT it's
disabling 3 such loop elisions (I think all three statements in that
do while).  That's probably the best way forward here...

https://reviews.llvm.org/D9800 made the decision to do such a
transformation when a loop can be fully elided ("deleted").

>
> This has historically been strongly NAK'd, and I don't think that position
> has changed in the meantime. A variable-argument 64-bit division is
> really expensive, especially on 32-bit machines that lack a native
> 32-bit division instruction, and we don't want developers to accidentally
> insert one in their driver code.
>
> Explicitly calling one of the division helpers in linux/math64.h is the
> established way for driver writers to declare that a particular division
> cannot be turned into a cheaper operation and is never run in a
> performance critical code path. The compiler of course cannot know
> about either of those.
>
>         Arnd
Arnd Bergmann Oct. 10, 2022, 10:13 p.m. UTC | #4
On Mon, Oct 10, 2022, at 11:23 PM, Nick Desaulniers wrote:
> On Sat, Jul 16, 2022 at 2:47 AM Arnd Bergmann <arnd@kernel.org> wrote:
>> On Sat, Jul 16, 2022 at 2:16 AM Nick Desaulniers <ndesaulniers@google.com> wrote:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/nwfpe/softfloat.c#n2312
> Any creative ideas on how to avoid this? Perhaps putting the `aSig -=
> bSig;` in inline asm? Inserting a `barrier()` or empty asm statement
> into the loops also seems to work.

I was going to suggest adding a barrier() as well, should have
read on first ;-)

> I see this function seems to be based on Berkeley Softfloat
> http://www.jhauser.us/arithmetic/SoftFloat.html
> v2.  v3 looks like a total rewrite.  Looking at v3e, it looks like
> float64_rem() is now called f64_rem() and defined in f64_rem.c.  It
> doesn't look like there's anything from v3 that we could backport to
> the kernel's v2 to avoid this.
>
> Otherwise perhaps we just disable OABI_COMPAT for clang. Quite a few
> defconfigs explicitly enable FPE_NWFPE though.  Are there really a lot
> of OABI binaries still in use?

I'd do the minimal thing here, neither NWFPE nor OABI_COMPAT should
be too relevant here. Russell still uses some machines with old OABI
binaries, but I have not heard from anyone else using those in a
very long time.

Note that while gcc can still produce kernels that are OABI binaries,
gcc-4.6 marked the arm-linux-gnu target as obsolete and gcc-4.8
removed it, so any existing binaries were built with toolchains
predating that.

     Arnd
Nick Desaulniers Oct. 10, 2022, 10:34 p.m. UTC | #5
On Mon, Oct 10, 2022 at 3:14 PM Arnd Bergmann <arnd@kernel.org> wrote:
>
> On Mon, Oct 10, 2022, at 11:23 PM, Nick Desaulniers wrote:
> > On Sat, Jul 16, 2022 at 2:47 AM Arnd Bergmann <arnd@kernel.org> wrote:
> >> On Sat, Jul 16, 2022 at 2:16 AM Nick Desaulniers <ndesaulniers@google.com> wrote:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/nwfpe/softfloat.c#n2312
> > Any creative ideas on how to avoid this? Perhaps putting the `aSig -=
> > bSig;` in inline asm? Inserting a `barrier()` or empty asm statement
> > into the loops also seems to work.
>
> I was going to suggest adding a barrier() as well, should have
> read on first ;-)

barrier() forces reloads+spills in the loop.  The output with `-mllvm
-replexitval=never` is optimal (assuming the loop is faster than
__aeabi_uldivmod (which I think is unprovable).
https://godbolt.org/z/7dMabYYcM

As much I hate relying on compiler-internal flags, I think this is optimal:
```
diff --git a/arch/arm/nwfpe/Makefile b/arch/arm/nwfpe/Makefile
index 303400fa2cdf..2aec85ab1e8b 100644
--- a/arch/arm/nwfpe/Makefile
+++ b/arch/arm/nwfpe/Makefile
@@ -11,3 +11,9 @@ nwfpe-y                               += fpa11.o
fpa11_cpdo.o fpa11_cpdt.o \
                                   entry.o

 nwfpe-$(CONFIG_FPE_NWFPE_XP)   += extended_cpdo.o
+
+# Try really hard to avoid generating calls to __aeabi_uldivmod() from
+# float64_rem() due to loop elision.
+ifdef CONFIG_CC_IS_CLANG
+CFLAGS_softfloat.o     += -mllvm -replexitval=never
+endif
```

Part of me is tempted to move float64_rem() to its own file for that
flag, but indvars+loop-utils isn't eliding other loops in that file
(comparing the full disassembly before+after the above diff).

Long term, it might be nice for us to have `--rtlib` recognize
`--rtlib=linux-kernel@version` or something so that we could better
describe the effective compiler runtime to the compiler.  There are
already differences in compiler-rt and libgcc where we could make
better codegen decisions if we were to consider the target rtlib.
These libraries also change over time though...
diff mbox series

Patch

diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
index 6d2ba454f25b..3fa273219312 100644
--- a/arch/arm/lib/Makefile
+++ b/arch/arm/lib/Makefile
@@ -29,6 +29,7 @@  endif
 obj-$(CONFIG_UACCESS_WITH_MEMCPY) += uaccess_with_memcpy.o
 
 lib-$(CONFIG_MMU) += $(mmu-y)
+lib-$(CONFIG_AEABI) += aeabi_uldivmod.o
 
 ifeq ($(CONFIG_CPU_32v3),y)
   lib-y	+= io-readsw-armv3.o io-writesw-armv3.o
diff --git a/arch/arm/lib/aeabi_uldivmod.c b/arch/arm/lib/aeabi_uldivmod.c
new file mode 100644
index 000000000000..310427893195
--- /dev/null
+++ b/arch/arm/lib/aeabi_uldivmod.c
@@ -0,0 +1,23 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Unsigned 64b division with remainder, as is typically provided by libgcc or
+ * compiler-rt.
+ *
+ * Copyright (C) 2023 Google, LLC.
+ *
+ * Author: Nick Desaulniers <ndesaulniers@google.com>
+ */
+
+#include <linux/math64.h>
+
+struct result {
+	u64 quot, rem;
+};
+
+struct result __aeabi_uldivmod(u64 numerator, u64 denominator)
+{
+	struct result res;
+
+	res.quot = div64_u64_rem(numerator, denominator, &res.rem);
+	return res;
+}