mbox series

[0/4] Reinstate and improve MIPS `do_div' implementation

Message ID alpine.DEB.2.21.2104200044060.44318@angie.orcam.me.uk (mailing list archive)
Headers show
Series Reinstate and improve MIPS `do_div' implementation | expand

Message

Maciej W. Rozycki April 20, 2021, 2:50 a.m. UTC
Hi,

 As Huacai has recently discovered the MIPS backend for `do_div' has been 
broken and inadvertently disabled with commit c21004cd5b4c ("MIPS: Rewrite 
<asm/div64.h> to work with gcc 4.4.0.").  As it is code I have originally 
written myself and Huacai had issues bringing it back to life leading to a 
request to discard it even I have decided to step in.

 In the end I have fixed the code and measured its performance to be ~100% 
better on average than our generic code.  I have decided it would be worth 
having the test module I have prepared for correctness evaluation as well 
as benchmarking, so I have included it with the series, also so that I can 
refer to the results easily.

 In the end I have included four patches on this occasion: 1/4 is the test 
module, 2/4 is an inline documentation fix/clarification for the `do_div' 
wrapper, 3/4 enables the MIPS `__div64_32' backend and 4/4 adds a small 
performance improvement to it.

 I have investigated a fifth change as a potential improvement where I 
replaced the call to `do_div64_32' with a DIVU instruction for cases where 
the high part of the intermediate divident is zero, but it has turned out 
to regress performance a little, so I have discarded it.

 Also a follow-up change might be worth having to reduce the code size and 
place `__div64_32' out of line for CC_OPTIMIZE_FOR_SIZE configurations, 
but I have not fully prepared such a change at this time.  I did use the 
WIP form I have for performance evaluation however; see the figures quoted 
with 4/4.

 These changes have been verified with a DECstation system with an R3400 
MIPS I processor @40MHz and a MTI Malta system with a 5Kc MIPS64 processor 
@160MHz.

 See individual change descriptions and any additional discussions for
further details.

 Questions, comments or concerns?  Otherwise please apply.

  Maciej

Comments

Thomas Bogendoerfer April 21, 2021, 12:01 p.m. UTC | #1
On Tue, Apr 20, 2021 at 04:50:22AM +0200, Maciej W. Rozycki wrote:
> Hi,
> 
>  As Huacai has recently discovered the MIPS backend for `do_div' has been 
> broken and inadvertently disabled with commit c21004cd5b4c ("MIPS: Rewrite 
> <asm/div64.h> to work with gcc 4.4.0.").  As it is code I have originally 
> written myself and Huacai had issues bringing it back to life leading to a 
> request to discard it even I have decided to step in.
> 
>  In the end I have fixed the code and measured its performance to be ~100% 
> better on average than our generic code.  I have decided it would be worth 
> having the test module I have prepared for correctness evaluation as well 
> as benchmarking, so I have included it with the series, also so that I can 
> refer to the results easily.
> 
>  In the end I have included four patches on this occasion: 1/4 is the test 
> module, 2/4 is an inline documentation fix/clarification for the `do_div' 
> wrapper, 3/4 enables the MIPS `__div64_32' backend and 4/4 adds a small 
> performance improvement to it.
> 
>  I have investigated a fifth change as a potential improvement where I 
> replaced the call to `do_div64_32' with a DIVU instruction for cases where 
> the high part of the intermediate divident is zero, but it has turned out 
> to regress performance a little, so I have discarded it.
> 
>  Also a follow-up change might be worth having to reduce the code size and 
> place `__div64_32' out of line for CC_OPTIMIZE_FOR_SIZE configurations, 
> but I have not fully prepared such a change at this time.  I did use the 
> WIP form I have for performance evaluation however; see the figures quoted 
> with 4/4.
> 
>  These changes have been verified with a DECstation system with an R3400 
> MIPS I processor @40MHz and a MTI Malta system with a 5Kc MIPS64 processor 
> @160MHz.
> 
>  See individual change descriptions and any additional discussions for
> further details.
> 
>  Questions, comments or concerns?  Otherwise please apply.

series applied to mips-next.

Thomas.
Maciej W. Rozycki April 21, 2021, 1:12 p.m. UTC | #2
On Wed, 21 Apr 2021, Thomas Bogendoerfer wrote:

> series applied to mips-next.

 Thanks!  I was about to send v2 with a small fix for an issue discovered 
with message formatting in the failure path of the test_div64 module, 
which was supposed to align the expected result with one actually produced 
for easier visual comparison.  Somehow I failed to realise that format 
specifiers will expand to text of a different length and therefore source 
string alignment won't do.

 I'll send an incremental change then.

  Maciej
H. Nikolaus Schaller April 21, 2021, 4 p.m. UTC | #3
Hi,

> Am 20.04.2021 um 04:50 schrieb Maciej W. Rozycki <macro@orcam.me.uk>:
> 
> Hi,
> 
> As Huacai has recently discovered the MIPS backend for `do_div' has been 
> broken and inadvertently disabled with commit c21004cd5b4c ("MIPS: Rewrite 
> <asm/div64.h> to work with gcc 4.4.0.").  As it is code I have originally 
> written myself and Huacai had issues bringing it back to life leading to a 
> request to discard it even I have decided to step in.
> 
> In the end I have fixed the code and measured its performance to be ~100% 
> better on average than our generic code.

That would be good.

>  I have decided it would be worth 
> having the test module I have prepared for correctness evaluation as well 
> as benchmarking, so I have included it with the series, also so that I can 
> refer to the results easily.
> 
> In the end I have included four patches on this occasion: 1/4 is the test 
> module, 2/4 is an inline documentation fix/clarification for the `do_div' 
> wrapper, 3/4 enables the MIPS `__div64_32' backend and 4/4 adds a small 
> performance improvement to it.

How can I apply them to the kernel? There is something wrong which makes
git am fail.

> 
> I have investigated a fifth change as a potential improvement where I 
> replaced the call to `do_div64_32' with a DIVU instruction for cases where 
> the high part of the intermediate divident is zero, but it has turned out 
> to regress performance a little, so I have discarded it.
> 
> Also a follow-up change might be worth having to reduce the code size and 
> place `__div64_32' out of line for CC_OPTIMIZE_FOR_SIZE configurations, 
> but I have not fully prepared such a change at this time.  I did use the 
> WIP form I have for performance evaluation however; see the figures quoted 
> with 4/4.
> 
> These changes have been verified with a DECstation system with an R3400 
> MIPS I processor @40MHz and a MTI Malta system with a 5Kc MIPS64 processor 
> @160MHz.

I'd like to test on ~320 MHz JZ4730.

> 
> See individual change descriptions and any additional discussions for
> further details.
> 
> Questions, comments or concerns?  Otherwise please apply.
> 
>  Maciej

BR and thanks,
Nikolaus Schaller
Maciej W. Rozycki April 21, 2021, 7:04 p.m. UTC | #4
On Wed, 21 Apr 2021, H. Nikolaus Schaller wrote:

> > In the end I have included four patches on this occasion: 1/4 is the test 
> > module, 2/4 is an inline documentation fix/clarification for the `do_div' 
> > wrapper, 3/4 enables the MIPS `__div64_32' backend and 4/4 adds a small 
> > performance improvement to it.
> 
> How can I apply them to the kernel? There is something wrong which makes
> git am fail.

 I don't know.  The changes were made against vanilla 5.12-rc7, but then 
the pieces affected have not changed for ages.  FWIW I can `git am' the 
series as received back just fine.

> > These changes have been verified with a DECstation system with an R3400 
> > MIPS I processor @40MHz and a MTI Malta system with a 5Kc MIPS64 processor 
> > @160MHz.
> 
> I'd like to test on ~320 MHz JZ4730.

 Sure, I'd love to hear how this code performs with other implementations.  

  Maciej
H. Nikolaus Schaller April 22, 2021, 5:53 a.m. UTC | #5
Hi,

> Am 21.04.2021 um 21:04 schrieb Maciej W. Rozycki <macro@orcam.me.uk>:
> 
> On Wed, 21 Apr 2021, H. Nikolaus Schaller wrote:
> 
>>> In the end I have included four patches on this occasion: 1/4 is the test 
>>> module, 2/4 is an inline documentation fix/clarification for the `do_div' 
>>> wrapper, 3/4 enables the MIPS `__div64_32' backend and 4/4 adds a small 
>>> performance improvement to it.
>> 
>> How can I apply them to the kernel? There is something wrong which makes
>> git am fail.
> 
> I don't know.  The changes were made against vanilla 5.12-rc7, but then 
> the pieces affected have not changed for ages.  FWIW I can `git am' the 
> series as received back just fine.

Please can you point me to some download/pull/gitweb? It seems as if the series
also did not appear at https://patchwork.kernel.org/project/linux-mips/list/

> 
>>> These changes have been verified with a DECstation system with an R3400 
>>> MIPS I processor @40MHz and a MTI Malta system with a 5Kc MIPS64 processor 
>>> @160MHz.
>> 
>> I'd like to test on ~320 MHz JZ4730.
> 
> Sure, I'd love to hear how this code performs with other implementations.  
> 
>  Maciej

BR and thanks,
Nikolaus Schaller
Jiaxun Yang April 22, 2021, 1:39 p.m. UTC | #6
On Thu, Apr 22, 2021, at 1:53 PM, H. Nikolaus Schaller wrote:
> Hi,
> 
> > Am 21.04.2021 um 21:04 schrieb Maciej W. Rozycki <macro@orcam.me.uk>:
> > 
> > On Wed, 21 Apr 2021, H. Nikolaus Schaller wrote:
> > 
> >>> In the end I have included four patches on this occasion: 1/4 is the test 
> >>> module, 2/4 is an inline documentation fix/clarification for the `do_div' 
> >>> wrapper, 3/4 enables the MIPS `__div64_32' backend and 4/4 adds a small 
> >>> performance improvement to it.
> >> 
> >> How can I apply them to the kernel? There is something wrong which makes
> >> git am fail.
> > 
> > I don't know.  The changes were made against vanilla 5.12-rc7, but then 
> > the pieces affected have not changed for ages.  FWIW I can `git am' the 
> > series as received back just fine.
> 
> Please can you point me to some download/pull/gitweb? It seems as if the series
> also did not appear at https://patchwork.kernel.org/project/linux-mips/list/
> 

You may try download raw from:

https://lore.kernel.org/linux-mips/E6326E8A-50DA-4F81-9865-F29EE0E298A9@goldelico.com/T/#t
Maciej W. Rozycki April 22, 2021, 3:58 p.m. UTC | #7
On Thu, 22 Apr 2021, Jiaxun Yang wrote:

> > Please can you point me to some download/pull/gitweb? It seems as if the series
> > also did not appear at https://patchwork.kernel.org/project/linux-mips/list/
> > 
> 
> You may try download raw from:
> 
> https://lore.kernel.org/linux-mips/E6326E8A-50DA-4F81-9865-F29EE0E298A9@goldelico.com/T/#t

 Or you can cherry-pick the commits directly from mips-next: 
<git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux.git>.

  Maciej
H. Nikolaus Schaller April 22, 2021, 4 p.m. UTC | #8
Hi,

> Am 22.04.2021 um 15:39 schrieb Jiaxun Yang <jiaxun.yang@flygoat.com>:
> 
> 
> 
> On Thu, Apr 22, 2021, at 1:53 PM, H. Nikolaus Schaller wrote:
>> Hi,
>> 
>>> Am 21.04.2021 um 21:04 schrieb Maciej W. Rozycki <macro@orcam.me.uk>:
>>> 
>>> On Wed, 21 Apr 2021, H. Nikolaus Schaller wrote:
>>> 
>>>>> In the end I have included four patches on this occasion: 1/4 is the test 
>>>>> module, 2/4 is an inline documentation fix/clarification for the `do_div' 
>>>>> wrapper, 3/4 enables the MIPS `__div64_32' backend and 4/4 adds a small 
>>>>> performance improvement to it.
>>>> 
>>>> How can I apply them to the kernel? There is something wrong which makes
>>>> git am fail.
>>> 
>>> I don't know.  The changes were made against vanilla 5.12-rc7, but then 
>>> the pieces affected have not changed for ages.  FWIW I can `git am' the 
>>> series as received back just fine.
>> 
>> Please can you point me to some download/pull/gitweb? It seems as if the series
>> also did not appear at https://patchwork.kernel.org/project/linux-mips/list/
>> 
> 
> You may try download raw from:
> 
> https://lore.kernel.org/linux-mips/E6326E8A-50DA-4F81-9865-F29EE0E298A9@goldelico.com/T/#t

I simply tried again and it seems that I had tried to git am it on top of the
patches from Huacai which of course fails.

Now I could run the tests:

from [PATCH 4/4]:

> This has passed correctness verification with test_div64 and reduced the
> module's average execution time down to 1.0445s and 0.2619s from 1.0668s
> and 0.2629s respectively for an R3400 CPU @40MHz and a 5Kc CPU @160MHz.

test only [PATCH 1/4 and 2/4]:

[  256.301140] test_div64: Completed 64bit/32bit division and modulo test, 0.291154944s elapsed

+ [PATCH 3/4]

[ 1698.698920] test_div64: Completed 64bit/32bit division and modulo test, 0.132142865s elapsed

+ [PATCH 4/4]

[  466.818349] test_div64: Completed 64bit/32bit division and modulo test, 0.134429075s elapsed

So the new code is indeed faster than the default implementation.
[PATCH 4/4] has no significant influence (wouldn't say it is slower because timer resolution
isn't very high on this machine and the kernel has some scheduling issue [1]).

Anyways the JZ4730 can boot and works with these patches included.

BR and thanks,
Nikolaus Schaller


[1] we are preparing full support for the JZ4730 based Skytone Alpha machine. Most features
are working except sound/I2S. I2C is a little unreliable and Ethernet has hickups. And scheduling
which indicates some fundamental IRQ or timer issue we could not yet identify.

So if anyone happens to have such a machine and can help debugging, please contact me.
Maciej W. Rozycki April 22, 2021, 4:55 p.m. UTC | #9
On Thu, 22 Apr 2021, H. Nikolaus Schaller wrote:

> > This has passed correctness verification with test_div64 and reduced the
> > module's average execution time down to 1.0445s and 0.2619s from 1.0668s
> > and 0.2629s respectively for an R3400 CPU @40MHz and a 5Kc CPU @160MHz.
> 
> test only [PATCH 1/4 and 2/4]:
> 
> [  256.301140] test_div64: Completed 64bit/32bit division and modulo test, 0.291154944s elapsed
> 
> + [PATCH 3/4]
> 
> [ 1698.698920] test_div64: Completed 64bit/32bit division and modulo test, 0.132142865s elapsed
> 
> + [PATCH 4/4]
> 
> [  466.818349] test_div64: Completed 64bit/32bit division and modulo test, 0.134429075s elapsed
> 
> So the new code is indeed faster than the default implementation.
> [PATCH 4/4] has no significant influence (wouldn't say it is slower because timer resolution
> isn't very high on this machine and the kernel has some scheduling issue [1]).

 Have you used it as a module or at bootstrap?  I have noticed that at 
bootstrap the initialisation of the random number generator sometimes 
interferes with the benchmark, which happens when there's an intervening 
message produced, e.g.:

test_div64: Starting 64bit/32bit division and modulo test
random: fast init done
test_div64: Completed 64bit/32bit division and modulo test, 1.069906272s elapsed

I think it can be worked around by configuration changes so that more 
stuff is run between the RNG and the test module, but instead I have 
simply inserted:

	mdelay(5000);

at the beginning of `test_div64_init' instead, as for historical reasons I 
haven't got the systems involved set up for modules (beyond Linux 2.4) at 
this time.

 NB I have run the benchmark five times with each change and system and 
with the RNG taken out of the picture results were very stable as any 
fluctuation only started at the fifth decimal digit.  Both the DECstation 
(the model I used anyway) and the Malta have a high-resolution clock 
source though, the I/O ASIC free-running counter register at 25MHz (used 
by David L. Mills, the original author of the NTP suite, for his reference 
implementation) and the CP0 Count register at 80MHz respectively.

 I would expect your JZ4730 device to have the CP0 Count register as well, 
as it has been architectural ever since MIPS III really, or is your system 
SMP with CP0 Count registers out of sync across CPUs due to sleep modes or 
whatever?

 Thanks for sharing your figures.

> [1] we are preparing full support for the JZ4730 based Skytone Alpha machine. Most features
> are working except sound/I2S. I2C is a little unreliable and Ethernet has hickups. And scheduling
> which indicates some fundamental IRQ or timer issue we could not yet identify.

 Good luck with that!

  Maciej
H. Nikolaus Schaller April 22, 2021, 5:06 p.m. UTC | #10
> Am 22.04.2021 um 18:55 schrieb Maciej W. Rozycki <macro@orcam.me.uk>:
> 
> 
> Have you used it as a module or at bootstrap?

I did load it by insmod.

> I would expect your JZ4730 device to have the CP0 Count register as well, 
> as it has been architectural ever since MIPS III really, or is your system 
> SMP with CP0 Count registers out of sync across CPUs due to sleep modes or 
> whatever?

It switches clocksource to some operating system timers on the SoC which
may have an influence on the resolution (or precision).

> Thanks for sharing your figures.

It was a pleasure towards better MIPS support...

> 
>> [1] we are preparing full support for the JZ4730 based Skytone Alpha machine. Most features
>> are working except sound/I2S. I2C is a little unreliable and Ethernet has hickups. And scheduling
>> which indicates some fundamental IRQ or timer issue we could not yet identify.
> 
> Good luck with that!

BR and thanks,
Niklaus