mbox series

[RFC,v4,00/11] powerpc: switch VDSO to C implementation.

Message ID cover.1579196675.git.christophe.leroy@c-s.fr (mailing list archive)
Headers show
Series powerpc: switch VDSO to C implementation. | expand

Message

Christophe Leroy Jan. 16, 2020, 5:58 p.m. UTC
This is a fourth tentative to switch powerpc VDSO to generic C implementation.

This version should work on PPC64 (untested). VDSO32 for PPC64 is
impossible to build and has been de-activated, because the powerpc
ASM header files for C are not prepared to build 32 bits code with CONFIG_PPC64.

powerpc is a bit special for VDSO as well as system calls in the
way that it requires setting CR SO bit which cannot be done in C.
Therefore, entry/exit and fallback need to be performed in ASM.

Note that on previous patches, a buggy version of vdsotest was used which was
underestimating the time in gettimeofday compared to clock-get... functions.

This series applies on a merge of powerpc/merge and tip/timers/core branches,
series "lib/vdso: Bugfix and consolidation"
(https://lore.kernel.org/patchwork/project/lkml/list/?series=425784)
applied after the above merge.

On a powerpc8xx, with current powerpc/32 ASM VDSO:

gettimeofday:    vdso: 907 nsec/call
clock-getres-realtime:    vdso: 484 nsec/call
clock-gettime-realtime:    vdso: 899 nsec/call

The first patch adds VDSO generic C support without any changes to common code.
Performance is as follows:

gettimeofday:    vdso: 1211 nsec/call
clock-getres-realtime:    vdso: 722 nsec/call
clock-gettime-realtime:    vdso: 1216 nsec/call

Then a few changes in the common code have allowed performance improvement. At
the end of the series we have:

gettimeofday:    vdso: 974 nsec/call
clock-getres-realtime:    vdso: 545 nsec/call
clock-gettime-realtime:    vdso: 941 nsec/call

The final result is rather close to pure ASM VDSO:
* 7% more on gettimeofday (9 cycles)
* 5% more on clock-gettime-realtime (6 cycles)
* 12% more on clock-getres-realtime (8 cycles)

Due to the unavoidable ASM trampoline, we won't get much closer but that should be
acceptable for a port from ASM to a generic C code (here, 1 cycle is about 7,5 ns)

Christophe Leroy (11):
  powerpc/64: Don't provide time functions in compat VDSO32
  powerpc/vdso: Switch VDSO to generic C implementation.
  lib: vdso: only read hrtimer_res when needed in __cvdso_clock_getres()
  powerpc/vdso: simplify __get_datapage()
  lib: vdso: allow arches to provide vdso data pointer
  powerpc/vdso: provide inline alternative to __get_datapage()
  powerpc/vdso: provide vdso data pointer from the ASM caller.
  lib: vdso: allow fixed clock mode
  powerpc/vdso: override __arch_vdso_capable()
  lib: vdso: Allow arches to override the ns shift operation
  powerpc/32: provide vdso_shift_ns()

 arch/powerpc/Kconfig                         |   2 +
 arch/powerpc/include/asm/clocksource.h       |   6 +
 arch/powerpc/include/asm/vdso/gettimeofday.h | 117 ++++++++++++
 arch/powerpc/include/asm/vdso/vsyscall.h     |  25 +++
 arch/powerpc/include/asm/vdso_datapage.h     |  52 +++---
 arch/powerpc/kernel/asm-offsets.c            |  46 +----
 arch/powerpc/kernel/time.c                   |  91 +---------
 arch/powerpc/kernel/vdso.c                   |  58 ++----
 arch/powerpc/kernel/vdso32/Makefile          |  27 ++-
 arch/powerpc/kernel/vdso32/datapage.S        |  10 +-
 arch/powerpc/kernel/vdso32/gettimeofday.S    | 258 ++++-----------------------
 arch/powerpc/kernel/vdso32/vdso32.lds.S      |   9 +-
 arch/powerpc/kernel/vdso32/vgettimeofday.c   |  29 +++
 arch/powerpc/kernel/vdso64/Makefile          |  23 ++-
 arch/powerpc/kernel/vdso64/datapage.S        |  13 +-
 arch/powerpc/kernel/vdso64/gettimeofday.S    | 257 ++++----------------------
 arch/powerpc/kernel/vdso64/vdso64.lds.S      |   7 +-
 arch/powerpc/kernel/vdso64/vgettimeofday.c   |  29 +++
 lib/vdso/gettimeofday.c                      | 107 ++++++++---
 19 files changed, 457 insertions(+), 709 deletions(-)
 create mode 100644 arch/powerpc/include/asm/clocksource.h
 create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h
 create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h
 create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c
 create mode 100644 arch/powerpc/kernel/vdso64/vgettimeofday.c

Comments

Segher Boessenkool Jan. 17, 2020, 8:58 a.m. UTC | #1
Hi!

On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote:
> On a powerpc8xx, with current powerpc/32 ASM VDSO:
> 
> gettimeofday:    vdso: 907 nsec/call
> clock-getres-realtime:    vdso: 484 nsec/call
> clock-gettime-realtime:    vdso: 899 nsec/call
> 
> The first patch adds VDSO generic C support without any changes to common code.
> Performance is as follows:
> 
> gettimeofday:    vdso: 1211 nsec/call
> clock-getres-realtime:    vdso: 722 nsec/call
> clock-gettime-realtime:    vdso: 1216 nsec/call
> 
> Then a few changes in the common code have allowed performance improvement. At
> the end of the series we have:
> 
> gettimeofday:    vdso: 974 nsec/call
> clock-getres-realtime:    vdso: 545 nsec/call
> clock-gettime-realtime:    vdso: 941 nsec/call
> 
> The final result is rather close to pure ASM VDSO:
> * 7% more on gettimeofday (9 cycles)
> * 5% more on clock-gettime-realtime (6 cycles)
> * 12% more on clock-getres-realtime (8 cycles)

Nice!  Much better.

It should be tested on more representative hardware, too, but this looks
promising alright :-)


Segher
Christophe Leroy Jan. 17, 2020, 9:26 a.m. UTC | #2
Le 17/01/2020 à 09:58, Segher Boessenkool a écrit :
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote:
>> On a powerpc8xx, with current powerpc/32 ASM VDSO:
>>
>> gettimeofday:    vdso: 907 nsec/call
>> clock-getres-realtime:    vdso: 484 nsec/call
>> clock-gettime-realtime:    vdso: 899 nsec/call
>>
>> The first patch adds VDSO generic C support without any changes to common code.
>> Performance is as follows:
>>
>> gettimeofday:    vdso: 1211 nsec/call
>> clock-getres-realtime:    vdso: 722 nsec/call
>> clock-gettime-realtime:    vdso: 1216 nsec/call
>>
>> Then a few changes in the common code have allowed performance improvement. At
>> the end of the series we have:
>>
>> gettimeofday:    vdso: 974 nsec/call
>> clock-getres-realtime:    vdso: 545 nsec/call
>> clock-gettime-realtime:    vdso: 941 nsec/call
>>
>> The final result is rather close to pure ASM VDSO:
>> * 7% more on gettimeofday (9 cycles)
>> * 5% more on clock-gettime-realtime (6 cycles)
>> * 12% more on clock-getres-realtime (8 cycles)
> 
> Nice!  Much better.
> 
> It should be tested on more representative hardware, too, but this looks
> promising alright :-)
> 

Yes.

Now the challenge is to get VDSO32 buildable on PPC64. The big issue is 
that in most powerpc/include/asm/*.h , CONFIG_PPC64 is used to know if 
the build is a 64 bits build or a 32 bits build, so VDSO32 build fails.

I don't know how this could be easily fixed.

Christophe
Christophe Leroy Jan. 20, 2020, 2:56 p.m. UTC | #3
Hi

On 01/17/2020 08:58 AM, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:58:24PM +0000, Christophe Leroy wrote:
>> On a powerpc8xx, with current powerpc/32 ASM VDSO:
>>
>> gettimeofday:    vdso: 907 nsec/call
>> clock-getres-realtime:    vdso: 484 nsec/call
>> clock-gettime-realtime:    vdso: 899 nsec/call
>>
>> The first patch adds VDSO generic C support without any changes to common code.
>> Performance is as follows:
>>
>> gettimeofday:    vdso: 1211 nsec/call
>> clock-getres-realtime:    vdso: 722 nsec/call
>> clock-gettime-realtime:    vdso: 1216 nsec/call
>>
>> Then a few changes in the common code have allowed performance improvement. At
>> the end of the series we have:
>>
>> gettimeofday:    vdso: 974 nsec/call
>> clock-getres-realtime:    vdso: 545 nsec/call
>> clock-gettime-realtime:    vdso: 941 nsec/call
>>
>> The final result is rather close to pure ASM VDSO:
>> * 7% more on gettimeofday (9 cycles)
>> * 5% more on clock-gettime-realtime (6 cycles)
>> * 12% more on clock-getres-realtime (8 cycles)
> 
> Nice!  Much better.
> 
> It should be tested on more representative hardware, too, but this looks
> promising alright :-)
> 

mpc832x (e300c2 core) at 333 MHz:

Before:

gettimeofday:    vdso: 235 nsec/call
clock-getres-realtime-coarse:    vdso: 1668 nsec/call
clock-gettime-realtime-coarse:    vdso: 1338 nsec/call
clock-getres-realtime:    vdso: 135 nsec/call
clock-gettime-realtime:    vdso: 244 nsec/call
clock-getres-boottime:    vdso: 1232 nsec/call
clock-gettime-boottime:    vdso: 1935 nsec/call
clock-getres-tai:    vdso: 1257 nsec/call
clock-gettime-tai:    vdso: 1898 nsec/call
clock-getres-monotonic-raw:    vdso: 1229 nsec/call
clock-gettime-monotonic-raw:    vdso: 1541 nsec/call
clock-getres-monotonic-coarse:    vdso: 1699 nsec/call
clock-gettime-monotonic-coarse:    vdso: 1477 nsec/call
clock-getres-monotonic:    vdso: 135 nsec/call
clock-gettime-monotonic:    vdso: 283 nsec/call

With the series:

gettimeofday:    vdso: 271 nsec/call
clock-getres-realtime-coarse:    vdso: 159 nsec/call
clock-gettime-realtime-coarse:    vdso: 184 nsec/call
clock-getres-realtime:    vdso: 163 nsec/call
clock-gettime-realtime:    vdso: 281 nsec/call
clock-getres-boottime:    vdso: 169 nsec/call
clock-gettime-boottime:    vdso: 274 nsec/call
clock-getres-tai:    vdso: 163 nsec/call
clock-gettime-tai:    vdso: 277 nsec/call
clock-getres-monotonic-raw:    vdso: 166 nsec/call
clock-gettime-monotonic-raw:    vdso: 302 nsec/call
clock-getres-monotonic-coarse:    vdso: 159 nsec/call
clock-gettime-monotonic-coarse:    vdso: 184 nsec/call
clock-getres-monotonic:    vdso: 166 nsec/call
clock-gettime-monotonic:    vdso: 274 nsec/call

Christophe
Segher Boessenkool Jan. 20, 2020, 3:19 p.m. UTC | #4
On Mon, Jan 20, 2020 at 02:56:00PM +0000, Christophe Leroy wrote:
> >Nice!  Much better.
> >
> >It should be tested on more representative hardware, too, but this looks
> >promising alright :-)
> 
> mpc832x (e300c2 core) at 333 MHz:
> 
> Before:
> 
> gettimeofday:    vdso: 235 nsec/call
> clock-gettime-realtime:    vdso: 244 nsec/call
> 
> With the series:
> 
> gettimeofday:    vdso: 271 nsec/call
> clock-gettime-realtime:    vdso: 281 nsec/call

Those are important, and degrade ~15%.  That is acceptable IMO, but do
you see a way to optimise this (later)?

Anyway, excellent results, thanks for your persistence!


Segher
Christophe Leroy Jan. 20, 2020, 5:08 p.m. UTC | #5
Le 20/01/2020 à 16:19, Segher Boessenkool a écrit :
> On Mon, Jan 20, 2020 at 02:56:00PM +0000, Christophe Leroy wrote:
>>> Nice!  Much better.
>>>
>>> It should be tested on more representative hardware, too, but this looks
>>> promising alright :-)
>>
>> mpc832x (e300c2 core) at 333 MHz:
>>
>> Before:
>>
>> gettimeofday:    vdso: 235 nsec/call
>> clock-gettime-realtime:    vdso: 244 nsec/call
>>
>> With the series:
>>
>> gettimeofday:    vdso: 271 nsec/call
>> clock-gettime-realtime:    vdso: 281 nsec/call
> 
> Those are important, and degrade ~15%.  That is acceptable IMO, but do
> you see a way to optimise this (later)?

Not easy I think.

First we have the unavoidable ASM entry function that can't be dropped 
because of the CR[SO] bit the set on error or clear on no error and that 
can't be done in C.

In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts 
are generic and read from the VDSO data.

And there is still some funny code generated by GCC (8.1), like:

  620:	7d 29 3c 30 	srw     r9,r9,r7
  624:	21 87 00 20 	subfic  r12,r7,32
  628:	7d 07 3c 31 	srw.    r7,r8,r7
  62c:	7d 08 60 30 	slw     r8,r8,r12
  630:	7d 0b 4b 78 	or      r11,r8,r9
  634:	39 40 00 00 	li      r10,0
  638:	40 82 00 84 	bne     6bc <__c_kernel_clock_gettime+0x114>
  63c:	81 23 00 24 	lwz     r9,36(r3)
  640:	81 05 00 00 	lwz     r8,0(r5)
...
  6bc:	7d 69 5b 78 	mr      r9,r11
  6c0:	7c ea 3b 78 	mr      r10,r7
  6c4:	7d 2b 4b 78 	mr      r11,r9
  6c8:	4b ff ff 74 	b       63c <__c_kernel_clock_gettime+0x94>

This branch to 6bc is totally useless:
- copying r11 into r9 is pointless as r9 is overwritten in 63c
- copying back r9 into r11 is pointless as r11 has not been modified 
inbetween.
- loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is 
pointless as well, could have directly put the result of srw. in r10.

Christophe
Segher Boessenkool Jan. 20, 2020, 5:27 p.m. UTC | #6
On Mon, Jan 20, 2020 at 06:08:23PM +0100, Christophe Leroy wrote:
> Not easy I think.
> 
> First we have the unavoidable ASM entry function that can't be dropped 
> because of the CR[SO] bit the set on error or clear on no error and that 
> can't be done in C.

Yup.

> In our ASM VDSO, fixed shifts are used, while in generic C VDSO, shifts 
> are generic and read from the VDSO data.

Does that cost more than just a few cycles?

> And there is still some funny code generated by GCC (8.1), like:
> 
>  620:	7d 29 3c 30 	srw     r9,r9,r7
>  624:	21 87 00 20 	subfic  r12,r7,32
>  628:	7d 07 3c 31 	srw.    r7,r8,r7
>  62c:	7d 08 60 30 	slw     r8,r8,r12
>  630:	7d 0b 4b 78 	or      r11,r8,r9

(This can be done cheaper for fixed shifts, you can use rlwimi then).

>  634:	39 40 00 00 	li      r10,0
>  638:	40 82 00 84 	bne     6bc <__c_kernel_clock_gettime+0x114>
>  63c:	81 23 00 24 	lwz     r9,36(r3)
>  640:	81 05 00 00 	lwz     r8,0(r5)
> ...
>  6bc:	7d 69 5b 78 	mr      r9,r11
>  6c0:	7c ea 3b 78 	mr      r10,r7
>  6c4:	7d 2b 4b 78 	mr      r11,r9
>  6c8:	4b ff ff 74 	b       63c <__c_kernel_clock_gettime+0x94>
> 
> This branch to 6bc is totally useless:
> - copying r11 into r9 is pointless as r9 is overwritten in 63c
> - copying back r9 into r11 is pointless as r11 has not been modified 
> inbetween.

Yeah, huh, how did that happen.

> - loading r10 with 0 then overwritting r10 with r7 when r7 is not 0 is 
> pointless as well, could have directly put the result of srw. in r10.

This may be harder to make the compiler do.

But the r9/r11 thing suggests you are preventing optimisation somewhere,
maybe with some asm?  Do you have some small testcase I can compile?


Segher