diff mbox

[RFC] arm: decompressor: initialize PIC offset base register for uClinux tools

Message ID 510BF0B3.3030608@arm.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jonathan Austin Feb. 1, 2013, 4:43 p.m. UTC
Hi Nicolas, thanks for the comments,

On 29/01/13 20:13, Nicolas Pitre wrote:
> On Tue, 29 Jan 2013, Jonathan Austin wrote:
> 
>> Before jumping to (position independent) C-code from the decompressor's
>> assembler world we set-up the C environment. This setup currently does not
>> set r9, which for arm-none-uclinux-uclibceabi should be the PIC offset base
>> register (IE should point to the beginning of the GOT).
>>
>> Currently, therefore, in order to build working kernels that use the
>> decompressor it is necessary to use an arm-linux-gnueabi toolchain, or
>> similar. uClinux toolchains cause a Prefetch Abort to occur at the beginning
>> of the decompress_kernel function.
>>
>> This patch allows uClinux toolchains to build bootable zImages by setting r9
>> to the beginning of the GOT when __uClinux__ is defined, allowing the
>> decompressor's C functions to work correctly.
>>
>> Signed-off-by: Jonathan Austin <jonathan.austin@arm.com>
>> ---
>>
>> One other possibility would be to specify -mno-single-pic-base when building
>> the decompressor. This works around the problem, but forces the compiler to
>> generate less optimal code.
> 
> How "less optimal"?  How much bigger/slower is it?
> If not significant enough then going with -mno-single-pic-base might be
> fine.

Code that needs to access anything global will need to derive the location
of the GOT for itself, but there's a possible upside there that there's an
extra free register (r9 can be used as a general purpose register...)

The patch would look like:
-----8<-------
  *   r5  = appended dtb size (0 if not present)
  *   r7  = architecture ID
  *   r8  = atags pointer
+ *   r9  = GOT start (for uClinux ABI), relocated
  *   r11 = GOT start
  *   r12 = GOT end
  *   sp  = stack pointer
@@ -470,6 +474,7 @@ not_relocated:      mov     r0, #0
  *   r4  = kernel execution address
  *   r7  = architecture ID
  *   r8  = atags pointer
+ *   r9  = GOT start (for uClinux ABI)
  */
                mov     r0, r4
                mov     r1, sp                  @ malloc space above stack
------->8-----------


The question that now occurs is whether we should just set r9 whether or not
we're using a uClinux toolchain - I don't think it is going to hurt as the
arm-linux-gnueabi world can happily clobber it with no bad consequences...

But after all this, it seems that just using -mno-single-pic base as in the patch
above is best...

Thoughts?

Jonny

Comments

Nicolas Pitre Feb. 1, 2013, 6:07 p.m. UTC | #1
On Fri, 1 Feb 2013, Jonathan Austin wrote:

> Hi Nicolas, thanks for the comments,
> 
> On 29/01/13 20:13, Nicolas Pitre wrote:
> > On Tue, 29 Jan 2013, Jonathan Austin wrote:
> > 
> >> Before jumping to (position independent) C-code from the decompressor's
> >> assembler world we set-up the C environment. This setup currently does not
> >> set r9, which for arm-none-uclinux-uclibceabi should be the PIC offset base
> >> register (IE should point to the beginning of the GOT).
> >>
> >> Currently, therefore, in order to build working kernels that use the
> >> decompressor it is necessary to use an arm-linux-gnueabi toolchain, or
> >> similar. uClinux toolchains cause a Prefetch Abort to occur at the beginning
> >> of the decompress_kernel function.
> >>
> >> This patch allows uClinux toolchains to build bootable zImages by setting r9
> >> to the beginning of the GOT when __uClinux__ is defined, allowing the
> >> decompressor's C functions to work correctly.
> >>
> >> Signed-off-by: Jonathan Austin <jonathan.austin@arm.com>
> >> ---
> >>
> >> One other possibility would be to specify -mno-single-pic-base when building
> >> the decompressor. This works around the problem, but forces the compiler to
> >> generate less optimal code.
> > 
> > How "less optimal"?  How much bigger/slower is it?
> > If not significant enough then going with -mno-single-pic-base might be
> > fine.
> 
> Code that needs to access anything global will need to derive the location
> of the GOT for itself, but there's a possible upside there that there's an
> extra free register (r9 can be used as a general purpose register...)

We try to minimize those in order to perform the easy relocation trick 
which requires no reference to global initialized data.  Hence this in 
the linker script:

  /DISCARD/ : {
    *(.ARM.exidx*)
    *(.ARM.extab*)
    /*
     * Discard any r/w data - this produces a link error if we have any,
     * which is required for PIC decompression.  Local data generates
     * GOTOFF relocations, which prevents it being relocated independently
     * of the text/got segments.
     */
    *(.data)
  }

> The patch would look like:
> -----8<-------
> diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
> index 5cad8a6..afed28e 100644
> --- a/arch/arm/boot/compressed/Makefile
> +++ b/arch/arm/boot/compressed/Makefile
> @@ -120,7 +120,7 @@ ORIG_CFLAGS := $(KBUILD_CFLAGS)
>  KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS))
>  endif
>  -ccflags-y := -fpic -fno-builtin -I$(obj)
> +ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj)
>  asflags-y := -Wa,-march=all -DZIMAGE
>   # Supply kernel BSS size to the decompressor via a linker symbol.
> ------>8---------
> 
> 
> I did a fairly crude benchmark - count how many instructions we need in
> order to finish decompressing the kernel...
> 
> Setup r9 correctly:       129,976,282
> Use -mno-single-pic-base: 124,826,778
> 
> (this was done using an R-class model and a magic semi-hosting call to pause
> the model at the end of the decompress_kernel function)
> 
> So, it seems like the extra register means there's actually a 4% *win* 
> in instruction terms from using -mno-single-pic-base

Looks like you have a winner.

Acked-by: Nicolas Pitre <nico@linaro.org>

> That said, I've still made some comments/amendments below...
> 
> > 
> >>   arch/arm/boot/compressed/head.S |    4 ++++
> >>   1 file changed, 4 insertions(+)
> >>
> >> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
> >> index fe4d9c3..4491e75 100644
> >> --- a/arch/arm/boot/compressed/head.S
> >> +++ b/arch/arm/boot/compressed/head.S
> >> @@ -410,6 +410,10 @@ wont_overwrite:
> >>    *   sp  = stack pointer
> >>    */
> >>   		orrs	r1, r0, r5
> >> +#ifdef __uClinux__
> >> +		mov	r9, r11			@ PIC offset base register
> >> +		addne	r9, r9, r0		@ Also needs relocating
> >> +#endif
> >>   		beq	not_relocated
> > 
> > Please don't insert your code between the orrs and the beq as those two
> > go logically together.
> 
> I'd initially done this in order to change only one site - as we need to
> set r9 and then add the offset I was using the condition code to test r0...
> 
> However, this was silly - I think I can just do it in one instruction:
> 
> add   r9, r11, r0
> 
> In the case that we're not relocated, r0 should be 0 anyway...
> 
> > 
> > In fact, the best location for this would probably be between the
> > wont_overwrite label and the comment that immediately follows it. And
> > then, those comments that follow until the branch into C code should be
> > updated accordingly.
> 
> 
> Okay, assuming I've understood you correctly, you're suggesting something
> like this:
> 
> -----8<-------
> 
> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
> index fe4d9c3..d81efbd 100644
> --- a/arch/arm/boot/compressed/head.S
> +++ b/arch/arm/boot/compressed/head.S
> @@ -396,6 +396,9 @@ dtb_check_done:
>                 mov     pc, r0
>   wont_overwrite:
> +#ifdef __uClinux__
> +               add     r9, r11, r0             @ uClinux PIC offset base register
> +#endif
>  /*
>   * If delta is zero, we are running at the address we were linked at.
>   *   r0  = delta
> @@ -405,6 +408,7 @@ wont_overwrite:
>   *   r5  = appended dtb size (0 if not present)
>   *   r7  = architecture ID
>   *   r8  = atags pointer
> + *   r9  = GOT start (for uClinux ABI), relocated
>   *   r11 = GOT start
>   *   r12 = GOT end
>   *   sp  = stack pointer
> @@ -470,6 +474,7 @@ not_relocated:      mov     r0, #0
>   *   r4  = kernel execution address
>   *   r7  = architecture ID
>   *   r8  = atags pointer
> + *   r9  = GOT start (for uClinux ABI)
>   */
>                 mov     r0, r4
>                 mov     r1, sp                  @ malloc space above stack
> ------->8-----------

Yes, that's what I was suggesting.

> The question that now occurs is whether we should just set r9 whether or not
> we're using a uClinux toolchain - I don't think it is going to hurt as the
> arm-linux-gnueabi world can happily clobber it with no bad consequences...
> 
> But after all this, it seems that just using -mno-single-pic base as in the patch
> above is best...

Indeed.  As longas this option is compatible with all toolchains.


Nicolas
Russell King - ARM Linux Feb. 1, 2013, 6:18 p.m. UTC | #2
On Fri, Feb 01, 2013 at 04:43:31PM +0000, Jonathan Austin wrote:
> Code that needs to access anything global will need to derive the location
> of the GOT for itself, but there's a possible upside there that there's an
> extra free register (r9 can be used as a general purpose register...)
> 
> The patch would look like:
> -----8<-------
> diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
> index 5cad8a6..afed28e 100644
> --- a/arch/arm/boot/compressed/Makefile
> +++ b/arch/arm/boot/compressed/Makefile
> @@ -120,7 +120,7 @@ ORIG_CFLAGS := $(KBUILD_CFLAGS)
>  KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS))
>  endif
>  -ccflags-y := -fpic -fno-builtin -I$(obj)
> +ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj)
>  asflags-y := -Wa,-march=all -DZIMAGE
>   # Supply kernel BSS size to the decompressor via a linker symbol.
> ------>8---------
> 
> 
> I did a fairly crude benchmark - count how many instructions we need in
> order to finish decompressing the kernel...
> 
> Setup r9 correctly:       129,976,282
> Use -mno-single-pic-base: 124,826,778
> 
> (this was done using an R-class model and a magic semi-hosting call to pause
> the model at the end of the decompress_kernel function)
> 
> So, it seems like the extra register means there's actually a 4% *win* 
> in instruction terms from using -mno-single-pic-base

Hmm.  This is the opposite of what I'd expect.  -msingle-pic-base says:

     Treat the register used for PIC addressing as read-only, rather
     than loading it in the prologue for each function.  The run-time
     system is responsible for initializing this register with an
     appropriate value before execution begins.

which implies that we should be able to load it before calling the C
code (as you're doing) and then the compiler won't issue instructions
to reload that register.

Giving -mno-single-pic-base suggests that it would turn _off_ this
behaviour (which afaik - sensibly - is not by default enabled.)

So, I'm not sure I fully understand what's going on here.
Jonathan Austin Feb. 4, 2013, noon UTC | #3
Hi Russell,

On 01/02/13 18:18, Russell King - ARM Linux wrote:
> On Fri, Feb 01, 2013 at 04:43:31PM +0000, Jonathan Austin wrote:
>> Code that needs to access anything global will need to derive the location
>> of the GOT for itself, but there's a possible upside there that there's an
>> extra free register (r9 can be used as a general purpose register...)
>>
>> The patch would look like:
>> -----8<-------
>> diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
>> index 5cad8a6..afed28e 100644
>> --- a/arch/arm/boot/compressed/Makefile
>> +++ b/arch/arm/boot/compressed/Makefile
>> @@ -120,7 +120,7 @@ ORIG_CFLAGS := $(KBUILD_CFLAGS)
>>   KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS))
>>   endif
>>   -ccflags-y := -fpic -fno-builtin -I$(obj)
>> +ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj)
>>   asflags-y := -Wa,-march=all -DZIMAGE
>>    # Supply kernel BSS size to the decompressor via a linker symbol.
>> ------>8---------
>>
>>
>> I did a fairly crude benchmark - count how many instructions we need in
>> order to finish decompressing the kernel...
>>
>> Setup r9 correctly:       129,976,282
>> Use -mno-single-pic-base: 124,826,778
>>
>> (this was done using an R-class model and a magic semi-hosting call to pause
>> the model at the end of the decompress_kernel function)
>>
>> So, it seems like the extra register means there's actually a 4% *win*
>> in instruction terms from using -mno-single-pic-base
> 
> Hmm.  This is the opposite of what I'd expect.  -msingle-pic-base says:
> 
>       Treat the register used for PIC addressing as read-only, rather
>       than loading it in the prologue for each function.  The run-time
>       system is responsible for initializing this register with an
>       appropriate value before execution begins.
> 
> which implies that we should be able to load it before calling the C
> code (as you're doing) and then the compiler won't issue instructions
> to reload that register.
> 
> Giving -mno-single-pic-base suggests that it would turn _off_ this
> behaviour (which afaik - sensibly - is not by default enabled.)
> 
> So, I'm not sure I fully understand what's going on here.

You seem to have understood! Specifying -mno-single-pic-base
means the compiler *won't* expect r9 to point to the GOT, but also
means r9 is free as a general purpose register, the effect that I
believe gives the performance improvement in the decompresser.

Perhaps the context missing is that these are two independent patch
suggestions that achieve the same thing in different ways (that is,
they stop the decompresser running off to some incorrect memory location
because r9 isn't set-up). The -mno-single-pic base patch does it by not
using r9 as a PIC offset, and the 'initialise r9' patch does what it says
on the tin.

As you see, I benchmarked them and got the opposite result to what I
expected (IE -mno-songle-pic-base is quicker), so, based also on
Nicolas's Ack, would now champion a different patch to the original one
that I posted...

This is probably overkill, but here's a simple C example for comparison:
$cat pic.c
----------------
int foo;
int ret_foo()
{
	return foo;
}
----------------
$arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -S pic.c -o pic.s
$cat pic.s
-------------
[...]
ret_foo:
	ldr	r3, .L2
	ldr	r3, [r9, r3]
	ldr	r0, [r3, #0]
	bx	lr
.L3:
	.align	2
.L2:
	.word	foo(GOT)
	.size	ret_foo, .-ret_foo
[...]
-------------
$arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -mno-single-pic-base -S pic.c -o no-single-pic-base.s
$cat no-single-pic-base.s
-------------
[...]
ret_foo:
	ldr	r3, .L2
	ldr	r2, .L2+4
.LPIC0:
	add	r3, pc, r3
	ldr	r3, [r3, r2]
	ldr	r0, [r3, #0]
	bx	lr
.L3:
	.align	2
.L2:
	.word	_GLOBAL_OFFSET_TABLE_-(.LPIC0+8)
	.word	foo(GOT)
[...]
-----------

So we have a 'penalty' of an extra ldr and add when we don't use a read-only
PIC base, but the win of a register free seems to trump that in the decompresser.

Does that clear things up, or did I miss the point of what wasn't clear to you?

Jonny
Russell King - ARM Linux Feb. 4, 2013, 12:07 p.m. UTC | #4
On Mon, Feb 04, 2013 at 12:00:00PM +0000, Jonathan Austin wrote:
> You seem to have understood! Specifying -mno-single-pic-base
> means the compiler *won't* expect r9 to point to the GOT, but also
> means r9 is free as a general purpose register, the effect that I
> believe gives the performance improvement in the decompresser.
> 
> Perhaps the context missing is that these are two independent patch
> suggestions that achieve the same thing in different ways (that is,
> they stop the decompresser running off to some incorrect memory location
> because r9 isn't set-up). The -mno-single-pic base patch does it by not
> using r9 as a PIC offset, and the 'initialise r9' patch does what it says
> on the tin.
> 
> As you see, I benchmarked them and got the opposite result to what I
> expected (IE -mno-songle-pic-base is quicker), so, based also on
> Nicolas's Ack, would now champion a different patch to the original one
> that I posted...
> 
> This is probably overkill, but here's a simple C example for comparison:
> $cat pic.c
> ----------------
> int foo;
> int ret_foo()
> {
> 	return foo;
> }
> ----------------
> $arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -S pic.c -o pic.s
> $cat pic.s
> -------------
> [...]
> ret_foo:
> 	ldr	r3, .L2
> 	ldr	r3, [r9, r3]
> 	ldr	r0, [r3, #0]
> 	bx	lr
> .L3:
> 	.align	2
> .L2:
> 	.word	foo(GOT)
> 	.size	ret_foo, .-ret_foo

Ah, so the problem is that the default for single-pic-base is different
with uclinux compilers from other compilers.  Other compilers will
default to -mno-single-pic-base, but what your build above shows is that
for your compiler, your default is -msingle-pic-base.

So, passing -mno-single-pic-base means that you're actually _restoring_
the compiler behaviour that we're expecting for the decompressor.
Jonathan Austin Feb. 4, 2013, 12:20 p.m. UTC | #5
On 04/02/13 12:07, Russell King - ARM Linux wrote:
> On Mon, Feb 04, 2013 at 12:00:00PM +0000, Jonathan Austin wrote:
>> You seem to have understood! Specifying -mno-single-pic-base
>> means the compiler *won't* expect r9 to point to the GOT, but also
>> means r9 is free as a general purpose register, the effect that I
>> believe gives the performance improvement in the decompresser.
>>
>> Perhaps the context missing is that these are two independent patch
>> suggestions that achieve the same thing in different ways (that is,
>> they stop the decompresser running off to some incorrect memory location
>> because r9 isn't set-up). The -mno-single-pic base patch does it by not
>> using r9 as a PIC offset, and the 'initialise r9' patch does what it says
>> on the tin.
>>
>> As you see, I benchmarked them and got the opposite result to what I
>> expected (IE -mno-songle-pic-base is quicker), so, based also on
>> Nicolas's Ack, would now champion a different patch to the original one
>> that I posted...
>>
>> This is probably overkill, but here's a simple C example for comparison:
>> $cat pic.c
>> ----------------
>> int foo;
>> int ret_foo()
>> {
>> 	return foo;
>> }
>> ----------------
>> $arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -S pic.c -o pic.s
>> $cat pic.s
>> -------------
>> [...]
>> ret_foo:
>> 	ldr	r3, .L2
>> 	ldr	r3, [r9, r3]
>> 	ldr	r0, [r3, #0]
>> 	bx	lr
>> .L3:
>> 	.align	2
>> .L2:
>> 	.word	foo(GOT)
>> 	.size	ret_foo, .-ret_foo
>
> Ah, so the problem is that the default for single-pic-base is different
> with uclinux compilers from other compilers.  Other compilers will
> default to -mno-single-pic-base, but what your build above shows is that
> for your compiler, your default is -msingle-pic-base.
>
> So, passing -mno-single-pic-base means that you're actually _restoring_
> the compiler behaviour that we're expecting for the decompressor.
>

Ahh, I see.

My experience is that my toolchain behaves much like most other uclinux 
toolchains - I've just checked:
- Codesourcery
- A Pengutronix one for the M3
- An ARM one

And they all default to using r9.

So shall I put the -m*no*-single-pic-base one in to the patch system?

Jonny
Russell King - ARM Linux Feb. 4, 2013, 12:23 p.m. UTC | #6
On Mon, Feb 04, 2013 at 12:20:52PM +0000, Jonathan Austin wrote:
> My experience is that my toolchain behaves much like most other uclinux  
> toolchains - I've just checked:
> - Codesourcery
> - A Pengutronix one for the M3
> - An ARM one
>
> And they all default to using r9.
>
> So shall I put the -m*no*-single-pic-base one in to the patch system?

Yup, though I don't think I'll be pushing it until the next merge window.
diff mbox

Patch

diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
index 5cad8a6..afed28e 100644
--- a/arch/arm/boot/compressed/Makefile
+++ b/arch/arm/boot/compressed/Makefile
@@ -120,7 +120,7 @@  ORIG_CFLAGS := $(KBUILD_CFLAGS)
 KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS))
 endif
 -ccflags-y := -fpic -fno-builtin -I$(obj)
+ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj)
 asflags-y := -Wa,-march=all -DZIMAGE
  # Supply kernel BSS size to the decompressor via a linker symbol.
------>8---------


I did a fairly crude benchmark - count how many instructions we need in
order to finish decompressing the kernel...

Setup r9 correctly:       129,976,282
Use -mno-single-pic-base: 124,826,778

(this was done using an R-class model and a magic semi-hosting call to pause
the model at the end of the decompress_kernel function)

So, it seems like the extra register means there's actually a 4% *win* 
in instruction terms from using -mno-single-pic-base

That said, I've still made some comments/amendments below...

> 
>>   arch/arm/boot/compressed/head.S |    4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
>> index fe4d9c3..4491e75 100644
>> --- a/arch/arm/boot/compressed/head.S
>> +++ b/arch/arm/boot/compressed/head.S
>> @@ -410,6 +410,10 @@ wont_overwrite:
>>    *   sp  = stack pointer
>>    */
>>   		orrs	r1, r0, r5
>> +#ifdef __uClinux__
>> +		mov	r9, r11			@ PIC offset base register
>> +		addne	r9, r9, r0		@ Also needs relocating
>> +#endif
>>   		beq	not_relocated
> 
> Please don't insert your code between the orrs and the beq as those two
> go logically together.

I'd initially done this in order to change only one site - as we need to
set r9 and then add the offset I was using the condition code to test r0...

However, this was silly - I think I can just do it in one instruction:

add   r9, r11, r0

In the case that we're not relocated, r0 should be 0 anyway...

> 
> In fact, the best location for this would probably be between the
> wont_overwrite label and the comment that immediately follows it. And
> then, those comments that follow until the branch into C code should be
> updated accordingly.


Okay, assuming I've understood you correctly, you're suggesting something
like this:

-----8<-------

diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S
index fe4d9c3..d81efbd 100644
--- a/arch/arm/boot/compressed/head.S
+++ b/arch/arm/boot/compressed/head.S
@@ -396,6 +396,9 @@  dtb_check_done:
                mov     pc, r0
  wont_overwrite:
+#ifdef __uClinux__
+               add     r9, r11, r0             @ uClinux PIC offset base register
+#endif
 /*
  * If delta is zero, we are running at the address we were linked at.
  *   r0  = delta
@@ -405,6 +408,7 @@  wont_overwrite: