Message ID | 510BF0B3.3030608@arm.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Fri, 1 Feb 2013, Jonathan Austin wrote: > Hi Nicolas, thanks for the comments, > > On 29/01/13 20:13, Nicolas Pitre wrote: > > On Tue, 29 Jan 2013, Jonathan Austin wrote: > > > >> Before jumping to (position independent) C-code from the decompressor's > >> assembler world we set-up the C environment. This setup currently does not > >> set r9, which for arm-none-uclinux-uclibceabi should be the PIC offset base > >> register (IE should point to the beginning of the GOT). > >> > >> Currently, therefore, in order to build working kernels that use the > >> decompressor it is necessary to use an arm-linux-gnueabi toolchain, or > >> similar. uClinux toolchains cause a Prefetch Abort to occur at the beginning > >> of the decompress_kernel function. > >> > >> This patch allows uClinux toolchains to build bootable zImages by setting r9 > >> to the beginning of the GOT when __uClinux__ is defined, allowing the > >> decompressor's C functions to work correctly. > >> > >> Signed-off-by: Jonathan Austin <jonathan.austin@arm.com> > >> --- > >> > >> One other possibility would be to specify -mno-single-pic-base when building > >> the decompressor. This works around the problem, but forces the compiler to > >> generate less optimal code. > > > > How "less optimal"? How much bigger/slower is it? > > If not significant enough then going with -mno-single-pic-base might be > > fine. > > Code that needs to access anything global will need to derive the location > of the GOT for itself, but there's a possible upside there that there's an > extra free register (r9 can be used as a general purpose register...) We try to minimize those in order to perform the easy relocation trick which requires no reference to global initialized data. Hence this in the linker script: /DISCARD/ : { *(.ARM.exidx*) *(.ARM.extab*) /* * Discard any r/w data - this produces a link error if we have any, * which is required for PIC decompression. Local data generates * GOTOFF relocations, which prevents it being relocated independently * of the text/got segments. */ *(.data) } > The patch would look like: > -----8<------- > diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile > index 5cad8a6..afed28e 100644 > --- a/arch/arm/boot/compressed/Makefile > +++ b/arch/arm/boot/compressed/Makefile > @@ -120,7 +120,7 @@ ORIG_CFLAGS := $(KBUILD_CFLAGS) > KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS)) > endif > -ccflags-y := -fpic -fno-builtin -I$(obj) > +ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj) > asflags-y := -Wa,-march=all -DZIMAGE > # Supply kernel BSS size to the decompressor via a linker symbol. > ------>8--------- > > > I did a fairly crude benchmark - count how many instructions we need in > order to finish decompressing the kernel... > > Setup r9 correctly: 129,976,282 > Use -mno-single-pic-base: 124,826,778 > > (this was done using an R-class model and a magic semi-hosting call to pause > the model at the end of the decompress_kernel function) > > So, it seems like the extra register means there's actually a 4% *win* > in instruction terms from using -mno-single-pic-base Looks like you have a winner. Acked-by: Nicolas Pitre <nico@linaro.org> > That said, I've still made some comments/amendments below... > > > > >> arch/arm/boot/compressed/head.S | 4 ++++ > >> 1 file changed, 4 insertions(+) > >> > >> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S > >> index fe4d9c3..4491e75 100644 > >> --- a/arch/arm/boot/compressed/head.S > >> +++ b/arch/arm/boot/compressed/head.S > >> @@ -410,6 +410,10 @@ wont_overwrite: > >> * sp = stack pointer > >> */ > >> orrs r1, r0, r5 > >> +#ifdef __uClinux__ > >> + mov r9, r11 @ PIC offset base register > >> + addne r9, r9, r0 @ Also needs relocating > >> +#endif > >> beq not_relocated > > > > Please don't insert your code between the orrs and the beq as those two > > go logically together. > > I'd initially done this in order to change only one site - as we need to > set r9 and then add the offset I was using the condition code to test r0... > > However, this was silly - I think I can just do it in one instruction: > > add r9, r11, r0 > > In the case that we're not relocated, r0 should be 0 anyway... > > > > > In fact, the best location for this would probably be between the > > wont_overwrite label and the comment that immediately follows it. And > > then, those comments that follow until the branch into C code should be > > updated accordingly. > > > Okay, assuming I've understood you correctly, you're suggesting something > like this: > > -----8<------- > > diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S > index fe4d9c3..d81efbd 100644 > --- a/arch/arm/boot/compressed/head.S > +++ b/arch/arm/boot/compressed/head.S > @@ -396,6 +396,9 @@ dtb_check_done: > mov pc, r0 > wont_overwrite: > +#ifdef __uClinux__ > + add r9, r11, r0 @ uClinux PIC offset base register > +#endif > /* > * If delta is zero, we are running at the address we were linked at. > * r0 = delta > @@ -405,6 +408,7 @@ wont_overwrite: > * r5 = appended dtb size (0 if not present) > * r7 = architecture ID > * r8 = atags pointer > + * r9 = GOT start (for uClinux ABI), relocated > * r11 = GOT start > * r12 = GOT end > * sp = stack pointer > @@ -470,6 +474,7 @@ not_relocated: mov r0, #0 > * r4 = kernel execution address > * r7 = architecture ID > * r8 = atags pointer > + * r9 = GOT start (for uClinux ABI) > */ > mov r0, r4 > mov r1, sp @ malloc space above stack > ------->8----------- Yes, that's what I was suggesting. > The question that now occurs is whether we should just set r9 whether or not > we're using a uClinux toolchain - I don't think it is going to hurt as the > arm-linux-gnueabi world can happily clobber it with no bad consequences... > > But after all this, it seems that just using -mno-single-pic base as in the patch > above is best... Indeed. As longas this option is compatible with all toolchains. Nicolas
On Fri, Feb 01, 2013 at 04:43:31PM +0000, Jonathan Austin wrote: > Code that needs to access anything global will need to derive the location > of the GOT for itself, but there's a possible upside there that there's an > extra free register (r9 can be used as a general purpose register...) > > The patch would look like: > -----8<------- > diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile > index 5cad8a6..afed28e 100644 > --- a/arch/arm/boot/compressed/Makefile > +++ b/arch/arm/boot/compressed/Makefile > @@ -120,7 +120,7 @@ ORIG_CFLAGS := $(KBUILD_CFLAGS) > KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS)) > endif > -ccflags-y := -fpic -fno-builtin -I$(obj) > +ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj) > asflags-y := -Wa,-march=all -DZIMAGE > # Supply kernel BSS size to the decompressor via a linker symbol. > ------>8--------- > > > I did a fairly crude benchmark - count how many instructions we need in > order to finish decompressing the kernel... > > Setup r9 correctly: 129,976,282 > Use -mno-single-pic-base: 124,826,778 > > (this was done using an R-class model and a magic semi-hosting call to pause > the model at the end of the decompress_kernel function) > > So, it seems like the extra register means there's actually a 4% *win* > in instruction terms from using -mno-single-pic-base Hmm. This is the opposite of what I'd expect. -msingle-pic-base says: Treat the register used for PIC addressing as read-only, rather than loading it in the prologue for each function. The run-time system is responsible for initializing this register with an appropriate value before execution begins. which implies that we should be able to load it before calling the C code (as you're doing) and then the compiler won't issue instructions to reload that register. Giving -mno-single-pic-base suggests that it would turn _off_ this behaviour (which afaik - sensibly - is not by default enabled.) So, I'm not sure I fully understand what's going on here.
Hi Russell, On 01/02/13 18:18, Russell King - ARM Linux wrote: > On Fri, Feb 01, 2013 at 04:43:31PM +0000, Jonathan Austin wrote: >> Code that needs to access anything global will need to derive the location >> of the GOT for itself, but there's a possible upside there that there's an >> extra free register (r9 can be used as a general purpose register...) >> >> The patch would look like: >> -----8<------- >> diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile >> index 5cad8a6..afed28e 100644 >> --- a/arch/arm/boot/compressed/Makefile >> +++ b/arch/arm/boot/compressed/Makefile >> @@ -120,7 +120,7 @@ ORIG_CFLAGS := $(KBUILD_CFLAGS) >> KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS)) >> endif >> -ccflags-y := -fpic -fno-builtin -I$(obj) >> +ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj) >> asflags-y := -Wa,-march=all -DZIMAGE >> # Supply kernel BSS size to the decompressor via a linker symbol. >> ------>8--------- >> >> >> I did a fairly crude benchmark - count how many instructions we need in >> order to finish decompressing the kernel... >> >> Setup r9 correctly: 129,976,282 >> Use -mno-single-pic-base: 124,826,778 >> >> (this was done using an R-class model and a magic semi-hosting call to pause >> the model at the end of the decompress_kernel function) >> >> So, it seems like the extra register means there's actually a 4% *win* >> in instruction terms from using -mno-single-pic-base > > Hmm. This is the opposite of what I'd expect. -msingle-pic-base says: > > Treat the register used for PIC addressing as read-only, rather > than loading it in the prologue for each function. The run-time > system is responsible for initializing this register with an > appropriate value before execution begins. > > which implies that we should be able to load it before calling the C > code (as you're doing) and then the compiler won't issue instructions > to reload that register. > > Giving -mno-single-pic-base suggests that it would turn _off_ this > behaviour (which afaik - sensibly - is not by default enabled.) > > So, I'm not sure I fully understand what's going on here. You seem to have understood! Specifying -mno-single-pic-base means the compiler *won't* expect r9 to point to the GOT, but also means r9 is free as a general purpose register, the effect that I believe gives the performance improvement in the decompresser. Perhaps the context missing is that these are two independent patch suggestions that achieve the same thing in different ways (that is, they stop the decompresser running off to some incorrect memory location because r9 isn't set-up). The -mno-single-pic base patch does it by not using r9 as a PIC offset, and the 'initialise r9' patch does what it says on the tin. As you see, I benchmarked them and got the opposite result to what I expected (IE -mno-songle-pic-base is quicker), so, based also on Nicolas's Ack, would now champion a different patch to the original one that I posted... This is probably overkill, but here's a simple C example for comparison: $cat pic.c ---------------- int foo; int ret_foo() { return foo; } ---------------- $arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -S pic.c -o pic.s $cat pic.s ------------- [...] ret_foo: ldr r3, .L2 ldr r3, [r9, r3] ldr r0, [r3, #0] bx lr .L3: .align 2 .L2: .word foo(GOT) .size ret_foo, .-ret_foo [...] ------------- $arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -mno-single-pic-base -S pic.c -o no-single-pic-base.s $cat no-single-pic-base.s ------------- [...] ret_foo: ldr r3, .L2 ldr r2, .L2+4 .LPIC0: add r3, pc, r3 ldr r3, [r3, r2] ldr r0, [r3, #0] bx lr .L3: .align 2 .L2: .word _GLOBAL_OFFSET_TABLE_-(.LPIC0+8) .word foo(GOT) [...] ----------- So we have a 'penalty' of an extra ldr and add when we don't use a read-only PIC base, but the win of a register free seems to trump that in the decompresser. Does that clear things up, or did I miss the point of what wasn't clear to you? Jonny
On Mon, Feb 04, 2013 at 12:00:00PM +0000, Jonathan Austin wrote: > You seem to have understood! Specifying -mno-single-pic-base > means the compiler *won't* expect r9 to point to the GOT, but also > means r9 is free as a general purpose register, the effect that I > believe gives the performance improvement in the decompresser. > > Perhaps the context missing is that these are two independent patch > suggestions that achieve the same thing in different ways (that is, > they stop the decompresser running off to some incorrect memory location > because r9 isn't set-up). The -mno-single-pic base patch does it by not > using r9 as a PIC offset, and the 'initialise r9' patch does what it says > on the tin. > > As you see, I benchmarked them and got the opposite result to what I > expected (IE -mno-songle-pic-base is quicker), so, based also on > Nicolas's Ack, would now champion a different patch to the original one > that I posted... > > This is probably overkill, but here's a simple C example for comparison: > $cat pic.c > ---------------- > int foo; > int ret_foo() > { > return foo; > } > ---------------- > $arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -S pic.c -o pic.s > $cat pic.s > ------------- > [...] > ret_foo: > ldr r3, .L2 > ldr r3, [r9, r3] > ldr r0, [r3, #0] > bx lr > .L3: > .align 2 > .L2: > .word foo(GOT) > .size ret_foo, .-ret_foo Ah, so the problem is that the default for single-pic-base is different with uclinux compilers from other compilers. Other compilers will default to -mno-single-pic-base, but what your build above shows is that for your compiler, your default is -msingle-pic-base. So, passing -mno-single-pic-base means that you're actually _restoring_ the compiler behaviour that we're expecting for the decompressor.
On 04/02/13 12:07, Russell King - ARM Linux wrote: > On Mon, Feb 04, 2013 at 12:00:00PM +0000, Jonathan Austin wrote: >> You seem to have understood! Specifying -mno-single-pic-base >> means the compiler *won't* expect r9 to point to the GOT, but also >> means r9 is free as a general purpose register, the effect that I >> believe gives the performance improvement in the decompresser. >> >> Perhaps the context missing is that these are two independent patch >> suggestions that achieve the same thing in different ways (that is, >> they stop the decompresser running off to some incorrect memory location >> because r9 isn't set-up). The -mno-single-pic base patch does it by not >> using r9 as a PIC offset, and the 'initialise r9' patch does what it says >> on the tin. >> >> As you see, I benchmarked them and got the opposite result to what I >> expected (IE -mno-songle-pic-base is quicker), so, based also on >> Nicolas's Ack, would now champion a different patch to the original one >> that I posted... >> >> This is probably overkill, but here's a simple C example for comparison: >> $cat pic.c >> ---------------- >> int foo; >> int ret_foo() >> { >> return foo; >> } >> ---------------- >> $arm-none-uclinux-uclibceabi-gcc -O2 -fPIC -S pic.c -o pic.s >> $cat pic.s >> ------------- >> [...] >> ret_foo: >> ldr r3, .L2 >> ldr r3, [r9, r3] >> ldr r0, [r3, #0] >> bx lr >> .L3: >> .align 2 >> .L2: >> .word foo(GOT) >> .size ret_foo, .-ret_foo > > Ah, so the problem is that the default for single-pic-base is different > with uclinux compilers from other compilers. Other compilers will > default to -mno-single-pic-base, but what your build above shows is that > for your compiler, your default is -msingle-pic-base. > > So, passing -mno-single-pic-base means that you're actually _restoring_ > the compiler behaviour that we're expecting for the decompressor. > Ahh, I see. My experience is that my toolchain behaves much like most other uclinux toolchains - I've just checked: - Codesourcery - A Pengutronix one for the M3 - An ARM one And they all default to using r9. So shall I put the -m*no*-single-pic-base one in to the patch system? Jonny
On Mon, Feb 04, 2013 at 12:20:52PM +0000, Jonathan Austin wrote: > My experience is that my toolchain behaves much like most other uclinux > toolchains - I've just checked: > - Codesourcery > - A Pengutronix one for the M3 > - An ARM one > > And they all default to using r9. > > So shall I put the -m*no*-single-pic-base one in to the patch system? Yup, though I don't think I'll be pushing it until the next merge window.
diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile index 5cad8a6..afed28e 100644 --- a/arch/arm/boot/compressed/Makefile +++ b/arch/arm/boot/compressed/Makefile @@ -120,7 +120,7 @@ ORIG_CFLAGS := $(KBUILD_CFLAGS) KBUILD_CFLAGS = $(subst -pg, , $(ORIG_CFLAGS)) endif -ccflags-y := -fpic -fno-builtin -I$(obj) +ccflags-y := -fpic -mno-single-pic-base -fno-builtin -I$(obj) asflags-y := -Wa,-march=all -DZIMAGE # Supply kernel BSS size to the decompressor via a linker symbol. ------>8--------- I did a fairly crude benchmark - count how many instructions we need in order to finish decompressing the kernel... Setup r9 correctly: 129,976,282 Use -mno-single-pic-base: 124,826,778 (this was done using an R-class model and a magic semi-hosting call to pause the model at the end of the decompress_kernel function) So, it seems like the extra register means there's actually a 4% *win* in instruction terms from using -mno-single-pic-base That said, I've still made some comments/amendments below... > >> arch/arm/boot/compressed/head.S | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S >> index fe4d9c3..4491e75 100644 >> --- a/arch/arm/boot/compressed/head.S >> +++ b/arch/arm/boot/compressed/head.S >> @@ -410,6 +410,10 @@ wont_overwrite: >> * sp = stack pointer >> */ >> orrs r1, r0, r5 >> +#ifdef __uClinux__ >> + mov r9, r11 @ PIC offset base register >> + addne r9, r9, r0 @ Also needs relocating >> +#endif >> beq not_relocated > > Please don't insert your code between the orrs and the beq as those two > go logically together. I'd initially done this in order to change only one site - as we need to set r9 and then add the offset I was using the condition code to test r0... However, this was silly - I think I can just do it in one instruction: add r9, r11, r0 In the case that we're not relocated, r0 should be 0 anyway... > > In fact, the best location for this would probably be between the > wont_overwrite label and the comment that immediately follows it. And > then, those comments that follow until the branch into C code should be > updated accordingly. Okay, assuming I've understood you correctly, you're suggesting something like this: -----8<------- diff --git a/arch/arm/boot/compressed/head.S b/arch/arm/boot/compressed/head.S index fe4d9c3..d81efbd 100644 --- a/arch/arm/boot/compressed/head.S +++ b/arch/arm/boot/compressed/head.S @@ -396,6 +396,9 @@ dtb_check_done: mov pc, r0 wont_overwrite: +#ifdef __uClinux__ + add r9, r11, r0 @ uClinux PIC offset base register +#endif /* * If delta is zero, we are running at the address we were linked at. * r0 = delta @@ -405,6 +408,7 @@ wont_overwrite: