Message ID | 20220427171241.2426592-3-ardb@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: use PIE code generation for KASLR kernel | expand |
On 2022-04-27, Ard Biesheuvel wrote: >We currently use ordinary, position dependent code generation for the >core kernel, which happens to default to the 'small' code model on both >GCC and Clang. This is the code model that relies on ADRP/ADD or >ADRP/LDR pairs for symbol references, which are PC-relative with a range >of -/+ 4 GiB, and therefore happen to be position independent in >practice. > >This means that the fact that we can link the relocatable KASLR kernel >using the -pie linker flag (which generates the runtime relocations and >inserts them into the binary) is somewhat of a coincidence, and not >something which is explicitly supported by the toolchains. Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a conincidence, not guaranteed by the toolchain. -shared needs -fpic object files. -shared -Bsymbolic is very similar to -pie and therefore works with -fpie object files, but the usage is not recommended from the toolchain perspective. >The reason we have not used -fpie for code generation so far (which is >the compiler flag that should be used to generate code that is to be >linked with -pie) is that by default, it generates code based on >assumptions that only hold for shared libraries and PIE executables, >i.e., that gathering all relocatable quantities into a Global Offset >Table (GOT) is desirable because it reduces the CoW footprint, and >because it permits ELF symbol preemption (which lets an executable >override symbols defined in a shared library, in a way that forces the >shared library to update all of its internal references as well). >Ironically, this means we end up with many more absolute references that >all need to be fixed up at boot. This is not about symbol preemption (when the executable and a shared objectdefine the same symbol, which one wins). An executable using a GOT which will be resolved to a shared object => this is regular relocation resolving and there is no preemption. It is that the compiler prefers code generation which can avoid text relocations / copy relocations / canonical PLT entries (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary). >Fortunately, we can convince the compiler to handle this in a way that >is a bit more suitable for freestanding binaries such as the kernel, by >setting the 'hidden' visibility #pragma, which informs the compiler that >symbol preemption or CoW footprint are of no concern to us, and so >PC-relative references that are resolved at link time are perfectly >fine. Agree >So let's enable this #pragma and build with -fpie when building a >relocatable kernel. This also means that all constant data items that >carry statically initialized pointer variables are now emitted into the >.data.rel.ro* sections, so move these into .rodata where they belong. LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata >Code size impact (GCC): > >Before: > > text data bss total filename > 16712396 18659064 534556 35906016 vmlinux > >After: > > text data bss total filename > 16804400 18612876 534556 35951832 vmlinux > >Code size impact (Clang): > >Before: > > text data bss total filename > 17194584 13335060 535268 31064912 vmlinux > >After: > > text data bss total filename > 17194536 13310032 535268 31039836 vmlinux > >Signed-off-by: Ard Biesheuvel <ardb@kernel.org> >--- > arch/arm64/Makefile | 4 ++++ > arch/arm64/kernel/vmlinux.lds.S | 9 ++++----- > 2 files changed, 8 insertions(+), 5 deletions(-) > >diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile >index 2f1de88651e6..94b6c51f5de6 100644 >--- a/arch/arm64/Makefile >+++ b/arch/arm64/Makefile >@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y) > # with the relocation offsets always being zero. > LDFLAGS_vmlinux += -shared -Bsymbolic -z notext \ > $(call ld-option, --no-apply-dynamic-relocs) >+ >+# Generate position independent code without relying on a Global Offset Table >+KBUILD_CFLAGS_KERNEL += -fpie -include $(srctree)/include/linux/hidden.h >+ > endif > > ifeq ($(CONFIG_ARM64_ERRATUM_843419),y) >diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S >index edaf0faf766f..b1e071ac1acf 100644 >--- a/arch/arm64/kernel/vmlinux.lds.S >+++ b/arch/arm64/kernel/vmlinux.lds.S >@@ -174,8 +174,6 @@ SECTIONS > KEXEC_TEXT > TRAMP_TEXT > *(.gnu.warning) >- . = ALIGN(16); >- *(.got) /* Global offset table */ > } > > /* >@@ -192,6 +190,8 @@ SECTIONS > /* everything from this point to __init_begin will be marked RO NX */ > RO_DATA(PAGE_SIZE) > >+ .data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) } >+ > HYPERVISOR_DATA_SECTIONS > > idmap_pg_dir = .; >@@ -273,6 +273,8 @@ SECTIONS > _sdata = .; > RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN) > >+ .data.rel : ALIGN(8) { *(.data.rel*) } >+ > /* > * Data written with the MMU off but read with the MMU on requires > * cache lines to be invalidated, discarding up to a Cache Writeback >@@ -320,9 +322,6 @@ SECTIONS > *(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt) > } > ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!") >- >- .data.rel.ro : { *(.data.rel.ro) } >- ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!") > } > > #include "image-vars.h" >-- >2.30.2 > >-- >You received this message because you are subscribed to the Google Groups "Clang Built Linux" group. >To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com. >To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20220427171241.2426592-3-ardb%40kernel.org.
On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: > > On 2022-04-27, Ard Biesheuvel wrote: > >We currently use ordinary, position dependent code generation for the > >core kernel, which happens to default to the 'small' code model on both > >GCC and Clang. This is the code model that relies on ADRP/ADD or > >ADRP/LDR pairs for symbol references, which are PC-relative with a range > >of -/+ 4 GiB, and therefore happen to be position independent in > >practice. > > > >This means that the fact that we can link the relocatable KASLR kernel > >using the -pie linker flag (which generates the runtime relocations and > >inserts them into the binary) is somewhat of a coincidence, and not > >something which is explicitly supported by the toolchains. > > Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a > conincidence, not guaranteed by the toolchain. > > -shared needs -fpic object files. -shared -Bsymbolic is very similar to > -pie and therefore works with -fpie object files, but the usage is not > recommended from the toolchain perspective. > So are you suggesting we should also switch from -shared to -Bsymbol to -pie if we can? I don't remember the details, but IIRC ld.bfd didn't set the ELF binary type correctly, but perhaps this has now been fixed. > >The reason we have not used -fpie for code generation so far (which is > >the compiler flag that should be used to generate code that is to be > >linked with -pie) is that by default, it generates code based on > >assumptions that only hold for shared libraries and PIE executables, > >i.e., that gathering all relocatable quantities into a Global Offset > >Table (GOT) is desirable because it reduces the CoW footprint, and > >because it permits ELF symbol preemption (which lets an executable > >override symbols defined in a shared library, in a way that forces the > >shared library to update all of its internal references as well). > >Ironically, this means we end up with many more absolute references that > >all need to be fixed up at boot. > > This is not about symbol preemption (when the executable and a shared > objectdefine the same symbol, which one wins). An executable using a GOT > which will be resolved to a shared object => this is regular relocation > resolving and there is no preemption. > > It is that the compiler prefers code generation which can avoid text > relocations / copy relocations / canonical PLT entries > (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary). > Fair enough. So the compiler cannot generate relative references to undefined external symbols since it doesn't know at codegen time whether the symbol reference will be satisfied by the executable itself or by a shared library, and in the latter case, the relative distance to the symbol is not known at build time, and so a runtime relocation is required. But how about references to symbols with external visibility that are defined in the same compilation unit? I don't quite understand why those references need to go via the GOT as well. > >Fortunately, we can convince the compiler to handle this in a way that > >is a bit more suitable for freestanding binaries such as the kernel, by > >setting the 'hidden' visibility #pragma, which informs the compiler that > >symbol preemption or CoW footprint are of no concern to us, and so > >PC-relative references that are resolved at link time are perfectly > >fine. > > Agree > The only unfortunate thing is that -fvisibility=hidden does not give us the behavior we want, and we are forced to use the #pragma instead. > >So let's enable this #pragma and build with -fpie when building a > >relocatable kernel. This also means that all constant data items that > >carry statically initialized pointer variables are now emitted into the > >.data.rel.ro* sections, so move these into .rodata where they belong. > > LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata > I am referring to the .rodata pseudo-segment that we have in the kernel, which runs from _etext to __inittext_begin. > >Code size impact (GCC): > > > >Before: > > > > text data bss total filename > > 16712396 18659064 534556 35906016 vmlinux > > > >After: > > > > text data bss total filename > > 16804400 18612876 534556 35951832 vmlinux > > > >Code size impact (Clang): > > > >Before: > > > > text data bss total filename > > 17194584 13335060 535268 31064912 vmlinux > > > >After: > > > > text data bss total filename > > 17194536 13310032 535268 31039836 vmlinux > > > >Signed-off-by: Ard Biesheuvel <ardb@kernel.org> > >--- > > arch/arm64/Makefile | 4 ++++ > > arch/arm64/kernel/vmlinux.lds.S | 9 ++++----- > > 2 files changed, 8 insertions(+), 5 deletions(-) > > > >diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile > >index 2f1de88651e6..94b6c51f5de6 100644 > >--- a/arch/arm64/Makefile > >+++ b/arch/arm64/Makefile > >@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y) > > # with the relocation offsets always being zero. > > LDFLAGS_vmlinux += -shared -Bsymbolic -z notext \ > > $(call ld-option, --no-apply-dynamic-relocs) > >+ > >+# Generate position independent code without relying on a Global Offset Table > >+KBUILD_CFLAGS_KERNEL += -fpie -include $(srctree)/include/linux/hidden.h > >+ > > endif > > > > ifeq ($(CONFIG_ARM64_ERRATUM_843419),y) > >diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S > >index edaf0faf766f..b1e071ac1acf 100644 > >--- a/arch/arm64/kernel/vmlinux.lds.S > >+++ b/arch/arm64/kernel/vmlinux.lds.S > >@@ -174,8 +174,6 @@ SECTIONS > > KEXEC_TEXT > > TRAMP_TEXT > > *(.gnu.warning) > >- . = ALIGN(16); > >- *(.got) /* Global offset table */ > > } > > > > /* > >@@ -192,6 +190,8 @@ SECTIONS > > /* everything from this point to __init_begin will be marked RO NX */ > > RO_DATA(PAGE_SIZE) > > > >+ .data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) } > >+ > > HYPERVISOR_DATA_SECTIONS > > > > idmap_pg_dir = .; > >@@ -273,6 +273,8 @@ SECTIONS > > _sdata = .; > > RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN) > > > >+ .data.rel : ALIGN(8) { *(.data.rel*) } > >+ > > /* > > * Data written with the MMU off but read with the MMU on requires > > * cache lines to be invalidated, discarding up to a Cache Writeback > >@@ -320,9 +322,6 @@ SECTIONS > > *(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt) > > } > > ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!") > >- > >- .data.rel.ro : { *(.data.rel.ro) } > >- ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!") > > } > > > > #include "image-vars.h" > >-- > >2.30.2 > > > >-- > >You received this message because you are subscribed to the Google Groups "Clang Built Linux" group. > >To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com. > >To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20220427171241.2426592-3-ardb%40kernel.org.
On 2022-04-28, Ard Biesheuvel wrote: >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: >> >> On 2022-04-27, Ard Biesheuvel wrote: >> >We currently use ordinary, position dependent code generation for the >> >core kernel, which happens to default to the 'small' code model on both >> >GCC and Clang. This is the code model that relies on ADRP/ADD or >> >ADRP/LDR pairs for symbol references, which are PC-relative with a range >> >of -/+ 4 GiB, and therefore happen to be position independent in >> >practice. >> > >> >This means that the fact that we can link the relocatable KASLR kernel >> >using the -pie linker flag (which generates the runtime relocations and >> >inserts them into the binary) is somewhat of a coincidence, and not >> >something which is explicitly supported by the toolchains. >> >> Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a >> conincidence, not guaranteed by the toolchain. >> >> -shared needs -fpic object files. -shared -Bsymbolic is very similar to >> -pie and therefore works with -fpie object files, but the usage is not >> recommended from the toolchain perspective. >> > >So are you suggesting we should also switch from -shared to -Bsymbol >to -pie if we can? I don't remember the details, but IIRC ld.bfd >didn't set the ELF binary type correctly, but perhaps this has now >been fixed. Yes, -shared -Bsymbolic => -pie, but that can be done later. For e_type: ET_DYN, I think unlikely there was a bug. -pie was added by binutils in 2003: it's close to -shared but doesn't allow its definitions to be preempted/interposed. Code earlier than that might use -shared -Bsymbolic before -pie was available. >> >The reason we have not used -fpie for code generation so far (which is >> >the compiler flag that should be used to generate code that is to be >> >linked with -pie) is that by default, it generates code based on >> >assumptions that only hold for shared libraries and PIE executables, >> >i.e., that gathering all relocatable quantities into a Global Offset >> >Table (GOT) is desirable because it reduces the CoW footprint, and >> >because it permits ELF symbol preemption (which lets an executable >> >override symbols defined in a shared library, in a way that forces the >> >shared library to update all of its internal references as well). >> >Ironically, this means we end up with many more absolute references that >> >all need to be fixed up at boot. >> >> This is not about symbol preemption (when the executable and a shared >> objectdefine the same symbol, which one wins). An executable using a GOT >> which will be resolved to a shared object => this is regular relocation >> resolving and there is no preemption. >> >> It is that the compiler prefers code generation which can avoid text >> relocations / copy relocations / canonical PLT entries >> (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary). >> > >Fair enough. So the compiler cannot generate relative references to >undefined external symbols since it doesn't know at codegen time >whether the symbol reference will be satisfied by the executable >itself or by a shared library, and in the latter case, the relative >distance to the symbol is not known at build time, and so a runtime >relocation is required. Right. >But how about references to symbols with >external visibility that are defined in the same compilation unit? I >don't quite understand why those references need to go via the GOT as >well. If you mean references to a non-local STV_DEFAULT (default visibility) definition => * -fpic: use GOT because the definition may be replaced by another at run time. Conservatively use a GOT-generating code sequence to allow potential symbol preemption(interposition). The linker may optimize out the GOT (x86-64 GOTPCRELX, recent ld.lld for aarch64, powerpc64 TOC-indirect to TOC-relative optimization). * -fpie or -fno-pie: the definition cannot be replaced. GOT is unneeded. -fpie is an optimization on top of -fpic: (a) non-local STV_DEFAULT definitions can be assumed non-interposable (b) (irrelevant to the kernel) TLS can use more optimized models. >> >Fortunately, we can convince the compiler to handle this in a way that >> >is a bit more suitable for freestanding binaries such as the kernel, by >> >setting the 'hidden' visibility #pragma, which informs the compiler that >> >symbol preemption or CoW footprint are of no concern to us, and so >> >PC-relative references that are resolved at link time are perfectly >> >fine. >> >> Agree >> > >The only unfortunate thing is that -fvisibility=hidden does not give >us the behavior we want, and we are forced to use the #pragma instead. Right. For a very long time there had been no option controlling the access mode for undefined symbols (-fvisibility= is for defined symbols). I added -fdirect-access-external-data to Clang which supports many architectures (x86, aarch64, arm, riscv, ...). GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64). The use of `#pragma GCC visibility push(hidden)` looks good as a portable solution. >> >So let's enable this #pragma and build with -fpie when building a >> >relocatable kernel. This also means that all constant data items that >> >carry statically initialized pointer variables are now emitted into the >> >.data.rel.ro* sections, so move these into .rodata where they belong. >> >> LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata >> > >I am referring to the .rodata pseudo-segment that we have in the >kernel, which runs from _etext to __inittext_begin. OK >> >Code size impact (GCC): >> > >> >Before: >> > >> > text data bss total filename >> > 16712396 18659064 534556 35906016 vmlinux >> > >> >After: >> > >> > text data bss total filename >> > 16804400 18612876 534556 35951832 vmlinux >> > >> >Code size impact (Clang): >> > >> >Before: >> > >> > text data bss total filename >> > 17194584 13335060 535268 31064912 vmlinux >> > >> >After: >> > >> > text data bss total filename >> > 17194536 13310032 535268 31039836 vmlinux The size difference for Clang matches my expecation:) I am somewhat surprised that data is smaller, though... I wonder how GCC makes code bloated so much... >> >Signed-off-by: Ard Biesheuvel <ardb@kernel.org> >> >--- >> > arch/arm64/Makefile | 4 ++++ >> > arch/arm64/kernel/vmlinux.lds.S | 9 ++++----- >> > 2 files changed, 8 insertions(+), 5 deletions(-) >> > >> >diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile >> >index 2f1de88651e6..94b6c51f5de6 100644 >> >--- a/arch/arm64/Makefile >> >+++ b/arch/arm64/Makefile >> >@@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y) >> > # with the relocation offsets always being zero. >> > LDFLAGS_vmlinux += -shared -Bsymbolic -z notext \ >> > $(call ld-option, --no-apply-dynamic-relocs) >> >+ >> >+# Generate position independent code without relying on a Global Offset Table >> >+KBUILD_CFLAGS_KERNEL += -fpie -include $(srctree)/include/linux/hidden.h >> >+ >> > endif >> > >> > ifeq ($(CONFIG_ARM64_ERRATUM_843419),y) >> >diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S >> >index edaf0faf766f..b1e071ac1acf 100644 >> >--- a/arch/arm64/kernel/vmlinux.lds.S >> >+++ b/arch/arm64/kernel/vmlinux.lds.S >> >@@ -174,8 +174,6 @@ SECTIONS >> > KEXEC_TEXT >> > TRAMP_TEXT >> > *(.gnu.warning) >> >- . = ALIGN(16); >> >- *(.got) /* Global offset table */ >> > } >> > >> > /* >> >@@ -192,6 +190,8 @@ SECTIONS >> > /* everything from this point to __init_begin will be marked RO NX */ >> > RO_DATA(PAGE_SIZE) >> > >> >+ .data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) } >> >+ >> > HYPERVISOR_DATA_SECTIONS >> > >> > idmap_pg_dir = .; >> >@@ -273,6 +273,8 @@ SECTIONS >> > _sdata = .; >> > RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN) >> > >> >+ .data.rel : ALIGN(8) { *(.data.rel*) } >> >+ >> > /* >> > * Data written with the MMU off but read with the MMU on requires >> > * cache lines to be invalidated, discarding up to a Cache Writeback >> >@@ -320,9 +322,6 @@ SECTIONS >> > *(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt) >> > } >> > ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!") >> >- >> >- .data.rel.ro : { *(.data.rel.ro) } >> >- ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!") >> > } >> > >> > #include "image-vars.h" >> >-- >> >2.30.2 >> > >> >-- >> >You received this message because you are subscribed to the Google Groups "Clang Built Linux" group. >> >To unsubscribe from this group and stop receiving emails from it, send an email to clang-built-linux+unsubscribe@googlegroups.com. >> >To view this discussion on the web visit https://groups.google.com/d/msgid/clang-built-linux/20220427171241.2426592-3-ardb%40kernel.org.
On Thu, 28 Apr 2022 at 08:57, Fangrui Song <maskray@google.com> wrote: > > On 2022-04-28, Ard Biesheuvel wrote: > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: > >> > >> On 2022-04-27, Ard Biesheuvel wrote: > >> >We currently use ordinary, position dependent code generation for the > >> >core kernel, which happens to default to the 'small' code model on both > >> >GCC and Clang. This is the code model that relies on ADRP/ADD or > >> >ADRP/LDR pairs for symbol references, which are PC-relative with a range > >> >of -/+ 4 GiB, and therefore happen to be position independent in > >> >practice. > >> > > >> >This means that the fact that we can link the relocatable KASLR kernel > >> >using the -pie linker flag (which generates the runtime relocations and > >> >inserts them into the binary) is somewhat of a coincidence, and not > >> >something which is explicitly supported by the toolchains. > >> > >> Agree. The current -fno-PIE + -shared -Bsymbolic combo works as a > >> conincidence, not guaranteed by the toolchain. > >> > >> -shared needs -fpic object files. -shared -Bsymbolic is very similar to > >> -pie and therefore works with -fpie object files, but the usage is not > >> recommended from the toolchain perspective. > >> > > > >So are you suggesting we should also switch from -shared to -Bsymbol > >to -pie if we can? I don't remember the details, but IIRC ld.bfd > >didn't set the ELF binary type correctly, but perhaps this has now > >been fixed. > > Yes, -shared -Bsymbolic => -pie, but that can be done later. > > For e_type: ET_DYN, I think unlikely there was a bug. > -pie was added by binutils in 2003: it's close to -shared but doesn't > allow its definitions to be preempted/interposed. Code earlier than that > might use -shared -Bsymbolic before -pie was available. > > >> >The reason we have not used -fpie for code generation so far (which is > >> >the compiler flag that should be used to generate code that is to be > >> >linked with -pie) is that by default, it generates code based on > >> >assumptions that only hold for shared libraries and PIE executables, > >> >i.e., that gathering all relocatable quantities into a Global Offset > >> >Table (GOT) is desirable because it reduces the CoW footprint, and > >> >because it permits ELF symbol preemption (which lets an executable > >> >override symbols defined in a shared library, in a way that forces the > >> >shared library to update all of its internal references as well). > >> >Ironically, this means we end up with many more absolute references that > >> >all need to be fixed up at boot. > >> > >> This is not about symbol preemption (when the executable and a shared > >> objectdefine the same symbol, which one wins). An executable using a GOT > >> which will be resolved to a shared object => this is regular relocation > >> resolving and there is no preemption. > >> > >> It is that the compiler prefers code generation which can avoid text > >> relocations / copy relocations / canonical PLT entries > >> (https://maskray.me/blog/2021-01-09-copy-relocations-canonical-plt-entries-and-protected#summary). > >> > > > >Fair enough. So the compiler cannot generate relative references to > >undefined external symbols since it doesn't know at codegen time > >whether the symbol reference will be satisfied by the executable > >itself or by a shared library, and in the latter case, the relative > >distance to the symbol is not known at build time, and so a runtime > >relocation is required. > > Right. > > >But how about references to symbols with > >external visibility that are defined in the same compilation unit? I > >don't quite understand why those references need to go via the GOT as > >well. > > If you mean references to a non-local STV_DEFAULT (default visibility) definition => > > * -fpic: use GOT because the definition may be replaced by another at run time. > Conservatively use a GOT-generating code sequence to allow potential symbol > preemption(interposition). The linker may optimize out the GOT (x86-64 > GOTPCRELX, recent ld.lld for aarch64, powerpc64 TOC-indirect to > TOC-relative optimization). > * -fpie or -fno-pie: the definition cannot be replaced. GOT is unneeded. > > -fpie is an optimization on top of -fpic: (a) non-local STV_DEFAULT > definitions can be assumed non-interposable (b) (irrelevant to the > kernel) TLS can use more optimized models. > > >> >Fortunately, we can convince the compiler to handle this in a way that > >> >is a bit more suitable for freestanding binaries such as the kernel, by > >> >setting the 'hidden' visibility #pragma, which informs the compiler that > >> >symbol preemption or CoW footprint are of no concern to us, and so > >> >PC-relative references that are resolved at link time are perfectly > >> >fine. > >> > >> Agree > >> > > > >The only unfortunate thing is that -fvisibility=hidden does not give > >us the behavior we want, and we are forced to use the #pragma instead. > > Right. For a very long time there had been no option controlling the > access mode for undefined symbols (-fvisibility= is for defined > symbols). > > I added -fdirect-access-external-data to Clang which supports > many architectures (x86, aarch64, arm, riscv, ...). > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64). > > The use of `#pragma GCC visibility push(hidden)` looks good as a > portable solution. > OK > >> >So let's enable this #pragma and build with -fpie when building a > >> >relocatable kernel. This also means that all constant data items that > >> >carry statically initialized pointer variables are now emitted into the > >> >.data.rel.ro* sections, so move these into .rodata where they belong. > >> > >> LGTM, except: is ".rodata" a typo? The patch doesn't reference .rodata > >> > > > >I am referring to the .rodata pseudo-segment that we have in the > >kernel, which runs from _etext to __inittext_begin. > > OK > > >> >Code size impact (GCC): > >> > > >> >Before: > >> > > >> > text data bss total filename > >> > 16712396 18659064 534556 35906016 vmlinux > >> > > >> >After: > >> > > >> > text data bss total filename > >> > 16804400 18612876 534556 35951832 vmlinux > >> > > >> >Code size impact (Clang): > >> > > >> >Before: > >> > > >> > text data bss total filename > >> > 17194584 13335060 535268 31064912 vmlinux > >> > > >> >After: > >> > > >> > text data bss total filename > >> > 17194536 13310032 535268 31039836 vmlinux > > The size difference for Clang matches my expecation:) > I am somewhat surprised that data is smaller, though... > > I wonder how GCC makes code bloated so much... > This is caused by the use of RELA format instead of RELR.
On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote: > > On 2022-04-28, Ard Biesheuvel wrote: > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: > >> > >> On 2022-04-27, Ard Biesheuvel wrote: > >> >Fortunately, we can convince the compiler to handle this in a way that > >> >is a bit more suitable for freestanding binaries such as the kernel, by > >> >setting the 'hidden' visibility #pragma, which informs the compiler that > >> >symbol preemption or CoW footprint are of no concern to us, and so > >> >PC-relative references that are resolved at link time are perfectly > >> >fine. > >> > >> Agree > >> > > > >The only unfortunate thing is that -fvisibility=hidden does not give > >us the behavior we want, and we are forced to use the #pragma instead. > > Right. For a very long time there had been no option controlling the > access mode for undefined symbols (-fvisibility= is for defined > symbols). > > I added -fdirect-access-external-data to Clang which supports > many architectures (x86, aarch64, arm, riscv, ...). > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64). > > The use of `#pragma GCC visibility push(hidden)` looks good as a > portable solution. Portable, sure, which is fine for now. But there's just something about injecting a header into ever TU via -include in order to set a pragma and that there's such pragmas effecting codegen that makes my skin crawl. Perhaps we can come up with a formal feature request for toolchain vendors for an actual command line flag? Does the pragma have the same effect as `-fdirect-access-external-data`/`-mdirect-extern-access`, or would this feature request look like yet another distinct flag?
On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote: > > On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote: > > > > On 2022-04-28, Ard Biesheuvel wrote: > > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: > > >> > > >> On 2022-04-27, Ard Biesheuvel wrote: > > >> >Fortunately, we can convince the compiler to handle this in a way that > > >> >is a bit more suitable for freestanding binaries such as the kernel, by > > >> >setting the 'hidden' visibility #pragma, which informs the compiler that > > >> >symbol preemption or CoW footprint are of no concern to us, and so > > >> >PC-relative references that are resolved at link time are perfectly > > >> >fine. > > >> > > >> Agree > > >> > > > > > >The only unfortunate thing is that -fvisibility=hidden does not give > > >us the behavior we want, and we are forced to use the #pragma instead. > > > > Right. For a very long time there had been no option controlling the > > access mode for undefined symbols (-fvisibility= is for defined > > symbols). > > > > I added -fdirect-access-external-data to Clang which supports > > many architectures (x86, aarch64, arm, riscv, ...). > > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64). > > > > The use of `#pragma GCC visibility push(hidden)` looks good as a > > portable solution. > > Portable, sure, which is fine for now. > > But there's just something about injecting a header into ever TU via > -include in order to set a pragma and that there's such pragmas > effecting codegen that makes my skin crawl. > > Perhaps we can come up with a formal feature request for toolchain > vendors for an actual command line flag? > > Does the pragma have the same effect as > `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould > this feature request look like yet another distinct flag? I agree that this is rather nasty. What I don't understand is why -fvisibility=hidden gives different behavior to begin with, or why -ffreestanding -fpie builds don't default to hidden visibility for symbol declarations as well as definitions.
On 2022-04-28, Ard Biesheuvel wrote: >On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote: >> >> On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote: >> > >> > On 2022-04-28, Ard Biesheuvel wrote: >> > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: >> > >> >> > >> On 2022-04-27, Ard Biesheuvel wrote: >> > >> >Fortunately, we can convince the compiler to handle this in a way that >> > >> >is a bit more suitable for freestanding binaries such as the kernel, by >> > >> >setting the 'hidden' visibility #pragma, which informs the compiler that >> > >> >symbol preemption or CoW footprint are of no concern to us, and so >> > >> >PC-relative references that are resolved at link time are perfectly >> > >> >fine. >> > >> >> > >> Agree >> > >> >> > > >> > >The only unfortunate thing is that -fvisibility=hidden does not give >> > >us the behavior we want, and we are forced to use the #pragma instead. >> > >> > Right. For a very long time there had been no option controlling the >> > access mode for undefined symbols (-fvisibility= is for defined >> > symbols). >> > >> > I added -fdirect-access-external-data to Clang which supports >> > many architectures (x86, aarch64, arm, riscv, ...). >> > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64). >> > >> > The use of `#pragma GCC visibility push(hidden)` looks good as a >> > portable solution. >> >> Portable, sure, which is fine for now. >> >> But there's just something about injecting a header into ever TU via >> -include in order to set a pragma and that there's such pragmas >> effecting codegen that makes my skin crawl. >> >> Perhaps we can come up with a formal feature request for toolchain >> vendors for an actual command line flag? >> >> Does the pragma have the same effect as >> `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould >> this feature request look like yet another distinct flag? `#pragma GCC visibility push(hidden)` is very similar to -fvisibility=hidden -fdirect-access-external-data with Clang. In Clang there are only two differences: // TLS initial-exec model with -fdirect-access-external-data; // TLS local-exec model with `#pragma GCC visibility push(hidden)` extern __thread int var; int foo() { return var; } // hidden visibility suppresses -fno-plt. // -fdirect-access-external-data / GCC -mdirect-extern-access doesn't suppress -fno-plt. extern int bar(); int foo() { return bar() + 2; } The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden -fdirect-access-external-data can replace `#pragma GCC visibility push(hidden)`. >I agree that this is rather nasty. What I don't understand is why >-fvisibility=hidden gives different behavior to begin with, or why >-ffreestanding -fpie builds don't default to hidden visibility for >symbol declarations as well as definitions. -ffreestanding doesn't mean there is no DSO. A libc implementation (e.g. musl) may use -ffreestanding to avoid libc dependencies from the host environment. It may ship several shared objects and export multiple symbols. Implied -fvisibility=hidden will get in the way. There is a merit to make options orthogonal.
On Fri, 29 Apr 2022 at 09:03, Fangrui Song <maskray@google.com> wrote: > > On 2022-04-28, Ard Biesheuvel wrote: > >On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote: > >> > >> On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote: > >> > > >> > On 2022-04-28, Ard Biesheuvel wrote: > >> > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: > >> > >> > >> > >> On 2022-04-27, Ard Biesheuvel wrote: > >> > >> >Fortunately, we can convince the compiler to handle this in a way that > >> > >> >is a bit more suitable for freestanding binaries such as the kernel, by > >> > >> >setting the 'hidden' visibility #pragma, which informs the compiler that > >> > >> >symbol preemption or CoW footprint are of no concern to us, and so > >> > >> >PC-relative references that are resolved at link time are perfectly > >> > >> >fine. > >> > >> > >> > >> Agree > >> > >> > >> > > > >> > >The only unfortunate thing is that -fvisibility=hidden does not give > >> > >us the behavior we want, and we are forced to use the #pragma instead. > >> > > >> > Right. For a very long time there had been no option controlling the > >> > access mode for undefined symbols (-fvisibility= is for defined > >> > symbols). > >> > > >> > I added -fdirect-access-external-data to Clang which supports > >> > many architectures (x86, aarch64, arm, riscv, ...). > >> > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64). > >> > > >> > The use of `#pragma GCC visibility push(hidden)` looks good as a > >> > portable solution. > >> > >> Portable, sure, which is fine for now. > >> > >> But there's just something about injecting a header into ever TU via > >> -include in order to set a pragma and that there's such pragmas > >> effecting codegen that makes my skin crawl. > >> > >> Perhaps we can come up with a formal feature request for toolchain > >> vendors for an actual command line flag? > >> > >> Does the pragma have the same effect as > >> `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould > >> this feature request look like yet another distinct flag? > > `#pragma GCC visibility push(hidden)` is very similar to > -fvisibility=hidden -fdirect-access-external-data with Clang. > In Clang there are only two differences: > > // TLS initial-exec model with -fdirect-access-external-data; > // TLS local-exec model with `#pragma GCC visibility push(hidden)` > extern __thread int var; > int foo() { return var; } > > // hidden visibility suppresses -fno-plt. > // -fdirect-access-external-data / GCC -mdirect-extern-access doesn't suppress -fno-plt. > extern int bar(); > int foo() { return bar() + 2; } > > > The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden > -fdirect-access-external-data can replace `#pragma GCC visibility > push(hidden)`. > OK. But you mentioned that GCC does not implement -mdirect-extern-access for AArch64, right? So for now, the pragma is the only portable option we have. > >I agree that this is rather nasty. What I don't understand is why > >-fvisibility=hidden gives different behavior to begin with, or why > >-ffreestanding -fpie builds don't default to hidden visibility for > >symbol declarations as well as definitions. > > -ffreestanding doesn't mean there is no DSO. A libc implementation (e.g. > musl) may use -ffreestanding to avoid libc dependencies from the host > environment. It may ship several shared objects and export multiple symbols. > Implied -fvisibility=hidden will get in the way. > > There is a merit to make options orthogonal. Fair enough.
On 2022-04-29, Ard Biesheuvel wrote: >On Fri, 29 Apr 2022 at 09:03, Fangrui Song <maskray@google.com> wrote: >> >> On 2022-04-28, Ard Biesheuvel wrote: >> >On Thu, 28 Apr 2022 at 20:53, Nick Desaulniers <ndesaulniers@google.com> wrote: >> >> >> >> On Wed, Apr 27, 2022 at 11:57 PM Fangrui Song <maskray@google.com> wrote: >> >> > >> >> > On 2022-04-28, Ard Biesheuvel wrote: >> >> > >On Thu, 28 Apr 2022 at 04:40, Fangrui Song <maskray@google.com> wrote: >> >> > >> >> >> > >> On 2022-04-27, Ard Biesheuvel wrote: >> >> > >> >Fortunately, we can convince the compiler to handle this in a way that >> >> > >> >is a bit more suitable for freestanding binaries such as the kernel, by >> >> > >> >setting the 'hidden' visibility #pragma, which informs the compiler that >> >> > >> >symbol preemption or CoW footprint are of no concern to us, and so >> >> > >> >PC-relative references that are resolved at link time are perfectly >> >> > >> >fine. >> >> > >> >> >> > >> Agree >> >> > >> >> >> > > >> >> > >The only unfortunate thing is that -fvisibility=hidden does not give >> >> > >us the behavior we want, and we are forced to use the #pragma instead. >> >> > >> >> > Right. For a very long time there had been no option controlling the >> >> > access mode for undefined symbols (-fvisibility= is for defined >> >> > symbols). >> >> > >> >> > I added -fdirect-access-external-data to Clang which supports >> >> > many architectures (x86, aarch64, arm, riscv, ...). >> >> > GCC's x86 port added -mdirect-extern-access in 2022-02 (not available on aarch64). >> >> > >> >> > The use of `#pragma GCC visibility push(hidden)` looks good as a >> >> > portable solution. >> >> >> >> Portable, sure, which is fine for now. >> >> >> >> But there's just something about injecting a header into ever TU via >> >> -include in order to set a pragma and that there's such pragmas >> >> effecting codegen that makes my skin crawl. >> >> >> >> Perhaps we can come up with a formal feature request for toolchain >> >> vendors for an actual command line flag? >> >> >> >> Does the pragma have the same effect as >> >> `-fdirect-access-external-data`/`-mdirect-extern-access`, or wvisould >> >> this feature request look like yet another distinct flag? >> >> `#pragma GCC visibility push(hidden)` is very similar to >> -fvisibility=hidden -fdirect-access-external-data with Clang. >> In Clang there are only two differences: >> >> // TLS initial-exec model with -fdirect-access-external-data; >> // TLS local-exec model with `#pragma GCC visibility push(hidden)` >> extern __thread int var; >> int foo() { return var; } >> >> // hidden visibility suppresses -fno-plt. >> // -fdirect-access-external-data / GCC -mdirect-extern-access doesn't suppress -fno-plt. >> extern int bar(); >> int foo() { return bar() + 2; } >> >> >> The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden >> -fdirect-access-external-data can replace `#pragma GCC visibility >> push(hidden)`. >> > >OK. But you mentioned that GCC does not implement >-mdirect-extern-access for AArch64, right? So for now, the pragma is >the only portable option we have. Right. >> >I agree that this is rather nasty. What I don't understand is why >> >-fvisibility=hidden gives different behavior to begin with, or why >> >-ffreestanding -fpie builds don't default to hidden visibility for >> >symbol declarations as well as definitions. >> >> -ffreestanding doesn't mean there is no DSO. A libc implementation (e.g. >> musl) may use -ffreestanding to avoid libc dependencies from the host >> environment. It may ship several shared objects and export multiple symbols. >> Implied -fvisibility=hidden will get in the way. >> >> There is a merit to make options orthogonal. > >Fair enough.
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 2f1de88651e6..94b6c51f5de6 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -18,6 +18,10 @@ ifeq ($(CONFIG_RELOCATABLE), y) # with the relocation offsets always being zero. LDFLAGS_vmlinux += -shared -Bsymbolic -z notext \ $(call ld-option, --no-apply-dynamic-relocs) + +# Generate position independent code without relying on a Global Offset Table +KBUILD_CFLAGS_KERNEL += -fpie -include $(srctree)/include/linux/hidden.h + endif ifeq ($(CONFIG_ARM64_ERRATUM_843419),y) diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S index edaf0faf766f..b1e071ac1acf 100644 --- a/arch/arm64/kernel/vmlinux.lds.S +++ b/arch/arm64/kernel/vmlinux.lds.S @@ -174,8 +174,6 @@ SECTIONS KEXEC_TEXT TRAMP_TEXT *(.gnu.warning) - . = ALIGN(16); - *(.got) /* Global offset table */ } /* @@ -192,6 +190,8 @@ SECTIONS /* everything from this point to __init_begin will be marked RO NX */ RO_DATA(PAGE_SIZE) + .data.rel.ro : ALIGN(8) { *(.got) *(.data.rel.ro*) } + HYPERVISOR_DATA_SECTIONS idmap_pg_dir = .; @@ -273,6 +273,8 @@ SECTIONS _sdata = .; RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_ALIGN) + .data.rel : ALIGN(8) { *(.data.rel*) } + /* * Data written with the MMU off but read with the MMU on requires * cache lines to be invalidated, discarding up to a Cache Writeback @@ -320,9 +322,6 @@ SECTIONS *(.plt) *(.plt.*) *(.iplt) *(.igot .igot.plt) } ASSERT(SIZEOF(.plt) == 0, "Unexpected run-time procedure linkages detected!") - - .data.rel.ro : { *(.data.rel.ro) } - ASSERT(SIZEOF(.data.rel.ro) == 0, "Unexpected RELRO detected!") } #include "image-vars.h"
We currently use ordinary, position dependent code generation for the core kernel, which happens to default to the 'small' code model on both GCC and Clang. This is the code model that relies on ADRP/ADD or ADRP/LDR pairs for symbol references, which are PC-relative with a range of -/+ 4 GiB, and therefore happen to be position independent in practice. This means that the fact that we can link the relocatable KASLR kernel using the -pie linker flag (which generates the runtime relocations and inserts them into the binary) is somewhat of a coincidence, and not something which is explicitly supported by the toolchains. The reason we have not used -fpie for code generation so far (which is the compiler flag that should be used to generate code that is to be linked with -pie) is that by default, it generates code based on assumptions that only hold for shared libraries and PIE executables, i.e., that gathering all relocatable quantities into a Global Offset Table (GOT) is desirable because it reduces the CoW footprint, and because it permits ELF symbol preemption (which lets an executable override symbols defined in a shared library, in a way that forces the shared library to update all of its internal references as well). Ironically, this means we end up with many more absolute references that all need to be fixed up at boot. Fortunately, we can convince the compiler to handle this in a way that is a bit more suitable for freestanding binaries such as the kernel, by setting the 'hidden' visibility #pragma, which informs the compiler that symbol preemption or CoW footprint are of no concern to us, and so PC-relative references that are resolved at link time are perfectly fine. So let's enable this #pragma and build with -fpie when building a relocatable kernel. This also means that all constant data items that carry statically initialized pointer variables are now emitted into the .data.rel.ro* sections, so move these into .rodata where they belong. Code size impact (GCC): Before: text data bss total filename 16712396 18659064 534556 35906016 vmlinux After: text data bss total filename 16804400 18612876 534556 35951832 vmlinux Code size impact (Clang): Before: text data bss total filename 17194584 13335060 535268 31064912 vmlinux After: text data bss total filename 17194536 13310032 535268 31039836 vmlinux Signed-off-by: Ard Biesheuvel <ardb@kernel.org> --- arch/arm64/Makefile | 4 ++++ arch/arm64/kernel/vmlinux.lds.S | 9 ++++----- 2 files changed, 8 insertions(+), 5 deletions(-)