diff mbox series

[PATCH-next,v2] arm32: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION

Message ID 20240307151231.654025-1-liuyuntao12@huawei.com (mailing list archive)
State New, archived
Headers show
Series [PATCH-next,v2] arm32: enable HAVE_LD_DEAD_CODE_DATA_ELIMINATION | expand

Commit Message

liuyuntao (F) March 7, 2024, 3:12 p.m. UTC
The current arm32 architecture does not yet support the
HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
embedded scenarios, and enabling this feature would be beneficial for
reducing the size of the kernel image.

In order to make this work, we keep the necessary tables by annotating
them with KEEP, also it requires further changes to linker script to KEEP
some tables and wildcard compiler generated sections into the right place.

It boots normally with defconfig, vexpress_defconfig and tinyconfig.

The size comparison of zImage is as follows:
defconfig       vexpress_defconfig      tinyconfig
5137712         5138024                 424192          no dce
5032560         4997824                 298384          dce
2.0%            2.7%                    29.7%           shrink

When using smaller config file, there is a significant reduction in the
size of the zImage.

We also tested this patch on a commercially available single-board
computer, and the comparison is as follows:
a15eb_config
2161384         no dce
2092240         dce
3.2%            shrink

The zImage size has been reduced by approximately 3.2%, which is 70KB on
2.1M.

Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
---
v2:
   - Support config XIP_KERNEL.
   - Support LLVM compilation.

v1: https://lore.kernel.org/all/20240220081527.23408-1-liuyuntao12@huawei.com/
---
 arch/arm/Kconfig                       |  1 +
 arch/arm/boot/compressed/vmlinux.lds.S |  4 ++--
 arch/arm/include/asm/vmlinux.lds.h     | 18 +++++++++++++++++-
 arch/arm/kernel/vmlinux-xip.lds.S      |  8 ++++++--
 arch/arm/kernel/vmlinux.lds.S          | 10 +++++++---
 5 files changed, 33 insertions(+), 8 deletions(-)

Comments

Arnd Bergmann March 8, 2024, 1:15 p.m. UTC | #1
On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
> The current arm32 architecture does not yet support the
> HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
> embedded scenarios, and enabling this feature would be beneficial for
> reducing the size of the kernel image.
>
> In order to make this work, we keep the necessary tables by annotating
> them with KEEP, also it requires further changes to linker script to KEEP
> some tables and wildcard compiler generated sections into the right place.
>
> It boots normally with defconfig, vexpress_defconfig and tinyconfig.
>
> The size comparison of zImage is as follows:
> defconfig       vexpress_defconfig      tinyconfig
> 5137712         5138024                 424192          no dce
> 5032560         4997824                 298384          dce
> 2.0%            2.7%                    29.7%           shrink
>
> When using smaller config file, there is a significant reduction in the
> size of the zImage.
>
> We also tested this patch on a commercially available single-board
> computer, and the comparison is as follows:
> a15eb_config
> 2161384         no dce
> 2092240         dce
> 3.2%            shrink
>
> The zImage size has been reduced by approximately 3.2%, which is 70KB on
> 2.1M.
>
> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>

I've retested with both gcc-13 and clang-18, and so no
more build issues. Your previous version already worked
fine for me.

I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
which showed a significant improvement as expected. I also
tried combining it with an experimental CONFIG_LTO_CLANG
patch, but that did not show any further improvements.

Tested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>

Adding Ard Biesheuvel and Fangrui Song to Cc, so they can comment
on the ARM_VECTORS_TEXT workaround. I don't understand enough of
the details of what is going on here.

Full quote of the patch below so they can see the whole thing.

If they are also happy with the patch, I think you can send it
into Russell's patch tracker at
https://www.armlinux.org.uk/developer/patches/info.php

> ---
> v2:
>    - Support config XIP_KERNEL.
>    - Support LLVM compilation.
>
> v1: https://lore.kernel.org/all/20240220081527.23408-1-liuyuntao12@huawei.com/
> ---
>  arch/arm/Kconfig                       |  1 +
>  arch/arm/boot/compressed/vmlinux.lds.S |  4 ++--
>  arch/arm/include/asm/vmlinux.lds.h     | 18 +++++++++++++++++-
>  arch/arm/kernel/vmlinux-xip.lds.S      |  8 ++++++--
>  arch/arm/kernel/vmlinux.lds.S          | 10 +++++++---
>  5 files changed, 33 insertions(+), 8 deletions(-)
>
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 0af6709570d1..de78ceb821df 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -113,6 +113,7 @@ config ARM
>  	select HAVE_KERNEL_XZ
>  	select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
>  	select HAVE_KRETPROBES if HAVE_KPROBES
> +	select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
>  	select HAVE_MOD_ARCH_SPECIFIC
>  	select HAVE_NMI
>  	select HAVE_OPTPROBES if !THUMB2_KERNEL
> diff --git a/arch/arm/boot/compressed/vmlinux.lds.S 
> b/arch/arm/boot/compressed/vmlinux.lds.S
> index 3fcb3e62dc56..da21244aa892 100644
> --- a/arch/arm/boot/compressed/vmlinux.lds.S
> +++ b/arch/arm/boot/compressed/vmlinux.lds.S
> @@ -89,7 +89,7 @@ SECTIONS
>       * The EFI stub always executes from RAM, and runs strictly before 
> the
>       * decompressor, so we can make an exception for its r/w data, and 
> keep it
>       */
> -    *(.data.efistub .bss.efistub)
> +    *(.data.* .bss.*)
>      __pecoff_data_end = .;
> 
>      /*
> @@ -125,7 +125,7 @@ SECTIONS
> 
>    . = BSS_START;
>    __bss_start = .;
> -  .bss			: { *(.bss) }
> +  .bss			: { *(.bss .bss.*) }
>    _end = .;
> 
>    . = ALIGN(8);		/* the stack must be 64-bit aligned */
> diff --git a/arch/arm/include/asm/vmlinux.lds.h 
> b/arch/arm/include/asm/vmlinux.lds.h
> index 4c8632d5c432..dfe2b6ad6b51 100644
> --- a/arch/arm/include/asm/vmlinux.lds.h
> +++ b/arch/arm/include/asm/vmlinux.lds.h
> @@ -42,7 +42,7 @@
>  #define PROC_INFO							\
>  		. = ALIGN(4);						\
>  		__proc_info_begin = .;					\
> -		*(.proc.info.init)					\
> +		KEEP(*(.proc.info.init))				\
>  		__proc_info_end = .;
> 
>  #define IDMAP_TEXT							\
> @@ -87,6 +87,22 @@
>  		*(.vfp11_veneer)                                        \
>  		*(.v4_bx)
> 
> +/*
> +When CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, it is important 
> to
> +annotate .vectors sections with KEEP. While linking with ld, it is
> +acceptable to directly use KEEP with .vectors sections in ARM_VECTORS.
> +However, when using ld.lld for linking, KEEP is not recognized within 
> the
> +OVERLAY command; it is treated as a regular string. Hence, it is 
> advisable
> +to define a distinct section here that explicitly retains the .vectors
> +sections when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is turned on.
> +*/
> +#define ARM_VECTORS_TEXT						\
> +	.vectors.text : {						\
> +		KEEP(*(.vectors))					\
> +		KEEP(*(.vectors.bhb.loop8))				\
> +		KEEP(*(.vectors.bhb.bpiall))				\
> +       }
> +
>  #define ARM_TEXT							\
>  		IDMAP_TEXT						\
>  		__entry_text_start = .;					\
> diff --git a/arch/arm/kernel/vmlinux-xip.lds.S 
> b/arch/arm/kernel/vmlinux-xip.lds.S
> index c16d196b5aad..035fa18060b3 100644
> --- a/arch/arm/kernel/vmlinux-xip.lds.S
> +++ b/arch/arm/kernel/vmlinux-xip.lds.S
> @@ -63,7 +63,7 @@ SECTIONS
>  	. = ALIGN(4);
>  	__ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
>  		__start___ex_table = .;
> -		ARM_MMU_KEEP(*(__ex_table))
> +		ARM_MMU_KEEP(KEEP(*(__ex_table)))
>  		__stop___ex_table = .;
>  	}
> 
> @@ -83,7 +83,7 @@ SECTIONS
>  	}
>  	.init.arch.info : {
>  		__arch_info_begin = .;
> -		*(.arch.info.init)
> +		KEEP(*(.arch.info.init))
>  		__arch_info_end = .;
>  	}
>  	.init.tagtable : {
> @@ -135,6 +135,10 @@ SECTIONS
>  	ARM_TCM
>  #endif
> 
> +#ifdef LD_DEAD_CODE_DATA_ELIMINATION
> +	ARM_VECTORS_TEXT
> +#endif
> +
>  	/*
>  	 * End of copied data. We need a dummy section to get its LMA.
>  	 * Also located before final ALIGN() as trailing padding is not stored
> diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
> index bd9127c4b451..2cfb890c93fb 100644
> --- a/arch/arm/kernel/vmlinux.lds.S
> +++ b/arch/arm/kernel/vmlinux.lds.S
> @@ -74,7 +74,7 @@ SECTIONS
>  	. = ALIGN(4);
>  	__ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
>  		__start___ex_table = .;
> -		ARM_MMU_KEEP(*(__ex_table))
> +		ARM_MMU_KEEP(KEEP(*(__ex_table)))
>  		__stop___ex_table = .;
>  	}
> 
> @@ -99,7 +99,7 @@ SECTIONS
>  	}
>  	.init.arch.info : {
>  		__arch_info_begin = .;
> -		*(.arch.info.init)
> +		KEEP(*(.arch.info.init))
>  		__arch_info_end = .;
>  	}
>  	.init.tagtable : {
> @@ -116,7 +116,7 @@ SECTIONS
>  #endif
>  	.init.pv_table : {
>  		__pv_table_begin = .;
> -		*(.pv_table)
> +		KEEP(*(.pv_table))
>  		__pv_table_end = .;
>  	}

I previously asked about this bit, since it appeared that this
might prevent a lot of code from being discarded when
CONFIG_ARM_PATCH_PHYS_VIRT is set. I tested this again now,
and found this makes very little practical difference, so
it's all good.

> @@ -134,6 +134,10 @@ SECTIONS
>  	ARM_TCM
>  #endif
> 
> +#ifdef LD_DEAD_CODE_DATA_ELIMINATION
> +	ARM_VECTORS_TEXT
> +#endif
> +
>  #ifdef CONFIG_STRICT_KERNEL_RWX
>  	. = ALIGN(1<<SECTION_SHIFT);
>  #else
Ard Biesheuvel March 8, 2024, 2:27 p.m. UTC | #2
On Fri, 8 Mar 2024 at 14:16, Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
> > The current arm32 architecture does not yet support the
> > HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
> > embedded scenarios, and enabling this feature would be beneficial for
> > reducing the size of the kernel image.
> >
> > In order to make this work, we keep the necessary tables by annotating
> > them with KEEP, also it requires further changes to linker script to KEEP
> > some tables and wildcard compiler generated sections into the right place.
> >
> > It boots normally with defconfig, vexpress_defconfig and tinyconfig.
> >
> > The size comparison of zImage is as follows:
> > defconfig       vexpress_defconfig      tinyconfig
> > 5137712         5138024                 424192          no dce
> > 5032560         4997824                 298384          dce
> > 2.0%            2.7%                    29.7%           shrink
> >
> > When using smaller config file, there is a significant reduction in the
> > size of the zImage.
> >
> > We also tested this patch on a commercially available single-board
> > computer, and the comparison is as follows:
> > a15eb_config
> > 2161384         no dce
> > 2092240         dce
> > 3.2%            shrink
> >
> > The zImage size has been reduced by approximately 3.2%, which is 70KB on
> > 2.1M.
> >
> > Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
>
> I've retested with both gcc-13 and clang-18, and so no
> more build issues. Your previous version already worked
> fine for me.
>
> I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
> which showed a significant improvement as expected. I also
> tried combining it with an experimental CONFIG_LTO_CLANG
> patch, but that did not show any further improvements.
>
> Tested-by: Arnd Bergmann <arnd@arndb.de>
> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
>
> Adding Ard Biesheuvel and Fangrui Song to Cc, so they can comment
> on the ARM_VECTORS_TEXT workaround. I don't understand enough of
> the details of what is going on here.
>

Thanks for the cc

> Full quote of the patch below so they can see the whole thing.
>
> If they are also happy with the patch, I think you can send it
> into Russell's patch tracker at
> https://www.armlinux.org.uk/developer/patches/info.php
>

No, not happy at all :-)

The resulting kernel does not boot (built with GCC or Clang). And the
patch is buggy (see below)

> > ---
> > v2:
> >    - Support config XIP_KERNEL.
> >    - Support LLVM compilation.
> >
> > v1: https://lore.kernel.org/all/20240220081527.23408-1-liuyuntao12@huawei.com/
> > ---
> >  arch/arm/Kconfig                       |  1 +
> >  arch/arm/boot/compressed/vmlinux.lds.S |  4 ++--
> >  arch/arm/include/asm/vmlinux.lds.h     | 18 +++++++++++++++++-
> >  arch/arm/kernel/vmlinux-xip.lds.S      |  8 ++++++--
> >  arch/arm/kernel/vmlinux.lds.S          | 10 +++++++---
> >  5 files changed, 33 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index 0af6709570d1..de78ceb821df 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> > @@ -113,6 +113,7 @@ config ARM
> >       select HAVE_KERNEL_XZ
> >       select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
> >       select HAVE_KRETPROBES if HAVE_KPROBES
> > +     select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
> >       select HAVE_MOD_ARCH_SPECIFIC
> >       select HAVE_NMI
> >       select HAVE_OPTPROBES if !THUMB2_KERNEL
> > diff --git a/arch/arm/boot/compressed/vmlinux.lds.S
> > b/arch/arm/boot/compressed/vmlinux.lds.S
> > index 3fcb3e62dc56..da21244aa892 100644
> > --- a/arch/arm/boot/compressed/vmlinux.lds.S
> > +++ b/arch/arm/boot/compressed/vmlinux.lds.S
> > @@ -89,7 +89,7 @@ SECTIONS
> >       * The EFI stub always executes from RAM, and runs strictly before
> > the
> >       * decompressor, so we can make an exception for its r/w data, and
> > keep it
> >       */
> > -    *(.data.efistub .bss.efistub)
> > +    *(.data.* .bss.*)

Why is this necessary? There is a reason we don't allow .data in the
decompressor.

> >      __pecoff_data_end = .;
> >
> >      /*
> > @@ -125,7 +125,7 @@ SECTIONS
> >
> >    . = BSS_START;
> >    __bss_start = .;
> > -  .bss                       : { *(.bss) }
> > +  .bss                       : { *(.bss .bss.*) }
> >    _end = .;
> >
> >    . = ALIGN(8);              /* the stack must be 64-bit aligned */
> > diff --git a/arch/arm/include/asm/vmlinux.lds.h
> > b/arch/arm/include/asm/vmlinux.lds.h
> > index 4c8632d5c432..dfe2b6ad6b51 100644
> > --- a/arch/arm/include/asm/vmlinux.lds.h
> > +++ b/arch/arm/include/asm/vmlinux.lds.h
> > @@ -42,7 +42,7 @@
> >  #define PROC_INFO                                                    \
> >               . = ALIGN(4);                                           \
> >               __proc_info_begin = .;                                  \
> > -             *(.proc.info.init)                                      \
> > +             KEEP(*(.proc.info.init))                                \
> >               __proc_info_end = .;
> >
> >  #define IDMAP_TEXT                                                   \
> > @@ -87,6 +87,22 @@
> >               *(.vfp11_veneer)                                        \
> >               *(.v4_bx)
> >
> > +/*
> > +When CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, it is important
> > to
> > +annotate .vectors sections with KEEP. While linking with ld, it is
> > +acceptable to directly use KEEP with .vectors sections in ARM_VECTORS.
> > +However, when using ld.lld for linking, KEEP is not recognized within
> > the
> > +OVERLAY command; it is treated as a regular string. Hence, it is
> > advisable
> > +to define a distinct section here that explicitly retains the .vectors
> > +sections when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is turned on.
> > +*/
> > +#define ARM_VECTORS_TEXT                                             \
> > +     .vectors.text : {                                               \
> > +             KEEP(*(.vectors))                                       \
> > +             KEEP(*(.vectors.bhb.loop8))                             \
> > +             KEEP(*(.vectors.bhb.bpiall))                            \
> > +       }
> > +

This looks fishy to me. How is this supposed to work? You cannot emit
these sections into some random other place in the binary.

And also, ARM_VECTORS_TEXT is never used (by accident, see below)

> >  #define ARM_TEXT                                                     \
> >               IDMAP_TEXT                                              \
> >               __entry_text_start = .;                                 \
> > diff --git a/arch/arm/kernel/vmlinux-xip.lds.S
> > b/arch/arm/kernel/vmlinux-xip.lds.S
> > index c16d196b5aad..035fa18060b3 100644
> > --- a/arch/arm/kernel/vmlinux-xip.lds.S
> > +++ b/arch/arm/kernel/vmlinux-xip.lds.S
> > @@ -63,7 +63,7 @@ SECTIONS
> >       . = ALIGN(4);
> >       __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
> >               __start___ex_table = .;
> > -             ARM_MMU_KEEP(*(__ex_table))
> > +             ARM_MMU_KEEP(KEEP(*(__ex_table)))
> >               __stop___ex_table = .;
> >       }
> >
> > @@ -83,7 +83,7 @@ SECTIONS
> >       }
> >       .init.arch.info : {
> >               __arch_info_begin = .;
> > -             *(.arch.info.init)
> > +             KEEP(*(.arch.info.init))
> >               __arch_info_end = .;
> >       }
> >       .init.tagtable : {
> > @@ -135,6 +135,10 @@ SECTIONS
> >       ARM_TCM
> >  #endif
> >
> > +#ifdef LD_DEAD_CODE_DATA_ELIMINATION

This should be CONFIG_LD_DEAD_CODE_DATA_ELIMINATION

> > +     ARM_VECTORS_TEXT
> > +#endif
> > +
> >       /*
> >        * End of copied data. We need a dummy section to get its LMA.
> >        * Also located before final ALIGN() as trailing padding is not stored
> > diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
> > index bd9127c4b451..2cfb890c93fb 100644
> > --- a/arch/arm/kernel/vmlinux.lds.S
> > +++ b/arch/arm/kernel/vmlinux.lds.S
> > @@ -74,7 +74,7 @@ SECTIONS
> >       . = ALIGN(4);
> >       __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
> >               __start___ex_table = .;
> > -             ARM_MMU_KEEP(*(__ex_table))
> > +             ARM_MMU_KEEP(KEEP(*(__ex_table)))
> >               __stop___ex_table = .;
> >       }
> >
> > @@ -99,7 +99,7 @@ SECTIONS
> >       }
> >       .init.arch.info : {
> >               __arch_info_begin = .;
> > -             *(.arch.info.init)
> > +             KEEP(*(.arch.info.init))
> >               __arch_info_end = .;
> >       }
> >       .init.tagtable : {
> > @@ -116,7 +116,7 @@ SECTIONS
> >  #endif
> >       .init.pv_table : {
> >               __pv_table_begin = .;
> > -             *(.pv_table)
> > +             KEEP(*(.pv_table))
> >               __pv_table_end = .;
> >       }
>
> I previously asked about this bit, since it appeared that this
> might prevent a lot of code from being discarded when
> CONFIG_ARM_PATCH_PHYS_VIRT is set. I tested this again now,
> and found this makes very little practical difference, so
> it's all good.
>
> > @@ -134,6 +134,10 @@ SECTIONS
> >       ARM_TCM
> >  #endif
> >
> > +#ifdef LD_DEAD_CODE_DATA_ELIMINATION

Same here


> > +     ARM_VECTORS_TEXT
> > +#endif
> > +
> >  #ifdef CONFIG_STRICT_KERNEL_RWX
> >       . = ALIGN(1<<SECTION_SHIFT);
> >  #else
Ard Biesheuvel March 8, 2024, 3:37 p.m. UTC | #3
On Fri, 8 Mar 2024 at 15:27, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Fri, 8 Mar 2024 at 14:16, Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
> > > The current arm32 architecture does not yet support the
> > > HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
> > > embedded scenarios, and enabling this feature would be beneficial for
> > > reducing the size of the kernel image.
> > >
> > > In order to make this work, we keep the necessary tables by annotating
> > > them with KEEP, also it requires further changes to linker script to KEEP
> > > some tables and wildcard compiler generated sections into the right place.
> > >
> > > It boots normally with defconfig, vexpress_defconfig and tinyconfig.
> > >
> > > The size comparison of zImage is as follows:
> > > defconfig       vexpress_defconfig      tinyconfig
> > > 5137712         5138024                 424192          no dce
> > > 5032560         4997824                 298384          dce
> > > 2.0%            2.7%                    29.7%           shrink
> > >
> > > When using smaller config file, there is a significant reduction in the
> > > size of the zImage.
> > >
> > > We also tested this patch on a commercially available single-board
> > > computer, and the comparison is as follows:
> > > a15eb_config
> > > 2161384         no dce
> > > 2092240         dce
> > > 3.2%            shrink
> > >
> > > The zImage size has been reduced by approximately 3.2%, which is 70KB on
> > > 2.1M.
> > >
> > > Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
> >
> > I've retested with both gcc-13 and clang-18, and so no
> > more build issues. Your previous version already worked
> > fine for me.
> >
> > I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
> > which showed a significant improvement as expected. I also
> > tried combining it with an experimental CONFIG_LTO_CLANG
> > patch, but that did not show any further improvements.
> >
> > Tested-by: Arnd Bergmann <arnd@arndb.de>
> > Reviewed-by: Arnd Bergmann <arnd@arndb.de>
> >
> > Adding Ard Biesheuvel and Fangrui Song to Cc, so they can comment
> > on the ARM_VECTORS_TEXT workaround. I don't understand enough of
> > the details of what is going on here.
> >
>
> Thanks for the cc
>
> > Full quote of the patch below so they can see the whole thing.
> >
> > If they are also happy with the patch, I think you can send it
> > into Russell's patch tracker at
> > https://www.armlinux.org.uk/developer/patches/info.php
> >
>
> No, not happy at all :-)
>
> The resulting kernel does not boot (built with GCC or Clang). And the
> patch is buggy (see below)
>
> > > ---
> > > v2:
> > >    - Support config XIP_KERNEL.
> > >    - Support LLVM compilation.
> > >
> > > v1: https://lore.kernel.org/all/20240220081527.23408-1-liuyuntao12@huawei.com/
> > > ---
> > >  arch/arm/Kconfig                       |  1 +
> > >  arch/arm/boot/compressed/vmlinux.lds.S |  4 ++--
> > >  arch/arm/include/asm/vmlinux.lds.h     | 18 +++++++++++++++++-
> > >  arch/arm/kernel/vmlinux-xip.lds.S      |  8 ++++++--
> > >  arch/arm/kernel/vmlinux.lds.S          | 10 +++++++---
> > >  5 files changed, 33 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > > index 0af6709570d1..de78ceb821df 100644
> > > --- a/arch/arm/Kconfig
> > > +++ b/arch/arm/Kconfig
> > > @@ -113,6 +113,7 @@ config ARM
> > >       select HAVE_KERNEL_XZ
> > >       select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
> > >       select HAVE_KRETPROBES if HAVE_KPROBES
> > > +     select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
> > >       select HAVE_MOD_ARCH_SPECIFIC
> > >       select HAVE_NMI
> > >       select HAVE_OPTPROBES if !THUMB2_KERNEL
> > > diff --git a/arch/arm/boot/compressed/vmlinux.lds.S
> > > b/arch/arm/boot/compressed/vmlinux.lds.S
> > > index 3fcb3e62dc56..da21244aa892 100644
> > > --- a/arch/arm/boot/compressed/vmlinux.lds.S
> > > +++ b/arch/arm/boot/compressed/vmlinux.lds.S
> > > @@ -89,7 +89,7 @@ SECTIONS
> > >       * The EFI stub always executes from RAM, and runs strictly before
> > > the
> > >       * decompressor, so we can make an exception for its r/w data, and
> > > keep it
> > >       */
> > > -    *(.data.efistub .bss.efistub)
> > > +    *(.data.* .bss.*)
>
> Why is this necessary? There is a reason we don't allow .data in the
> decompressor.
>
> > >      __pecoff_data_end = .;
> > >
> > >      /*
> > > @@ -125,7 +125,7 @@ SECTIONS
> > >
> > >    . = BSS_START;
> > >    __bss_start = .;
> > > -  .bss                       : { *(.bss) }
> > > +  .bss                       : { *(.bss .bss.*) }
> > >    _end = .;
> > >
> > >    . = ALIGN(8);              /* the stack must be 64-bit aligned */
> > > diff --git a/arch/arm/include/asm/vmlinux.lds.h
> > > b/arch/arm/include/asm/vmlinux.lds.h
> > > index 4c8632d5c432..dfe2b6ad6b51 100644
> > > --- a/arch/arm/include/asm/vmlinux.lds.h
> > > +++ b/arch/arm/include/asm/vmlinux.lds.h
> > > @@ -42,7 +42,7 @@
> > >  #define PROC_INFO                                                    \
> > >               . = ALIGN(4);                                           \
> > >               __proc_info_begin = .;                                  \
> > > -             *(.proc.info.init)                                      \
> > > +             KEEP(*(.proc.info.init))                                \
> > >               __proc_info_end = .;
> > >
> > >  #define IDMAP_TEXT                                                   \
> > > @@ -87,6 +87,22 @@
> > >               *(.vfp11_veneer)                                        \
> > >               *(.v4_bx)
> > >
> > > +/*
> > > +When CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, it is important
> > > to
> > > +annotate .vectors sections with KEEP. While linking with ld, it is
> > > +acceptable to directly use KEEP with .vectors sections in ARM_VECTORS.
> > > +However, when using ld.lld for linking, KEEP is not recognized within
> > > the
> > > +OVERLAY command; it is treated as a regular string. Hence, it is
> > > advisable
> > > +to define a distinct section here that explicitly retains the .vectors
> > > +sections when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is turned on.
> > > +*/
> > > +#define ARM_VECTORS_TEXT                                             \
> > > +     .vectors.text : {                                               \
> > > +             KEEP(*(.vectors))                                       \
> > > +             KEEP(*(.vectors.bhb.loop8))                             \
> > > +             KEEP(*(.vectors.bhb.bpiall))                            \
> > > +       }
> > > +
>
> This looks fishy to me. How is this supposed to work? You cannot emit
> these sections into some random other place in the binary.
>
> And also, ARM_VECTORS_TEXT is never used (by accident, see below)
>

The below appears to work for me:

--- a/arch/arm/kernel/entry-armv.S
+++ b/arch/arm/kernel/entry-armv.S
@@ -1076,7 +1076,12 @@
        W(b)    vector_irq
        W(b)    vector_fiq

+       .text
+       .reloc  ., R_ARM_NONE, .vectors
 #ifdef CONFIG_HARDEN_BRANCH_HISTORY
+       .reloc  ., R_ARM_NONE, .vectors.bhb.loop8
+       .reloc  ., R_ARM_NONE, .vectors.bhb.bpiall
+
        .section .vectors.bhb.loop8, "ax", %progbits
        W(b)    vector_rst
        W(b)    vector_bhb_loop8_und
Ard Biesheuvel March 9, 2024, 12:01 a.m. UTC | #4
On Fri, 8 Mar 2024 at 16:37, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Fri, 8 Mar 2024 at 15:27, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Fri, 8 Mar 2024 at 14:16, Arnd Bergmann <arnd@arndb.de> wrote:
> > >
> > > On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
> > > > The current arm32 architecture does not yet support the
> > > > HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
> > > > embedded scenarios, and enabling this feature would be beneficial for
> > > > reducing the size of the kernel image.
> > > >
> > > > In order to make this work, we keep the necessary tables by annotating
> > > > them with KEEP, also it requires further changes to linker script to KEEP
> > > > some tables and wildcard compiler generated sections into the right place.
> > > >
> > > > It boots normally with defconfig, vexpress_defconfig and tinyconfig.
> > > >
> > > > The size comparison of zImage is as follows:
> > > > defconfig       vexpress_defconfig      tinyconfig
> > > > 5137712         5138024                 424192          no dce
> > > > 5032560         4997824                 298384          dce
> > > > 2.0%            2.7%                    29.7%           shrink
> > > >
> > > > When using smaller config file, there is a significant reduction in the
> > > > size of the zImage.
> > > >
> > > > We also tested this patch on a commercially available single-board
> > > > computer, and the comparison is as follows:
> > > > a15eb_config
> > > > 2161384         no dce
> > > > 2092240         dce
> > > > 3.2%            shrink
> > > >
> > > > The zImage size has been reduced by approximately 3.2%, which is 70KB on
> > > > 2.1M.
> > > >
> > > > Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
> > >
> > > I've retested with both gcc-13 and clang-18, and so no
> > > more build issues. Your previous version already worked
> > > fine for me.
> > >
> > > I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
> > > which showed a significant improvement as expected. I also
> > > tried combining it with an experimental CONFIG_LTO_CLANG
> > > patch, but that did not show any further improvements.
> > >
> > > Tested-by: Arnd Bergmann <arnd@arndb.de>
> > > Reviewed-by: Arnd Bergmann <arnd@arndb.de>
> > >
> > > Adding Ard Biesheuvel and Fangrui Song to Cc, so they can comment
> > > on the ARM_VECTORS_TEXT workaround. I don't understand enough of
> > > the details of what is going on here.
> > >
> >
> > Thanks for the cc
> >
> > > Full quote of the patch below so they can see the whole thing.
> > >
> > > If they are also happy with the patch, I think you can send it
> > > into Russell's patch tracker at
> > > https://www.armlinux.org.uk/developer/patches/info.php
> > >
> >
> > No, not happy at all :-)
> >
> > The resulting kernel does not boot (built with GCC or Clang). And the
> > patch is buggy (see below)
> >
> > > > ---
> > > > v2:
> > > >    - Support config XIP_KERNEL.
> > > >    - Support LLVM compilation.
> > > >
> > > > v1: https://lore.kernel.org/all/20240220081527.23408-1-liuyuntao12@huawei.com/
> > > > ---
> > > >  arch/arm/Kconfig                       |  1 +
> > > >  arch/arm/boot/compressed/vmlinux.lds.S |  4 ++--
> > > >  arch/arm/include/asm/vmlinux.lds.h     | 18 +++++++++++++++++-
> > > >  arch/arm/kernel/vmlinux-xip.lds.S      |  8 ++++++--
> > > >  arch/arm/kernel/vmlinux.lds.S          | 10 +++++++---
> > > >  5 files changed, 33 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > > > index 0af6709570d1..de78ceb821df 100644
> > > > --- a/arch/arm/Kconfig
> > > > +++ b/arch/arm/Kconfig
> > > > @@ -113,6 +113,7 @@ config ARM
> > > >       select HAVE_KERNEL_XZ
> > > >       select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
> > > >       select HAVE_KRETPROBES if HAVE_KPROBES
> > > > +     select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
> > > >       select HAVE_MOD_ARCH_SPECIFIC
> > > >       select HAVE_NMI
> > > >       select HAVE_OPTPROBES if !THUMB2_KERNEL
> > > > diff --git a/arch/arm/boot/compressed/vmlinux.lds.S
> > > > b/arch/arm/boot/compressed/vmlinux.lds.S
> > > > index 3fcb3e62dc56..da21244aa892 100644
> > > > --- a/arch/arm/boot/compressed/vmlinux.lds.S
> > > > +++ b/arch/arm/boot/compressed/vmlinux.lds.S
> > > > @@ -89,7 +89,7 @@ SECTIONS
> > > >       * The EFI stub always executes from RAM, and runs strictly before
> > > > the
> > > >       * decompressor, so we can make an exception for its r/w data, and
> > > > keep it
> > > >       */
> > > > -    *(.data.efistub .bss.efistub)
> > > > +    *(.data.* .bss.*)
> >
> > Why is this necessary? There is a reason we don't allow .data in the
> > decompressor.
> >
> > > >      __pecoff_data_end = .;
> > > >
> > > >      /*
> > > > @@ -125,7 +125,7 @@ SECTIONS
> > > >
> > > >    . = BSS_START;
> > > >    __bss_start = .;
> > > > -  .bss                       : { *(.bss) }
> > > > +  .bss                       : { *(.bss .bss.*) }
> > > >    _end = .;
> > > >
> > > >    . = ALIGN(8);              /* the stack must be 64-bit aligned */
> > > > diff --git a/arch/arm/include/asm/vmlinux.lds.h
> > > > b/arch/arm/include/asm/vmlinux.lds.h
> > > > index 4c8632d5c432..dfe2b6ad6b51 100644
> > > > --- a/arch/arm/include/asm/vmlinux.lds.h
> > > > +++ b/arch/arm/include/asm/vmlinux.lds.h
> > > > @@ -42,7 +42,7 @@
> > > >  #define PROC_INFO                                                    \
> > > >               . = ALIGN(4);                                           \
> > > >               __proc_info_begin = .;                                  \
> > > > -             *(.proc.info.init)                                      \
> > > > +             KEEP(*(.proc.info.init))                                \
> > > >               __proc_info_end = .;
> > > >
> > > >  #define IDMAP_TEXT                                                   \
> > > > @@ -87,6 +87,22 @@
> > > >               *(.vfp11_veneer)                                        \
> > > >               *(.v4_bx)
> > > >
> > > > +/*
> > > > +When CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, it is important
> > > > to
> > > > +annotate .vectors sections with KEEP. While linking with ld, it is
> > > > +acceptable to directly use KEEP with .vectors sections in ARM_VECTORS.
> > > > +However, when using ld.lld for linking, KEEP is not recognized within
> > > > the
> > > > +OVERLAY command; it is treated as a regular string. Hence, it is
> > > > advisable
> > > > +to define a distinct section here that explicitly retains the .vectors
> > > > +sections when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is turned on.
> > > > +*/
> > > > +#define ARM_VECTORS_TEXT                                             \
> > > > +     .vectors.text : {                                               \
> > > > +             KEEP(*(.vectors))                                       \
> > > > +             KEEP(*(.vectors.bhb.loop8))                             \
> > > > +             KEEP(*(.vectors.bhb.bpiall))                            \
> > > > +       }
> > > > +
> >
> > This looks fishy to me. How is this supposed to work? You cannot emit
> > these sections into some random other place in the binary.
> >
> > And also, ARM_VECTORS_TEXT is never used (by accident, see below)
> >
>
> The below appears to work for me:
>
> --- a/arch/arm/kernel/entry-armv.S
> +++ b/arch/arm/kernel/entry-armv.S
> @@ -1076,7 +1076,12 @@
>         W(b)    vector_irq
>         W(b)    vector_fiq
>
> +       .text
> +       .reloc  ., R_ARM_NONE, .vectors
>  #ifdef CONFIG_HARDEN_BRANCH_HISTORY
> +       .reloc  ., R_ARM_NONE, .vectors.bhb.loop8
> +       .reloc  ., R_ARM_NONE, .vectors.bhb.bpiall
> +
>         .section .vectors.bhb.loop8, "ax", %progbits
>         W(b)    vector_rst
>         W(b)    vector_bhb_loop8_und

... or even better:

--- a/arch/arm/kernel/entry-armv.S
+++ b/arch/arm/kernel/entry-armv.S
@@ -1066,4 +1066,5 @@

        .section .vectors, "ax", %progbits
+       .reloc  .text, R_ARM_NONE, .
        W(b)    vector_rst
        W(b)    vector_und
@@ -1079,4 +1080,5 @@
 #ifdef CONFIG_HARDEN_BRANCH_HISTORY
        .section .vectors.bhb.loop8, "ax", %progbits
+       .reloc  .text, R_ARM_NONE, .
        W(b)    vector_rst
        W(b)    vector_bhb_loop8_und
@@ -1091,4 +1093,5 @@

        .section .vectors.bhb.bpiall, "ax", %progbits
+       .reloc  .text, R_ARM_NONE, .
        W(b)    vector_rst
        W(b)    vector_bhb_bpiall_und
liuyuntao (F) March 9, 2024, 6:14 a.m. UTC | #5
On 2024/3/8 21:15, Arnd Bergmann wrote:
> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
>> The current arm32 architecture does not yet support the
>> HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
>> embedded scenarios, and enabling this feature would be beneficial for
>> reducing the size of the kernel image.
>>
>> In order to make this work, we keep the necessary tables by annotating
>> them with KEEP, also it requires further changes to linker script to KEEP
>> some tables and wildcard compiler generated sections into the right place.
>>
>> It boots normally with defconfig, vexpress_defconfig and tinyconfig.
>>
>> The size comparison of zImage is as follows:
>> defconfig       vexpress_defconfig      tinyconfig
>> 5137712         5138024                 424192          no dce
>> 5032560         4997824                 298384          dce
>> 2.0%            2.7%                    29.7%           shrink
>>
>> When using smaller config file, there is a significant reduction in the
>> size of the zImage.
>>
>> We also tested this patch on a commercially available single-board
>> computer, and the comparison is as follows:
>> a15eb_config
>> 2161384         no dce
>> 2092240         dce
>> 3.2%            shrink
>>
>> The zImage size has been reduced by approximately 3.2%, which is 70KB on
>> 2.1M.
>>
>> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
> 
> I've retested with both gcc-13 and clang-18, and so no
> more build issues. Your previous version already worked
> fine for me.
> 
> I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
> which showed a significant improvement as expected. I also
> tried combining it with an experimental CONFIG_LTO_CLANG
> patch, but that did not show any further improvements.
> 

Thanks for the tests, CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and 
CONFIG_TRIM_UNUSED_KSYMS do indeed result in a significant improvement.
I found that arm32 still doesn't support CONFIG_LTO_CLANG. I've done 
some work on it, but without success. I'd like to learn more about the 
CONFIG_LTO_CLANG patch. Do you have any relevant links?
liuyuntao (F) March 9, 2024, 6:42 a.m. UTC | #6
On 2024/3/8 22:27, Ard Biesheuvel wrote:
> On Fri, 8 Mar 2024 at 14:16, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
>>> The current arm32 architecture does not yet support the
>>> HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
>>> embedded scenarios, and enabling this feature would be beneficial for
>>> reducing the size of the kernel image.
>>>
>>> In order to make this work, we keep the necessary tables by annotating
>>> them with KEEP, also it requires further changes to linker script to KEEP
>>> some tables and wildcard compiler generated sections into the right place.
>>>
>>> It boots normally with defconfig, vexpress_defconfig and tinyconfig.
>>>
>>> The size comparison of zImage is as follows:
>>> defconfig       vexpress_defconfig      tinyconfig
>>> 5137712         5138024                 424192          no dce
>>> 5032560         4997824                 298384          dce
>>> 2.0%            2.7%                    29.7%           shrink
>>>
>>> When using smaller config file, there is a significant reduction in the
>>> size of the zImage.
>>>
>>> We also tested this patch on a commercially available single-board
>>> computer, and the comparison is as follows:
>>> a15eb_config
>>> 2161384         no dce
>>> 2092240         dce
>>> 3.2%            shrink
>>>
>>> The zImage size has been reduced by approximately 3.2%, which is 70KB on
>>> 2.1M.
>>>
>>> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
>>
>> I've retested with both gcc-13 and clang-18, and so no
>> more build issues. Your previous version already worked
>> fine for me.
>>
>> I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
>> which showed a significant improvement as expected. I also
>> tried combining it with an experimental CONFIG_LTO_CLANG
>> patch, but that did not show any further improvements.
>>
>> Tested-by: Arnd Bergmann <arnd@arndb.de>
>> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
>>
>> Adding Ard Biesheuvel and Fangrui Song to Cc, so they can comment
>> on the ARM_VECTORS_TEXT workaround. I don't understand enough of
>> the details of what is going on here.
>>
> 
> Thanks for the cc
> 
>> Full quote of the patch below so they can see the whole thing.
>>
>> If they are also happy with the patch, I think you can send it
>> into Russell's patch tracker at
>> https://www.armlinux.org.uk/developer/patches/info.php
>>
> 
> No, not happy at all :-)
> 
> The resulting kernel does not boot (built with GCC or Clang). And the
> patch is buggy (see below)
> 
>>> ---
>>> v2:
>>>     - Support config XIP_KERNEL.
>>>     - Support LLVM compilation.
>>>
>>> v1: https://lore.kernel.org/all/20240220081527.23408-1-liuyuntao12@huawei.com/
>>> ---
>>>   arch/arm/Kconfig                       |  1 +
>>>   arch/arm/boot/compressed/vmlinux.lds.S |  4 ++--
>>>   arch/arm/include/asm/vmlinux.lds.h     | 18 +++++++++++++++++-
>>>   arch/arm/kernel/vmlinux-xip.lds.S      |  8 ++++++--
>>>   arch/arm/kernel/vmlinux.lds.S          | 10 +++++++---
>>>   5 files changed, 33 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>>> index 0af6709570d1..de78ceb821df 100644
>>> --- a/arch/arm/Kconfig
>>> +++ b/arch/arm/Kconfig
>>> @@ -113,6 +113,7 @@ config ARM
>>>        select HAVE_KERNEL_XZ
>>>        select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
>>>        select HAVE_KRETPROBES if HAVE_KPROBES
>>> +     select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
>>>        select HAVE_MOD_ARCH_SPECIFIC
>>>        select HAVE_NMI
>>>        select HAVE_OPTPROBES if !THUMB2_KERNEL
>>> diff --git a/arch/arm/boot/compressed/vmlinux.lds.S
>>> b/arch/arm/boot/compressed/vmlinux.lds.S
>>> index 3fcb3e62dc56..da21244aa892 100644
>>> --- a/arch/arm/boot/compressed/vmlinux.lds.S
>>> +++ b/arch/arm/boot/compressed/vmlinux.lds.S
>>> @@ -89,7 +89,7 @@ SECTIONS
>>>        * The EFI stub always executes from RAM, and runs strictly before
>>> the
>>>        * decompressor, so we can make an exception for its r/w data, and
>>> keep it
>>>        */
>>> -    *(.data.efistub .bss.efistub)
>>> +    *(.data.* .bss.*)
> 
> Why is this necessary? There is a reason we don't allow .data in the
> decompressor.
> 

Arnd previously asked about this bit also. When CONFIG_EFI and
CONFIG_LD_DEAD_CODE_DATA_ELIMINATION was enabled,I came across
a link failure using ld:
           arm-linux-gnueabi-ld: warning: orphan section 
`.data.efi_loglevel' from `drivers/firmware/efi/libstub/printk.stub.o' 
being placed in section `.data.efi_loglevel'
arm-linux-gnueabi-ld: warning: orphan section `.data.screen_info_guid' 
from `drivers/firmware/efi/libstub/screen_info.stub.o' being placed in 
section `.data.screen_info_guid'
arm-linux-gnueabi-ld: warning: orphan section `.data.cpu_state_guid' 
from `drivers/firmware/efi/libstub/arm32-stub.stub.o' being placed in 
section `.data.cpu_state_guid'
arm-linux-gnueabi-ld: warning: orphan section `.data.efi_nokaslr' from 
`drivers/firmware/efi/libstub/efi-stub-helper.stub.o' being placed in 
section `.data.efi_nokaslr'
arm-linux-gnueabi-ld: error: zImage file size is incorrect

So I changed .data.efistub to .data.*., the same to .bss.efistub. 

Perhaps it looks clearer like this.

--- a/arch/arm/boot/compressed/vmlinux.lds.S
+++ b/arch/arm/boot/compressed/vmlinux.lds.S
@@ -89,7 +89,11 @@ SECTIONS
       * The EFI stub always executes from RAM, and runs strictly before the
       * decompressor, so we can make an exception for its r/w data, and 
keep it
       */
+#ifdef CONFIG_LD_DEAD_CODE_DATA_ELIMINATION
      *(.data.* .bss.*)
+#else
+    *(.data.efistub .bss.efistub)
+#endif
      __pecoff_data_end = .;

      /*

Which approach do you prefer, or do you have a better way?


> On Tue, Feb 20, 2024, at 10:53, liuyuntao (F) wrote:
>> 在 2024/2/20 16:40, Arnd Bergmann 写道:
>>> On Tue, Feb 20, 2024, at 09:15, Yuntao Liu wrote:
>> #
>> # ARM discards the .data section because it disallows r/w data in the
>> # decompressor. So move our .data to .data.efistub and .bss to .bss.efistub,
>> # which are preserved explicitly by the decompressor linker script.
>> #
>> STUBCOPY_FLAGS-$(CONFIG_ARM)	+= --rename-section .data=.data.efistub	\
>> 				   --rename-section .bss=.bss.efistub,load,alloc
>>
>> ---
>>
>> I think that .data.efistub represents the entire .data section, the same 
>> applies to .bss as well,
>>
>> so i move all .data and .bss into the stub here.
>>
> 
> Ok, I see. 



>>>       __pecoff_data_end = .;
>>>
>>>       /*
>>> @@ -125,7 +125,7 @@ SECTIONS
>>>
>>>     . = BSS_START;
>>>     __bss_start = .;
>>> -  .bss                       : { *(.bss) }
>>> +  .bss                       : { *(.bss .bss.*) }
>>>     _end = .;
>>>
>>>     . = ALIGN(8);              /* the stack must be 64-bit aligned */
>>> diff --git a/arch/arm/include/asm/vmlinux.lds.h
>>> b/arch/arm/include/asm/vmlinux.lds.h
>>> index 4c8632d5c432..dfe2b6ad6b51 100644
>>> --- a/arch/arm/include/asm/vmlinux.lds.h
>>> +++ b/arch/arm/include/asm/vmlinux.lds.h
>>> @@ -42,7 +42,7 @@
>>>   #define PROC_INFO                                                    \
>>>                . = ALIGN(4);                                           \
>>>                __proc_info_begin = .;                                  \
>>> -             *(.proc.info.init)                                      \
>>> +             KEEP(*(.proc.info.init))                                \
>>>                __proc_info_end = .;
>>>
>>>   #define IDMAP_TEXT                                                   \
>>> @@ -87,6 +87,22 @@
>>>                *(.vfp11_veneer)                                        \
>>>                *(.v4_bx)
>>>
>>> +/*
>>> +When CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, it is important
>>> to
>>> +annotate .vectors sections with KEEP. While linking with ld, it is
>>> +acceptable to directly use KEEP with .vectors sections in ARM_VECTORS.
>>> +However, when using ld.lld for linking, KEEP is not recognized within
>>> the
>>> +OVERLAY command; it is treated as a regular string. Hence, it is
>>> advisable
>>> +to define a distinct section here that explicitly retains the .vectors
>>> +sections when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is turned on.
>>> +*/
>>> +#define ARM_VECTORS_TEXT                                             \
>>> +     .vectors.text : {                                               \
>>> +             KEEP(*(.vectors))                                       \
>>> +             KEEP(*(.vectors.bhb.loop8))                             \
>>> +             KEEP(*(.vectors.bhb.bpiall))                            \
>>> +       }
>>> +
> 
> This looks fishy to me. How is this supposed to work? You cannot emit
> these sections into some random other place in the binary.
> 
> And also, ARM_VECTORS_TEXT is never used (by accident, see below)

Yes, this way to KEEP .vectors section is not good.

> 
>>>   #define ARM_TEXT                                                     \
>>>                IDMAP_TEXT                                              \
>>>                __entry_text_start = .;                                 \
>>> diff --git a/arch/arm/kernel/vmlinux-xip.lds.S
>>> b/arch/arm/kernel/vmlinux-xip.lds.S
>>> index c16d196b5aad..035fa18060b3 100644
>>> --- a/arch/arm/kernel/vmlinux-xip.lds.S
>>> +++ b/arch/arm/kernel/vmlinux-xip.lds.S
>>> @@ -63,7 +63,7 @@ SECTIONS
>>>        . = ALIGN(4);
>>>        __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
>>>                __start___ex_table = .;
>>> -             ARM_MMU_KEEP(*(__ex_table))
>>> +             ARM_MMU_KEEP(KEEP(*(__ex_table)))
>>>                __stop___ex_table = .;
>>>        }
>>>
>>> @@ -83,7 +83,7 @@ SECTIONS
>>>        }
>>>        .init.arch.info : {
>>>                __arch_info_begin = .;
>>> -             *(.arch.info.init)
>>> +             KEEP(*(.arch.info.init))
>>>                __arch_info_end = .;
>>>        }
>>>        .init.tagtable : {
>>> @@ -135,6 +135,10 @@ SECTIONS
>>>        ARM_TCM
>>>   #endif
>>>
>>> +#ifdef LD_DEAD_CODE_DATA_ELIMINATION
> 
> This should be CONFIG_LD_DEAD_CODE_DATA_ELIMINATION

My mistake, and thank you for pointing out it.

> 
>>> +     ARM_VECTORS_TEXT
>>> +#endif
>>> +
>>>        /*
>>>         * End of copied data. We need a dummy section to get its LMA.
>>>         * Also located before final ALIGN() as trailing padding is not stored
>>> diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
>>> index bd9127c4b451..2cfb890c93fb 100644
>>> --- a/arch/arm/kernel/vmlinux.lds.S
>>> +++ b/arch/arm/kernel/vmlinux.lds.S
>>> @@ -74,7 +74,7 @@ SECTIONS
>>>        . = ALIGN(4);
>>>        __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
>>>                __start___ex_table = .;
>>> -             ARM_MMU_KEEP(*(__ex_table))
>>> +             ARM_MMU_KEEP(KEEP(*(__ex_table)))
>>>                __stop___ex_table = .;
>>>        }
>>>
>>> @@ -99,7 +99,7 @@ SECTIONS
>>>        }
>>>        .init.arch.info : {
>>>                __arch_info_begin = .;
>>> -             *(.arch.info.init)
>>> +             KEEP(*(.arch.info.init))
>>>                __arch_info_end = .;
>>>        }
>>>        .init.tagtable : {
>>> @@ -116,7 +116,7 @@ SECTIONS
>>>   #endif
>>>        .init.pv_table : {
>>>                __pv_table_begin = .;
>>> -             *(.pv_table)
>>> +             KEEP(*(.pv_table))
>>>                __pv_table_end = .;
>>>        }
>>
>> I previously asked about this bit, since it appeared that this
>> might prevent a lot of code from being discarded when
>> CONFIG_ARM_PATCH_PHYS_VIRT is set. I tested this again now,
>> and found this makes very little practical difference, so
>> it's all good.
>>
>>> @@ -134,6 +134,10 @@ SECTIONS
>>>        ARM_TCM
>>>   #endif
>>>
>>> +#ifdef LD_DEAD_CODE_DATA_ELIMINATION
> 
> Same here
> 
> 
>>> +     ARM_VECTORS_TEXT
>>> +#endif
>>> +
>>>   #ifdef CONFIG_STRICT_KERNEL_RWX
>>>        . = ALIGN(1<<SECTION_SHIFT);
>>>   #else
liuyuntao (F) March 9, 2024, 6:46 a.m. UTC | #7
On 2024/3/9 8:01, Ard Biesheuvel wrote:
> On Fri, 8 Mar 2024 at 16:37, Ard Biesheuvel <ardb@kernel.org> wrote:
>>
>> On Fri, 8 Mar 2024 at 15:27, Ard Biesheuvel <ardb@kernel.org> wrote:
>>>
>>> On Fri, 8 Mar 2024 at 14:16, Arnd Bergmann <arnd@arndb.de> wrote:
>>>>
>>>> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
>>>>> The current arm32 architecture does not yet support the
>>>>> HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
>>>>> embedded scenarios, and enabling this feature would be beneficial for
>>>>> reducing the size of the kernel image.
>>>>>
>>>>> In order to make this work, we keep the necessary tables by annotating
>>>>> them with KEEP, also it requires further changes to linker script to KEEP
>>>>> some tables and wildcard compiler generated sections into the right place.
>>>>>
>>>>> It boots normally with defconfig, vexpress_defconfig and tinyconfig.
>>>>>
>>>>> The size comparison of zImage is as follows:
>>>>> defconfig       vexpress_defconfig      tinyconfig
>>>>> 5137712         5138024                 424192          no dce
>>>>> 5032560         4997824                 298384          dce
>>>>> 2.0%            2.7%                    29.7%           shrink
>>>>>
>>>>> When using smaller config file, there is a significant reduction in the
>>>>> size of the zImage.
>>>>>
>>>>> We also tested this patch on a commercially available single-board
>>>>> computer, and the comparison is as follows:
>>>>> a15eb_config
>>>>> 2161384         no dce
>>>>> 2092240         dce
>>>>> 3.2%            shrink
>>>>>
>>>>> The zImage size has been reduced by approximately 3.2%, which is 70KB on
>>>>> 2.1M.
>>>>>
>>>>> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
>>>>
>>>> I've retested with both gcc-13 and clang-18, and so no
>>>> more build issues. Your previous version already worked
>>>> fine for me.
>>>>
>>>> I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
>>>> which showed a significant improvement as expected. I also
>>>> tried combining it with an experimental CONFIG_LTO_CLANG
>>>> patch, but that did not show any further improvements.
>>>>
>>>> Tested-by: Arnd Bergmann <arnd@arndb.de>
>>>> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
>>>>
>>>> Adding Ard Biesheuvel and Fangrui Song to Cc, so they can comment
>>>> on the ARM_VECTORS_TEXT workaround. I don't understand enough of
>>>> the details of what is going on here.
>>>>
>>>
>>> Thanks for the cc
>>>
>>>> Full quote of the patch below so they can see the whole thing.
>>>>
>>>> If they are also happy with the patch, I think you can send it
>>>> into Russell's patch tracker at
>>>> https://www.armlinux.org.uk/developer/patches/info.php
>>>>
>>>
>>> No, not happy at all :-)
>>>
>>> The resulting kernel does not boot (built with GCC or Clang). And the
>>> patch is buggy (see below)
>>>
>>>>> ---
>>>>> v2:
>>>>>     - Support config XIP_KERNEL.
>>>>>     - Support LLVM compilation.
>>>>>
>>>>> v1: https://lore.kernel.org/all/20240220081527.23408-1-liuyuntao12@huawei.com/
>>>>> ---
>>>>>   arch/arm/Kconfig                       |  1 +
>>>>>   arch/arm/boot/compressed/vmlinux.lds.S |  4 ++--
>>>>>   arch/arm/include/asm/vmlinux.lds.h     | 18 +++++++++++++++++-
>>>>>   arch/arm/kernel/vmlinux-xip.lds.S      |  8 ++++++--
>>>>>   arch/arm/kernel/vmlinux.lds.S          | 10 +++++++---
>>>>>   5 files changed, 33 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>>>>> index 0af6709570d1..de78ceb821df 100644
>>>>> --- a/arch/arm/Kconfig
>>>>> +++ b/arch/arm/Kconfig
>>>>> @@ -113,6 +113,7 @@ config ARM
>>>>>        select HAVE_KERNEL_XZ
>>>>>        select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
>>>>>        select HAVE_KRETPROBES if HAVE_KPROBES
>>>>> +     select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
>>>>>        select HAVE_MOD_ARCH_SPECIFIC
>>>>>        select HAVE_NMI
>>>>>        select HAVE_OPTPROBES if !THUMB2_KERNEL
>>>>> diff --git a/arch/arm/boot/compressed/vmlinux.lds.S
>>>>> b/arch/arm/boot/compressed/vmlinux.lds.S
>>>>> index 3fcb3e62dc56..da21244aa892 100644
>>>>> --- a/arch/arm/boot/compressed/vmlinux.lds.S
>>>>> +++ b/arch/arm/boot/compressed/vmlinux.lds.S
>>>>> @@ -89,7 +89,7 @@ SECTIONS
>>>>>        * The EFI stub always executes from RAM, and runs strictly before
>>>>> the
>>>>>        * decompressor, so we can make an exception for its r/w data, and
>>>>> keep it
>>>>>        */
>>>>> -    *(.data.efistub .bss.efistub)
>>>>> +    *(.data.* .bss.*)
>>>
>>> Why is this necessary? There is a reason we don't allow .data in the
>>> decompressor.
>>>
>>>>>       __pecoff_data_end = .;
>>>>>
>>>>>       /*
>>>>> @@ -125,7 +125,7 @@ SECTIONS
>>>>>
>>>>>     . = BSS_START;
>>>>>     __bss_start = .;
>>>>> -  .bss                       : { *(.bss) }
>>>>> +  .bss                       : { *(.bss .bss.*) }
>>>>>     _end = .;
>>>>>
>>>>>     . = ALIGN(8);              /* the stack must be 64-bit aligned */
>>>>> diff --git a/arch/arm/include/asm/vmlinux.lds.h
>>>>> b/arch/arm/include/asm/vmlinux.lds.h
>>>>> index 4c8632d5c432..dfe2b6ad6b51 100644
>>>>> --- a/arch/arm/include/asm/vmlinux.lds.h
>>>>> +++ b/arch/arm/include/asm/vmlinux.lds.h
>>>>> @@ -42,7 +42,7 @@
>>>>>   #define PROC_INFO                                                    \
>>>>>                . = ALIGN(4);                                           \
>>>>>                __proc_info_begin = .;                                  \
>>>>> -             *(.proc.info.init)                                      \
>>>>> +             KEEP(*(.proc.info.init))                                \
>>>>>                __proc_info_end = .;
>>>>>
>>>>>   #define IDMAP_TEXT                                                   \
>>>>> @@ -87,6 +87,22 @@
>>>>>                *(.vfp11_veneer)                                        \
>>>>>                *(.v4_bx)
>>>>>
>>>>> +/*
>>>>> +When CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, it is important
>>>>> to
>>>>> +annotate .vectors sections with KEEP. While linking with ld, it is
>>>>> +acceptable to directly use KEEP with .vectors sections in ARM_VECTORS.
>>>>> +However, when using ld.lld for linking, KEEP is not recognized within
>>>>> the
>>>>> +OVERLAY command; it is treated as a regular string. Hence, it is
>>>>> advisable
>>>>> +to define a distinct section here that explicitly retains the .vectors
>>>>> +sections when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is turned on.
>>>>> +*/
>>>>> +#define ARM_VECTORS_TEXT                                             \
>>>>> +     .vectors.text : {                                               \
>>>>> +             KEEP(*(.vectors))                                       \
>>>>> +             KEEP(*(.vectors.bhb.loop8))                             \
>>>>> +             KEEP(*(.vectors.bhb.bpiall))                            \
>>>>> +       }
>>>>> +
>>>
>>> This looks fishy to me. How is this supposed to work? You cannot emit
>>> these sections into some random other place in the binary.
>>>
>>> And also, ARM_VECTORS_TEXT is never used (by accident, see below)
>>>
>>
>> The below appears to work for me:
>>
>> --- a/arch/arm/kernel/entry-armv.S
>> +++ b/arch/arm/kernel/entry-armv.S
>> @@ -1076,7 +1076,12 @@
>>          W(b)    vector_irq
>>          W(b)    vector_fiq
>>
>> +       .text
>> +       .reloc  ., R_ARM_NONE, .vectors
>>   #ifdef CONFIG_HARDEN_BRANCH_HISTORY
>> +       .reloc  ., R_ARM_NONE, .vectors.bhb.loop8
>> +       .reloc  ., R_ARM_NONE, .vectors.bhb.bpiall
>> +
>>          .section .vectors.bhb.loop8, "ax", %progbits
>>          W(b)    vector_rst
>>          W(b)    vector_bhb_loop8_und
> 
> ... or even better:
> 
> --- a/arch/arm/kernel/entry-armv.S
> +++ b/arch/arm/kernel/entry-armv.S
> @@ -1066,4 +1066,5 @@
> 
>          .section .vectors, "ax", %progbits
> +       .reloc  .text, R_ARM_NONE, .
>          W(b)    vector_rst
>          W(b)    vector_und
> @@ -1079,4 +1080,5 @@
>   #ifdef CONFIG_HARDEN_BRANCH_HISTORY
>          .section .vectors.bhb.loop8, "ax", %progbits
> +       .reloc  .text, R_ARM_NONE, .
>          W(b)    vector_rst
>          W(b)    vector_bhb_loop8_und
> @@ -1091,4 +1093,5 @@
> 
>          .section .vectors.bhb.bpiall, "ax", %progbits
> +       .reloc  .text, R_ARM_NONE, .
>          W(b)    vector_rst
>          W(b)    vector_bhb_bpiall_und

I used `.reloc  ., R_ARM_NONE, .vectors` to KEEP .vectors section, but 
it failed. It seems now that I did not use the reloc directive correctly.
Thanks Ard, and your approach is concise and effctive.
and, could I submit a v3 patch to apply these new changes?
liuyuntao (F) March 9, 2024, 6:56 a.m. UTC | #8
On 2024/3/8 22:27, Ard Biesheuvel wrote:
> On Fri, 8 Mar 2024 at 14:16, Arnd Bergmann <arnd@arndb.de> wrote:
>>
>> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
>>> The current arm32 architecture does not yet support the
>>> HAVE_LD_DEAD_CODE_DATA_ELIMINATION feature. arm32 is widely used in
>>> embedded scenarios, and enabling this feature would be beneficial for
>>> reducing the size of the kernel image.
>>>
>>> In order to make this work, we keep the necessary tables by annotating
>>> them with KEEP, also it requires further changes to linker script to KEEP
>>> some tables and wildcard compiler generated sections into the right place.
>>>
>>> It boots normally with defconfig, vexpress_defconfig and tinyconfig.
>>>
>>> The size comparison of zImage is as follows:
>>> defconfig       vexpress_defconfig      tinyconfig
>>> 5137712         5138024                 424192          no dce
>>> 5032560         4997824                 298384          dce
>>> 2.0%            2.7%                    29.7%           shrink
>>>
>>> When using smaller config file, there is a significant reduction in the
>>> size of the zImage.
>>>
>>> We also tested this patch on a commercially available single-board
>>> computer, and the comparison is as follows:
>>> a15eb_config
>>> 2161384         no dce
>>> 2092240         dce
>>> 3.2%            shrink
>>>
>>> The zImage size has been reduced by approximately 3.2%, which is 70KB on
>>> 2.1M.
>>>
>>> Signed-off-by: Yuntao Liu <liuyuntao12@huawei.com>
>>
>> I've retested with both gcc-13 and clang-18, and so no
>> more build issues. Your previous version already worked
>> fine for me.
>>
>> I did some tests combining this with CONFIG_TRIM_UNUSED_KSYMS,
>> which showed a significant improvement as expected. I also
>> tried combining it with an experimental CONFIG_LTO_CLANG
>> patch, but that did not show any further improvements.
>>
>> Tested-by: Arnd Bergmann <arnd@arndb.de>
>> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
>>
>> Adding Ard Biesheuvel and Fangrui Song to Cc, so they can comment
>> on the ARM_VECTORS_TEXT workaround. I don't understand enough of
>> the details of what is going on here.
>>
> 
> Thanks for the cc
> 
>> Full quote of the patch below so they can see the whole thing.
>>
>> If they are also happy with the patch, I think you can send it
>> into Russell's patch tracker at
>> https://www.armlinux.org.uk/developer/patches/info.php
>>
> 
> No, not happy at all :-)
> 
> The resulting kernel does not boot (built with GCC or Clang). And the
> patch is buggy (see below)
> 
After applying .reloc .text, R_ARM_NONE, ., the resulting kernel boots 
well in QEMU. I tested it with the latest linux-next master branch and 
the mainline master branch.
By the way, I used vexpress_defconfig for testing, it worked.
Arnd Bergmann March 9, 2024, 8:20 a.m. UTC | #9
On Sat, Mar 9, 2024, at 07:14, liuyuntao (F) wrote:
> On 2024/3/8 21:15, Arnd Bergmann wrote:
>> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
>
> Thanks for the tests, CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and 
> CONFIG_TRIM_UNUSED_KSYMS do indeed result in a significant improvement.
> I found that arm32 still doesn't support CONFIG_LTO_CLANG. I've done 
> some work on it, but without success. I'd like to learn more about the 
> CONFIG_LTO_CLANG patch. Do you have any relevant links?

I did not try to get it to boot and gave up when I did not see
any size improvement. I think there were previous attempts to
do it elsewhere, which I did not try to find.

The patch below makes it build, but it still requires disabling
CONFIG_THUMB2_KERNEL, which totally defeats the purpose of shrinking
the kernel as it adds some 40% size overhead in the vmlinux.
There are probably also runtime bugs that get introduced by this.

     Arnd

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index de78ceb821df..7ebfda4839e8 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -2,6 +2,8 @@
 config ARM
 	bool
 	default y
+	select ARCH_SUPPORTS_LTO_CLANG
+	select ARCH_SUPPORTS_LTO_CLANG_THIN
 	select ARCH_32BIT_OFF_T
 	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
 	select ARCH_HAS_BINFMT_FLAT
diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
index 726ecabcef09..f2ddce451ab9 100644
--- a/arch/arm/boot/compressed/Makefile
+++ b/arch/arm/boot/compressed/Makefile
@@ -9,6 +9,8 @@ OBJS		=
 
 HEAD	= head.o
 OBJS	+= misc.o decompress.o
+CFLAGS_REMOVE_misc.o += $(CC_FLAGS_LTO)
+CFLAGS_REMOVE_decompress.o += $(CC_FLAGS_LTO)
 ifeq ($(CONFIG_DEBUG_UNCOMPRESS),y)
 OBJS	+= debug.o
 AFLAGS_head.o += -DDEBUG
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index d19d140a10c7..aee9e13023a8 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -38,15 +38,14 @@ EXPORT_SYMBOL(arm_heavy_mb);
 static void flush_pfn_alias(unsigned long pfn, unsigned long vaddr)
 {
 	unsigned long to = FLUSH_ALIAS_START + (CACHE_COLOUR(vaddr) << PAGE_SHIFT);
-	const int zero = 0;
 
 	set_top_pte(to, pfn_pte(pfn, PAGE_KERNEL));
 
-	asm(	"mcrr	p15, 0, %1, %0, c14\n"
-	"	mcr	p15, 0, %2, c7, c10, 4"
+	asm("mcrr	p15, 0, %1, %0, c14"
 	    :
-	    : "r" (to), "r" (to + PAGE_SIZE - 1), "r" (zero)
+	    : "r" (to), "r" (to + PAGE_SIZE - 1)
 	    : "cc");
+	dsb();
 }
 
 static void flush_icache_alias(unsigned long pfn, unsigned long vaddr, unsigned long len)
@@ -68,11 +67,11 @@ void flush_cache_mm(struct mm_struct *mm)
 	}
 
 	if (cache_is_vipt_aliasing()) {
-		asm(	"mcr	p15, 0, %0, c7, c14, 0\n"
-		"	mcr	p15, 0, %0, c7, c10, 4"
+		asm("mcr	p15, 0, %0, c7, c14, 0"
 		    :
 		    : "r" (0)
 		    : "cc");
+		dsb();
 	}
 }
 
@@ -84,11 +83,11 @@ void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned
 	}
 
 	if (cache_is_vipt_aliasing()) {
-		asm(	"mcr	p15, 0, %0, c7, c14, 0\n"
-		"	mcr	p15, 0, %0, c7, c10, 4"
+		asm("mcr	p15, 0, %0, c7, c14, 0"
 		    :
 		    : "r" (0)
 		    : "cc");
+		dsb();
 	}
 
 	if (vma->vm_flags & VM_EXEC)
liuyuntao (F) March 9, 2024, 1:24 p.m. UTC | #10
On 2024/3/9 16:20, Arnd Bergmann wrote:
> On Sat, Mar 9, 2024, at 07:14, liuyuntao (F) wrote:
>> On 2024/3/8 21:15, Arnd Bergmann wrote:
>>> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
>>
>> Thanks for the tests, CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and
>> CONFIG_TRIM_UNUSED_KSYMS do indeed result in a significant improvement.
>> I found that arm32 still doesn't support CONFIG_LTO_CLANG. I've done
>> some work on it, but without success. I'd like to learn more about the
>> CONFIG_LTO_CLANG patch. Do you have any relevant links?
> 
> I did not try to get it to boot and gave up when I did not see
> any size improvement. I think there were previous attempts to
> do it elsewhere, which I did not try to find.
> 

I tested this patch, the size improvement was only about one 
ten-thousandth, and the compilation time had increased by about a quarter,
and the kernel did not boot.

Strangely, LTO has actually increased the compilation time 
significantly, which seems contrary to its purpose.

           +          +trim      +dce       +trim+dce
no lto    5995384    5858720    5841024    5299032
lto       5990040    5854544    5839992    5289576
shrink    8.9‱     7.1‱     1.7‱     17.8‱


           +          +trim      +dce       +trim+dce
no lto    34.616     33.03      36.093     32.211
lto       46.881     45.324     47.247     43.246
increase  26.20%     27.10%     23.60%     25.50%



> The patch below makes it build, but it still requires disabling
> CONFIG_THUMB2_KERNEL, which totally defeats the purpose of shrinking
> the kernel as it adds some 40% size overhead in the vmlinux.
> There are probably also runtime bugs that get introduced by this.
> 
>       Arnd
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index de78ceb821df..7ebfda4839e8 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -2,6 +2,8 @@
>   config ARM
>   	bool
>   	default y
> +	select ARCH_SUPPORTS_LTO_CLANG
> +	select ARCH_SUPPORTS_LTO_CLANG_THIN
>   	select ARCH_32BIT_OFF_T
>   	select ARCH_CORRECT_STACKTRACE_ON_KRETPROBE if HAVE_KRETPROBES && FRAME_POINTER && !ARM_UNWIND
>   	select ARCH_HAS_BINFMT_FLAT
> diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile
> index 726ecabcef09..f2ddce451ab9 100644
> --- a/arch/arm/boot/compressed/Makefile
> +++ b/arch/arm/boot/compressed/Makefile
> @@ -9,6 +9,8 @@ OBJS		=
>   
>   HEAD	= head.o
>   OBJS	+= misc.o decompress.o
> +CFLAGS_REMOVE_misc.o += $(CC_FLAGS_LTO)
> +CFLAGS_REMOVE_decompress.o += $(CC_FLAGS_LTO)

Wow, I've encountered this issue before and didn't think to solve it in 
this way. You really have a thorough understanding of these parameters. 
On a side note, if CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, but 
only a few rodata sections are removed and no functions are eliminated, 
are there any compiler or linker options that can control this behavior?
thanks.

>   ifeq ($(CONFIG_DEBUG_UNCOMPRESS),y)
>   OBJS	+= debug.o
>   AFLAGS_head.o += -DDEBUG
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index d19d140a10c7..aee9e13023a8 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -38,15 +38,14 @@ EXPORT_SYMBOL(arm_heavy_mb);
>   static void flush_pfn_alias(unsigned long pfn, unsigned long vaddr)
>   {
>   	unsigned long to = FLUSH_ALIAS_START + (CACHE_COLOUR(vaddr) << PAGE_SHIFT);
> -	const int zero = 0;
>   
>   	set_top_pte(to, pfn_pte(pfn, PAGE_KERNEL));
>   
> -	asm(	"mcrr	p15, 0, %1, %0, c14\n"
> -	"	mcr	p15, 0, %2, c7, c10, 4"
> +	asm("mcrr	p15, 0, %1, %0, c14"
>   	    :
> -	    : "r" (to), "r" (to + PAGE_SIZE - 1), "r" (zero)
> +	    : "r" (to), "r" (to + PAGE_SIZE - 1)
>   	    : "cc");
> +	dsb();
>   }
>   
>   static void flush_icache_alias(unsigned long pfn, unsigned long vaddr, unsigned long len)
> @@ -68,11 +67,11 @@ void flush_cache_mm(struct mm_struct *mm)
>   	}
>   
>   	if (cache_is_vipt_aliasing()) {
> -		asm(	"mcr	p15, 0, %0, c7, c14, 0\n"
> -		"	mcr	p15, 0, %0, c7, c10, 4"
> +		asm("mcr	p15, 0, %0, c7, c14, 0"
>   		    :
>   		    : "r" (0)
>   		    : "cc");
> +		dsb();
>   	}
>   }
>   
> @@ -84,11 +83,11 @@ void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned
>   	}
>   
>   	if (cache_is_vipt_aliasing()) {
> -		asm(	"mcr	p15, 0, %0, c7, c14, 0\n"
> -		"	mcr	p15, 0, %0, c7, c10, 4"
> +		asm("mcr	p15, 0, %0, c7, c14, 0"
>   		    :
>   		    : "r" (0)
>   		    : "cc");
> +		dsb();
>   	}
>   
>   	if (vma->vm_flags & VM_EXEC)
Geert Uytterhoeven March 11, 2024, 9:14 a.m. UTC | #11
Hi Yuntao,

On Sat, Mar 9, 2024 at 2:24 PM liuyuntao (F) <liuyuntao12@huawei.com> wrote:
> On 2024/3/9 16:20, Arnd Bergmann wrote:
> > On Sat, Mar 9, 2024, at 07:14, liuyuntao (F) wrote:
> >> On 2024/3/8 21:15, Arnd Bergmann wrote:
> >>> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
> >>
> >> Thanks for the tests, CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and
> >> CONFIG_TRIM_UNUSED_KSYMS do indeed result in a significant improvement.
> >> I found that arm32 still doesn't support CONFIG_LTO_CLANG. I've done
> >> some work on it, but without success. I'd like to learn more about the
> >> CONFIG_LTO_CLANG patch. Do you have any relevant links?
> >
> > I did not try to get it to boot and gave up when I did not see
> > any size improvement. I think there were previous attempts to
> > do it elsewhere, which I did not try to find.
> >
>
> I tested this patch, the size improvement was only about one
> ten-thousandth, and the compilation time had increased by about a quarter,
> and the kernel did not boot.
>
> Strangely, LTO has actually increased the compilation time
> significantly, which seems contrary to its purpose.

The purpose of LTO is to reduce code size. Doing so requires more
processing, hence the total build time increases.

Gr{oetje,eeting}s,

                        Geert
liuyuntao (F) March 11, 2024, 9:39 a.m. UTC | #12
On 2024/3/11 17:14, Geert Uytterhoeven wrote:
> Hi Yuntao,
> 
> On Sat, Mar 9, 2024 at 2:24 PM liuyuntao (F) <liuyuntao12@huawei.com> wrote:
>> On 2024/3/9 16:20, Arnd Bergmann wrote:
>>> On Sat, Mar 9, 2024, at 07:14, liuyuntao (F) wrote:
>>>> On 2024/3/8 21:15, Arnd Bergmann wrote:
>>>>> On Thu, Mar 7, 2024, at 16:12, Yuntao Liu wrote:
>>>>
>>>> Thanks for the tests, CONFIG_LD_DEAD_CODE_DATA_ELIMINATION and
>>>> CONFIG_TRIM_UNUSED_KSYMS do indeed result in a significant improvement.
>>>> I found that arm32 still doesn't support CONFIG_LTO_CLANG. I've done
>>>> some work on it, but without success. I'd like to learn more about the
>>>> CONFIG_LTO_CLANG patch. Do you have any relevant links?
>>>
>>> I did not try to get it to boot and gave up when I did not see
>>> any size improvement. I think there were previous attempts to
>>> do it elsewhere, which I did not try to find.
>>>
>>
>> I tested this patch, the size improvement was only about one
>> ten-thousandth, and the compilation time had increased by about a quarter,
>> and the kernel did not boot.
>>
>> Strangely, LTO has actually increased the compilation time
>> significantly, which seems contrary to its purpose.
> 
> The purpose of LTO is to reduce code size. Doing so requires more
> processing, hence the total build time increases.
> 
> Gr{oetje,eeting}s,
> 
>                          Geert
> 

Thanks, Geert, I got it.
Arnd Bergmann March 11, 2024, 11:41 a.m. UTC | #13
On Mon, Mar 11, 2024, at 10:14, Geert Uytterhoeven wrote:
> On Sat, Mar 9, 2024 at 2:24 PM liuyuntao (F) <liuyuntao12@huawei.com> wrote:
>> On 2024/3/9 16:20, Arnd Bergmann wrote:
>>
>> I tested this patch, the size improvement was only about one
>> ten-thousandth, and the compilation time had increased by about a quarter,
>> and the kernel did not boot.
>>
>> Strangely, LTO has actually increased the compilation time
>> significantly, which seems contrary to its purpose.
>
> The purpose of LTO is to reduce code size. Doing so requires more
> processing, hence the total build time increases.

I think llvm treats it purely as a performance optimization of
the resulting binary, allowing cross-unit inlining and constant
folding, but I don't think it actually tries or succeeds to make
the output smaller. I do remember seeing size improvements with
LTO using gcc in the past, but this never made it into the
mainline kernel. The last time someone tried to add it was 2022[1],
not sure why there was no follow-up.

     Arnd

[1] https://lore.kernel.org/lkml/20221114114344.18650-1-jirislaby@kernel.org/
diff mbox series

Patch

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 0af6709570d1..de78ceb821df 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -113,6 +113,7 @@  config ARM
 	select HAVE_KERNEL_XZ
 	select HAVE_KPROBES if !XIP_KERNEL && !CPU_ENDIAN_BE32 && !CPU_V7M
 	select HAVE_KRETPROBES if HAVE_KPROBES
+	select HAVE_LD_DEAD_CODE_DATA_ELIMINATION
 	select HAVE_MOD_ARCH_SPECIFIC
 	select HAVE_NMI
 	select HAVE_OPTPROBES if !THUMB2_KERNEL
diff --git a/arch/arm/boot/compressed/vmlinux.lds.S b/arch/arm/boot/compressed/vmlinux.lds.S
index 3fcb3e62dc56..da21244aa892 100644
--- a/arch/arm/boot/compressed/vmlinux.lds.S
+++ b/arch/arm/boot/compressed/vmlinux.lds.S
@@ -89,7 +89,7 @@  SECTIONS
      * The EFI stub always executes from RAM, and runs strictly before the
      * decompressor, so we can make an exception for its r/w data, and keep it
      */
-    *(.data.efistub .bss.efistub)
+    *(.data.* .bss.*)
     __pecoff_data_end = .;
 
     /*
@@ -125,7 +125,7 @@  SECTIONS
 
   . = BSS_START;
   __bss_start = .;
-  .bss			: { *(.bss) }
+  .bss			: { *(.bss .bss.*) }
   _end = .;
 
   . = ALIGN(8);		/* the stack must be 64-bit aligned */
diff --git a/arch/arm/include/asm/vmlinux.lds.h b/arch/arm/include/asm/vmlinux.lds.h
index 4c8632d5c432..dfe2b6ad6b51 100644
--- a/arch/arm/include/asm/vmlinux.lds.h
+++ b/arch/arm/include/asm/vmlinux.lds.h
@@ -42,7 +42,7 @@ 
 #define PROC_INFO							\
 		. = ALIGN(4);						\
 		__proc_info_begin = .;					\
-		*(.proc.info.init)					\
+		KEEP(*(.proc.info.init))				\
 		__proc_info_end = .;
 
 #define IDMAP_TEXT							\
@@ -87,6 +87,22 @@ 
 		*(.vfp11_veneer)                                        \
 		*(.v4_bx)
 
+/*
+When CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is enabled, it is important to
+annotate .vectors sections with KEEP. While linking with ld, it is
+acceptable to directly use KEEP with .vectors sections in ARM_VECTORS.
+However, when using ld.lld for linking, KEEP is not recognized within the
+OVERLAY command; it is treated as a regular string. Hence, it is advisable
+to define a distinct section here that explicitly retains the .vectors
+sections when CONFIG_LD_DEAD_CODE_DATA_ELIMINATION is turned on.
+*/
+#define ARM_VECTORS_TEXT						\
+	.vectors.text : {						\
+		KEEP(*(.vectors))					\
+		KEEP(*(.vectors.bhb.loop8))				\
+		KEEP(*(.vectors.bhb.bpiall))				\
+       }
+
 #define ARM_TEXT							\
 		IDMAP_TEXT						\
 		__entry_text_start = .;					\
diff --git a/arch/arm/kernel/vmlinux-xip.lds.S b/arch/arm/kernel/vmlinux-xip.lds.S
index c16d196b5aad..035fa18060b3 100644
--- a/arch/arm/kernel/vmlinux-xip.lds.S
+++ b/arch/arm/kernel/vmlinux-xip.lds.S
@@ -63,7 +63,7 @@  SECTIONS
 	. = ALIGN(4);
 	__ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
 		__start___ex_table = .;
-		ARM_MMU_KEEP(*(__ex_table))
+		ARM_MMU_KEEP(KEEP(*(__ex_table)))
 		__stop___ex_table = .;
 	}
 
@@ -83,7 +83,7 @@  SECTIONS
 	}
 	.init.arch.info : {
 		__arch_info_begin = .;
-		*(.arch.info.init)
+		KEEP(*(.arch.info.init))
 		__arch_info_end = .;
 	}
 	.init.tagtable : {
@@ -135,6 +135,10 @@  SECTIONS
 	ARM_TCM
 #endif
 
+#ifdef LD_DEAD_CODE_DATA_ELIMINATION
+	ARM_VECTORS_TEXT
+#endif
+
 	/*
 	 * End of copied data. We need a dummy section to get its LMA.
 	 * Also located before final ALIGN() as trailing padding is not stored
diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index bd9127c4b451..2cfb890c93fb 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -74,7 +74,7 @@  SECTIONS
 	. = ALIGN(4);
 	__ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) {
 		__start___ex_table = .;
-		ARM_MMU_KEEP(*(__ex_table))
+		ARM_MMU_KEEP(KEEP(*(__ex_table)))
 		__stop___ex_table = .;
 	}
 
@@ -99,7 +99,7 @@  SECTIONS
 	}
 	.init.arch.info : {
 		__arch_info_begin = .;
-		*(.arch.info.init)
+		KEEP(*(.arch.info.init))
 		__arch_info_end = .;
 	}
 	.init.tagtable : {
@@ -116,7 +116,7 @@  SECTIONS
 #endif
 	.init.pv_table : {
 		__pv_table_begin = .;
-		*(.pv_table)
+		KEEP(*(.pv_table))
 		__pv_table_end = .;
 	}
 
@@ -134,6 +134,10 @@  SECTIONS
 	ARM_TCM
 #endif
 
+#ifdef LD_DEAD_CODE_DATA_ELIMINATION
+	ARM_VECTORS_TEXT
+#endif
+
 #ifdef CONFIG_STRICT_KERNEL_RWX
 	. = ALIGN(1<<SECTION_SHIFT);
 #else