diff mbox series

arm64: module: Widen module region to 2 GiB

Message ID 20230404135437.2744866-1-ardb@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64: module: Widen module region to 2 GiB | expand

Commit Message

Ard Biesheuvel April 4, 2023, 1:54 p.m. UTC
Shanker reports that, after loading a 110+ MiB kernel module, no other
modules can be loaded, in spite of the fact that module PLTs have been
enabled in the build.

This is due to the fact, even with module PLTs enabled, the module
region is dimensioned to be a fixed 128 MiB region, as we simply never
anticipated the need for supporting modules that huge.

So let's increase the size of the statically allocated module region to
2 GiB, and update the module loading logic so that we prefer loading
modules in the vicinity of the kernel text, where relative branches can
be resolved without the need for PLTs. Only when we run out of space
here (or when CONFIG_RANDOMIZE_MODULE_REGION_FULL is enabled), we will
fall back to the larger window and allocate from there.

While at it, let's try to simplify the logic a bit, to make it easier to
follow:
- remove the special cases for KASAN, which are no longer needed now
  that KASAN_VMALLOC is always enabled when KASAN is configured;
- instead of defining a global module base address, define a global
  module limit, and work our way down from it.

Cc: Shanker Donthineni <sdonthineni@nvidia.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
 Documentation/arm64/memory.rst  |  8 ++---
 arch/arm64/include/asm/memory.h |  2 +-
 arch/arm64/include/asm/module.h | 10 ++++--
 arch/arm64/kernel/kaslr.c       | 38 +++++++++++------------
 arch/arm64/kernel/module.c      | 54 ++++++++++++++++-----------------
 5 files changed, 59 insertions(+), 53 deletions(-)

Comments

Mark Rutland April 4, 2023, 3:49 p.m. UTC | #1
On Tue, Apr 04, 2023 at 03:54:37PM +0200, Ard Biesheuvel wrote:
> Shanker reports that, after loading a 110+ MiB kernel module, no other
> modules can be loaded, in spite of the fact that module PLTs have been
> enabled in the build.
> 
> This is due to the fact, even with module PLTs enabled, the module
> region is dimensioned to be a fixed 128 MiB region, as we simply never
> anticipated the need for supporting modules that huge.
> 
> So let's increase the size of the statically allocated module region to
> 2 GiB, and update the module loading logic so that we prefer loading
> modules in the vicinity of the kernel text, where relative branches can
> be resolved without the need for PLTs. Only when we run out of space
> here (or when CONFIG_RANDOMIZE_MODULE_REGION_FULL is enabled), we will
> fall back to the larger window and allocate from there.
> 
> While at it, let's try to simplify the logic a bit, to make it easier to
> follow:
> - remove the special cases for KASAN, which are no longer needed now
>   that KASAN_VMALLOC is always enabled when KASAN is configured;
> - instead of defining a global module base address, define a global
>   module limit, and work our way down from it.

I find this last change a bit confusing, and I think it'd be preferable to have
explicit start and end variables; more on that below.

> 
> Cc: Shanker Donthineni <sdonthineni@nvidia.com>
> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> ---
>  Documentation/arm64/memory.rst  |  8 ++---
>  arch/arm64/include/asm/memory.h |  2 +-
>  arch/arm64/include/asm/module.h | 10 ++++--
>  arch/arm64/kernel/kaslr.c       | 38 +++++++++++------------
>  arch/arm64/kernel/module.c      | 54 ++++++++++++++++-----------------
>  5 files changed, 59 insertions(+), 53 deletions(-)
> 
> diff --git a/Documentation/arm64/memory.rst b/Documentation/arm64/memory.rst
> index 2a641ba7be3b717a..55a55f30eed8a6ce 100644
> --- a/Documentation/arm64/memory.rst
> +++ b/Documentation/arm64/memory.rst
> @@ -33,8 +33,8 @@ AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
>    0000000000000000	0000ffffffffffff	 256TB		user
>    ffff000000000000	ffff7fffffffffff	 128TB		kernel logical memory map
>   [ffff600000000000	ffff7fffffffffff]	  32TB		[kasan shadow region]
> -  ffff800000000000	ffff800007ffffff	 128MB		modules
> -  ffff800008000000	fffffbffefffffff	 124TB		vmalloc
> +  ffff800000000000	ffff80007fffffff	   2GB		modules
> +  ffff800080000000	fffffbffefffffff	 124TB		vmalloc
>    fffffbfff0000000	fffffbfffdffffff	 224MB		fixed mappings (top down)
>    fffffbfffe000000	fffffbfffe7fffff	   8MB		[guard region]
>    fffffbfffe800000	fffffbffff7fffff	  16MB		PCI I/O space
> @@ -50,8 +50,8 @@ AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):
>    0000000000000000	000fffffffffffff	   4PB		user
>    fff0000000000000	ffff7fffffffffff	  ~4PB		kernel logical memory map
>   [fffd800000000000	ffff7fffffffffff]	 512TB		[kasan shadow region]
> -  ffff800000000000	ffff800007ffffff	 128MB		modules
> -  ffff800008000000	fffffbffefffffff	 124TB		vmalloc
> +  ffff800000000000	ffff80007fffffff	   2GB		modules
> +  ffff800080000000	fffffbffefffffff	 124TB		vmalloc
>    fffffbfff0000000	fffffbfffdffffff	 224MB		fixed mappings (top down)
>    fffffbfffe000000	fffffbfffe7fffff	   8MB		[guard region]
>    fffffbfffe800000	fffffbffff7fffff	  16MB		PCI I/O space
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index 78e5163836a0ab95..b58c3127323e16c8 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -46,7 +46,7 @@
>  #define KIMAGE_VADDR		(MODULES_END)
>  #define MODULES_END		(MODULES_VADDR + MODULES_VSIZE)
>  #define MODULES_VADDR		(_PAGE_END(VA_BITS_MIN))
> -#define MODULES_VSIZE		(SZ_128M)
> +#define MODULES_VSIZE		(SZ_2G)
>  #define VMEMMAP_START		(-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
>  #define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
>  #define PCI_IO_END		(VMEMMAP_START - SZ_8M)
> diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
> index 18734fed3bdd7609..98dae9f87b521f07 100644
> --- a/arch/arm64/include/asm/module.h
> +++ b/arch/arm64/include/asm/module.h
> @@ -31,9 +31,15 @@ u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
>  				void *loc, u64 val);
>  
>  #ifdef CONFIG_RANDOMIZE_BASE
> -extern u64 module_alloc_base;
> +extern u64 module_alloc_limit;
>  #else
> -#define module_alloc_base	((u64)_etext - MODULES_VSIZE)
> +#define module_alloc_limit	MODULE_REF_END
> +#endif
> +
> +#ifdef CONFIG_ARM64_MODULE_PLTS
> +#define MODULE_REF_END		((u64)_end)
> +#else
> +#define MODULE_REF_END		((u64)_etext)
>  #endif

I was initially a bit confused by this. I think it's a bit misleading for the
!CONFIG_ARM64_MODULE_PLTS case, since modules can still have data references
between _etext and _end, it's just that we're (hopefully) unlikely to have the
text be <128M with >2G of subsequent data.

I'd find this clearer if we could express the two constaints separately. e.g.
have something like:

#define MODULE_DATA_REF_END		((u64)_end)
#define MODULE_TEXT_REF_END		((u64)_etext)

That'd allow us to do something like the below, which I think would be clearer.

u64 module_alloc_end;
u64 module_alloc_base_noplt;
u64 module_alloc_base_plt;

/*
 * Call this somewhere after choosing hte KASLR limits
 */
void module_limits_init(void)
{
	module_alloc_end = (u64)_stext;

	/*
	 * 32-bit relative data references must always fall within 2G of the
	 * end of the kernel image.
	 */
	module_alloc_base_plt = max(MODULE_VADDR, MODULE_DATA_REF_END - SZ_2G);

	/*
	 * Direct branches must be within 128M of the end of the kernel text.
	 */
	module_alloc_base_noplt = max(module_alloc_base_plt,
				      MODULE_TEXT_REF_END - SZ_128M);
}

void *module_alloc(unsigned long size)
{
	gfp_t gfp_mask = GFP_KERNEL;
	void *p = NULL;

	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
		gfp_mask |= __GFP_NOWARN;

	/*
	 * By default, prefer to place modules such that they don't require
	 * PLTs. With CONFIG_RANDOMIZE_MODULE_REGION_FULL=y we'll prefer to use
	 * the entire range possible with PLTs, to get as much entropy as possible.
	 */
	if (!IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
		p = __vmalloc_node_range(size, MODULE_ALIGN,
					 module_base_noplt, module_end,
					 gfp_mask, PAGE_KERNEL,
					 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
					 __builtin_return_address(0));
	}

	if (!p && IS_ENABLED(CONFIG_MODULE_PLTS)) {
		p = __vmalloc_node_range(size, MODULE_ALIGN,
					 module_base_plt, module_end,
					 gfp_mask, PAGE_KERNEL,
					 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
					 __builtin_return_address(0));
	}

	[... play with KASAN here ...]
}

Is that viable, or am I missing something?

Thanks,
Mark.

>  struct plt_entry {
> diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
> index e7477f21a4c9d062..14e96c3f707a74a3 100644
> --- a/arch/arm64/kernel/kaslr.c
> +++ b/arch/arm64/kernel/kaslr.c
> @@ -8,6 +8,7 @@
>  #include <linux/init.h>
>  #include <linux/libfdt.h>
>  #include <linux/mm_types.h>
> +#include <linux/module.h>
>  #include <linux/sched.h>
>  #include <linux/types.h>
>  #include <linux/pgtable.h>
> @@ -17,10 +18,11 @@
>  #include <asm/kernel-pgtable.h>
>  #include <asm/memory.h>
>  #include <asm/mmu.h>
> +#include <asm/module.h>
>  #include <asm/sections.h>
>  #include <asm/setup.h>
>  
> -u64 __ro_after_init module_alloc_base;
> +u64 __ro_after_init module_alloc_limit = MODULE_REF_END;
>  u16 __initdata memstart_offset_seed;
>  
>  struct arm64_ftr_override kaslr_feature_override __initdata;
> @@ -30,12 +32,6 @@ static int __init kaslr_init(void)
>  	u64 module_range;
>  	u32 seed;
>  
> -	/*
> -	 * Set a reasonable default for module_alloc_base in case
> -	 * we end up running with module randomization disabled.
> -	 */
> -	module_alloc_base = (u64)_etext - MODULES_VSIZE;
> -
>  	if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
>  		pr_info("KASLR disabled on command line\n");
>  		return 0;
> @@ -69,24 +65,28 @@ static int __init kaslr_init(void)
>  		 * resolved via PLTs. (Branches between modules will be
>  		 * resolved normally.)
>  		 */
> -		module_range = SZ_2G - (u64)(_end - _stext);
> -		module_alloc_base = max((u64)_end - SZ_2G, (u64)MODULES_VADDR);
> +		module_range = SZ_2G;
>  	} else {
>  		/*
> -		 * Randomize the module region by setting module_alloc_base to
> -		 * a PAGE_SIZE multiple in the range [_etext - MODULES_VSIZE,
> -		 * _stext) . This guarantees that the resulting region still
> -		 * covers [_stext, _etext], and that all relative branches can
> -		 * be resolved without veneers unless this region is exhausted
> -		 * and we fall back to a larger 2GB window in module_alloc()
> -		 * when ARM64_MODULE_PLTS is enabled.
> +		 * Randomize the module region over a 128 MB window covering
> +		 * the kernel text. This guarantees that the resulting region
> +		 * still covers [_stext, _etext], and that all relative
> +		 * branches can be resolved without veneers unless this region
> +		 * is exhausted and we fall back to a larger 2GB window in
> +		 * module_alloc() when ARM64_MODULE_PLTS is enabled.
>  		 */
> -		module_range = MODULES_VSIZE - (u64)(_etext - _stext);
> +		module_range = SZ_128M;
>  	}
>  
> +	/*
> +	 * Subtract the size of the core kernel region that must be in range
> +	 * for all loaded modules.
> +	 */
> +	module_range -= MODULE_REF_END - (u64)_stext;
> +
>  	/* use the lower 21 bits to randomize the base of the module region */
> -	module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
> -	module_alloc_base &= PAGE_MASK;
> +	module_alloc_limit += (module_range * (seed & ((1 << 21) - 1))) >> 21;
> +	module_alloc_limit &= PAGE_MASK;
>  
>  	return 0;
>  }
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index 5af4975caeb58ff7..aa61493957c010b2 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -24,7 +24,6 @@
>  
>  void *module_alloc(unsigned long size)
>  {
> -	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
>  	gfp_t gfp_mask = GFP_KERNEL;
>  	void *p;
>  
> @@ -32,33 +31,34 @@ void *module_alloc(unsigned long size)
>  	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
>  		gfp_mask |= __GFP_NOWARN;
>  
> -	if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
> -	    IS_ENABLED(CONFIG_KASAN_SW_TAGS))
> -		/* don't exceed the static module region - see below */
> -		module_alloc_end = MODULES_END;
> -
> -	p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> -				module_alloc_end, gfp_mask, PAGE_KERNEL, VM_DEFER_KMEMLEAK,
> -				NUMA_NO_NODE, __builtin_return_address(0));
> +	/*
> +	 * First, try to allocate from the 128 MB region just below the limit.
> +	 * If KASLR is disabled, or CONFIG_RANDOMIZE_MODULE_REGION_FULL is not
> +	 * set, this will produce an allocation that allows all relative
> +	 * branches into the kernel text to be resolved without the need for
> +	 * veneers (PLTs). If CONFIG_RANDOMIZE_MODULE_REGION_FULL is set, this
> +	 * 128 MB window might not cover the kernel text, but branches between
> +	 * modules will still be in relative branching range.
> +	 */
> +	p = __vmalloc_node_range(size, MODULE_ALIGN,
> +				 module_alloc_limit - SZ_128M,
> +				 module_alloc_limit, gfp_mask, PAGE_KERNEL,
> +				 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> +				 __builtin_return_address(0));
>  
> -	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
> -	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
> -	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
> -	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
> -		/*
> -		 * KASAN without KASAN_VMALLOC can only deal with module
> -		 * allocations being served from the reserved module region,
> -		 * since the remainder of the vmalloc region is already
> -		 * backed by zero shadow pages, and punching holes into it
> -		 * is non-trivial. Since the module region is not randomized
> -		 * when KASAN is enabled without KASAN_VMALLOC, it is even
> -		 * less likely that the module region gets exhausted, so we
> -		 * can simply omit this fallback in that case.
> -		 */
> -		p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
> -				module_alloc_base + SZ_2G, GFP_KERNEL,
> -				PAGE_KERNEL, 0, NUMA_NO_NODE,
> -				__builtin_return_address(0));
> +	/*
> +	 * If the prior allocation failed, and we have configured support for
> +	 * fixing up out-of-range relative branches through the use of PLTs,
> +	 * fall back to a 2 GB window for module allocations. This is the
> +	 * maximum we can support, due to the use of 32-bit place relative
> +	 * symbol references, which cannot be fixed up using PLTs.
> +	 */
> +	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> +		p = __vmalloc_node_range(size, MODULE_ALIGN,
> +					 module_alloc_limit - SZ_2G,
> +					 module_alloc_limit, GFP_KERNEL,
> +					 PAGE_KERNEL, 0, NUMA_NO_NODE,
> +					 __builtin_return_address(0));
>  
>  	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
>  		vfree(p);
> -- 
> 2.39.2
>
Ard Biesheuvel April 4, 2023, 4:07 p.m. UTC | #2
On Tue, 4 Apr 2023 at 17:49, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Apr 04, 2023 at 03:54:37PM +0200, Ard Biesheuvel wrote:
> > Shanker reports that, after loading a 110+ MiB kernel module, no other
> > modules can be loaded, in spite of the fact that module PLTs have been
> > enabled in the build.
> >
> > This is due to the fact, even with module PLTs enabled, the module
> > region is dimensioned to be a fixed 128 MiB region, as we simply never
> > anticipated the need for supporting modules that huge.
> >
> > So let's increase the size of the statically allocated module region to
> > 2 GiB, and update the module loading logic so that we prefer loading
> > modules in the vicinity of the kernel text, where relative branches can
> > be resolved without the need for PLTs. Only when we run out of space
> > here (or when CONFIG_RANDOMIZE_MODULE_REGION_FULL is enabled), we will
> > fall back to the larger window and allocate from there.
> >
> > While at it, let's try to simplify the logic a bit, to make it easier to
> > follow:
> > - remove the special cases for KASAN, which are no longer needed now
> >   that KASAN_VMALLOC is always enabled when KASAN is configured;
> > - instead of defining a global module base address, define a global
> >   module limit, and work our way down from it.
>
> I find this last change a bit confusing, and I think it'd be preferable to have
> explicit start and end variables; more on that below.
>
> >
> > Cc: Shanker Donthineni <sdonthineni@nvidia.com>
> > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > ---
> >  Documentation/arm64/memory.rst  |  8 ++---
> >  arch/arm64/include/asm/memory.h |  2 +-
> >  arch/arm64/include/asm/module.h | 10 ++++--
> >  arch/arm64/kernel/kaslr.c       | 38 +++++++++++------------
> >  arch/arm64/kernel/module.c      | 54 ++++++++++++++++-----------------
> >  5 files changed, 59 insertions(+), 53 deletions(-)
> >
> > diff --git a/Documentation/arm64/memory.rst b/Documentation/arm64/memory.rst
> > index 2a641ba7be3b717a..55a55f30eed8a6ce 100644
> > --- a/Documentation/arm64/memory.rst
> > +++ b/Documentation/arm64/memory.rst
> > @@ -33,8 +33,8 @@ AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
> >    0000000000000000   0000ffffffffffff         256TB          user
> >    ffff000000000000   ffff7fffffffffff         128TB          kernel logical memory map
> >   [ffff600000000000   ffff7fffffffffff]         32TB          [kasan shadow region]
> > -  ffff800000000000   ffff800007ffffff         128MB          modules
> > -  ffff800008000000   fffffbffefffffff         124TB          vmalloc
> > +  ffff800000000000   ffff80007fffffff           2GB          modules
> > +  ffff800080000000   fffffbffefffffff         124TB          vmalloc
> >    fffffbfff0000000   fffffbfffdffffff         224MB          fixed mappings (top down)
> >    fffffbfffe000000   fffffbfffe7fffff           8MB          [guard region]
> >    fffffbfffe800000   fffffbffff7fffff          16MB          PCI I/O space
> > @@ -50,8 +50,8 @@ AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):
> >    0000000000000000   000fffffffffffff           4PB          user
> >    fff0000000000000   ffff7fffffffffff          ~4PB          kernel logical memory map
> >   [fffd800000000000   ffff7fffffffffff]        512TB          [kasan shadow region]
> > -  ffff800000000000   ffff800007ffffff         128MB          modules
> > -  ffff800008000000   fffffbffefffffff         124TB          vmalloc
> > +  ffff800000000000   ffff80007fffffff           2GB          modules
> > +  ffff800080000000   fffffbffefffffff         124TB          vmalloc
> >    fffffbfff0000000   fffffbfffdffffff         224MB          fixed mappings (top down)
> >    fffffbfffe000000   fffffbfffe7fffff           8MB          [guard region]
> >    fffffbfffe800000   fffffbffff7fffff          16MB          PCI I/O space
> > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > index 78e5163836a0ab95..b58c3127323e16c8 100644
> > --- a/arch/arm64/include/asm/memory.h
> > +++ b/arch/arm64/include/asm/memory.h
> > @@ -46,7 +46,7 @@
> >  #define KIMAGE_VADDR         (MODULES_END)
> >  #define MODULES_END          (MODULES_VADDR + MODULES_VSIZE)
> >  #define MODULES_VADDR                (_PAGE_END(VA_BITS_MIN))
> > -#define MODULES_VSIZE                (SZ_128M)
> > +#define MODULES_VSIZE                (SZ_2G)
> >  #define VMEMMAP_START                (-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
> >  #define VMEMMAP_END          (VMEMMAP_START + VMEMMAP_SIZE)
> >  #define PCI_IO_END           (VMEMMAP_START - SZ_8M)
> > diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
> > index 18734fed3bdd7609..98dae9f87b521f07 100644
> > --- a/arch/arm64/include/asm/module.h
> > +++ b/arch/arm64/include/asm/module.h
> > @@ -31,9 +31,15 @@ u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
> >                               void *loc, u64 val);
> >
> >  #ifdef CONFIG_RANDOMIZE_BASE
> > -extern u64 module_alloc_base;
> > +extern u64 module_alloc_limit;
> >  #else
> > -#define module_alloc_base    ((u64)_etext - MODULES_VSIZE)
> > +#define module_alloc_limit   MODULE_REF_END
> > +#endif
> > +
> > +#ifdef CONFIG_ARM64_MODULE_PLTS
> > +#define MODULE_REF_END               ((u64)_end)
> > +#else
> > +#define MODULE_REF_END               ((u64)_etext)
> >  #endif
>
> I was initially a bit confused by this. I think it's a bit misleading for the
> !CONFIG_ARM64_MODULE_PLTS case, since modules can still have data references
> between _etext and _end, it's just that we're (hopefully) unlikely to have the
> text be <128M with >2G of subsequent data.
>

So the reason for this is that otherwise, with PLTs disabled, we lose
(_end - _etext) worth of module space for no good reason, and this
could potentially be a substantial chunk of space. However, when PLTs
are enabled, we cannot safely use _etext as the upper bound, as _etext
- SZ_2G may produce an address that is out of range for a PREL32
relocation. However, in that case, we can tolerate the waste, so we
can just use _end instead.

> I'd find this clearer if we could express the two constaints separately. e.g.
> have something like:
>
> #define MODULE_DATA_REF_END             ((u64)_end)
> #define MODULE_TEXT_REF_END             ((u64)_etext)
>
> That'd allow us to do something like the below, which I think would be clearer.
>
> u64 module_alloc_end;
> u64 module_alloc_base_noplt;
> u64 module_alloc_base_plt;
>
> /*
>  * Call this somewhere after choosing hte KASLR limits
>  */
> void module_limits_init(void)
> {
>         module_alloc_end = (u64)_stext;
>
>         /*
>          * 32-bit relative data references must always fall within 2G of the
>          * end of the kernel image.
>          */
>         module_alloc_base_plt = max(MODULE_VADDR, MODULE_DATA_REF_END - SZ_2G);
>
>         /*
>          * Direct branches must be within 128M of the end of the kernel text.
>          */
>         module_alloc_base_noplt = max(module_alloc_base_plt,
>                                       MODULE_TEXT_REF_END - SZ_128M);
> }
>

Currently, the randomization of the module region is independent from
the randomization of the kernel itself, and once we incorporate that
here, I don't think it will be any clearer tbh.


> void *module_alloc(unsigned long size)
> {
>         gfp_t gfp_mask = GFP_KERNEL;
>         void *p = NULL;
>
>         if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
>                 gfp_mask |= __GFP_NOWARN;
>
>         /*
>          * By default, prefer to place modules such that they don't require
>          * PLTs. With CONFIG_RANDOMIZE_MODULE_REGION_FULL=y we'll prefer to use
>          * the entire range possible with PLTs, to get as much entropy as possible.
>          */
>         if (!IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
>                 p = __vmalloc_node_range(size, MODULE_ALIGN,
>                                          module_base_noplt, module_end,
>                                          gfp_mask, PAGE_KERNEL,
>                                          VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
>                                          __builtin_return_address(0));
>         }
>
>         if (!p && IS_ENABLED(CONFIG_MODULE_PLTS)) {
>                 p = __vmalloc_node_range(size, MODULE_ALIGN,
>                                          module_base_plt, module_end,
>                                          gfp_mask, PAGE_KERNEL,
>                                          VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
>                                          __builtin_return_address(0));
>         }
>
>         [... play with KASAN here ...]
> }
>
> Is that viable, or am I missing something?
>

It looks correct to me, but I'll defer to others to decide which is preferable.
Mark Rutland April 5, 2023, 11:37 a.m. UTC | #3
On Tue, Apr 04, 2023 at 06:07:04PM +0200, Ard Biesheuvel wrote:
> On Tue, 4 Apr 2023 at 17:49, Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Tue, Apr 04, 2023 at 03:54:37PM +0200, Ard Biesheuvel wrote:
> > > Shanker reports that, after loading a 110+ MiB kernel module, no other
> > > modules can be loaded, in spite of the fact that module PLTs have been
> > > enabled in the build.
> > >
> > > This is due to the fact, even with module PLTs enabled, the module
> > > region is dimensioned to be a fixed 128 MiB region, as we simply never
> > > anticipated the need for supporting modules that huge.
> > >
> > > So let's increase the size of the statically allocated module region to
> > > 2 GiB, and update the module loading logic so that we prefer loading
> > > modules in the vicinity of the kernel text, where relative branches can
> > > be resolved without the need for PLTs. Only when we run out of space
> > > here (or when CONFIG_RANDOMIZE_MODULE_REGION_FULL is enabled), we will
> > > fall back to the larger window and allocate from there.
> > >
> > > While at it, let's try to simplify the logic a bit, to make it easier to
> > > follow:
> > > - remove the special cases for KASAN, which are no longer needed now
> > >   that KASAN_VMALLOC is always enabled when KASAN is configured;
> > > - instead of defining a global module base address, define a global
> > >   module limit, and work our way down from it.
> >
> > I find this last change a bit confusing, and I think it'd be preferable to have
> > explicit start and end variables; more on that below.
> >
> > >
> > > Cc: Shanker Donthineni <sdonthineni@nvidia.com>
> > > Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
> > > ---
> > >  Documentation/arm64/memory.rst  |  8 ++---
> > >  arch/arm64/include/asm/memory.h |  2 +-
> > >  arch/arm64/include/asm/module.h | 10 ++++--
> > >  arch/arm64/kernel/kaslr.c       | 38 +++++++++++------------
> > >  arch/arm64/kernel/module.c      | 54 ++++++++++++++++-----------------
> > >  5 files changed, 59 insertions(+), 53 deletions(-)
> > >
> > > diff --git a/Documentation/arm64/memory.rst b/Documentation/arm64/memory.rst
> > > index 2a641ba7be3b717a..55a55f30eed8a6ce 100644
> > > --- a/Documentation/arm64/memory.rst
> > > +++ b/Documentation/arm64/memory.rst
> > > @@ -33,8 +33,8 @@ AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
> > >    0000000000000000   0000ffffffffffff         256TB          user
> > >    ffff000000000000   ffff7fffffffffff         128TB          kernel logical memory map
> > >   [ffff600000000000   ffff7fffffffffff]         32TB          [kasan shadow region]
> > > -  ffff800000000000   ffff800007ffffff         128MB          modules
> > > -  ffff800008000000   fffffbffefffffff         124TB          vmalloc
> > > +  ffff800000000000   ffff80007fffffff           2GB          modules
> > > +  ffff800080000000   fffffbffefffffff         124TB          vmalloc
> > >    fffffbfff0000000   fffffbfffdffffff         224MB          fixed mappings (top down)
> > >    fffffbfffe000000   fffffbfffe7fffff           8MB          [guard region]
> > >    fffffbfffe800000   fffffbffff7fffff          16MB          PCI I/O space
> > > @@ -50,8 +50,8 @@ AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):
> > >    0000000000000000   000fffffffffffff           4PB          user
> > >    fff0000000000000   ffff7fffffffffff          ~4PB          kernel logical memory map
> > >   [fffd800000000000   ffff7fffffffffff]        512TB          [kasan shadow region]
> > > -  ffff800000000000   ffff800007ffffff         128MB          modules
> > > -  ffff800008000000   fffffbffefffffff         124TB          vmalloc
> > > +  ffff800000000000   ffff80007fffffff           2GB          modules
> > > +  ffff800080000000   fffffbffefffffff         124TB          vmalloc
> > >    fffffbfff0000000   fffffbfffdffffff         224MB          fixed mappings (top down)
> > >    fffffbfffe000000   fffffbfffe7fffff           8MB          [guard region]
> > >    fffffbfffe800000   fffffbffff7fffff          16MB          PCI I/O space
> > > diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> > > index 78e5163836a0ab95..b58c3127323e16c8 100644
> > > --- a/arch/arm64/include/asm/memory.h
> > > +++ b/arch/arm64/include/asm/memory.h
> > > @@ -46,7 +46,7 @@
> > >  #define KIMAGE_VADDR         (MODULES_END)
> > >  #define MODULES_END          (MODULES_VADDR + MODULES_VSIZE)
> > >  #define MODULES_VADDR                (_PAGE_END(VA_BITS_MIN))
> > > -#define MODULES_VSIZE                (SZ_128M)
> > > +#define MODULES_VSIZE                (SZ_2G)
> > >  #define VMEMMAP_START                (-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
> > >  #define VMEMMAP_END          (VMEMMAP_START + VMEMMAP_SIZE)
> > >  #define PCI_IO_END           (VMEMMAP_START - SZ_8M)
> > > diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
> > > index 18734fed3bdd7609..98dae9f87b521f07 100644
> > > --- a/arch/arm64/include/asm/module.h
> > > +++ b/arch/arm64/include/asm/module.h
> > > @@ -31,9 +31,15 @@ u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
> > >                               void *loc, u64 val);
> > >
> > >  #ifdef CONFIG_RANDOMIZE_BASE
> > > -extern u64 module_alloc_base;
> > > +extern u64 module_alloc_limit;
> > >  #else
> > > -#define module_alloc_base    ((u64)_etext - MODULES_VSIZE)
> > > +#define module_alloc_limit   MODULE_REF_END
> > > +#endif
> > > +
> > > +#ifdef CONFIG_ARM64_MODULE_PLTS
> > > +#define MODULE_REF_END               ((u64)_end)
> > > +#else
> > > +#define MODULE_REF_END               ((u64)_etext)
> > >  #endif
> >
> > I was initially a bit confused by this. I think it's a bit misleading for the
> > !CONFIG_ARM64_MODULE_PLTS case, since modules can still have data references
> > between _etext and _end, it's just that we're (hopefully) unlikely to have the
> > text be <128M with >2G of subsequent data.
> >
> 
> So the reason for this is that otherwise, with PLTs disabled, we lose
> (_end - _etext) worth of module space for no good reason, and this
> could potentially be a substantial chunk of space. However, when PLTs
> are enabled, we cannot safely use _etext as the upper bound, as _etext
> - SZ_2G may produce an address that is out of range for a PREL32
> relocation. 

I understood those two constraints, which are:

(1) For PREL32 references, the module must be within 2GiB of _end regardless of
    whether we use PLTs.

(2) For direct branches without PLTs, the module must be within 128MiB of
    _etext.

What I find confusing is having a single conditional MODULE_REF_END definition,
as it implicitly relies on other properties being maintained elsewhere, when we
try to allocate modules relative to this, and I don't think those always align
as we'd prefer.

Consider a config:

	CONFIG_MODULE_PLTS=y
	RANDOMIZE_BASE=n
	CONFIG_RANDOMIZE_MODULE_REGION_FULL=n

In that config, with your patch we'd have:

	#define module_alloc_limit	MODULE_REF_END
	#define MODULE_REF_END		((u64)_end)

In module alloc(), our first attempt at allocating the module would be:

	p = __vmalloc_node_range(size, MODULE_ALIGN,
				 module_alloc_limit - SZ_128M,
				 module_alloc_limit, gfp_mask, PAGE_KERNEL,
				 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
				 __builtin_return_address(0));

In this case, module_alloc_limit is '(u64)_end', so if there's a signficiant
quantity of data between _etext and _end we will fail to allocate in the
preferred 128M region that avoids PLTs more often than necessary, before
falling back to the 2G region that may require PLTs.

That's not a functional problem since we'll fall back to using PLTs, but I
don't think that's as intended.

Consider another config with:

	CONFIG_MODULE_PLTS=n
	RANDOMIZE_BASE=n
	CONFIG_RANDOMIZE_MODULE_REGION_FULL=n

In that config we'd have:

	#define module_alloc_limit	MODULE_REF_END
	#define MODULE_REF_END		((u64)_etext)

In module alloc(), our only attempt at allocating the module would be:

	p = __vmalloc_node_range(size, MODULE_ALIGN,
				 module_alloc_limit - SZ_128M,
				 module_alloc_limit, gfp_mask, PAGE_KERNEL,
				 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
				 __builtin_return_address(0));

Say I've built an incredibly unlikely kernel with 64M of text and 2G-96M of
data between _etext and _end. In this case, 'module_alloc_limit - SZ_128M'
would be 32M below '_end - SZ_2G', so PREL32 relocations could be out-of-range.

> However, in that case, we can tolerate the waste, so we can just use _end
> instead.

I get the idea, but as above, I don't think that's always correct.

> > I'd find this clearer if we could express the two constaints separately. e.g.
> > have something like:
> >
> > #define MODULE_DATA_REF_END             ((u64)_end)
> > #define MODULE_TEXT_REF_END             ((u64)_etext)
> >
> > That'd allow us to do something like the below, which I think would be clearer.
> >
> > u64 module_alloc_end;
> > u64 module_alloc_base_noplt;
> > u64 module_alloc_base_plt;
> >
> > /*
> >  * Call this somewhere after choosing hte KASLR limits
> >  */
> > void module_limits_init(void)
> > {
> >         module_alloc_end = (u64)_stext;
> >
> >         /*
> >          * 32-bit relative data references must always fall within 2G of the
> >          * end of the kernel image.
> >          */
> >         module_alloc_base_plt = max(MODULE_VADDR, MODULE_DATA_REF_END - SZ_2G);
> >
> >         /*
> >          * Direct branches must be within 128M of the end of the kernel text.
> >          */
> >         module_alloc_base_noplt = max(module_alloc_base_plt,
> >                                       MODULE_TEXT_REF_END - SZ_128M);
> > }
> >
> 
> Currently, the randomization of the module region is independent from
> the randomization of the kernel itself, and once we incorporate that
> here, I don't think it will be any clearer tbh.

Ok; I've clearly missed that aspect, and I'll have to go page that in.

Thanks,
Mark.

> > void *module_alloc(unsigned long size)
> > {
> >         gfp_t gfp_mask = GFP_KERNEL;
> >         void *p = NULL;
> >
> >         if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
> >                 gfp_mask |= __GFP_NOWARN;
> >
> >         /*
> >          * By default, prefer to place modules such that they don't require
> >          * PLTs. With CONFIG_RANDOMIZE_MODULE_REGION_FULL=y we'll prefer to use
> >          * the entire range possible with PLTs, to get as much entropy as possible.
> >          */
> >         if (!IS_ENABLED(CONFIG_RANDOMIZE_MODULE_REGION_FULL)) {
> >                 p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                          module_base_noplt, module_end,
> >                                          gfp_mask, PAGE_KERNEL,
> >                                          VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                          __builtin_return_address(0));
> >         }
> >
> >         if (!p && IS_ENABLED(CONFIG_MODULE_PLTS)) {
> >                 p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                          module_base_plt, module_end,
> >                                          gfp_mask, PAGE_KERNEL,
> >                                          VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                          __builtin_return_address(0));
> >         }
> >
> >         [... play with KASAN here ...]
> > }
> >
> > Is that viable, or am I missing something?
> >
> 
> It looks correct to me, but I'll defer to others to decide which is preferable.
Ard Biesheuvel April 5, 2023, 1:16 p.m. UTC | #4
On Wed, 5 Apr 2023 at 13:38, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Apr 04, 2023 at 06:07:04PM +0200, Ard Biesheuvel wrote:
> > On Tue, 4 Apr 2023 at 17:49, Mark Rutland <mark.rutland@arm.com> wrote:
> > >
> > > On Tue, Apr 04, 2023 at 03:54:37PM +0200, Ard Biesheuvel wrote:
...
> > > >  #ifdef CONFIG_RANDOMIZE_BASE
> > > > -extern u64 module_alloc_base;
> > > > +extern u64 module_alloc_limit;
> > > >  #else
> > > > -#define module_alloc_base    ((u64)_etext - MODULES_VSIZE)
> > > > +#define module_alloc_limit   MODULE_REF_END
> > > > +#endif
> > > > +
> > > > +#ifdef CONFIG_ARM64_MODULE_PLTS
> > > > +#define MODULE_REF_END               ((u64)_end)
> > > > +#else
> > > > +#define MODULE_REF_END               ((u64)_etext)
> > > >  #endif
> > >
> > > I was initially a bit confused by this. I think it's a bit misleading for the
> > > !CONFIG_ARM64_MODULE_PLTS case, since modules can still have data references
> > > between _etext and _end, it's just that we're (hopefully) unlikely to have the
> > > text be <128M with >2G of subsequent data.
> > >
> >
> > So the reason for this is that otherwise, with PLTs disabled, we lose
> > (_end - _etext) worth of module space for no good reason, and this
> > could potentially be a substantial chunk of space. However, when PLTs
> > are enabled, we cannot safely use _etext as the upper bound, as _etext
> > - SZ_2G may produce an address that is out of range for a PREL32
> > relocation.
>
> I understood those two constraints, which are:
>
> (1) For PREL32 references, the module must be within 2GiB of _end regardless of
>     whether we use PLTs.
>
> (2) For direct branches without PLTs, the module must be within 128MiB of
>     _etext.
>

To be pedantic, let's define it as

(1) for PREL32 references, the module region must be at most 2 GiB in
size and include the kernel range [_stext, _end), so that PREL32
references from any module to any other module or the kernel are
always within -/+ 2 GiB

(2) for CALL26 references, the module region must be at most 128 MiB
in size and include the kernel range [_stext, _etext) so that CALL26
references from any module to any other module or the kernel are
always within -/+ 128 MiB


> What I find confusing is having a single conditional MODULE_REF_END definition,
> as it implicitly relies on other properties being maintained elsewhere, when we
> try to allocate modules relative to this, and I don't think those always align
> as we'd prefer.
>
> Consider a config:
>
>         CONFIG_MODULE_PLTS=y
>         RANDOMIZE_BASE=n
>         CONFIG_RANDOMIZE_MODULE_REGION_FULL=n
>
> In that config, with your patch we'd have:
>
>         #define module_alloc_limit      MODULE_REF_END
>         #define MODULE_REF_END          ((u64)_end)
>
> In module alloc(), our first attempt at allocating the module would be:
>
>         p = __vmalloc_node_range(size, MODULE_ALIGN,
>                                  module_alloc_limit - SZ_128M,
>                                  module_alloc_limit, gfp_mask, PAGE_KERNEL,
>                                  VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
>                                  __builtin_return_address(0));
>
> In this case, module_alloc_limit is '(u64)_end', so if there's a signficiant
> quantity of data between _etext and _end we will fail to allocate in the
> preferred 128M region that avoids PLTs more often than necessary, before
> falling back to the 2G region that may require PLTs.
>
> That's not a functional problem since we'll fall back to using PLTs, but I
> don't think that's as intended.
>

The allocations occur bottom up, so we will fall back earlier than
strictly necessary, but only after exhausting a significant chunk of
the module region. I don't see that as a problem.


> Consider another config with:
>
>         CONFIG_MODULE_PLTS=n
>         RANDOMIZE_BASE=n
>         CONFIG_RANDOMIZE_MODULE_REGION_FULL=n
>
> In that config we'd have:
>
>         #define module_alloc_limit      MODULE_REF_END
>         #define MODULE_REF_END          ((u64)_etext)
>
> In module alloc(), our only attempt at allocating the module would be:
>
>         p = __vmalloc_node_range(size, MODULE_ALIGN,
>                                  module_alloc_limit - SZ_128M,
>                                  module_alloc_limit, gfp_mask, PAGE_KERNEL,
>                                  VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
>                                  __builtin_return_address(0));
>
> Say I've built an incredibly unlikely kernel with 64M of text and 2G-96M of
> data between _etext and _end. In this case, 'module_alloc_limit - SZ_128M'
> would be 32M below '_end - SZ_2G', so PREL32 relocations could be out-of-range.
>

I think that ~2 GiB kernel images have their own special set of
challenges, and this is probably not the nastiest one.

> > However, in that case, we can tolerate the waste, so we can just use _end
> > instead.
>
> I get the idea, but as above, I don't think that's always correct.
>
> > > I'd find this clearer if we could express the two constaints separately. e.g.
> > > have something like:
> > >
> > > #define MODULE_DATA_REF_END             ((u64)_end)
> > > #define MODULE_TEXT_REF_END             ((u64)_etext)
> > >
> > > That'd allow us to do something like the below, which I think would be clearer.
> > >
> > > u64 module_alloc_end;
> > > u64 module_alloc_base_noplt;
> > > u64 module_alloc_base_plt;
> > >
> > > /*
> > >  * Call this somewhere after choosing hte KASLR limits
> > >  */
> > > void module_limits_init(void)
> > > {
> > >         module_alloc_end = (u64)_stext;
> > >
> > >         /*
> > >          * 32-bit relative data references must always fall within 2G of the
> > >          * end of the kernel image.
> > >          */
> > >         module_alloc_base_plt = max(MODULE_VADDR, MODULE_DATA_REF_END - SZ_2G);
> > >
> > >         /*
> > >          * Direct branches must be within 128M of the end of the kernel text.
> > >          */
> > >         module_alloc_base_noplt = max(module_alloc_base_plt,
> > >                                       MODULE_TEXT_REF_END - SZ_128M);
> > > }
> > >
> >
> > Currently, the randomization of the module region is independent from
> > the randomization of the kernel itself, and once we incorporate that
> > here, I don't think it will be any clearer tbh.
>
> Ok; I've clearly missed that aspect, and I'll have to go page that in.
>

ok
Mark Rutland May 2, 2023, 11:31 a.m. UTC | #5
On Wed, Apr 05, 2023 at 03:16:25PM +0200, Ard Biesheuvel wrote:
> On Wed, 5 Apr 2023 at 13:38, Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > On Tue, Apr 04, 2023 at 06:07:04PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 4 Apr 2023 at 17:49, Mark Rutland <mark.rutland@arm.com> wrote:
> > > >
> > > > On Tue, Apr 04, 2023 at 03:54:37PM +0200, Ard Biesheuvel wrote:
> ...
> > > > >  #ifdef CONFIG_RANDOMIZE_BASE
> > > > > -extern u64 module_alloc_base;
> > > > > +extern u64 module_alloc_limit;
> > > > >  #else
> > > > > -#define module_alloc_base    ((u64)_etext - MODULES_VSIZE)
> > > > > +#define module_alloc_limit   MODULE_REF_END
> > > > > +#endif
> > > > > +
> > > > > +#ifdef CONFIG_ARM64_MODULE_PLTS
> > > > > +#define MODULE_REF_END               ((u64)_end)
> > > > > +#else
> > > > > +#define MODULE_REF_END               ((u64)_etext)
> > > > >  #endif
> > > >
> > > > I was initially a bit confused by this. I think it's a bit misleading for the
> > > > !CONFIG_ARM64_MODULE_PLTS case, since modules can still have data references
> > > > between _etext and _end, it's just that we're (hopefully) unlikely to have the
> > > > text be <128M with >2G of subsequent data.
> > > >
> > >
> > > So the reason for this is that otherwise, with PLTs disabled, we lose
> > > (_end - _etext) worth of module space for no good reason, and this
> > > could potentially be a substantial chunk of space. However, when PLTs
> > > are enabled, we cannot safely use _etext as the upper bound, as _etext
> > > - SZ_2G may produce an address that is out of range for a PREL32
> > > relocation.
> >
> > I understood those two constraints, which are:
> >
> > (1) For PREL32 references, the module must be within 2GiB of _end regardless of
> >     whether we use PLTs.
> >
> > (2) For direct branches without PLTs, the module must be within 128MiB of
> >     _etext.
> >
> 
> To be pedantic, let's define it as
> 
> (1) for PREL32 references, the module region must be at most 2 GiB in
> size and include the kernel range [_stext, _end), so that PREL32
> references from any module to any other module or the kernel are
> always within -/+ 2 GiB
> 
> (2) for CALL26 references, the module region must be at most 128 MiB
> in size and include the kernel range [_stext, _etext) so that CALL26
> references from any module to any other module or the kernel are
> always within -/+ 128 MiB

Yes, agreed.

As discussed in-person at Linaro connect, I'd prefer if we could have the
module logic in one place, and explicitly handle the (rare/unlikely) cases
where the kernel is too big to allow us to select a 128M or 2G region.

I've pushed a branch (atop v6.3 for now) to:

  https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/module-space-rework

Which does that (and I think handles the randomization aspects correctly). I
intend to rebase and post that come v6.4-rc1.

Thanks,
Mark.

> > What I find confusing is having a single conditional MODULE_REF_END definition,
> > as it implicitly relies on other properties being maintained elsewhere, when we
> > try to allocate modules relative to this, and I don't think those always align
> > as we'd prefer.
> >
> > Consider a config:
> >
> >         CONFIG_MODULE_PLTS=y
> >         RANDOMIZE_BASE=n
> >         CONFIG_RANDOMIZE_MODULE_REGION_FULL=n
> >
> > In that config, with your patch we'd have:
> >
> >         #define module_alloc_limit      MODULE_REF_END
> >         #define MODULE_REF_END          ((u64)_end)
> >
> > In module alloc(), our first attempt at allocating the module would be:
> >
> >         p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                  module_alloc_limit - SZ_128M,
> >                                  module_alloc_limit, gfp_mask, PAGE_KERNEL,
> >                                  VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                  __builtin_return_address(0));
> >
> > In this case, module_alloc_limit is '(u64)_end', so if there's a signficiant
> > quantity of data between _etext and _end we will fail to allocate in the
> > preferred 128M region that avoids PLTs more often than necessary, before
> > falling back to the 2G region that may require PLTs.
> >
> > That's not a functional problem since we'll fall back to using PLTs, but I
> > don't think that's as intended.
> >
> 
> The allocations occur bottom up, so we will fall back earlier than
> strictly necessary, but only after exhausting a significant chunk of
> the module region. I don't see that as a problem.
> 
> 
> > Consider another config with:
> >
> >         CONFIG_MODULE_PLTS=n
> >         RANDOMIZE_BASE=n
> >         CONFIG_RANDOMIZE_MODULE_REGION_FULL=n
> >
> > In that config we'd have:
> >
> >         #define module_alloc_limit      MODULE_REF_END
> >         #define MODULE_REF_END          ((u64)_etext)
> >
> > In module alloc(), our only attempt at allocating the module would be:
> >
> >         p = __vmalloc_node_range(size, MODULE_ALIGN,
> >                                  module_alloc_limit - SZ_128M,
> >                                  module_alloc_limit, gfp_mask, PAGE_KERNEL,
> >                                  VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
> >                                  __builtin_return_address(0));
> >
> > Say I've built an incredibly unlikely kernel with 64M of text and 2G-96M of
> > data between _etext and _end. In this case, 'module_alloc_limit - SZ_128M'
> > would be 32M below '_end - SZ_2G', so PREL32 relocations could be out-of-range.
> >
> 
> I think that ~2 GiB kernel images have their own special set of
> challenges, and this is probably not the nastiest one.
> 
> > > However, in that case, we can tolerate the waste, so we can just use _end
> > > instead.
> >
> > I get the idea, but as above, I don't think that's always correct.
> >
> > > > I'd find this clearer if we could express the two constaints separately. e.g.
> > > > have something like:
> > > >
> > > > #define MODULE_DATA_REF_END             ((u64)_end)
> > > > #define MODULE_TEXT_REF_END             ((u64)_etext)
> > > >
> > > > That'd allow us to do something like the below, which I think would be clearer.
> > > >
> > > > u64 module_alloc_end;
> > > > u64 module_alloc_base_noplt;
> > > > u64 module_alloc_base_plt;
> > > >
> > > > /*
> > > >  * Call this somewhere after choosing hte KASLR limits
> > > >  */
> > > > void module_limits_init(void)
> > > > {
> > > >         module_alloc_end = (u64)_stext;
> > > >
> > > >         /*
> > > >          * 32-bit relative data references must always fall within 2G of the
> > > >          * end of the kernel image.
> > > >          */
> > > >         module_alloc_base_plt = max(MODULE_VADDR, MODULE_DATA_REF_END - SZ_2G);
> > > >
> > > >         /*
> > > >          * Direct branches must be within 128M of the end of the kernel text.
> > > >          */
> > > >         module_alloc_base_noplt = max(module_alloc_base_plt,
> > > >                                       MODULE_TEXT_REF_END - SZ_128M);
> > > > }
> > > >
> > >
> > > Currently, the randomization of the module region is independent from
> > > the randomization of the kernel itself, and once we incorporate that
> > > here, I don't think it will be any clearer tbh.
> >
> > Ok; I've clearly missed that aspect, and I'll have to go page that in.
> >
> 
> ok
diff mbox series

Patch

diff --git a/Documentation/arm64/memory.rst b/Documentation/arm64/memory.rst
index 2a641ba7be3b717a..55a55f30eed8a6ce 100644
--- a/Documentation/arm64/memory.rst
+++ b/Documentation/arm64/memory.rst
@@ -33,8 +33,8 @@  AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit)::
   0000000000000000	0000ffffffffffff	 256TB		user
   ffff000000000000	ffff7fffffffffff	 128TB		kernel logical memory map
  [ffff600000000000	ffff7fffffffffff]	  32TB		[kasan shadow region]
-  ffff800000000000	ffff800007ffffff	 128MB		modules
-  ffff800008000000	fffffbffefffffff	 124TB		vmalloc
+  ffff800000000000	ffff80007fffffff	   2GB		modules
+  ffff800080000000	fffffbffefffffff	 124TB		vmalloc
   fffffbfff0000000	fffffbfffdffffff	 224MB		fixed mappings (top down)
   fffffbfffe000000	fffffbfffe7fffff	   8MB		[guard region]
   fffffbfffe800000	fffffbffff7fffff	  16MB		PCI I/O space
@@ -50,8 +50,8 @@  AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):
   0000000000000000	000fffffffffffff	   4PB		user
   fff0000000000000	ffff7fffffffffff	  ~4PB		kernel logical memory map
  [fffd800000000000	ffff7fffffffffff]	 512TB		[kasan shadow region]
-  ffff800000000000	ffff800007ffffff	 128MB		modules
-  ffff800008000000	fffffbffefffffff	 124TB		vmalloc
+  ffff800000000000	ffff80007fffffff	   2GB		modules
+  ffff800080000000	fffffbffefffffff	 124TB		vmalloc
   fffffbfff0000000	fffffbfffdffffff	 224MB		fixed mappings (top down)
   fffffbfffe000000	fffffbfffe7fffff	   8MB		[guard region]
   fffffbfffe800000	fffffbffff7fffff	  16MB		PCI I/O space
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index 78e5163836a0ab95..b58c3127323e16c8 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -46,7 +46,7 @@ 
 #define KIMAGE_VADDR		(MODULES_END)
 #define MODULES_END		(MODULES_VADDR + MODULES_VSIZE)
 #define MODULES_VADDR		(_PAGE_END(VA_BITS_MIN))
-#define MODULES_VSIZE		(SZ_128M)
+#define MODULES_VSIZE		(SZ_2G)
 #define VMEMMAP_START		(-(UL(1) << (VA_BITS - VMEMMAP_SHIFT)))
 #define VMEMMAP_END		(VMEMMAP_START + VMEMMAP_SIZE)
 #define PCI_IO_END		(VMEMMAP_START - SZ_8M)
diff --git a/arch/arm64/include/asm/module.h b/arch/arm64/include/asm/module.h
index 18734fed3bdd7609..98dae9f87b521f07 100644
--- a/arch/arm64/include/asm/module.h
+++ b/arch/arm64/include/asm/module.h
@@ -31,9 +31,15 @@  u64 module_emit_veneer_for_adrp(struct module *mod, Elf64_Shdr *sechdrs,
 				void *loc, u64 val);
 
 #ifdef CONFIG_RANDOMIZE_BASE
-extern u64 module_alloc_base;
+extern u64 module_alloc_limit;
 #else
-#define module_alloc_base	((u64)_etext - MODULES_VSIZE)
+#define module_alloc_limit	MODULE_REF_END
+#endif
+
+#ifdef CONFIG_ARM64_MODULE_PLTS
+#define MODULE_REF_END		((u64)_end)
+#else
+#define MODULE_REF_END		((u64)_etext)
 #endif
 
 struct plt_entry {
diff --git a/arch/arm64/kernel/kaslr.c b/arch/arm64/kernel/kaslr.c
index e7477f21a4c9d062..14e96c3f707a74a3 100644
--- a/arch/arm64/kernel/kaslr.c
+++ b/arch/arm64/kernel/kaslr.c
@@ -8,6 +8,7 @@ 
 #include <linux/init.h>
 #include <linux/libfdt.h>
 #include <linux/mm_types.h>
+#include <linux/module.h>
 #include <linux/sched.h>
 #include <linux/types.h>
 #include <linux/pgtable.h>
@@ -17,10 +18,11 @@ 
 #include <asm/kernel-pgtable.h>
 #include <asm/memory.h>
 #include <asm/mmu.h>
+#include <asm/module.h>
 #include <asm/sections.h>
 #include <asm/setup.h>
 
-u64 __ro_after_init module_alloc_base;
+u64 __ro_after_init module_alloc_limit = MODULE_REF_END;
 u16 __initdata memstart_offset_seed;
 
 struct arm64_ftr_override kaslr_feature_override __initdata;
@@ -30,12 +32,6 @@  static int __init kaslr_init(void)
 	u64 module_range;
 	u32 seed;
 
-	/*
-	 * Set a reasonable default for module_alloc_base in case
-	 * we end up running with module randomization disabled.
-	 */
-	module_alloc_base = (u64)_etext - MODULES_VSIZE;
-
 	if (kaslr_feature_override.val & kaslr_feature_override.mask & 0xf) {
 		pr_info("KASLR disabled on command line\n");
 		return 0;
@@ -69,24 +65,28 @@  static int __init kaslr_init(void)
 		 * resolved via PLTs. (Branches between modules will be
 		 * resolved normally.)
 		 */
-		module_range = SZ_2G - (u64)(_end - _stext);
-		module_alloc_base = max((u64)_end - SZ_2G, (u64)MODULES_VADDR);
+		module_range = SZ_2G;
 	} else {
 		/*
-		 * Randomize the module region by setting module_alloc_base to
-		 * a PAGE_SIZE multiple in the range [_etext - MODULES_VSIZE,
-		 * _stext) . This guarantees that the resulting region still
-		 * covers [_stext, _etext], and that all relative branches can
-		 * be resolved without veneers unless this region is exhausted
-		 * and we fall back to a larger 2GB window in module_alloc()
-		 * when ARM64_MODULE_PLTS is enabled.
+		 * Randomize the module region over a 128 MB window covering
+		 * the kernel text. This guarantees that the resulting region
+		 * still covers [_stext, _etext], and that all relative
+		 * branches can be resolved without veneers unless this region
+		 * is exhausted and we fall back to a larger 2GB window in
+		 * module_alloc() when ARM64_MODULE_PLTS is enabled.
 		 */
-		module_range = MODULES_VSIZE - (u64)(_etext - _stext);
+		module_range = SZ_128M;
 	}
 
+	/*
+	 * Subtract the size of the core kernel region that must be in range
+	 * for all loaded modules.
+	 */
+	module_range -= MODULE_REF_END - (u64)_stext;
+
 	/* use the lower 21 bits to randomize the base of the module region */
-	module_alloc_base += (module_range * (seed & ((1 << 21) - 1))) >> 21;
-	module_alloc_base &= PAGE_MASK;
+	module_alloc_limit += (module_range * (seed & ((1 << 21) - 1))) >> 21;
+	module_alloc_limit &= PAGE_MASK;
 
 	return 0;
 }
diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
index 5af4975caeb58ff7..aa61493957c010b2 100644
--- a/arch/arm64/kernel/module.c
+++ b/arch/arm64/kernel/module.c
@@ -24,7 +24,6 @@ 
 
 void *module_alloc(unsigned long size)
 {
-	u64 module_alloc_end = module_alloc_base + MODULES_VSIZE;
 	gfp_t gfp_mask = GFP_KERNEL;
 	void *p;
 
@@ -32,33 +31,34 @@  void *module_alloc(unsigned long size)
 	if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
 		gfp_mask |= __GFP_NOWARN;
 
-	if (IS_ENABLED(CONFIG_KASAN_GENERIC) ||
-	    IS_ENABLED(CONFIG_KASAN_SW_TAGS))
-		/* don't exceed the static module region - see below */
-		module_alloc_end = MODULES_END;
-
-	p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-				module_alloc_end, gfp_mask, PAGE_KERNEL, VM_DEFER_KMEMLEAK,
-				NUMA_NO_NODE, __builtin_return_address(0));
+	/*
+	 * First, try to allocate from the 128 MB region just below the limit.
+	 * If KASLR is disabled, or CONFIG_RANDOMIZE_MODULE_REGION_FULL is not
+	 * set, this will produce an allocation that allows all relative
+	 * branches into the kernel text to be resolved without the need for
+	 * veneers (PLTs). If CONFIG_RANDOMIZE_MODULE_REGION_FULL is set, this
+	 * 128 MB window might not cover the kernel text, but branches between
+	 * modules will still be in relative branching range.
+	 */
+	p = __vmalloc_node_range(size, MODULE_ALIGN,
+				 module_alloc_limit - SZ_128M,
+				 module_alloc_limit, gfp_mask, PAGE_KERNEL,
+				 VM_DEFER_KMEMLEAK, NUMA_NO_NODE,
+				 __builtin_return_address(0));
 
-	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
-	    (IS_ENABLED(CONFIG_KASAN_VMALLOC) ||
-	     (!IS_ENABLED(CONFIG_KASAN_GENERIC) &&
-	      !IS_ENABLED(CONFIG_KASAN_SW_TAGS))))
-		/*
-		 * KASAN without KASAN_VMALLOC can only deal with module
-		 * allocations being served from the reserved module region,
-		 * since the remainder of the vmalloc region is already
-		 * backed by zero shadow pages, and punching holes into it
-		 * is non-trivial. Since the module region is not randomized
-		 * when KASAN is enabled without KASAN_VMALLOC, it is even
-		 * less likely that the module region gets exhausted, so we
-		 * can simply omit this fallback in that case.
-		 */
-		p = __vmalloc_node_range(size, MODULE_ALIGN, module_alloc_base,
-				module_alloc_base + SZ_2G, GFP_KERNEL,
-				PAGE_KERNEL, 0, NUMA_NO_NODE,
-				__builtin_return_address(0));
+	/*
+	 * If the prior allocation failed, and we have configured support for
+	 * fixing up out-of-range relative branches through the use of PLTs,
+	 * fall back to a 2 GB window for module allocations. This is the
+	 * maximum we can support, due to the use of 32-bit place relative
+	 * symbol references, which cannot be fixed up using PLTs.
+	 */
+	if (!p && IS_ENABLED(CONFIG_ARM64_MODULE_PLTS))
+		p = __vmalloc_node_range(size, MODULE_ALIGN,
+					 module_alloc_limit - SZ_2G,
+					 module_alloc_limit, GFP_KERNEL,
+					 PAGE_KERNEL, 0, NUMA_NO_NODE,
+					 __builtin_return_address(0));
 
 	if (p && (kasan_alloc_module_shadow(p, size, gfp_mask) < 0)) {
 		vfree(p);