diff mbox series

[v9,3/3] riscv: Use PUD/P4D/PGD pages for the linear mapping

Message ID 20230324155421.271544-4-alexghiti@rivosinc.com (mailing list archive)
State Accepted
Commit 3335068f87217ea59d08f462187dc856652eea15
Headers show
Series riscv: Use PUD/P4D/PGD pages for the linear mapping | expand

Checks

Context Check Description
conchuod/cover_letter success Series has a cover letter
conchuod/tree_selection success Guessed tree name to be for-next
conchuod/fixes_present success Fixes tag not required for -next series
conchuod/maintainers_pattern success MAINTAINERS pattern errors before the patch: 1 and now 1
conchuod/verify_signedoff success Signed-off-by tag matches author and committer
conchuod/kdoc success Errors and warnings before: 0 this patch: 0
conchuod/build_rv64_clang_allmodconfig success Errors and warnings before: 2310 this patch: 2310
conchuod/module_param success Was 0 now: 0
conchuod/build_rv64_gcc_allmodconfig success Errors and warnings before: 17794 this patch: 17794
conchuod/build_rv32_defconfig success Build OK
conchuod/dtb_warn_rv64 success Errors and warnings before: 3 this patch: 3
conchuod/header_inline success No static functions without inline keyword in header files
conchuod/checkpatch warning WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants
conchuod/source_inline success Was 0 now: 0
conchuod/build_rv64_nommu_k210_defconfig success Build OK
conchuod/verify_fixes success No Fixes tag
conchuod/build_rv64_nommu_virt_defconfig success Build OK

Commit Message

Alexandre Ghiti March 24, 2023, 3:54 p.m. UTC
During the early page table creation, we used to set the mapping for
PAGE_OFFSET to the kernel load address: but the kernel load address is
always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD
pages as this physical address is not aligned on PUD/P4D/PGD size (whereas
PAGE_OFFSET is).

But actually we don't have to establish this mapping (ie set va_pa_offset)
that early in the boot process because:

- first, setup_vm installs a temporary kernel mapping and among other
  things, discovers the system memory,
- then, setup_vm_final creates the final kernel mapping and takes
  advantage of the discovered system memory to create the linear
  mapping.

During the first phase, we don't know the start of the system memory and
then until the second phase is finished, we can't use the linear mapping at
all and phys_to_virt/virt_to_phys translations must not be used because it
would result in a different translation from the 'real' one once the final
mapping is installed.

So here we simply delay the initialization of va_pa_offset to after the
system memory discovery. But to make sure noone uses the linear mapping
before, we add some guard in the DEBUG_VIRTUAL config.

Finally we can use PUD/P4D/PGD hugepages when possible, which will result
in a better TLB utilization.

Note that:
- this does not apply to rv32 as the kernel mapping lies in the linear
  mapping.
- we rely on the firmware to protect itself using PMP.

Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Acked-by: Rob Herring <robh@kernel.org> # DT bits
---
 arch/riscv/include/asm/page.h | 16 ++++++++++
 arch/riscv/mm/init.c          | 58 +++++++++++++++++++++++++++++++----
 arch/riscv/mm/physaddr.c      | 16 ++++++++++
 drivers/of/fdt.c              | 11 ++++---
 4 files changed, 90 insertions(+), 11 deletions(-)

Comments

Andrew Jones March 27, 2023, 9:39 a.m. UTC | #1
On Fri, Mar 24, 2023 at 04:54:21PM +0100, Alexandre Ghiti wrote:
> During the early page table creation, we used to set the mapping for
> PAGE_OFFSET to the kernel load address: but the kernel load address is
> always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD
> pages as this physical address is not aligned on PUD/P4D/PGD size (whereas
> PAGE_OFFSET is).
> 
> But actually we don't have to establish this mapping (ie set va_pa_offset)
> that early in the boot process because:
> 
> - first, setup_vm installs a temporary kernel mapping and among other
>   things, discovers the system memory,
> - then, setup_vm_final creates the final kernel mapping and takes
>   advantage of the discovered system memory to create the linear
>   mapping.
> 
> During the first phase, we don't know the start of the system memory and
> then until the second phase is finished, we can't use the linear mapping at
> all and phys_to_virt/virt_to_phys translations must not be used because it
> would result in a different translation from the 'real' one once the final
> mapping is installed.
> 
> So here we simply delay the initialization of va_pa_offset to after the
> system memory discovery. But to make sure noone uses the linear mapping
> before, we add some guard in the DEBUG_VIRTUAL config.
> 
> Finally we can use PUD/P4D/PGD hugepages when possible, which will result
> in a better TLB utilization.
> 
> Note that:
> - this does not apply to rv32 as the kernel mapping lies in the linear
>   mapping.
> - we rely on the firmware to protect itself using PMP.
> 
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> Acked-by: Rob Herring <robh@kernel.org> # DT bits
> ---
>  arch/riscv/include/asm/page.h | 16 ++++++++++
>  arch/riscv/mm/init.c          | 58 +++++++++++++++++++++++++++++++----
>  arch/riscv/mm/physaddr.c      | 16 ++++++++++
>  drivers/of/fdt.c              | 11 ++++---
>  4 files changed, 90 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 8dc686f549b6..ea1a0e237211 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -90,6 +90,14 @@ typedef struct page *pgtable_t;
>  #define PTE_FMT "%08lx"
>  #endif
>  
> +#ifdef CONFIG_64BIT
> +/*
> + * We override this value as its generic definition uses __pa too early in
> + * the boot process (before kernel_map.va_pa_offset is set).
> + */
> +#define MIN_MEMBLOCK_ADDR      0
> +#endif
> +
>  #ifdef CONFIG_MMU
>  #define ARCH_PFN_OFFSET		(PFN_DOWN((unsigned long)phys_ram_base))
>  #else
> @@ -121,7 +129,11 @@ extern phys_addr_t phys_ram_base;
>  #define is_linear_mapping(x)	\
>  	((x) >= PAGE_OFFSET && (!IS_ENABLED(CONFIG_64BIT) || (x) < PAGE_OFFSET + KERN_VIRT_SIZE))
>  
> +#ifndef CONFIG_DEBUG_VIRTUAL
>  #define linear_mapping_pa_to_va(x)	((void *)((unsigned long)(x) + kernel_map.va_pa_offset))
> +#else
> +void *linear_mapping_pa_to_va(unsigned long x);
> +#endif
>  #define kernel_mapping_pa_to_va(y)	({					\
>  	unsigned long _y = (unsigned long)(y);					\
>  	(IS_ENABLED(CONFIG_XIP_KERNEL) && _y < phys_ram_base) ?			\
> @@ -130,7 +142,11 @@ extern phys_addr_t phys_ram_base;
>  	})
>  #define __pa_to_va_nodebug(x)		linear_mapping_pa_to_va(x)
>  
> +#ifndef CONFIG_DEBUG_VIRTUAL
>  #define linear_mapping_va_to_pa(x)	((unsigned long)(x) - kernel_map.va_pa_offset)
> +#else
> +phys_addr_t linear_mapping_va_to_pa(unsigned long x);
> +#endif
>  #define kernel_mapping_va_to_pa(y) ({						\
>  	unsigned long _y = (unsigned long)(y);					\
>  	(IS_ENABLED(CONFIG_XIP_KERNEL) && _y < kernel_map.virt_addr + XIP_OFFSET) ? \
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 3b37d8606920..f803671d18b2 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -213,6 +213,14 @@ static void __init setup_bootmem(void)
>  	phys_ram_end = memblock_end_of_DRAM();
>  	if (!IS_ENABLED(CONFIG_XIP_KERNEL))
>  		phys_ram_base = memblock_start_of_DRAM();
> +
> +	/*
> +	 * In 64-bit, any use of __va/__pa before this point is wrong as we
> +	 * did not know the start of DRAM before.
> +	 */
> +	if (IS_ENABLED(CONFIG_64BIT))
> +		kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
> +
>  	/*
>  	 * memblock allocator is not aware of the fact that last 4K bytes of
>  	 * the addressable memory can not be mapped because of IS_ERR_VALUE
> @@ -667,9 +675,16 @@ void __init create_pgd_mapping(pgd_t *pgdp,
>  
>  static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  {
> -	/* Upgrade to PMD_SIZE mappings whenever possible */
> -	base &= PMD_SIZE - 1;
> -	if (!base && size >= PMD_SIZE)
> +	if (!(base & (PGDIR_SIZE - 1)) && size >= PGDIR_SIZE)
> +		return PGDIR_SIZE;
> +
> +	if (!(base & (P4D_SIZE - 1)) && size >= P4D_SIZE)
> +		return P4D_SIZE;
> +
> +	if (!(base & (PUD_SIZE - 1)) && size >= PUD_SIZE)
> +		return PUD_SIZE;
> +
> +	if (!(base & (PMD_SIZE - 1)) && size >= PMD_SIZE)
>  		return PMD_SIZE;
>  
>  	return PAGE_SIZE;
> @@ -978,11 +993,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  	set_satp_mode();
>  #endif
>  
> -	kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
> +	/*
> +	 * In 64-bit, we defer the setup of va_pa_offset to setup_bootmem,
> +	 * where we have the system memory layout: this allows us to align
> +	 * the physical and virtual mappings and then make use of PUD/P4D/PGD
> +	 * for the linear mapping. This is only possible because the kernel
> +	 * mapping lies outside the linear mapping.
> +	 * In 32-bit however, as the kernel resides in the linear mapping,
> +	 * setup_vm_final can not change the mapping established here,
> +	 * otherwise the same kernel addresses would get mapped to different
> +	 * physical addresses (if the start of dram is different from the
> +	 * kernel physical address start).
> +	 */
> +	kernel_map.va_pa_offset = IS_ENABLED(CONFIG_64BIT) ?
> +				0UL : PAGE_OFFSET - kernel_map.phys_addr;
>  	kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
>  
> -	phys_ram_base = kernel_map.phys_addr;
> -
>  	/*
>  	 * The default maximal physical memory size is KERN_VIRT_SIZE for 32-bit
>  	 * kernel, whereas for 64-bit kernel, the end of the virtual address
> @@ -1106,6 +1132,17 @@ static void __init create_linear_mapping_page_table(void)
>  	phys_addr_t start, end;
>  	u64 i;
>  
> +#ifdef CONFIG_STRICT_KERNEL_RWX
> +	phys_addr_t ktext_start = __pa_symbol(_start);
> +	phys_addr_t ktext_size = __init_data_begin - _start;
> +	phys_addr_t krodata_start = __pa_symbol(__start_rodata);
> +	phys_addr_t krodata_size = _data - __start_rodata;
> +
> +	/* Isolate kernel text and rodata so they don't get mapped with a PUD */
> +	memblock_mark_nomap(ktext_start,  ktext_size);
> +	memblock_mark_nomap(krodata_start, krodata_size);
> +#endif
> +
>  	/* Map all memory banks in the linear mapping */
>  	for_each_mem_range(i, &start, &end) {
>  		if (start >= end)
> @@ -1118,6 +1155,15 @@ static void __init create_linear_mapping_page_table(void)
>  
>  		create_linear_mapping_range(start, end);
>  	}
> +
> +#ifdef CONFIG_STRICT_KERNEL_RWX
> +	create_linear_mapping_range(ktext_start, ktext_start + ktext_size);
> +	create_linear_mapping_range(krodata_start,
> +				    krodata_start + krodata_size);

Just for my own education, it looks to me like the rodata is left writable
until the end of start_kernel(), when mark_rodata_ro() is called. Is that
correct?

Thanks,
drew

> +
> +	memblock_clear_nomap(ktext_start,  ktext_size);
> +	memblock_clear_nomap(krodata_start, krodata_size);
> +#endif
>  }
>  
>  static void __init setup_vm_final(void)
> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> index 9b18bda74154..18706f457da7 100644
> --- a/arch/riscv/mm/physaddr.c
> +++ b/arch/riscv/mm/physaddr.c
> @@ -33,3 +33,19 @@ phys_addr_t __phys_addr_symbol(unsigned long x)
>  	return __va_to_pa_nodebug(x);
>  }
>  EXPORT_SYMBOL(__phys_addr_symbol);
> +
> +phys_addr_t linear_mapping_va_to_pa(unsigned long x)
> +{
> +	BUG_ON(!kernel_map.va_pa_offset);
> +
> +	return ((unsigned long)(x) - kernel_map.va_pa_offset);
> +}
> +EXPORT_SYMBOL(linear_mapping_va_to_pa);
> +
> +void *linear_mapping_pa_to_va(unsigned long x)
> +{
> +	BUG_ON(!kernel_map.va_pa_offset);
> +
> +	return ((void *)((unsigned long)(x) + kernel_map.va_pa_offset));
> +}
> +EXPORT_SYMBOL(linear_mapping_pa_to_va);
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index d1a68b6d03b3..d14735a81301 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -887,12 +887,13 @@ const void * __init of_flat_dt_match_machine(const void *default_match,
>  static void __early_init_dt_declare_initrd(unsigned long start,
>  					   unsigned long end)
>  {
> -	/* ARM64 would cause a BUG to occur here when CONFIG_DEBUG_VM is
> -	 * enabled since __va() is called too early. ARM64 does make use
> -	 * of phys_initrd_start/phys_initrd_size so we can skip this
> -	 * conversion.
> +	/*
> +	 * __va() is not yet available this early on some platforms. In that
> +	 * case, the platform uses phys_initrd_start/phys_initrd_size instead
> +	 * and does the VA conversion itself.
>  	 */
> -	if (!IS_ENABLED(CONFIG_ARM64)) {
> +	if (!IS_ENABLED(CONFIG_ARM64) &&
> +	    !(IS_ENABLED(CONFIG_RISCV) && IS_ENABLED(CONFIG_64BIT))) {
>  		initrd_start = (unsigned long)__va(start);
>  		initrd_end = (unsigned long)__va(end);
>  		initrd_below_start_ok = 1;
> -- 
> 2.37.2
>
Alexandre Ghiti March 27, 2023, 11:15 a.m. UTC | #2
Hi Andrew,

On Mon, Mar 27, 2023 at 11:39 AM Andrew Jones <ajones@ventanamicro.com> wrote:
>
> On Fri, Mar 24, 2023 at 04:54:21PM +0100, Alexandre Ghiti wrote:
> > During the early page table creation, we used to set the mapping for
> > PAGE_OFFSET to the kernel load address: but the kernel load address is
> > always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD
> > pages as this physical address is not aligned on PUD/P4D/PGD size (whereas
> > PAGE_OFFSET is).
> >
> > But actually we don't have to establish this mapping (ie set va_pa_offset)
> > that early in the boot process because:
> >
> > - first, setup_vm installs a temporary kernel mapping and among other
> >   things, discovers the system memory,
> > - then, setup_vm_final creates the final kernel mapping and takes
> >   advantage of the discovered system memory to create the linear
> >   mapping.
> >
> > During the first phase, we don't know the start of the system memory and
> > then until the second phase is finished, we can't use the linear mapping at
> > all and phys_to_virt/virt_to_phys translations must not be used because it
> > would result in a different translation from the 'real' one once the final
> > mapping is installed.
> >
> > So here we simply delay the initialization of va_pa_offset to after the
> > system memory discovery. But to make sure noone uses the linear mapping
> > before, we add some guard in the DEBUG_VIRTUAL config.
> >
> > Finally we can use PUD/P4D/PGD hugepages when possible, which will result
> > in a better TLB utilization.
> >
> > Note that:
> > - this does not apply to rv32 as the kernel mapping lies in the linear
> >   mapping.
> > - we rely on the firmware to protect itself using PMP.
> >
> > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > Acked-by: Rob Herring <robh@kernel.org> # DT bits
> > ---
> >  arch/riscv/include/asm/page.h | 16 ++++++++++
> >  arch/riscv/mm/init.c          | 58 +++++++++++++++++++++++++++++++----
> >  arch/riscv/mm/physaddr.c      | 16 ++++++++++
> >  drivers/of/fdt.c              | 11 ++++---
> >  4 files changed, 90 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> > index 8dc686f549b6..ea1a0e237211 100644
> > --- a/arch/riscv/include/asm/page.h
> > +++ b/arch/riscv/include/asm/page.h
> > @@ -90,6 +90,14 @@ typedef struct page *pgtable_t;
> >  #define PTE_FMT "%08lx"
> >  #endif
> >
> > +#ifdef CONFIG_64BIT
> > +/*
> > + * We override this value as its generic definition uses __pa too early in
> > + * the boot process (before kernel_map.va_pa_offset is set).
> > + */
> > +#define MIN_MEMBLOCK_ADDR      0
> > +#endif
> > +
> >  #ifdef CONFIG_MMU
> >  #define ARCH_PFN_OFFSET              (PFN_DOWN((unsigned long)phys_ram_base))
> >  #else
> > @@ -121,7 +129,11 @@ extern phys_addr_t phys_ram_base;
> >  #define is_linear_mapping(x) \
> >       ((x) >= PAGE_OFFSET && (!IS_ENABLED(CONFIG_64BIT) || (x) < PAGE_OFFSET + KERN_VIRT_SIZE))
> >
> > +#ifndef CONFIG_DEBUG_VIRTUAL
> >  #define linear_mapping_pa_to_va(x)   ((void *)((unsigned long)(x) + kernel_map.va_pa_offset))
> > +#else
> > +void *linear_mapping_pa_to_va(unsigned long x);
> > +#endif
> >  #define kernel_mapping_pa_to_va(y)   ({                                      \
> >       unsigned long _y = (unsigned long)(y);                                  \
> >       (IS_ENABLED(CONFIG_XIP_KERNEL) && _y < phys_ram_base) ?                 \
> > @@ -130,7 +142,11 @@ extern phys_addr_t phys_ram_base;
> >       })
> >  #define __pa_to_va_nodebug(x)                linear_mapping_pa_to_va(x)
> >
> > +#ifndef CONFIG_DEBUG_VIRTUAL
> >  #define linear_mapping_va_to_pa(x)   ((unsigned long)(x) - kernel_map.va_pa_offset)
> > +#else
> > +phys_addr_t linear_mapping_va_to_pa(unsigned long x);
> > +#endif
> >  #define kernel_mapping_va_to_pa(y) ({                                                \
> >       unsigned long _y = (unsigned long)(y);                                  \
> >       (IS_ENABLED(CONFIG_XIP_KERNEL) && _y < kernel_map.virt_addr + XIP_OFFSET) ? \
> > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > index 3b37d8606920..f803671d18b2 100644
> > --- a/arch/riscv/mm/init.c
> > +++ b/arch/riscv/mm/init.c
> > @@ -213,6 +213,14 @@ static void __init setup_bootmem(void)
> >       phys_ram_end = memblock_end_of_DRAM();
> >       if (!IS_ENABLED(CONFIG_XIP_KERNEL))
> >               phys_ram_base = memblock_start_of_DRAM();
> > +
> > +     /*
> > +      * In 64-bit, any use of __va/__pa before this point is wrong as we
> > +      * did not know the start of DRAM before.
> > +      */
> > +     if (IS_ENABLED(CONFIG_64BIT))
> > +             kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
> > +
> >       /*
> >        * memblock allocator is not aware of the fact that last 4K bytes of
> >        * the addressable memory can not be mapped because of IS_ERR_VALUE
> > @@ -667,9 +675,16 @@ void __init create_pgd_mapping(pgd_t *pgdp,
> >
> >  static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> >  {
> > -     /* Upgrade to PMD_SIZE mappings whenever possible */
> > -     base &= PMD_SIZE - 1;
> > -     if (!base && size >= PMD_SIZE)
> > +     if (!(base & (PGDIR_SIZE - 1)) && size >= PGDIR_SIZE)
> > +             return PGDIR_SIZE;
> > +
> > +     if (!(base & (P4D_SIZE - 1)) && size >= P4D_SIZE)
> > +             return P4D_SIZE;
> > +
> > +     if (!(base & (PUD_SIZE - 1)) && size >= PUD_SIZE)
> > +             return PUD_SIZE;
> > +
> > +     if (!(base & (PMD_SIZE - 1)) && size >= PMD_SIZE)
> >               return PMD_SIZE;
> >
> >       return PAGE_SIZE;
> > @@ -978,11 +993,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >       set_satp_mode();
> >  #endif
> >
> > -     kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
> > +     /*
> > +      * In 64-bit, we defer the setup of va_pa_offset to setup_bootmem,
> > +      * where we have the system memory layout: this allows us to align
> > +      * the physical and virtual mappings and then make use of PUD/P4D/PGD
> > +      * for the linear mapping. This is only possible because the kernel
> > +      * mapping lies outside the linear mapping.
> > +      * In 32-bit however, as the kernel resides in the linear mapping,
> > +      * setup_vm_final can not change the mapping established here,
> > +      * otherwise the same kernel addresses would get mapped to different
> > +      * physical addresses (if the start of dram is different from the
> > +      * kernel physical address start).
> > +      */
> > +     kernel_map.va_pa_offset = IS_ENABLED(CONFIG_64BIT) ?
> > +                             0UL : PAGE_OFFSET - kernel_map.phys_addr;
> >       kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
> >
> > -     phys_ram_base = kernel_map.phys_addr;
> > -
> >       /*
> >        * The default maximal physical memory size is KERN_VIRT_SIZE for 32-bit
> >        * kernel, whereas for 64-bit kernel, the end of the virtual address
> > @@ -1106,6 +1132,17 @@ static void __init create_linear_mapping_page_table(void)
> >       phys_addr_t start, end;
> >       u64 i;
> >
> > +#ifdef CONFIG_STRICT_KERNEL_RWX
> > +     phys_addr_t ktext_start = __pa_symbol(_start);
> > +     phys_addr_t ktext_size = __init_data_begin - _start;
> > +     phys_addr_t krodata_start = __pa_symbol(__start_rodata);
> > +     phys_addr_t krodata_size = _data - __start_rodata;
> > +
> > +     /* Isolate kernel text and rodata so they don't get mapped with a PUD */
> > +     memblock_mark_nomap(ktext_start,  ktext_size);
> > +     memblock_mark_nomap(krodata_start, krodata_size);
> > +#endif
> > +
> >       /* Map all memory banks in the linear mapping */
> >       for_each_mem_range(i, &start, &end) {
> >               if (start >= end)
> > @@ -1118,6 +1155,15 @@ static void __init create_linear_mapping_page_table(void)
> >
> >               create_linear_mapping_range(start, end);
> >       }
> > +
> > +#ifdef CONFIG_STRICT_KERNEL_RWX
> > +     create_linear_mapping_range(ktext_start, ktext_start + ktext_size);
> > +     create_linear_mapping_range(krodata_start,
> > +                                 krodata_start + krodata_size);
>
> Just for my own education, it looks to me like the rodata is left writable
> until the end of start_kernel(), when mark_rodata_ro() is called. Is that
> correct?

Yes, right before init is triggered, certainly that late because the
rodata section embeds the "__ro_after_init" variables.


>
> Thanks,
> drew
>
> > +
> > +     memblock_clear_nomap(ktext_start,  ktext_size);
> > +     memblock_clear_nomap(krodata_start, krodata_size);
> > +#endif
> >  }
> >
> >  static void __init setup_vm_final(void)
> > diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> > index 9b18bda74154..18706f457da7 100644
> > --- a/arch/riscv/mm/physaddr.c
> > +++ b/arch/riscv/mm/physaddr.c
> > @@ -33,3 +33,19 @@ phys_addr_t __phys_addr_symbol(unsigned long x)
> >       return __va_to_pa_nodebug(x);
> >  }
> >  EXPORT_SYMBOL(__phys_addr_symbol);
> > +
> > +phys_addr_t linear_mapping_va_to_pa(unsigned long x)
> > +{
> > +     BUG_ON(!kernel_map.va_pa_offset);
> > +
> > +     return ((unsigned long)(x) - kernel_map.va_pa_offset);
> > +}
> > +EXPORT_SYMBOL(linear_mapping_va_to_pa);
> > +
> > +void *linear_mapping_pa_to_va(unsigned long x)
> > +{
> > +     BUG_ON(!kernel_map.va_pa_offset);
> > +
> > +     return ((void *)((unsigned long)(x) + kernel_map.va_pa_offset));
> > +}
> > +EXPORT_SYMBOL(linear_mapping_pa_to_va);
> > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> > index d1a68b6d03b3..d14735a81301 100644
> > --- a/drivers/of/fdt.c
> > +++ b/drivers/of/fdt.c
> > @@ -887,12 +887,13 @@ const void * __init of_flat_dt_match_machine(const void *default_match,
> >  static void __early_init_dt_declare_initrd(unsigned long start,
> >                                          unsigned long end)
> >  {
> > -     /* ARM64 would cause a BUG to occur here when CONFIG_DEBUG_VM is
> > -      * enabled since __va() is called too early. ARM64 does make use
> > -      * of phys_initrd_start/phys_initrd_size so we can skip this
> > -      * conversion.
> > +     /*
> > +      * __va() is not yet available this early on some platforms. In that
> > +      * case, the platform uses phys_initrd_start/phys_initrd_size instead
> > +      * and does the VA conversion itself.
> >        */
> > -     if (!IS_ENABLED(CONFIG_ARM64)) {
> > +     if (!IS_ENABLED(CONFIG_ARM64) &&
> > +         !(IS_ENABLED(CONFIG_RISCV) && IS_ENABLED(CONFIG_64BIT))) {
> >               initrd_start = (unsigned long)__va(start);
> >               initrd_end = (unsigned long)__va(end);
> >               initrd_below_start_ok = 1;
> > --
> > 2.37.2
> >
Andrew Jones March 27, 2023, 11:37 a.m. UTC | #3
On Mon, Mar 27, 2023 at 01:15:43PM +0200, Alexandre Ghiti wrote:
> Hi Andrew,
> 
> On Mon, Mar 27, 2023 at 11:39 AM Andrew Jones <ajones@ventanamicro.com> wrote:
> >
> > On Fri, Mar 24, 2023 at 04:54:21PM +0100, Alexandre Ghiti wrote:
> > > During the early page table creation, we used to set the mapping for
> > > PAGE_OFFSET to the kernel load address: but the kernel load address is
> > > always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD
> > > pages as this physical address is not aligned on PUD/P4D/PGD size (whereas
> > > PAGE_OFFSET is).
> > >
> > > But actually we don't have to establish this mapping (ie set va_pa_offset)
> > > that early in the boot process because:
> > >
> > > - first, setup_vm installs a temporary kernel mapping and among other
> > >   things, discovers the system memory,
> > > - then, setup_vm_final creates the final kernel mapping and takes
> > >   advantage of the discovered system memory to create the linear
> > >   mapping.
> > >
> > > During the first phase, we don't know the start of the system memory and
> > > then until the second phase is finished, we can't use the linear mapping at
> > > all and phys_to_virt/virt_to_phys translations must not be used because it
> > > would result in a different translation from the 'real' one once the final
> > > mapping is installed.
> > >
> > > So here we simply delay the initialization of va_pa_offset to after the
> > > system memory discovery. But to make sure noone uses the linear mapping
> > > before, we add some guard in the DEBUG_VIRTUAL config.
> > >
> > > Finally we can use PUD/P4D/PGD hugepages when possible, which will result
> > > in a better TLB utilization.
> > >
> > > Note that:
> > > - this does not apply to rv32 as the kernel mapping lies in the linear
> > >   mapping.
> > > - we rely on the firmware to protect itself using PMP.
> > >
> > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > Acked-by: Rob Herring <robh@kernel.org> # DT bits
> > > ---
> > >  arch/riscv/include/asm/page.h | 16 ++++++++++
> > >  arch/riscv/mm/init.c          | 58 +++++++++++++++++++++++++++++++----
> > >  arch/riscv/mm/physaddr.c      | 16 ++++++++++
> > >  drivers/of/fdt.c              | 11 ++++---
> > >  4 files changed, 90 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> > > index 8dc686f549b6..ea1a0e237211 100644
> > > --- a/arch/riscv/include/asm/page.h
> > > +++ b/arch/riscv/include/asm/page.h
> > > @@ -90,6 +90,14 @@ typedef struct page *pgtable_t;
> > >  #define PTE_FMT "%08lx"
> > >  #endif
> > >
> > > +#ifdef CONFIG_64BIT
> > > +/*
> > > + * We override this value as its generic definition uses __pa too early in
> > > + * the boot process (before kernel_map.va_pa_offset is set).
> > > + */
> > > +#define MIN_MEMBLOCK_ADDR      0
> > > +#endif
> > > +
> > >  #ifdef CONFIG_MMU
> > >  #define ARCH_PFN_OFFSET              (PFN_DOWN((unsigned long)phys_ram_base))
> > >  #else
> > > @@ -121,7 +129,11 @@ extern phys_addr_t phys_ram_base;
> > >  #define is_linear_mapping(x) \
> > >       ((x) >= PAGE_OFFSET && (!IS_ENABLED(CONFIG_64BIT) || (x) < PAGE_OFFSET + KERN_VIRT_SIZE))
> > >
> > > +#ifndef CONFIG_DEBUG_VIRTUAL
> > >  #define linear_mapping_pa_to_va(x)   ((void *)((unsigned long)(x) + kernel_map.va_pa_offset))
> > > +#else
> > > +void *linear_mapping_pa_to_va(unsigned long x);
> > > +#endif
> > >  #define kernel_mapping_pa_to_va(y)   ({                                      \
> > >       unsigned long _y = (unsigned long)(y);                                  \
> > >       (IS_ENABLED(CONFIG_XIP_KERNEL) && _y < phys_ram_base) ?                 \
> > > @@ -130,7 +142,11 @@ extern phys_addr_t phys_ram_base;
> > >       })
> > >  #define __pa_to_va_nodebug(x)                linear_mapping_pa_to_va(x)
> > >
> > > +#ifndef CONFIG_DEBUG_VIRTUAL
> > >  #define linear_mapping_va_to_pa(x)   ((unsigned long)(x) - kernel_map.va_pa_offset)
> > > +#else
> > > +phys_addr_t linear_mapping_va_to_pa(unsigned long x);
> > > +#endif
> > >  #define kernel_mapping_va_to_pa(y) ({                                                \
> > >       unsigned long _y = (unsigned long)(y);                                  \
> > >       (IS_ENABLED(CONFIG_XIP_KERNEL) && _y < kernel_map.virt_addr + XIP_OFFSET) ? \
> > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> > > index 3b37d8606920..f803671d18b2 100644
> > > --- a/arch/riscv/mm/init.c
> > > +++ b/arch/riscv/mm/init.c
> > > @@ -213,6 +213,14 @@ static void __init setup_bootmem(void)
> > >       phys_ram_end = memblock_end_of_DRAM();
> > >       if (!IS_ENABLED(CONFIG_XIP_KERNEL))
> > >               phys_ram_base = memblock_start_of_DRAM();
> > > +
> > > +     /*
> > > +      * In 64-bit, any use of __va/__pa before this point is wrong as we
> > > +      * did not know the start of DRAM before.
> > > +      */
> > > +     if (IS_ENABLED(CONFIG_64BIT))
> > > +             kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
> > > +
> > >       /*
> > >        * memblock allocator is not aware of the fact that last 4K bytes of
> > >        * the addressable memory can not be mapped because of IS_ERR_VALUE
> > > @@ -667,9 +675,16 @@ void __init create_pgd_mapping(pgd_t *pgdp,
> > >
> > >  static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> > >  {
> > > -     /* Upgrade to PMD_SIZE mappings whenever possible */
> > > -     base &= PMD_SIZE - 1;
> > > -     if (!base && size >= PMD_SIZE)
> > > +     if (!(base & (PGDIR_SIZE - 1)) && size >= PGDIR_SIZE)
> > > +             return PGDIR_SIZE;
> > > +
> > > +     if (!(base & (P4D_SIZE - 1)) && size >= P4D_SIZE)
> > > +             return P4D_SIZE;
> > > +
> > > +     if (!(base & (PUD_SIZE - 1)) && size >= PUD_SIZE)
> > > +             return PUD_SIZE;
> > > +
> > > +     if (!(base & (PMD_SIZE - 1)) && size >= PMD_SIZE)
> > >               return PMD_SIZE;
> > >
> > >       return PAGE_SIZE;
> > > @@ -978,11 +993,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> > >       set_satp_mode();
> > >  #endif
> > >
> > > -     kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
> > > +     /*
> > > +      * In 64-bit, we defer the setup of va_pa_offset to setup_bootmem,
> > > +      * where we have the system memory layout: this allows us to align
> > > +      * the physical and virtual mappings and then make use of PUD/P4D/PGD
> > > +      * for the linear mapping. This is only possible because the kernel
> > > +      * mapping lies outside the linear mapping.
> > > +      * In 32-bit however, as the kernel resides in the linear mapping,
> > > +      * setup_vm_final can not change the mapping established here,
> > > +      * otherwise the same kernel addresses would get mapped to different
> > > +      * physical addresses (if the start of dram is different from the
> > > +      * kernel physical address start).
> > > +      */
> > > +     kernel_map.va_pa_offset = IS_ENABLED(CONFIG_64BIT) ?
> > > +                             0UL : PAGE_OFFSET - kernel_map.phys_addr;
> > >       kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
> > >
> > > -     phys_ram_base = kernel_map.phys_addr;
> > > -
> > >       /*
> > >        * The default maximal physical memory size is KERN_VIRT_SIZE for 32-bit
> > >        * kernel, whereas for 64-bit kernel, the end of the virtual address
> > > @@ -1106,6 +1132,17 @@ static void __init create_linear_mapping_page_table(void)
> > >       phys_addr_t start, end;
> > >       u64 i;
> > >
> > > +#ifdef CONFIG_STRICT_KERNEL_RWX
> > > +     phys_addr_t ktext_start = __pa_symbol(_start);
> > > +     phys_addr_t ktext_size = __init_data_begin - _start;
> > > +     phys_addr_t krodata_start = __pa_symbol(__start_rodata);
> > > +     phys_addr_t krodata_size = _data - __start_rodata;
> > > +
> > > +     /* Isolate kernel text and rodata so they don't get mapped with a PUD */
> > > +     memblock_mark_nomap(ktext_start,  ktext_size);
> > > +     memblock_mark_nomap(krodata_start, krodata_size);
> > > +#endif
> > > +
> > >       /* Map all memory banks in the linear mapping */
> > >       for_each_mem_range(i, &start, &end) {
> > >               if (start >= end)
> > > @@ -1118,6 +1155,15 @@ static void __init create_linear_mapping_page_table(void)
> > >
> > >               create_linear_mapping_range(start, end);
> > >       }
> > > +
> > > +#ifdef CONFIG_STRICT_KERNEL_RWX
> > > +     create_linear_mapping_range(ktext_start, ktext_start + ktext_size);
> > > +     create_linear_mapping_range(krodata_start,
> > > +                                 krodata_start + krodata_size);
> >
> > Just for my own education, it looks to me like the rodata is left writable
> > until the end of start_kernel(), when mark_rodata_ro() is called. Is that
> > correct?
> 
> Yes, right before init is triggered, certainly that late because the
> rodata section embeds the "__ro_after_init" variables.

Ah, that indeed helps clarify why. Sounds good.

Thanks,
drew
Andrew Jones March 27, 2023, 11:37 a.m. UTC | #4
On Fri, Mar 24, 2023 at 04:54:21PM +0100, Alexandre Ghiti wrote:
> During the early page table creation, we used to set the mapping for
> PAGE_OFFSET to the kernel load address: but the kernel load address is
> always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD
> pages as this physical address is not aligned on PUD/P4D/PGD size (whereas
> PAGE_OFFSET is).
> 
> But actually we don't have to establish this mapping (ie set va_pa_offset)
> that early in the boot process because:
> 
> - first, setup_vm installs a temporary kernel mapping and among other
>   things, discovers the system memory,
> - then, setup_vm_final creates the final kernel mapping and takes
>   advantage of the discovered system memory to create the linear
>   mapping.
> 
> During the first phase, we don't know the start of the system memory and
> then until the second phase is finished, we can't use the linear mapping at
> all and phys_to_virt/virt_to_phys translations must not be used because it
> would result in a different translation from the 'real' one once the final
> mapping is installed.
> 
> So here we simply delay the initialization of va_pa_offset to after the
> system memory discovery. But to make sure noone uses the linear mapping
> before, we add some guard in the DEBUG_VIRTUAL config.
> 
> Finally we can use PUD/P4D/PGD hugepages when possible, which will result
> in a better TLB utilization.
> 
> Note that:
> - this does not apply to rv32 as the kernel mapping lies in the linear
>   mapping.
> - we rely on the firmware to protect itself using PMP.
> 
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> Acked-by: Rob Herring <robh@kernel.org> # DT bits
> ---
>  arch/riscv/include/asm/page.h | 16 ++++++++++
>  arch/riscv/mm/init.c          | 58 +++++++++++++++++++++++++++++++----
>  arch/riscv/mm/physaddr.c      | 16 ++++++++++
>  drivers/of/fdt.c              | 11 ++++---
>  4 files changed, 90 insertions(+), 11 deletions(-)
> 

Reviewed-by: Andrew Jones <ajones@ventanamicro.com>

Thanks,
drew
Anup Patel March 27, 2023, 12:13 p.m. UTC | #5
On Fri, Mar 24, 2023 at 9:27 PM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
>
> During the early page table creation, we used to set the mapping for
> PAGE_OFFSET to the kernel load address: but the kernel load address is
> always offseted by PMD_SIZE which makes it impossible to use PUD/P4D/PGD
> pages as this physical address is not aligned on PUD/P4D/PGD size (whereas
> PAGE_OFFSET is).
>
> But actually we don't have to establish this mapping (ie set va_pa_offset)
> that early in the boot process because:
>
> - first, setup_vm installs a temporary kernel mapping and among other
>   things, discovers the system memory,
> - then, setup_vm_final creates the final kernel mapping and takes
>   advantage of the discovered system memory to create the linear
>   mapping.
>
> During the first phase, we don't know the start of the system memory and
> then until the second phase is finished, we can't use the linear mapping at
> all and phys_to_virt/virt_to_phys translations must not be used because it
> would result in a different translation from the 'real' one once the final
> mapping is installed.
>
> So here we simply delay the initialization of va_pa_offset to after the
> system memory discovery. But to make sure noone uses the linear mapping
> before, we add some guard in the DEBUG_VIRTUAL config.
>
> Finally we can use PUD/P4D/PGD hugepages when possible, which will result
> in a better TLB utilization.
>
> Note that:
> - this does not apply to rv32 as the kernel mapping lies in the linear
>   mapping.
> - we rely on the firmware to protect itself using PMP.
>
> Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> Acked-by: Rob Herring <robh@kernel.org> # DT bits

Reviewed-by: Anup Patel <anup@brainfault.org>
Tested-by: Anup Patel <anup@brainfault.org>

Regards,
Anup

> ---
>  arch/riscv/include/asm/page.h | 16 ++++++++++
>  arch/riscv/mm/init.c          | 58 +++++++++++++++++++++++++++++++----
>  arch/riscv/mm/physaddr.c      | 16 ++++++++++
>  drivers/of/fdt.c              | 11 ++++---
>  4 files changed, 90 insertions(+), 11 deletions(-)
>
> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> index 8dc686f549b6..ea1a0e237211 100644
> --- a/arch/riscv/include/asm/page.h
> +++ b/arch/riscv/include/asm/page.h
> @@ -90,6 +90,14 @@ typedef struct page *pgtable_t;
>  #define PTE_FMT "%08lx"
>  #endif
>
> +#ifdef CONFIG_64BIT
> +/*
> + * We override this value as its generic definition uses __pa too early in
> + * the boot process (before kernel_map.va_pa_offset is set).
> + */
> +#define MIN_MEMBLOCK_ADDR      0
> +#endif
> +
>  #ifdef CONFIG_MMU
>  #define ARCH_PFN_OFFSET                (PFN_DOWN((unsigned long)phys_ram_base))
>  #else
> @@ -121,7 +129,11 @@ extern phys_addr_t phys_ram_base;
>  #define is_linear_mapping(x)   \
>         ((x) >= PAGE_OFFSET && (!IS_ENABLED(CONFIG_64BIT) || (x) < PAGE_OFFSET + KERN_VIRT_SIZE))
>
> +#ifndef CONFIG_DEBUG_VIRTUAL
>  #define linear_mapping_pa_to_va(x)     ((void *)((unsigned long)(x) + kernel_map.va_pa_offset))
> +#else
> +void *linear_mapping_pa_to_va(unsigned long x);
> +#endif
>  #define kernel_mapping_pa_to_va(y)     ({                                      \
>         unsigned long _y = (unsigned long)(y);                                  \
>         (IS_ENABLED(CONFIG_XIP_KERNEL) && _y < phys_ram_base) ?                 \
> @@ -130,7 +142,11 @@ extern phys_addr_t phys_ram_base;
>         })
>  #define __pa_to_va_nodebug(x)          linear_mapping_pa_to_va(x)
>
> +#ifndef CONFIG_DEBUG_VIRTUAL
>  #define linear_mapping_va_to_pa(x)     ((unsigned long)(x) - kernel_map.va_pa_offset)
> +#else
> +phys_addr_t linear_mapping_va_to_pa(unsigned long x);
> +#endif
>  #define kernel_mapping_va_to_pa(y) ({                                          \
>         unsigned long _y = (unsigned long)(y);                                  \
>         (IS_ENABLED(CONFIG_XIP_KERNEL) && _y < kernel_map.virt_addr + XIP_OFFSET) ? \
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 3b37d8606920..f803671d18b2 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -213,6 +213,14 @@ static void __init setup_bootmem(void)
>         phys_ram_end = memblock_end_of_DRAM();
>         if (!IS_ENABLED(CONFIG_XIP_KERNEL))
>                 phys_ram_base = memblock_start_of_DRAM();
> +
> +       /*
> +        * In 64-bit, any use of __va/__pa before this point is wrong as we
> +        * did not know the start of DRAM before.
> +        */
> +       if (IS_ENABLED(CONFIG_64BIT))
> +               kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
> +
>         /*
>          * memblock allocator is not aware of the fact that last 4K bytes of
>          * the addressable memory can not be mapped because of IS_ERR_VALUE
> @@ -667,9 +675,16 @@ void __init create_pgd_mapping(pgd_t *pgdp,
>
>  static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>  {
> -       /* Upgrade to PMD_SIZE mappings whenever possible */
> -       base &= PMD_SIZE - 1;
> -       if (!base && size >= PMD_SIZE)
> +       if (!(base & (PGDIR_SIZE - 1)) && size >= PGDIR_SIZE)
> +               return PGDIR_SIZE;
> +
> +       if (!(base & (P4D_SIZE - 1)) && size >= P4D_SIZE)
> +               return P4D_SIZE;
> +
> +       if (!(base & (PUD_SIZE - 1)) && size >= PUD_SIZE)
> +               return PUD_SIZE;
> +
> +       if (!(base & (PMD_SIZE - 1)) && size >= PMD_SIZE)
>                 return PMD_SIZE;
>
>         return PAGE_SIZE;
> @@ -978,11 +993,22 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>         set_satp_mode();
>  #endif
>
> -       kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
> +       /*
> +        * In 64-bit, we defer the setup of va_pa_offset to setup_bootmem,
> +        * where we have the system memory layout: this allows us to align
> +        * the physical and virtual mappings and then make use of PUD/P4D/PGD
> +        * for the linear mapping. This is only possible because the kernel
> +        * mapping lies outside the linear mapping.
> +        * In 32-bit however, as the kernel resides in the linear mapping,
> +        * setup_vm_final can not change the mapping established here,
> +        * otherwise the same kernel addresses would get mapped to different
> +        * physical addresses (if the start of dram is different from the
> +        * kernel physical address start).
> +        */
> +       kernel_map.va_pa_offset = IS_ENABLED(CONFIG_64BIT) ?
> +                               0UL : PAGE_OFFSET - kernel_map.phys_addr;
>         kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
>
> -       phys_ram_base = kernel_map.phys_addr;
> -
>         /*
>          * The default maximal physical memory size is KERN_VIRT_SIZE for 32-bit
>          * kernel, whereas for 64-bit kernel, the end of the virtual address
> @@ -1106,6 +1132,17 @@ static void __init create_linear_mapping_page_table(void)
>         phys_addr_t start, end;
>         u64 i;
>
> +#ifdef CONFIG_STRICT_KERNEL_RWX
> +       phys_addr_t ktext_start = __pa_symbol(_start);
> +       phys_addr_t ktext_size = __init_data_begin - _start;
> +       phys_addr_t krodata_start = __pa_symbol(__start_rodata);
> +       phys_addr_t krodata_size = _data - __start_rodata;
> +
> +       /* Isolate kernel text and rodata so they don't get mapped with a PUD */
> +       memblock_mark_nomap(ktext_start,  ktext_size);
> +       memblock_mark_nomap(krodata_start, krodata_size);
> +#endif
> +
>         /* Map all memory banks in the linear mapping */
>         for_each_mem_range(i, &start, &end) {
>                 if (start >= end)
> @@ -1118,6 +1155,15 @@ static void __init create_linear_mapping_page_table(void)
>
>                 create_linear_mapping_range(start, end);
>         }
> +
> +#ifdef CONFIG_STRICT_KERNEL_RWX
> +       create_linear_mapping_range(ktext_start, ktext_start + ktext_size);
> +       create_linear_mapping_range(krodata_start,
> +                                   krodata_start + krodata_size);
> +
> +       memblock_clear_nomap(ktext_start,  ktext_size);
> +       memblock_clear_nomap(krodata_start, krodata_size);
> +#endif
>  }
>
>  static void __init setup_vm_final(void)
> diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
> index 9b18bda74154..18706f457da7 100644
> --- a/arch/riscv/mm/physaddr.c
> +++ b/arch/riscv/mm/physaddr.c
> @@ -33,3 +33,19 @@ phys_addr_t __phys_addr_symbol(unsigned long x)
>         return __va_to_pa_nodebug(x);
>  }
>  EXPORT_SYMBOL(__phys_addr_symbol);
> +
> +phys_addr_t linear_mapping_va_to_pa(unsigned long x)
> +{
> +       BUG_ON(!kernel_map.va_pa_offset);
> +
> +       return ((unsigned long)(x) - kernel_map.va_pa_offset);
> +}
> +EXPORT_SYMBOL(linear_mapping_va_to_pa);
> +
> +void *linear_mapping_pa_to_va(unsigned long x)
> +{
> +       BUG_ON(!kernel_map.va_pa_offset);
> +
> +       return ((void *)((unsigned long)(x) + kernel_map.va_pa_offset));
> +}
> +EXPORT_SYMBOL(linear_mapping_pa_to_va);
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index d1a68b6d03b3..d14735a81301 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -887,12 +887,13 @@ const void * __init of_flat_dt_match_machine(const void *default_match,
>  static void __early_init_dt_declare_initrd(unsigned long start,
>                                            unsigned long end)
>  {
> -       /* ARM64 would cause a BUG to occur here when CONFIG_DEBUG_VM is
> -        * enabled since __va() is called too early. ARM64 does make use
> -        * of phys_initrd_start/phys_initrd_size so we can skip this
> -        * conversion.
> +       /*
> +        * __va() is not yet available this early on some platforms. In that
> +        * case, the platform uses phys_initrd_start/phys_initrd_size instead
> +        * and does the VA conversion itself.
>          */
> -       if (!IS_ENABLED(CONFIG_ARM64)) {
> +       if (!IS_ENABLED(CONFIG_ARM64) &&
> +           !(IS_ENABLED(CONFIG_RISCV) && IS_ENABLED(CONFIG_64BIT))) {
>                 initrd_start = (unsigned long)__va(start);
>                 initrd_end = (unsigned long)__va(end);
>                 initrd_below_start_ok = 1;
> --
> 2.37.2
>
diff mbox series

Patch

diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 8dc686f549b6..ea1a0e237211 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,6 +90,14 @@  typedef struct page *pgtable_t;
 #define PTE_FMT "%08lx"
 #endif
 
+#ifdef CONFIG_64BIT
+/*
+ * We override this value as its generic definition uses __pa too early in
+ * the boot process (before kernel_map.va_pa_offset is set).
+ */
+#define MIN_MEMBLOCK_ADDR      0
+#endif
+
 #ifdef CONFIG_MMU
 #define ARCH_PFN_OFFSET		(PFN_DOWN((unsigned long)phys_ram_base))
 #else
@@ -121,7 +129,11 @@  extern phys_addr_t phys_ram_base;
 #define is_linear_mapping(x)	\
 	((x) >= PAGE_OFFSET && (!IS_ENABLED(CONFIG_64BIT) || (x) < PAGE_OFFSET + KERN_VIRT_SIZE))
 
+#ifndef CONFIG_DEBUG_VIRTUAL
 #define linear_mapping_pa_to_va(x)	((void *)((unsigned long)(x) + kernel_map.va_pa_offset))
+#else
+void *linear_mapping_pa_to_va(unsigned long x);
+#endif
 #define kernel_mapping_pa_to_va(y)	({					\
 	unsigned long _y = (unsigned long)(y);					\
 	(IS_ENABLED(CONFIG_XIP_KERNEL) && _y < phys_ram_base) ?			\
@@ -130,7 +142,11 @@  extern phys_addr_t phys_ram_base;
 	})
 #define __pa_to_va_nodebug(x)		linear_mapping_pa_to_va(x)
 
+#ifndef CONFIG_DEBUG_VIRTUAL
 #define linear_mapping_va_to_pa(x)	((unsigned long)(x) - kernel_map.va_pa_offset)
+#else
+phys_addr_t linear_mapping_va_to_pa(unsigned long x);
+#endif
 #define kernel_mapping_va_to_pa(y) ({						\
 	unsigned long _y = (unsigned long)(y);					\
 	(IS_ENABLED(CONFIG_XIP_KERNEL) && _y < kernel_map.virt_addr + XIP_OFFSET) ? \
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 3b37d8606920..f803671d18b2 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -213,6 +213,14 @@  static void __init setup_bootmem(void)
 	phys_ram_end = memblock_end_of_DRAM();
 	if (!IS_ENABLED(CONFIG_XIP_KERNEL))
 		phys_ram_base = memblock_start_of_DRAM();
+
+	/*
+	 * In 64-bit, any use of __va/__pa before this point is wrong as we
+	 * did not know the start of DRAM before.
+	 */
+	if (IS_ENABLED(CONFIG_64BIT))
+		kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
+
 	/*
 	 * memblock allocator is not aware of the fact that last 4K bytes of
 	 * the addressable memory can not be mapped because of IS_ERR_VALUE
@@ -667,9 +675,16 @@  void __init create_pgd_mapping(pgd_t *pgdp,
 
 static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
 {
-	/* Upgrade to PMD_SIZE mappings whenever possible */
-	base &= PMD_SIZE - 1;
-	if (!base && size >= PMD_SIZE)
+	if (!(base & (PGDIR_SIZE - 1)) && size >= PGDIR_SIZE)
+		return PGDIR_SIZE;
+
+	if (!(base & (P4D_SIZE - 1)) && size >= P4D_SIZE)
+		return P4D_SIZE;
+
+	if (!(base & (PUD_SIZE - 1)) && size >= PUD_SIZE)
+		return PUD_SIZE;
+
+	if (!(base & (PMD_SIZE - 1)) && size >= PMD_SIZE)
 		return PMD_SIZE;
 
 	return PAGE_SIZE;
@@ -978,11 +993,22 @@  asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 	set_satp_mode();
 #endif
 
-	kernel_map.va_pa_offset = PAGE_OFFSET - kernel_map.phys_addr;
+	/*
+	 * In 64-bit, we defer the setup of va_pa_offset to setup_bootmem,
+	 * where we have the system memory layout: this allows us to align
+	 * the physical and virtual mappings and then make use of PUD/P4D/PGD
+	 * for the linear mapping. This is only possible because the kernel
+	 * mapping lies outside the linear mapping.
+	 * In 32-bit however, as the kernel resides in the linear mapping,
+	 * setup_vm_final can not change the mapping established here,
+	 * otherwise the same kernel addresses would get mapped to different
+	 * physical addresses (if the start of dram is different from the
+	 * kernel physical address start).
+	 */
+	kernel_map.va_pa_offset = IS_ENABLED(CONFIG_64BIT) ?
+				0UL : PAGE_OFFSET - kernel_map.phys_addr;
 	kernel_map.va_kernel_pa_offset = kernel_map.virt_addr - kernel_map.phys_addr;
 
-	phys_ram_base = kernel_map.phys_addr;
-
 	/*
 	 * The default maximal physical memory size is KERN_VIRT_SIZE for 32-bit
 	 * kernel, whereas for 64-bit kernel, the end of the virtual address
@@ -1106,6 +1132,17 @@  static void __init create_linear_mapping_page_table(void)
 	phys_addr_t start, end;
 	u64 i;
 
+#ifdef CONFIG_STRICT_KERNEL_RWX
+	phys_addr_t ktext_start = __pa_symbol(_start);
+	phys_addr_t ktext_size = __init_data_begin - _start;
+	phys_addr_t krodata_start = __pa_symbol(__start_rodata);
+	phys_addr_t krodata_size = _data - __start_rodata;
+
+	/* Isolate kernel text and rodata so they don't get mapped with a PUD */
+	memblock_mark_nomap(ktext_start,  ktext_size);
+	memblock_mark_nomap(krodata_start, krodata_size);
+#endif
+
 	/* Map all memory banks in the linear mapping */
 	for_each_mem_range(i, &start, &end) {
 		if (start >= end)
@@ -1118,6 +1155,15 @@  static void __init create_linear_mapping_page_table(void)
 
 		create_linear_mapping_range(start, end);
 	}
+
+#ifdef CONFIG_STRICT_KERNEL_RWX
+	create_linear_mapping_range(ktext_start, ktext_start + ktext_size);
+	create_linear_mapping_range(krodata_start,
+				    krodata_start + krodata_size);
+
+	memblock_clear_nomap(ktext_start,  ktext_size);
+	memblock_clear_nomap(krodata_start, krodata_size);
+#endif
 }
 
 static void __init setup_vm_final(void)
diff --git a/arch/riscv/mm/physaddr.c b/arch/riscv/mm/physaddr.c
index 9b18bda74154..18706f457da7 100644
--- a/arch/riscv/mm/physaddr.c
+++ b/arch/riscv/mm/physaddr.c
@@ -33,3 +33,19 @@  phys_addr_t __phys_addr_symbol(unsigned long x)
 	return __va_to_pa_nodebug(x);
 }
 EXPORT_SYMBOL(__phys_addr_symbol);
+
+phys_addr_t linear_mapping_va_to_pa(unsigned long x)
+{
+	BUG_ON(!kernel_map.va_pa_offset);
+
+	return ((unsigned long)(x) - kernel_map.va_pa_offset);
+}
+EXPORT_SYMBOL(linear_mapping_va_to_pa);
+
+void *linear_mapping_pa_to_va(unsigned long x)
+{
+	BUG_ON(!kernel_map.va_pa_offset);
+
+	return ((void *)((unsigned long)(x) + kernel_map.va_pa_offset));
+}
+EXPORT_SYMBOL(linear_mapping_pa_to_va);
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index d1a68b6d03b3..d14735a81301 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -887,12 +887,13 @@  const void * __init of_flat_dt_match_machine(const void *default_match,
 static void __early_init_dt_declare_initrd(unsigned long start,
 					   unsigned long end)
 {
-	/* ARM64 would cause a BUG to occur here when CONFIG_DEBUG_VM is
-	 * enabled since __va() is called too early. ARM64 does make use
-	 * of phys_initrd_start/phys_initrd_size so we can skip this
-	 * conversion.
+	/*
+	 * __va() is not yet available this early on some platforms. In that
+	 * case, the platform uses phys_initrd_start/phys_initrd_size instead
+	 * and does the VA conversion itself.
 	 */
-	if (!IS_ENABLED(CONFIG_ARM64)) {
+	if (!IS_ENABLED(CONFIG_ARM64) &&
+	    !(IS_ENABLED(CONFIG_RISCV) && IS_ENABLED(CONFIG_64BIT))) {
 		initrd_start = (unsigned long)__va(start);
 		initrd_end = (unsigned long)__va(end);
 		initrd_below_start_ok = 1;