diff mbox series

[v5,2/4] xen/riscv: introduce setup_initial_pages

Message ID 5b27693bcdf6d64381314aeef72cfe03dee8d73a.1681918194.git.oleksii.kurochko@gmail.com (mailing list archive)
State Superseded
Headers show
Series enable MMU for RISC-V | expand

Commit Message

Oleksii April 19, 2023, 3:42 p.m. UTC
The idea was taken from xvisor but the following changes
were done:
* Use only a minimal part of the code enough to enable MMU
* rename {_}setup_initial_pagetables functions
* add an argument for setup_initial_mapping to have
  an opportunity to make set PTE flags.
* update setup_initial_pagetables function to map sections
  with correct PTE flags.
* Rewrite enable_mmu() to C.
* map linker addresses range to load addresses range without
  1:1 mapping. It will be 1:1 only in case when
  load_start_addr is equal to linker_start_addr.
* add safety checks such as:
  * Xen size is less than page size
  * linker addresses range doesn't overlap load addresses
    range
* Rework macros {THIRD,SECOND,FIRST,ZEROETH}_{SHIFT,MASK}
* change PTE_LEAF_DEFAULT to RW instead of RWX.
* Remove phys_offset as it is not used now
* Remove alignment  of {map, pa}_start &= XEN_PT_LEVEL_MAP_MASK(0);
  in  setup_inital_mapping() as they should be already aligned.
  Make a check that {map_pa}_start are aligned.
* Remove clear_pagetables() as initial pagetables will be
  zeroed during bss initialization
* Remove __attribute__((section(".entry")) for setup_initial_pagetables()
  as there is no such section in xen.lds.S
* Update the argument of pte_is_valid() to "const pte_t *p"
* Add check that Xen's load address is aligned at 4k boundary
* Refactor setup_initial_pagetables() so it is mapping linker
  address range to load address range. After setup needed
  permissions for specific section ( such as .text, .rodata, etc )
  otherwise RW permission will be set by default.
* Add function to check that requested SATP_MODE is supported

Origin: git@github.com:xvisor/xvisor.git 9be2fdd7
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V5:
	* Indent fields of pte_t struct
	* Rename addr_to_pte() and ppn_to_paddr() to match their content
---
Changes in V4:
  * use GB() macros instead of defining SZ_1G
  * hardcode XEN_VIRT_START and add comment (ADDRESS_SPACE_END + 1 - GB(1))
  * remove unnecessary 'asm' word at the end of #error
  * encapsulate pte_t definition in a struct
  * rename addr_to_ppn() to ppn_to_paddr().
  * change type of paddr argument from const unsigned long to paddr_t
  * pte_to_paddr() update prototype.
  * calculate size of Xen binary based on an amount of page tables
  * use unsgined int instead of 'uint32_t' instead of uint32_t as
    their use isn't warranted.
  * remove extern of bss_{start,end} as they aren't used in mm.c anymore
  * fix code style
  * add argument for HANDLE_PGTBL macros instead of curr_lvl_num variable
  * make enable_mmu() as noinline to prevent under link-time optimization
    because of the nature of enable_mmu()
  * add function to check that SATP_MODE is supported.
  * update the commit message
  * update setup_initial_pagetables to set correct PTE flags in one pass
    instead of calling setup_pte_permissions after setup_initial_pagetables()
    as setup_initial_pagetables() isn't used to change permission flags.
---
Changes in V3:
 - update definition of pte_t structure to have a proper size of pte_t
   in case of RV32.
 - update asm/mm.h with new functions and remove unnecessary 'extern'.
 - remove LEVEL_* macros as only XEN_PT_LEVEL_* are enough.
 - update paddr_to_pte() to receive permissions as an argument.
 - add check that map_start & pa_start is properly aligned.
 - move  defines PAGETABLE_ORDER, PAGETABLE_ENTRIES, PTE_PPN_SHIFT to
   <asm/page-bits.h>
 - Rename PTE_SHIFT to PTE_PPN_SHIFT
 - refactor setup_initial_pagetables: map all LINK addresses to LOAD addresses
   and after setup PTEs permission for sections; update check that linker
   and load addresses don't overlap.
 - refactor setup_initial_mapping: allocate pagetable 'dynamically' if it is
   necessary.
 - rewrite enable_mmu in C; add the check that map_start and pa_start are
   aligned on 4k boundary.
 - update the comment for setup_initial_pagetable funcion
 - Add RV_STAGE1_MODE to support different MMU modes
 - set XEN_VIRT_START very high to not overlap with load address range
 - align bss section
---
Changes in V2:
 * update the commit message:
 * Remove {ZEROETH,FIRST,...}_{SHIFT,MASK, SIZE,...} and
   introduce instead of them XEN_PT_LEVEL_*() and LEVEL_*
 * Rework pt_linear_offset() and pt_index based on  XEN_PT_LEVEL_*()
 * Remove clear_pagetables() functions as pagetables were zeroed during
   .bss initialization
 * Rename _setup_initial_pagetables() to setup_initial_mapping()
 * Make PTE_DEFAULT equal to RX.
 * Update prototype of setup_initial_mapping(..., bool writable) -> 
   setup_initial_mapping(..., UL flags)  
 * Update calls of setup_initial_mapping according to new prototype
 * Remove unnecessary call of:
   _setup_initial_pagetables(..., load_addr_start, load_addr_end, load_addr_start, ...)
 * Define index* in the loop of setup_initial_mapping
 * Remove attribute "__attribute__((section(".entry")))" for setup_initial_pagetables()
   as we don't have such section
 * make arguments of paddr_to_pte() and pte_is_valid() as const.
 * make xen_second_pagetable static.
 * use <xen/kernel.h> instead of declaring extern unsigned long _stext, 0etext, _srodata, _erodata
 * update  'extern unsigned long __init_begin' to 'extern unsigned long __init_begin[]'
 * use aligned() instead of "__attribute__((__aligned__(PAGE_SIZE)))"
 * set __section(".bss.page_aligned") for page tables arrays
 * fix identatations
 * Change '__attribute__((section(".entry")))' to '__init'
 * Remove phys_offset as it isn't used now.
 * Remove alignment  of {map, pa}_start &= XEN_PT_LEVEL_MAP_MASK(0); in
   setup_inital_mapping() as they should be already aligned.
 * Remove clear_pagetables() as initial pagetables will be
   zeroed during bss initialization
 * Remove __attribute__((section(".entry")) for setup_initial_pagetables()
   as there is no such section in xen.lds.S
 * Update the argument of pte_is_valid() to "const pte_t *p"
---

 xen/arch/riscv/Makefile                |   1 +
 xen/arch/riscv/include/asm/config.h    |  12 +-
 xen/arch/riscv/include/asm/mm.h        |   9 +
 xen/arch/riscv/include/asm/page-bits.h |  10 +
 xen/arch/riscv/include/asm/page.h      |  63 +++++
 xen/arch/riscv/mm.c                    | 319 +++++++++++++++++++++++++
 xen/arch/riscv/riscv64/head.S          |   2 +
 xen/arch/riscv/setup.c                 |  11 +
 xen/arch/riscv/xen.lds.S               |   4 +
 9 files changed, 430 insertions(+), 1 deletion(-)
 create mode 100644 xen/arch/riscv/include/asm/mm.h
 create mode 100644 xen/arch/riscv/include/asm/page.h
 create mode 100644 xen/arch/riscv/mm.c

Comments

Jan Beulich April 20, 2023, 2:36 p.m. UTC | #1
On 19.04.2023 17:42, Oleksii Kurochko wrote:
> --- a/xen/arch/riscv/include/asm/page-bits.h
> +++ b/xen/arch/riscv/include/asm/page-bits.h
> @@ -1,6 +1,16 @@
>  #ifndef __RISCV_PAGE_BITS_H__
>  #define __RISCV_PAGE_BITS_H__
>  
> +#ifdef CONFIG_RISCV_64
> +#define PAGETABLE_ORDER         (9)
> +#else /* CONFIG_RISCV_32 */
> +#define PAGETABLE_ORDER         (10)
> +#endif
> +
> +#define PAGETABLE_ENTRIES       (1 << PAGETABLE_ORDER)
> +
> +#define PTE_PPN_SHIFT           10
> +
>  #define PAGE_SHIFT              12 /* 4 KiB Pages */
>  #define PADDR_BITS              56 /* 44-bit PPN */

Personally I think these two would better remain at the top. But maybe
that's just me ...

> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -0,0 +1,63 @@
> +#ifndef _ASM_RISCV_PAGE_H
> +#define _ASM_RISCV_PAGE_H
> +
> +#include <xen/const.h>
> +#include <xen/types.h>
> +
> +#define VPN_MASK                    ((unsigned long)(PAGETABLE_ENTRIES - 1))
> +
> +#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
> +#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) + PAGE_SHIFT)
> +#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) << XEN_PT_LEVEL_SHIFT(lvl))
> +#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
> +#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
> +
> +#define PTE_VALID                   BIT(0, UL)
> +#define PTE_READABLE                BIT(1, UL)
> +#define PTE_WRITABLE                BIT(2, UL)
> +#define PTE_EXECUTABLE              BIT(3, UL)
> +#define PTE_USER                    BIT(4, UL)
> +#define PTE_GLOBAL                  BIT(5, UL)
> +#define PTE_ACCESSED                BIT(6, UL)
> +#define PTE_DIRTY                   BIT(7, UL)
> +#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
> +
> +#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
> +#define PTE_TABLE                   (PTE_VALID)
> +
> +/* Calculate the offsets into the pagetables for a given VA */
> +#define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
> +
> +#define pt_index(lvl, va) pt_linear_offset(lvl, (va) & XEN_PT_LEVEL_MASK(lvl))
> +
> +/* Page Table entry */
> +typedef struct {
> +#ifdef CONFIG_RISCV_64
> +    uint64_t pte;
> +#else
> +    uint32_t pte;
> +#endif
> +} pte_t;
> +
> +#define pte_to_addr(x) (((x) >> PTE_PPN_SHIFT) << PAGE_SHIFT)

This will be at risk of overflow for RV32 without a cast to paddr_t (which
I expect would be a 64-bit type on RV32 just like it is on RV64).

> +#define addr_to_pte(x) (((x) >> PAGE_SHIFT) << PTE_PPN_SHIFT)
> +
> +static inline pte_t paddr_to_pte(const paddr_t paddr,
> +                                 const unsigned long permissions)
> +{
> +    unsigned long tmp = addr_to_pte(paddr);
> +    return (pte_t) { .pte = tmp | permissions };
> +}
> +
> +static inline paddr_t pte_to_paddr(const pte_t pte)
> +{
> +    return pte_to_addr(pte.pte);
> +}
> +
> +static inline bool pte_is_valid(const pte_t p)
> +{
> +    return p.pte & PTE_VALID;
> +}

A remark on all of the "const" here: It's fine if you want to keep them,
but generally we care about const-correctness only for pointed-to types.
Scalar and compound parameter values are owned by called function anyway,
so all the "const" achieves is that the function can't alter its own
argument(s).

> --- /dev/null
> +++ b/xen/arch/riscv/mm.c
> @@ -0,0 +1,319 @@
> +#include <xen/compiler.h>
> +#include <xen/init.h>
> +#include <xen/kernel.h>
> +
> +#include <asm/early_printk.h>
> +#include <asm/config.h>
> +#include <asm/csr.h>
> +#include <asm/mm.h>
> +#include <asm/page.h>
> +#include <asm/processor.h>
> +
> +struct mmu_desc {
> +    unsigned long num_levels;

Isn't "unsigned int" sufficient here?

> +/*
> + * It is expected that Xen won't be more then 2 MB.
> + * The check in xen.lds.S guarantees that.
> + * At least 4 page (in case when Sv48 or Sv57 are used )
> + * tables are needed to cover 2 MB. One for each page level
> + * table with PAGE_SIZE = 4 Kb
> + *
> + * One L0 page table can cover 2 MB
> + * (512 entries of one page table * PAGE_SIZE).
> + *
> + * It might be needed one more page table in case when
> + * Xen load address isn't 2 MB aligned.
> + *
> + */
> +#define PGTBL_INITIAL_COUNT     (5)

On x86 we have a CONFIG_PAGING_LEVELS constant. If you had something
like this as well, this 5 would better match the comment as
((CONFIG_PAGING_LEVELS - 1) + 1), avoiding to make space for the two
levels you won't need as long as only Sv39 is really meant to be used.

> +#define PGTBL_ENTRY_AMOUNT  (PAGE_SIZE / sizeof(pte_t))
> +
> +pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
> +stage1_pgtbl_root[PGTBL_ENTRY_AMOUNT];
> +
> +pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
> +stage1_pgtbl_nonroot[PGTBL_INITIAL_COUNT * PGTBL_ENTRY_AMOUNT];
> 
> +#define HANDLE_PGTBL(curr_lvl_num)                                          \
> +    index = pt_index(curr_lvl_num, page_addr);                              \
> +    if ( pte_is_valid(pgtbl[index]) )                                       \
> +    {                                                                       \
> +        /* Find L{ 0-3 } table */                                           \
> +        pgtbl = (pte_t *)pte_to_paddr(pgtbl[index]);                        \
> +    }                                                                       \
> +    else                                                                    \
> +    {                                                                       \
> +        /* Allocate new L{0-3} page table */                                \
> +        if ( mmu_desc->pgtbl_count == PGTBL_INITIAL_COUNT )                 \
> +        {                                                                   \
> +            early_printk("(XEN) No initial table available\n");             \
> +            /* panic(), BUG() or ASSERT() aren't ready now. */              \
> +            die();                                                          \
> +        }                                                                   \
> +        mmu_desc->pgtbl_count++;                                            \
> +        pgtbl[index] = paddr_to_pte((unsigned long)mmu_desc->next_pgtbl,    \
> +                                    PTE_VALID);                             \
> +        pgtbl = mmu_desc->next_pgtbl;                                       \
> +        mmu_desc->next_pgtbl += PGTBL_ENTRY_AMOUNT;                         \
> +    }
> +
> +static void __init setup_initial_mapping(struct mmu_desc *mmu_desc,
> +                                         unsigned long map_start,
> +                                         unsigned long map_end,
> +                                         unsigned long pa_start,
> +                                         unsigned long permissions)

Wouldn't the last one more sensibly be "unsigned int"?

> +{
> +    unsigned int index;
> +    pte_t *pgtbl;
> +    unsigned long page_addr;
> +    pte_t pte_to_be_written;
> +    unsigned long paddr;
> +    unsigned long tmp_permissions;
> +
> +    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
> +    {
> +        early_printk("(XEN) Xen should be loaded at 4k boundary\n");
> +        die();
> +    }
> +
> +    if ( map_start & ~XEN_PT_LEVEL_MAP_MASK(0) ||
> +         pa_start & ~XEN_PT_LEVEL_MAP_MASK(0) )
> +    {
> +        early_printk("(XEN) map and pa start addresses should be aligned\n");
> +        /* panic(), BUG() or ASSERT() aren't ready now. */
> +        die();
> +    }
> +
> +    page_addr = map_start;
> +    while ( page_addr < map_end )
> +    {
> +        pgtbl = mmu_desc->pgtbl_base;
> +
> +        switch (mmu_desc->num_levels)

Nit: Style (missing blanks).

> +        {
> +            case 4: /* Level 3 */

Nit: Indentation of case labels matches that of the opening brace in our
style.

> +                HANDLE_PGTBL(3);
> +            case 3: /* Level 2 */
> +                HANDLE_PGTBL(2);
> +            case 2: /* Level 1 */
> +                HANDLE_PGTBL(1);
> +            case 1: /* Level 0 */
> +                index = pt_index(0, page_addr);
> +                paddr = (page_addr - map_start) + pa_start;
> +
> +                tmp_permissions = permissions;
> +
> +                if ( is_kernel_text(LINK_TO_LOAD(page_addr)) ||
> +                     is_kernel_inittext(LINK_TO_LOAD(page_addr)) )
> +                    tmp_permissions =
> +                        PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
> +
> +                if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) )
> +                    tmp_permissions = PTE_READABLE | PTE_VALID;
> +
> +                pte_to_be_written = paddr_to_pte(paddr, tmp_permissions);
> +
> +                if ( !pte_is_valid(pgtbl[index]) )
> +                    pgtbl[index] = pte_to_be_written;
> +                else
> +                {
> +                    /*
> +                    * get an adresses of current pte and that one to
> +                    * be written without permission flags
> +                    */
> +                    unsigned long curr_pte =
> +                        pgtbl[index].pte & ~((1 << PTE_PPN_SHIFT) - 1);
> +
> +                    pte_to_be_written.pte &= ~((1 << PTE_PPN_SHIFT) - 1);

If you mean to only compare addresses, why don't you use pte_to_paddr()?
Question though is whether it's correct to ignore permission differenes.
I'd expect you only want to mask off PTE_ACCESSED and PTE_DIRTY.

> +                    if ( curr_pte != pte_to_be_written.pte )
> +                    {
> +                        early_printk("PTE that is intended to write isn't the"
> +                                    "same that the once are overwriting\n");

Nit: One-off indentation. As to the message text - I take it that's
temporary only anyway, and hence there's little point thinking about
improving it?

> +                        /* panic(), <asm/bug.h> aren't ready now. */
> +                        die();
> +                    }
> +                }
> +        }
> +
> +        /* Point to next page */
> +        page_addr += XEN_PT_LEVEL_SIZE(0);

Seeing this as well as the loop heading - maybe more suitable as a
for(;;) loop?

> +    }
> +}

Since HANDLE_PGTBL() is strictly for use above only, I think you'd better
#undef it here.

> +static void __init calc_pgtbl_lvls_num(struct  mmu_desc *mmu_desc)
> +{
> +    unsigned long satp_mode = RV_STAGE1_MODE;
> +
> +    /* Number of page table levels */
> +    switch (satp_mode)
> +    {
> +        case SATP_MODE_SV32:
> +            mmu_desc->num_levels = 2;
> +            break;
> +        case SATP_MODE_SV39:
> +            mmu_desc->num_levels = 3;
> +            break;
> +        case SATP_MODE_SV48:
> +            mmu_desc->num_levels = 4;
> +            break;
> +        default:
> +            early_printk("(XEN) Unsupported SATP_MODE\n");
> +            die();
> +    }
> +}
> +
> +static bool __init check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
> +                                            unsigned long load_start,
> +                                            unsigned long satp_mode)
> +{
> +    bool is_mode_supported = false;
> +    unsigned int index;
> +    unsigned int page_table_level = (mmu_desc->num_levels - 1);
> +    unsigned level_map_mask = XEN_PT_LEVEL_MAP_MASK(page_table_level);
> +
> +    unsigned long aligned_load_start = load_start & level_map_mask;
> +    unsigned long aligned_page_size = XEN_PT_LEVEL_SIZE(page_table_level);
> +    unsigned long xen_size = (unsigned long)(_end - start);
> +
> +    if ( (load_start + xen_size) > (aligned_load_start + aligned_page_size) )
> +    {
> +        early_printk("please place Xen to be in range of PAGE_SIZE "
> +                     "where PAGE_SIZE is XEN_PT_LEVEL_SIZE( {L3 | L2 | L1} ) "
> +                     "depending on expected SATP_MODE \n"
> +                     "XEN_PT_LEVEL_SIZE is defined in <asm/page.h>\n");
> +        die();
> +    }
> +
> +    index = pt_index(page_table_level, aligned_load_start);
> +    stage1_pgtbl_root[index] = paddr_to_pte(aligned_load_start,
> +                                            PTE_LEAF_DEFAULT | PTE_EXECUTABLE);
> +
> +    asm volatile("sfence.vma");

Nit (here and several times again below): Style (missing three blanks, as
"asm" is a keyword).

> +    csr_write(CSR_SATP,
> +              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) |
> +              satp_mode << SATP_MODE_SHIFT);
> +
> +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode )
> +        is_mode_supported = true;
> +
> +    /* Clean MMU root page table and disable MMU */
> +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
> +
> +    csr_write(CSR_SATP, 0);
> +    asm volatile("sfence.vma");

I guess what you do in this function could do with some more comments.
Looks like you're briefly enabling the MMU to check that what you wrote
to SATP you can also read back. (Isn't there a register reporting
whether the feature is available?)

If so, I think you cannot clear the stage1_pgtbl_root[] slot before
you've disabled the MMU again.

(As a minor aspect, I'd like to encourage you to write plain zero just
as 0, not as 0x0, unless used in contexts where other hex numbers nearby
make this desirable.)

> +    return is_mode_supported;
> +}
> +
> +/*
> + * setup_initial_pagetables:
> + *
> + * Build the page tables for Xen that map the following:
> + *  1. Calculate page table's level numbers.
> + *  2. Init mmu description structure.
> + *  3. Check that linker addresses range doesn't overlap
> + *     with load addresses range
> + *  4. Map all linker addresses and load addresses ( it shouldn't
> + *     be 1:1 mapped and will be 1:1 mapped only in case if
> + *     linker address is equal to load address ) with
> + *     RW permissions by default.
> + *  5. Setup proper PTE permissions for each section.
> + */
> +void __init setup_initial_pagetables(void)
> +{
> +    struct mmu_desc mmu_desc = { 0, 0, NULL, 0 };

Just {} ought to do as initializer, but if you want to spell things out,
then please use 0 / NULL consistently for integral / pointer type fields.

> +    /*
> +     * Access to _{stard, end } is always PC-relative

I guess you mean _start. For just a leading underscore I also don't think
using this braced form is useful.

> +     * thereby when access them we will get load adresses
> +     * of start and end of Xen
> +     * To get linker addresses LOAD_TO_LINK() is required
> +     * to use
> +     */
> +    unsigned long load_start    = (unsigned long)_start;
> +    unsigned long load_end      = (unsigned long)_end;
> +    unsigned long linker_start  = LOAD_TO_LINK(load_start);
> +    unsigned long linker_end    = LOAD_TO_LINK(load_end);
> +
> +    if ( (linker_start != load_start) &&
> +         (linker_start <= load_end) && (load_start <= linker_end) ) {

Nit: Style (brace placement).

> +        early_printk("(XEN) linker and load address ranges overlap\n");
> +        die();
> +    }
> +
> +    calc_pgtbl_lvls_num(&mmu_desc);
> +
> +    if ( !check_pgtbl_mode_support(&mmu_desc, load_start, RV_STAGE1_MODE) )
> +    {
> +        early_printk("requested MMU mode isn't supported by CPU\n"
> +                     "Please choose different in <asm/config.h>\n");
> +        die();
> +    }
> +
> +    mmu_desc.pgtbl_base = stage1_pgtbl_root;
> +    mmu_desc.next_pgtbl = stage1_pgtbl_nonroot;
> +
> +    setup_initial_mapping(&mmu_desc,
> +                          linker_start,
> +                          linker_end,
> +                          load_start,
> +                          PTE_LEAF_DEFAULT);
> +}
> +
> +void __init noinline enable_mmu()
> +{
> +    /*
> +     * Calculate a linker time address of the mmu_is_enabled
> +     * label and update CSR_STVEC with it.
> +     * MMU is configured in a way where linker addresses are mapped
> +     * on load addresses so in a case when linker addresses are not equal
> +     * to load addresses, after MMU is enabled, it will cause
> +     * an exception and jump to linker time addresses.
> +     * Otherwise if load addresses are equal to linker addresses the code
> +     * after mmu_is_enabled label will be executed without exception.
> +     */
> +    csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned long)&&mmu_is_enabled));
> +
> +    /* Ensure page table writes precede loading the SATP */
> +    asm volatile("sfence.vma");
> +
> +    /* Enable the MMU and load the new pagetable for Xen */
> +    csr_write(CSR_SATP,
> +              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) |

Please try to avoid open-coding of existing constructs: Here you mean
either PFN_DOWN() or paddr_to_pfn() (you see, we have even two). (As I
notice I did overlook at least similar earlier instance.)

> +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> +
> +    asm volatile(".align 2");
> +      mmu_is_enabled:

How in the world is one to spot this label? Yes, it shouldn't be entirely
unindented. But it also should be indented less than the surrounding code
(with the exception of switch() statement case labels, where a non-case
label might be intended the same as a case ones). Our rule of thumb is to
indent such labels by a single space.

> +    /*
> +     * Stack should be re-inited as:
> +     * 1. Right now an address of the stack is relative to load time
> +     *    addresses what will cause an issue in case of load start address
> +     *    isn't equal to linker start address.
> +     * 2. Addresses in stack are all load time relative which can be an
> +     *    issue in case when load start address isn't equal to linker
> +     *    start address.
> +     */
> +    asm volatile ("mv sp, %0" : : "r"((unsigned long)cpu0_boot_stack + STACK_SIZE));

Nit: Style (overly long line).

What's worse - I don't think it is permitted to alter sp in the middle of
a function. The compiler may maintain local variables on the stack which
don't correspond to any programmer specified ones, including pointer ones
which point into the stack frame. This is specifically why both x86 and
Arm have switch_stack_and_jump() macros.

> +    /*
> +     * We can't return to the caller because the stack was reseted
> +     * and it may have stash some variable on the stack.
> +     * Jump to a brand new function as the stack was reseted
> +    */

Nit: Style (indentation).

> +    cont_after_mmu_is_enabled();
> +}
> +
> diff --git a/xen/arch/riscv/riscv64/head.S b/xen/arch/riscv/riscv64/head.S
> index 8887f0cbd4..b3309d902c 100644
> --- a/xen/arch/riscv/riscv64/head.S
> +++ b/xen/arch/riscv/riscv64/head.S
> @@ -1,4 +1,5 @@
>  #include <asm/asm.h>
> +#include <asm/asm-offsets.h>
>  #include <asm/riscv_encoding.h>
>  
>          .section .text.header, "ax", %progbits
> @@ -32,3 +33,4 @@ ENTRY(start)
>          add     sp, sp, t0
>  
>          tail    start_xen
> +

???

> --- a/xen/arch/riscv/xen.lds.S
> +++ b/xen/arch/riscv/xen.lds.S
> @@ -136,6 +136,7 @@ SECTIONS
>      . = ALIGN(POINTER_ALIGN);
>      __init_end = .;
>  
> +    . = ALIGN(PAGE_SIZE);
>      .bss : {                     /* BSS */
>          __bss_start = .;
>          *(.bss.stack_aligned)

Why do you need this? You properly use __aligned(PAGE_SIZE) for the
page tables you define, and PAGE_SIZE wouldn't be enough here anyway
if STACK_SIZE > PAGE_SIZE (as .bss.stack_aligned comes first). The
only time you'd need such an ALIGN() is if the following label
(__bss_start in this case) needed to be aligned at a certain
boundary. (I'm a little puzzled though that __bss_start isn't
[immediately] preceded by ". = ALIGN(POINTER_ALIGN);" - didn't .bss
clearing rely on such alignment?)

Jan
Oleksii April 21, 2023, 4:01 p.m. UTC | #2
On Thu, 2023-04-20 at 16:36 +0200, Jan Beulich wrote:
> On 19.04.2023 17:42, Oleksii Kurochko wrote:
> > --- a/xen/arch/riscv/include/asm/page-bits.h
> > +++ b/xen/arch/riscv/include/asm/page-bits.h
> > @@ -1,6 +1,16 @@
> >  #ifndef __RISCV_PAGE_BITS_H__
> >  #define __RISCV_PAGE_BITS_H__
> >  
> > +#ifdef CONFIG_RISCV_64
> > +#define PAGETABLE_ORDER         (9)
> > +#else /* CONFIG_RISCV_32 */
> > +#define PAGETABLE_ORDER         (10)
> > +#endif
> > +
> > +#define PAGETABLE_ENTRIES       (1 << PAGETABLE_ORDER)
> > +
> > +#define PTE_PPN_SHIFT           10
> > +
> >  #define PAGE_SHIFT              12 /* 4 KiB Pages */
> >  #define PADDR_BITS              56 /* 44-bit PPN */
> 
> Personally I think these two would better remain at the top. But
> maybe
> that's just me ...
I'm ok to remain them at the top.

> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/page.h
> > @@ -0,0 +1,63 @@
> > +#ifndef _ASM_RISCV_PAGE_H
> > +#define _ASM_RISCV_PAGE_H
> > +
> > +#include <xen/const.h>
> > +#include <xen/types.h>
> > +
> > +#define VPN_MASK                    ((unsigned
> > long)(PAGETABLE_ENTRIES - 1))
> > +
> > +#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
> > +#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) +
> > PAGE_SHIFT)
> > +#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) <<
> > XEN_PT_LEVEL_SHIFT(lvl))
> > +#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) -
> > 1))
> > +#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK <<
> > XEN_PT_LEVEL_SHIFT(lvl))
> > +
> > +#define PTE_VALID                   BIT(0, UL)
> > +#define PTE_READABLE                BIT(1, UL)
> > +#define PTE_WRITABLE                BIT(2, UL)
> > +#define PTE_EXECUTABLE              BIT(3, UL)
> > +#define PTE_USER                    BIT(4, UL)
> > +#define PTE_GLOBAL                  BIT(5, UL)
> > +#define PTE_ACCESSED                BIT(6, UL)
> > +#define PTE_DIRTY                   BIT(7, UL)
> > +#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
> > +
> > +#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> > +#define PTE_TABLE                   (PTE_VALID)
> > +
> > +/* Calculate the offsets into the pagetables for a given VA */
> > +#define pt_linear_offset(lvl, va)   ((va) >>
> > XEN_PT_LEVEL_SHIFT(lvl))
> > +
> > +#define pt_index(lvl, va) pt_linear_offset(lvl, (va) &
> > XEN_PT_LEVEL_MASK(lvl))
> > +
> > +/* Page Table entry */
> > +typedef struct {
> > +#ifdef CONFIG_RISCV_64
> > +    uint64_t pte;
> > +#else
> > +    uint32_t pte;
> > +#endif
> > +} pte_t;
> > +
> > +#define pte_to_addr(x) (((x) >> PTE_PPN_SHIFT) << PAGE_SHIFT)
> 
> This will be at risk of overflow for RV32 without a cast to paddr_t
> (which
> I expect would be a 64-bit type on RV32 just like it is on RV64).
Yeah, paddr_t is u64 for both RV32 and RV64.
I'll add cast to paddr_t to avoid possible risk of overflow.

> 
> > +#define addr_to_pte(x) (((x) >> PAGE_SHIFT) << PTE_PPN_SHIFT)
> > +
> > +static inline pte_t paddr_to_pte(const paddr_t paddr,
> > +                                 const unsigned long permissions)
> > +{
> > +    unsigned long tmp = addr_to_pte(paddr);
> > +    return (pte_t) { .pte = tmp | permissions };
> > +}
> > +
> > +static inline paddr_t pte_to_paddr(const pte_t pte)
> > +{
> > +    return pte_to_addr(pte.pte);
> > +}
> > +
> > +static inline bool pte_is_valid(const pte_t p)
> > +{
> > +    return p.pte & PTE_VALID;
> > +}
> 
> A remark on all of the "const" here: It's fine if you want to keep
> them,
> but generally we care about const-correctness only for pointed-to
> types.
> Scalar and compound parameter values are owned by called function
> anyway,
> so all the "const" achieves is that the function can't alter its own
> argument(s).
Then it doesn't make for the current case and removing them will
simplify paddr_to_pte function. So I prefer to remove them.

> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/mm.c
> > @@ -0,0 +1,319 @@
> > +#include <xen/compiler.h>
> > +#include <xen/init.h>
> > +#include <xen/kernel.h>
> > +
> > +#include <asm/early_printk.h>
> > +#include <asm/config.h>
> > +#include <asm/csr.h>
> > +#include <asm/mm.h>
> > +#include <asm/page.h>
> > +#include <asm/processor.h>
> > +
> > +struct mmu_desc {
> > +    unsigned long num_levels;
> 
> Isn't "unsigned int" sufficient here?
Yes, it will be sufficient.

> 
> > +/*
> > + * It is expected that Xen won't be more then 2 MB.
> > + * The check in xen.lds.S guarantees that.
> > + * At least 4 page (in case when Sv48 or Sv57 are used )
> > + * tables are needed to cover 2 MB. One for each page level
> > + * table with PAGE_SIZE = 4 Kb
> > + *
> > + * One L0 page table can cover 2 MB
> > + * (512 entries of one page table * PAGE_SIZE).
> > + *
> > + * It might be needed one more page table in case when
> > + * Xen load address isn't 2 MB aligned.
> > + *
> > + */
> > +#define PGTBL_INITIAL_COUNT     (5)
> 
> On x86 we have a CONFIG_PAGING_LEVELS constant. If you had something
> like this as well, this 5 would better match the comment as
> ((CONFIG_PAGING_LEVELS - 1) + 1), avoiding to make space for the two
> levels you won't need as long as only Sv39 is really meant to be
> used.
Thanks for note. I'll use CONFIG_PAGING_LEVELS too.

> 
> > +#define PGTBL_ENTRY_AMOUNT  (PAGE_SIZE / sizeof(pte_t))
> > +
> > +pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
> > +stage1_pgtbl_root[PGTBL_ENTRY_AMOUNT];
> > +
> > +pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
> > +stage1_pgtbl_nonroot[PGTBL_INITIAL_COUNT * PGTBL_ENTRY_AMOUNT];
> > 
> > +#define
> > HANDLE_PGTBL(curr_lvl_num)                                         
> > \
> > +    index = pt_index(curr_lvl_num,
> > page_addr);                              \
> > +    if ( pte_is_valid(pgtbl[index])
> > )                                       \
> > +   
> > {                                                                  
> >      \
> > +        /* Find L{ 0-3 } table
> > */                                           \
> > +        pgtbl = (pte_t
> > *)pte_to_paddr(pgtbl[index]);                        \
> > +   
> > }                                                                  
> >      \
> > +   
> > else                                                               
> >      \
> > +   
> > {                                                                  
> >      \
> > +        /* Allocate new L{0-3} page table
> > */                                \
> > +        if ( mmu_desc->pgtbl_count == PGTBL_INITIAL_COUNT
> > )                 \
> > +       
> > {                                                                  
> > \
> > +            early_printk("(XEN) No initial table
> > available\n");             \
> > +            /* panic(), BUG() or ASSERT() aren't ready now.
> > */              \
> > +           
> > die();                                                          \
> > +       
> > }                                                                  
> > \
> > +        mmu_desc-
> > >pgtbl_count++;                                            \
> > +        pgtbl[index] = paddr_to_pte((unsigned long)mmu_desc-
> > >next_pgtbl,    \
> > +                                   
> > PTE_VALID);                             \
> > +        pgtbl = mmu_desc-
> > >next_pgtbl;                                       \
> > +        mmu_desc->next_pgtbl +=
> > PGTBL_ENTRY_AMOUNT;                         \
> > +    }
> > +
> > +static void __init setup_initial_mapping(struct mmu_desc
> > *mmu_desc,
> > +                                         unsigned long map_start,
> > +                                         unsigned long map_end,
> > +                                         unsigned long pa_start,
> > +                                         unsigned long
> > permissions)
> 
> Wouldn't the last one more sensibly be "unsigned int"?
It would. Thanks, I'll update the code.
> 
> > +{
> > +    unsigned int index;
> > +    pte_t *pgtbl;
> > +    unsigned long page_addr;
> > +    pte_t pte_to_be_written;
> > +    unsigned long paddr;
> > +    unsigned long tmp_permissions;
> > +
> > +    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
> > +    {
> > +        early_printk("(XEN) Xen should be loaded at 4k
> > boundary\n");
> > +        die();
> > +    }
> > +
> > +    if ( map_start & ~XEN_PT_LEVEL_MAP_MASK(0) ||
> > +         pa_start & ~XEN_PT_LEVEL_MAP_MASK(0) )
> > +    {
> > +        early_printk("(XEN) map and pa start addresses should be
> > aligned\n");
> > +        /* panic(), BUG() or ASSERT() aren't ready now. */
> > +        die();
> > +    }
> > +
> > +    page_addr = map_start;
> > +    while ( page_addr < map_end )
> > +    {
> > +        pgtbl = mmu_desc->pgtbl_base;
> > +
> > +        switch (mmu_desc->num_levels)
> 
> Nit: Style (missing blanks).
Thanks. Ill update.
> 
> > +        {
> > +            case 4: /* Level 3 */
> 
> Nit: Indentation of case labels matches that of the opening brace in
> our
> style.
Sure. Missed that.
> 
> > +                HANDLE_PGTBL(3);
> > +            case 3: /* Level 2 */
> > +                HANDLE_PGTBL(2);
> > +            case 2: /* Level 1 */
> > +                HANDLE_PGTBL(1);
> > +            case 1: /* Level 0 */
> > +                index = pt_index(0, page_addr);
> > +                paddr = (page_addr - map_start) + pa_start;
> > +
> > +                tmp_permissions = permissions;
> > +
> > +                if ( is_kernel_text(LINK_TO_LOAD(page_addr)) ||
> > +                     is_kernel_inittext(LINK_TO_LOAD(page_addr)) )
> > +                    tmp_permissions =
> > +                        PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
> > +
> > +                if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) )
> > +                    tmp_permissions = PTE_READABLE | PTE_VALID;
> > +
> > +                pte_to_be_written = paddr_to_pte(paddr,
> > tmp_permissions);
> > +
> > +                if ( !pte_is_valid(pgtbl[index]) )
> > +                    pgtbl[index] = pte_to_be_written;
> > +                else
> > +                {
> > +                    /*
> > +                    * get an adresses of current pte and that one
> > to
> > +                    * be written without permission flags
> > +                    */
> > +                    unsigned long curr_pte =
> > +                        pgtbl[index].pte & ~((1 << PTE_PPN_SHIFT)
> > - 1);
> > +
> > +                    pte_to_be_written.pte &= ~((1 <<
> > PTE_PPN_SHIFT) - 1);
> 
> If you mean to only compare addresses, why don't you use
> pte_to_paddr()?
Yes, it will be better to use pte_to_paddr().
> Question though is whether it's correct to ignore permission
> differenes.
> I'd expect you only want to mask off PTE_ACCESSED and PTE_DIRTY.
Initially I would like to do rough check but probably you are right and
it will be better to mask off only PTE_ACCESSED and PTE_DIRTY.

> 
> > +                    if ( curr_pte != pte_to_be_written.pte )
> > +                    {
> > +                        early_printk("PTE that is intended to
> > write isn't the"
> > +                                    "same that the once are
> > overwriting\n");
> 
> Nit: One-off indentation. As to the message text - I take it that's
> temporary only anyway, and hence there's little point thinking about
> improving it?
Probably it would be more clear:
"PTE overridden has occurred"
> 
> > +                        /* panic(), <asm/bug.h> aren't ready now.
> > */
> > +                        die();
> > +                    }
> > +                }
> > +        }
> > +
> > +        /* Point to next page */
> > +        page_addr += XEN_PT_LEVEL_SIZE(0);
> 
> Seeing this as well as the loop heading - maybe more suitable as a
> for(;;) loop?
I am not sure that I understand the benefits of changing "while (
page_addr < map_end )" to "for(;;)".
Could you please explain me what the benefits are?
> 
> > +    }
> > +}
> 
> Since HANDLE_PGTBL() is strictly for use above only, I think you'd
> better
> #undef it here.
> 
> > +static void __init calc_pgtbl_lvls_num(struct  mmu_desc *mmu_desc)
> > +{
> > +    unsigned long satp_mode = RV_STAGE1_MODE;
> > +
> > +    /* Number of page table levels */
> > +    switch (satp_mode)
> > +    {
> > +        case SATP_MODE_SV32:
> > +            mmu_desc->num_levels = 2;
> > +            break;
> > +        case SATP_MODE_SV39:
> > +            mmu_desc->num_levels = 3;
> > +            break;
> > +        case SATP_MODE_SV48:
> > +            mmu_desc->num_levels = 4;
> > +            break;
> > +        default:
> > +            early_printk("(XEN) Unsupported SATP_MODE\n");
> > +            die();
> > +    }
> > +}
> > +
> > +static bool __init check_pgtbl_mode_support(struct mmu_desc
> > *mmu_desc,
> > +                                            unsigned long
> > load_start,
> > +                                            unsigned long
> > satp_mode)
> > +{
> > +    bool is_mode_supported = false;
> > +    unsigned int index;
> > +    unsigned int page_table_level = (mmu_desc->num_levels - 1);
> > +    unsigned level_map_mask =
> > XEN_PT_LEVEL_MAP_MASK(page_table_level);
> > +
> > +    unsigned long aligned_load_start = load_start &
> > level_map_mask;
> > +    unsigned long aligned_page_size =
> > XEN_PT_LEVEL_SIZE(page_table_level);
> > +    unsigned long xen_size = (unsigned long)(_end - start);
> > +
> > +    if ( (load_start + xen_size) > (aligned_load_start +
> > aligned_page_size) )
> > +    {
> > +        early_printk("please place Xen to be in range of PAGE_SIZE
> > "
> > +                     "where PAGE_SIZE is XEN_PT_LEVEL_SIZE( {L3 |
> > L2 | L1} ) "
> > +                     "depending on expected SATP_MODE \n"
> > +                     "XEN_PT_LEVEL_SIZE is defined in
> > <asm/page.h>\n");
> > +        die();
> > +    }
> > +
> > +    index = pt_index(page_table_level, aligned_load_start);
> > +    stage1_pgtbl_root[index] = paddr_to_pte(aligned_load_start,
> > +                                            PTE_LEAF_DEFAULT |
> > PTE_EXECUTABLE);
> > +
> > +    asm volatile("sfence.vma");
> 
> Nit (here and several times again below): Style (missing three
> blanks, as
> "asm" is a keyword).
Thanks. I'll add them
> 
> > +    csr_write(CSR_SATP,
> > +              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) |
> > +              satp_mode << SATP_MODE_SHIFT);
> > +
> > +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode )
> > +        is_mode_supported = true;
> > +
> > +    /* Clean MMU root page table and disable MMU */
> > +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
> > +
> > +    csr_write(CSR_SATP, 0);
> > +    asm volatile("sfence.vma");
> 
> I guess what you do in this function could do with some more
> comments.
> Looks like you're briefly enabling the MMU to check that what you
> wrote
> to SATP you can also read back. (Isn't there a register reporting
> whether the feature is available?)
I supposed that it has to be but I couldn't find a register in docs.

> 
> If so, I think you cannot clear the stage1_pgtbl_root[] slot before
> you've disabled the MMU again.
> 
> (As a minor aspect, I'd like to encourage you to write plain zero
> just
> as 0, not as 0x0, unless used in contexts where other hex numbers
> nearby
> make this desirable.)
You are right it should be cleared after csr_write(CSR_SATP, 0)
> 
> > +    return is_mode_supported;
> > +}
> > +
> > +/*
> > + * setup_initial_pagetables:
> > + *
> > + * Build the page tables for Xen that map the following:
> > + *  1. Calculate page table's level numbers.
> > + *  2. Init mmu description structure.
> > + *  3. Check that linker addresses range doesn't overlap
> > + *     with load addresses range
> > + *  4. Map all linker addresses and load addresses ( it shouldn't
> > + *     be 1:1 mapped and will be 1:1 mapped only in case if
> > + *     linker address is equal to load address ) with
> > + *     RW permissions by default.
> > + *  5. Setup proper PTE permissions for each section.
> > + */
> > +void __init setup_initial_pagetables(void)
> > +{
> > +    struct mmu_desc mmu_desc = { 0, 0, NULL, 0 };
> 
> Just {} ought to do as initializer, but if you want to spell things
> out,
> then please use 0 / NULL consistently for integral / pointer type
> fields.
Thanks.

> 
> > +    /*
> > +     * Access to _{stard, end } is always PC-relative
> 
> I guess you mean _start. For just a leading underscore I also don't
> think
> using this braced form is useful.
OK then I'll update the comment w/o usage of braced form.
> 
> > +     * thereby when access them we will get load adresses
> > +     * of start and end of Xen
> > +     * To get linker addresses LOAD_TO_LINK() is required
> > +     * to use
> > +     */
> > +    unsigned long load_start    = (unsigned long)_start;
> > +    unsigned long load_end      = (unsigned long)_end;
> > +    unsigned long linker_start  = LOAD_TO_LINK(load_start);
> > +    unsigned long linker_end    = LOAD_TO_LINK(load_end);
> > +
> > +    if ( (linker_start != load_start) &&
> > +         (linker_start <= load_end) && (load_start <= linker_end)
> > ) {
> 
> Nit: Style (brace placement).
Thanks,

> 
> > +        early_printk("(XEN) linker and load address ranges
> > overlap\n");
> > +        die();
> > +    }
> > +
> > +    calc_pgtbl_lvls_num(&mmu_desc);
> > +
> > +    if ( !check_pgtbl_mode_support(&mmu_desc, load_start,
> > RV_STAGE1_MODE) )
> > +    {
> > +        early_printk("requested MMU mode isn't supported by CPU\n"
> > +                     "Please choose different in
> > <asm/config.h>\n");
> > +        die();
> > +    }
> > +
> > +    mmu_desc.pgtbl_base = stage1_pgtbl_root;
> > +    mmu_desc.next_pgtbl = stage1_pgtbl_nonroot;
> > +
> > +    setup_initial_mapping(&mmu_desc,
> > +                          linker_start,
> > +                          linker_end,
> > +                          load_start,
> > +                          PTE_LEAF_DEFAULT);
> > +}
> > +
> > +void __init noinline enable_mmu()
> > +{
> > +    /*
> > +     * Calculate a linker time address of the mmu_is_enabled
> > +     * label and update CSR_STVEC with it.
> > +     * MMU is configured in a way where linker addresses are
> > mapped
> > +     * on load addresses so in a case when linker addresses are
> > not equal
> > +     * to load addresses, after MMU is enabled, it will cause
> > +     * an exception and jump to linker time addresses.
> > +     * Otherwise if load addresses are equal to linker addresses
> > the code
> > +     * after mmu_is_enabled label will be executed without
> > exception.
> > +     */
> > +    csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned
> > long)&&mmu_is_enabled));
> > +
> > +    /* Ensure page table writes precede loading the SATP */
> > +    asm volatile("sfence.vma");
> > +
> > +    /* Enable the MMU and load the new pagetable for Xen */
> > +    csr_write(CSR_SATP,
> > +              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) |
> 
> Please try to avoid open-coding of existing constructs: Here you mean
> either PFN_DOWN() or paddr_to_pfn() (you see, we have even two). (As
> I
> notice I did overlook at least similar earlier instance.)
Sure. Thanks.
> 
> > +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > +
> > +    asm volatile(".align 2");
> > +      mmu_is_enabled:
> 
> How in the world is one to spot this label? Yes, it shouldn't be
> entirely
> unindented. But it also should be indented less than the surrounding
> code
> (with the exception of switch() statement case labels, where a non-
> case
> label might be intended the same as a case ones). Our rule of thumb
> is to
> indent such labels by a single space.
Thanks. It looks like I misunderstood previous comment about label
indentation.
> 
> > +    /*
> > +     * Stack should be re-inited as:
> > +     * 1. Right now an address of the stack is relative to load
> > time
> > +     *    addresses what will cause an issue in case of load start
> > address
> > +     *    isn't equal to linker start address.
> > +     * 2. Addresses in stack are all load time relative which can
> > be an
> > +     *    issue in case when load start address isn't equal to
> > linker
> > +     *    start address.
> > +     */
> > +    asm volatile ("mv sp, %0" : : "r"((unsigned
> > long)cpu0_boot_stack + STACK_SIZE));
> 
> Nit: Style (overly long line).
> 
> What's worse - I don't think it is permitted to alter sp in the
> middle of
> a function. The compiler may maintain local variables on the stack
> which
> don't correspond to any programmer specified ones, including pointer
> ones
> which point into the stack frame. This is specifically why both x86
> and
> Arm have switch_stack_and_jump() macros.
but the macros (from ARM) looks equal to the code mentioned above:
#define switch_stack_and_jump(stack, fn) do {                         
\
    asm volatile ("mov sp,%0; b " STR(fn) : : "r" (stack), "X" (fn) :
"memory" ); \
    unreachable();                                                    
\
} while ( false )

Here is part of disassembled enable_mmu():

ffffffffc004aedc:       18079073                csrw    satp,a5
ffffffffc004aee0:       00016797                auipc   a5,0x16
ffffffffc004aee4:       12078793                addi    a5,a5,288
ffffffffc004aee8:       813e                    mv      sp,a5
ffffffffc004af00:       0f4000ef                jal    
ra,ffffffffc004aff4 <cont_after_mmu_is_enabled>
...


> 
> > +    /*
> > +     * We can't return to the caller because the stack was reseted
> > +     * and it may have stash some variable on the stack.
> > +     * Jump to a brand new function as the stack was reseted
> > +    */
> 
> Nit: Style (indentation).
Thanks.
> 
> > +    cont_after_mmu_is_enabled();
> > +}
> > +
> > diff --git a/xen/arch/riscv/riscv64/head.S
> > b/xen/arch/riscv/riscv64/head.S
> > index 8887f0cbd4..b3309d902c 100644
> > --- a/xen/arch/riscv/riscv64/head.S
> > +++ b/xen/arch/riscv/riscv64/head.S
> > @@ -1,4 +1,5 @@
> >  #include <asm/asm.h>
> > +#include <asm/asm-offsets.h>
> >  #include <asm/riscv_encoding.h>
> >  
> >          .section .text.header, "ax", %progbits
> > @@ -32,3 +33,4 @@ ENTRY(start)
> >          add     sp, sp, t0
> >  
> >          tail    start_xen
> > +
> 
> ???
Shouldn't it be the one empty line at the end of a file?

> 
> > --- a/xen/arch/riscv/xen.lds.S
> > +++ b/xen/arch/riscv/xen.lds.S
> > @@ -136,6 +136,7 @@ SECTIONS
> >      . = ALIGN(POINTER_ALIGN);
> >      __init_end = .;
> >  
> > +    . = ALIGN(PAGE_SIZE);
> >      .bss : {                     /* BSS */
> >          __bss_start = .;
> >          *(.bss.stack_aligned)
> 
> Why do you need this? You properly use __aligned(PAGE_SIZE) for the
> page tables you define, and PAGE_SIZE wouldn't be enough here anyway
> if STACK_SIZE > PAGE_SIZE (as .bss.stack_aligned comes first). The
> only time you'd need such an ALIGN() is if the following label
> (__bss_start in this case) needed to be aligned at a certain
> boundary. (I'm a little puzzled though that __bss_start isn't
> [immediately] preceded by ". = ALIGN(POINTER_ALIGN);" - didn't .bss
> clearing rely on such alignment?)
ALIGN(PAGE_SIZE)  isn't needed anymore.
I used it to have 4k aligned physical address for PTE when I mapped
each section separately ( it was so in the previous verstion of MMU
patch series )

Regarding ". = ALIGN(POINTER_ALIGN);" I would say that it is enough to
have aligned __bss_end ( what was done ) to be sure that we can clear
__SIZEOF_POINTER__ bytes each iteration of .bss clearing loop and don't
worry that size of .bss section may not be divisible by
__SIZEOF_POINTER__.

~ Oleksii
Jan Beulich April 24, 2023, 10:18 a.m. UTC | #3
On 21.04.2023 18:01, Oleksii wrote:
> On Thu, 2023-04-20 at 16:36 +0200, Jan Beulich wrote:
>> On 19.04.2023 17:42, Oleksii Kurochko wrote:
>>> +                        /* panic(), <asm/bug.h> aren't ready now.
>>> */
>>> +                        die();
>>> +                    }
>>> +                }
>>> +        }
>>> +
>>> +        /* Point to next page */
>>> +        page_addr += XEN_PT_LEVEL_SIZE(0);
>>
>> Seeing this as well as the loop heading - maybe more suitable as a
>> for(;;) loop?
> I am not sure that I understand the benefits of changing "while (
> page_addr < map_end )" to "for(;;)".
> Could you please explain me what the benefits are?

The common case of using while() is in situations where one cannot
use for(). for() is (imo) preferable in all cases where the third of
the controlling expressions isn't empty (and is to be carried out
after every iteration): Any use of "continue" inside the loop will
then properly effect loop progress. (Of course there are cases where
this behavior isn't wanted; that's where while() may come into play
then.)

>>> +    csr_write(CSR_SATP,
>>> +              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) |
>>> +              satp_mode << SATP_MODE_SHIFT);
>>> +
>>> +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode )
>>> +        is_mode_supported = true;
>>> +
>>> +    /* Clean MMU root page table and disable MMU */
>>> +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
>>> +
>>> +    csr_write(CSR_SATP, 0);
>>> +    asm volatile("sfence.vma");
>>
>> I guess what you do in this function could do with some more
>> comments.
>> Looks like you're briefly enabling the MMU to check that what you
>> wrote
>> to SATP you can also read back. (Isn't there a register reporting
>> whether the feature is available?)
> I supposed that it has to be but I couldn't find a register in docs.

Well, yes, interestingly the register is marked WARL, so apparently
intended to by used for probing like you do. (I find the definition of
WARL a little odd though, as such writes supposedly aren't necessarily
value preserving. For SATP this might mean that translation is enabled
by a write of an unsupported mode, with a different number of levels.
This isn't going to work very well, I'm afraid.)

>>> +    /*
>>> +     * Stack should be re-inited as:
>>> +     * 1. Right now an address of the stack is relative to load
>>> time
>>> +     *    addresses what will cause an issue in case of load start
>>> address
>>> +     *    isn't equal to linker start address.
>>> +     * 2. Addresses in stack are all load time relative which can
>>> be an
>>> +     *    issue in case when load start address isn't equal to
>>> linker
>>> +     *    start address.
>>> +     */
>>> +    asm volatile ("mv sp, %0" : : "r"((unsigned
>>> long)cpu0_boot_stack + STACK_SIZE));
>>
>> Nit: Style (overly long line).
>>
>> What's worse - I don't think it is permitted to alter sp in the
>> middle of
>> a function. The compiler may maintain local variables on the stack
>> which
>> don't correspond to any programmer specified ones, including pointer
>> ones
>> which point into the stack frame. This is specifically why both x86
>> and
>> Arm have switch_stack_and_jump() macros.
> but the macros (from ARM) looks equal to the code mentioned above:
> #define switch_stack_and_jump(stack, fn) do {                         
> \
>     asm volatile ("mov sp,%0; b " STR(fn) : : "r" (stack), "X" (fn) :
> "memory" ); \

Note how writing SP and branch are contained in a single asm() there.
By checking ...

>     unreachable();                                                    
> \
> } while ( false )
> 
> Here is part of disassembled enable_mmu():
> 
> ffffffffc004aedc:       18079073                csrw    satp,a5
> ffffffffc004aee0:       00016797                auipc   a5,0x16
> ffffffffc004aee4:       12078793                addi    a5,a5,288
> ffffffffc004aee8:       813e                    mv      sp,a5
> ffffffffc004af00:       0f4000ef                jal    
> ra,ffffffffc004aff4 <cont_after_mmu_is_enabled>
> ...

... what the generated code in your case is you won't guarantee that
things remain that way with future (or simply different) compilers.

>>> --- a/xen/arch/riscv/riscv64/head.S
>>> +++ b/xen/arch/riscv/riscv64/head.S
>>> @@ -1,4 +1,5 @@
>>>  #include <asm/asm.h>
>>> +#include <asm/asm-offsets.h>
>>>  #include <asm/riscv_encoding.h>
>>>  
>>>          .section .text.header, "ax", %progbits
>>> @@ -32,3 +33,4 @@ ENTRY(start)
>>>          add     sp, sp, t0
>>>  
>>>          tail    start_xen
>>> +
>>
>> ???
> Shouldn't it be the one empty line at the end of a file?

There should be a newline at the end of a file, but not normally a
blank one. When you introduce a new file, it can be viewed as a matter
of taste whether to have an empty last line, but when you have a
seemingly unrelated change to a file like the one here, this is at
least odd.

>>> --- a/xen/arch/riscv/xen.lds.S
>>> +++ b/xen/arch/riscv/xen.lds.S
>>> @@ -136,6 +136,7 @@ SECTIONS
>>>      . = ALIGN(POINTER_ALIGN);
>>>      __init_end = .;
>>>  
>>> +    . = ALIGN(PAGE_SIZE);
>>>      .bss : {                     /* BSS */
>>>          __bss_start = .;
>>>          *(.bss.stack_aligned)
>>
>> Why do you need this? You properly use __aligned(PAGE_SIZE) for the
>> page tables you define, and PAGE_SIZE wouldn't be enough here anyway
>> if STACK_SIZE > PAGE_SIZE (as .bss.stack_aligned comes first). The
>> only time you'd need such an ALIGN() is if the following label
>> (__bss_start in this case) needed to be aligned at a certain
>> boundary. (I'm a little puzzled though that __bss_start isn't
>> [immediately] preceded by ". = ALIGN(POINTER_ALIGN);" - didn't .bss
>> clearing rely on such alignment?)
> ALIGN(PAGE_SIZE)  isn't needed anymore.
> I used it to have 4k aligned physical address for PTE when I mapped
> each section separately ( it was so in the previous verstion of MMU
> patch series )
> 
> Regarding ". = ALIGN(POINTER_ALIGN);" I would say that it is enough to
> have aligned __bss_end ( what was done ) to be sure that we can clear
> __SIZEOF_POINTER__ bytes each iteration of .bss clearing loop and don't
> worry that size of .bss section may not be divisible by
> __SIZEOF_POINTER__.

How would guaranteeing this only for __bss_end help? __bss_start could
still be misaligned, and then you'd
(a) use misaligned stores for clearing and
(b) extend clearing to outside of the .bss (as the last of the misaligned
stores would cross the __bss_end boundary).

Jan
Oleksii April 24, 2023, 3:16 p.m. UTC | #4
On Mon, 2023-04-24 at 12:18 +0200, Jan Beulich wrote:
> On 21.04.2023 18:01, Oleksii wrote:
> > On Thu, 2023-04-20 at 16:36 +0200, Jan Beulich wrote:
> > > On 19.04.2023 17:42, Oleksii Kurochko wrote:
> > > > +                        /* panic(), <asm/bug.h> aren't ready
> > > > now.
> > > > */
> > > > +                        die();
> > > > +                    }
> > > > +                }
> > > > +        }
> > > > +
> > > > +        /* Point to next page */
> > > > +        page_addr += XEN_PT_LEVEL_SIZE(0);
> > > 
> > > Seeing this as well as the loop heading - maybe more suitable as
> > > a
> > > for(;;) loop?
> > I am not sure that I understand the benefits of changing "while (
> > page_addr < map_end )" to "for(;;)".
> > Could you please explain me what the benefits are?
> 
> The common case of using while() is in situations where one cannot
> use for(). for() is (imo) preferable in all cases where the third of
> the controlling expressions isn't empty (and is to be carried out
> after every iteration): Any use of "continue" inside the loop will
> then properly effect loop progress. (Of course there are cases where
> this behavior isn't wanted; that's where while() may come into play
> then.)
Thanks for clarification. Now it is more clear.
> 
> > > > +    csr_write(CSR_SATP,
> > > > +              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT)
> > > > |
> > > > +              satp_mode << SATP_MODE_SHIFT);
> > > > +
> > > > +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode
> > > > )
> > > > +        is_mode_supported = true;
> > > > +
> > > > +    /* Clean MMU root page table and disable MMU */
> > > > +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
> > > > +
> > > > +    csr_write(CSR_SATP, 0);
> > > > +    asm volatile("sfence.vma");
> > > 
> > > I guess what you do in this function could do with some more
> > > comments.
> > > Looks like you're briefly enabling the MMU to check that what you
> > > wrote
> > > to SATP you can also read back. (Isn't there a register reporting
> > > whether the feature is available?)
> > I supposed that it has to be but I couldn't find a register in
> > docs.
> 
> Well, yes, interestingly the register is marked WARL, so apparently
> intended to by used for probing like you do. (I find the definition
> of
> WARL a little odd though, as such writes supposedly aren't
> necessarily
> value preserving. For SATP this might mean that translation is
> enabled
> by a write of an unsupported mode, with a different number of levels.
> This isn't going to work very well, I'm afraid.)
Agree. It will be an issue in case of a different number of levels.

Then it looks there is no way to check if SATP mode is supported.

So we have to rely on the fact that the developer specified
RV_STAGE1_MODE correctly in the config file.

> 
> > > > +    /*
> > > > +     * Stack should be re-inited as:
> > > > +     * 1. Right now an address of the stack is relative to
> > > > load
> > > > time
> > > > +     *    addresses what will cause an issue in case of load
> > > > start
> > > > address
> > > > +     *    isn't equal to linker start address.
> > > > +     * 2. Addresses in stack are all load time relative which
> > > > can
> > > > be an
> > > > +     *    issue in case when load start address isn't equal to
> > > > linker
> > > > +     *    start address.
> > > > +     */
> > > > +    asm volatile ("mv sp, %0" : : "r"((unsigned
> > > > long)cpu0_boot_stack + STACK_SIZE));
> > > 
> > > Nit: Style (overly long line).
> > > 
> > > What's worse - I don't think it is permitted to alter sp in the
> > > middle of
> > > a function. The compiler may maintain local variables on the
> > > stack
> > > which
> > > don't correspond to any programmer specified ones, including
> > > pointer
> > > ones
> > > which point into the stack frame. This is specifically why both
> > > x86
> > > and
> > > Arm have switch_stack_and_jump() macros.
> > but the macros (from ARM) looks equal to the code mentioned above:
> > #define switch_stack_and_jump(stack, fn) do
> > {                         
> > \
> >     asm volatile ("mov sp,%0; b " STR(fn) : : "r" (stack), "X" (fn)
> > :
> > "memory" ); \
> 
> Note how writing SP and branch are contained in a single asm() there.
> By checking ...
> 
> >    
> > unreachable();                                                    
> > \
> > } while ( false )
> > 
> > Here is part of disassembled enable_mmu():
> > 
> > ffffffffc004aedc:       18079073                csrw    satp,a5
> > ffffffffc004aee0:       00016797                auipc   a5,0x16
> > ffffffffc004aee4:       12078793                addi    a5,a5,288
> > ffffffffc004aee8:       813e                    mv      sp,a5
> > ffffffffc004af00:       0f4000ef                jal    
> > ra,ffffffffc004aff4 <cont_after_mmu_is_enabled>
> > ...
> 
> ... what the generated code in your case is you won't guarantee that
> things remain that way with future (or simply different) compilers.
Agree. Thanks for clarification. I'll take into account during the next
version of patch series.

> 
> > > > --- a/xen/arch/riscv/riscv64/head.S
> > > > +++ b/xen/arch/riscv/riscv64/head.S
> > > > @@ -1,4 +1,5 @@
> > > >  #include <asm/asm.h>
> > > > +#include <asm/asm-offsets.h>
> > > >  #include <asm/riscv_encoding.h>
> > > >  
> > > >          .section .text.header, "ax", %progbits
> > > > @@ -32,3 +33,4 @@ ENTRY(start)
> > > >          add     sp, sp, t0
> > > >  
> > > >          tail    start_xen
> > > > +
> > > 
> > > ???
> > Shouldn't it be the one empty line at the end of a file?
> 
> There should be a newline at the end of a file, but not normally a
> blank one. When you introduce a new file, it can be viewed as a
> matter
> of taste whether to have an empty last line, but when you have a
> seemingly unrelated change to a file like the one here, this is at
> least odd.
Agree. Then I'll remove this change from the patch series.

> 
> > > > --- a/xen/arch/riscv/xen.lds.S
> > > > +++ b/xen/arch/riscv/xen.lds.S
> > > > @@ -136,6 +136,7 @@ SECTIONS
> > > >      . = ALIGN(POINTER_ALIGN);
> > > >      __init_end = .;
> > > >  
> > > > +    . = ALIGN(PAGE_SIZE);
> > > >      .bss : {                     /* BSS */
> > > >          __bss_start = .;
> > > >          *(.bss.stack_aligned)
> > > 
> > > Why do you need this? You properly use __aligned(PAGE_SIZE) for
> > > the
> > > page tables you define, and PAGE_SIZE wouldn't be enough here
> > > anyway
> > > if STACK_SIZE > PAGE_SIZE (as .bss.stack_aligned comes first).
> > > The
> > > only time you'd need such an ALIGN() is if the following label
> > > (__bss_start in this case) needed to be aligned at a certain
> > > boundary. (I'm a little puzzled though that __bss_start isn't
> > > [immediately] preceded by ". = ALIGN(POINTER_ALIGN);" - didn't
> > > .bss
> > > clearing rely on such alignment?)
> > ALIGN(PAGE_SIZE)  isn't needed anymore.
> > I used it to have 4k aligned physical address for PTE when I mapped
> > each section separately ( it was so in the previous verstion of MMU
> > patch series )
> > 
> > Regarding ". = ALIGN(POINTER_ALIGN);" I would say that it is enough
> > to
> > have aligned __bss_end ( what was done ) to be sure that we can
> > clear
> > __SIZEOF_POINTER__ bytes each iteration of .bss clearing loop and
> > don't
> > worry that size of .bss section may not be divisible by
> > __SIZEOF_POINTER__.
> 
> How would guaranteeing this only for __bss_end help? __bss_start
> could
> still be misaligned, and then you'd
> (a) use misaligned stores for clearing and
> (b) extend clearing to outside of the .bss (as the last of the
> misaligned
> stores would cross the __bss_end boundary).
It seems you are right. I'll create a separate commit to align
__bss_start properly.

~ Oleksii
Jan Beulich April 24, 2023, 3:35 p.m. UTC | #5
On 24.04.2023 17:16, Oleksii wrote:
> On Mon, 2023-04-24 at 12:18 +0200, Jan Beulich wrote:
>> On 21.04.2023 18:01, Oleksii wrote:
>>> On Thu, 2023-04-20 at 16:36 +0200, Jan Beulich wrote:
>>>> On 19.04.2023 17:42, Oleksii Kurochko wrote:
>>>>> +    csr_write(CSR_SATP,
>>>>> +              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT)
>>>>> |
>>>>> +              satp_mode << SATP_MODE_SHIFT);
>>>>> +
>>>>> +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode
>>>>> )
>>>>> +        is_mode_supported = true;
>>>>> +
>>>>> +    /* Clean MMU root page table and disable MMU */
>>>>> +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
>>>>> +
>>>>> +    csr_write(CSR_SATP, 0);
>>>>> +    asm volatile("sfence.vma");
>>>>
>>>> I guess what you do in this function could do with some more
>>>> comments.
>>>> Looks like you're briefly enabling the MMU to check that what you
>>>> wrote
>>>> to SATP you can also read back. (Isn't there a register reporting
>>>> whether the feature is available?)
>>> I supposed that it has to be but I couldn't find a register in
>>> docs.
>>
>> Well, yes, interestingly the register is marked WARL, so apparently
>> intended to by used for probing like you do. (I find the definition
>> of
>> WARL a little odd though, as such writes supposedly aren't
>> necessarily
>> value preserving. For SATP this might mean that translation is
>> enabled
>> by a write of an unsupported mode, with a different number of levels.
>> This isn't going to work very well, I'm afraid.)
> Agree. It will be an issue in case of a different number of levels.
> 
> Then it looks there is no way to check if SATP mode is supported.
> 
> So we have to rely on the fact that the developer specified
> RV_STAGE1_MODE correctly in the config file.

Well, maybe the spec could be clarified in this regard. That WARL
behavior may be okay for some registers, but as said I think it isn't
enough of a guarantee for SATP probing. Alistair, Bob - any thoughts?

Jan
Oleksii April 28, 2023, 8:05 p.m. UTC | #6
Hi Jan,

On Mon, 2023-04-24 at 17:35 +0200, Jan Beulich wrote:
> On 24.04.2023 17:16, Oleksii wrote:
> > On Mon, 2023-04-24 at 12:18 +0200, Jan Beulich wrote:
> > > On 21.04.2023 18:01, Oleksii wrote:
> > > > On Thu, 2023-04-20 at 16:36 +0200, Jan Beulich wrote:
> > > > > On 19.04.2023 17:42, Oleksii Kurochko wrote:
> > > > > > +    csr_write(CSR_SATP,
> > > > > > +              ((unsigned long)stage1_pgtbl_root >>
> > > > > > PAGE_SHIFT)
> > > > > > > 
> > > > > > +              satp_mode << SATP_MODE_SHIFT);
> > > > > > +
> > > > > > +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) ==
> > > > > > satp_mode
> > > > > > )
> > > > > > +        is_mode_supported = true;
> > > > > > +
> > > > > > +    /* Clean MMU root page table and disable MMU */
> > > > > > +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
> > > > > > +
> > > > > > +    csr_write(CSR_SATP, 0);
> > > > > > +    asm volatile("sfence.vma");
> > > > > 
> > > > > I guess what you do in this function could do with some more
> > > > > comments.
> > > > > Looks like you're briefly enabling the MMU to check that what
> > > > > you
> > > > > wrote
> > > > > to SATP you can also read back. (Isn't there a register
> > > > > reporting
> > > > > whether the feature is available?)
> > > > I supposed that it has to be but I couldn't find a register in
> > > > docs.
> > > 
> > > Well, yes, interestingly the register is marked WARL, so
> > > apparently
> > > intended to by used for probing like you do. (I find the
> > > definition
> > > of
> > > WARL a little odd though, as such writes supposedly aren't
> > > necessarily
> > > value preserving. For SATP this might mean that translation is
> > > enabled
> > > by a write of an unsupported mode, with a different number of
> > > levels.
> > > This isn't going to work very well, I'm afraid.)
> > Agree. It will be an issue in case of a different number of levels.
> > 
> > Then it looks there is no way to check if SATP mode is supported.
> > 
> > So we have to rely on the fact that the developer specified
> > RV_STAGE1_MODE correctly in the config file.
> 
> Well, maybe the spec could be clarified in this regard. That WARL
> behavior may be okay for some registers, but as said I think it isn't
> enough of a guarantee for SATP probing. Alistair, Bob - any thoughts?
I've re-read the manual regarding CSR_SATP and the code of detecting
SATP mode will work fine.
From the manual ( 4.1.11
Supervisor Address Translation and Protection (satp) Register ):
“Implementations are not required to support all MODE settings, and if
satp is written with an unsupported MODE, the entire write has no
effect; no fields in satp are modified.”

So there is no leave any open question so I'll provide new patch
series.

~ Oleksii
Jan Beulich May 2, 2023, 6:15 a.m. UTC | #7
On 28.04.2023 22:05, Oleksii wrote:
> Hi Jan,
> 
> On Mon, 2023-04-24 at 17:35 +0200, Jan Beulich wrote:
>> On 24.04.2023 17:16, Oleksii wrote:
>>> On Mon, 2023-04-24 at 12:18 +0200, Jan Beulich wrote:
>>>> On 21.04.2023 18:01, Oleksii wrote:
>>>>> On Thu, 2023-04-20 at 16:36 +0200, Jan Beulich wrote:
>>>>>> On 19.04.2023 17:42, Oleksii Kurochko wrote:
>>>>>>> +    csr_write(CSR_SATP,
>>>>>>> +              ((unsigned long)stage1_pgtbl_root >>
>>>>>>> PAGE_SHIFT)
>>>>>>>>
>>>>>>> +              satp_mode << SATP_MODE_SHIFT);
>>>>>>> +
>>>>>>> +    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) ==
>>>>>>> satp_mode
>>>>>>> )
>>>>>>> +        is_mode_supported = true;
>>>>>>> +
>>>>>>> +    /* Clean MMU root page table and disable MMU */
>>>>>>> +    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
>>>>>>> +
>>>>>>> +    csr_write(CSR_SATP, 0);
>>>>>>> +    asm volatile("sfence.vma");
>>>>>>
>>>>>> I guess what you do in this function could do with some more
>>>>>> comments.
>>>>>> Looks like you're briefly enabling the MMU to check that what
>>>>>> you
>>>>>> wrote
>>>>>> to SATP you can also read back. (Isn't there a register
>>>>>> reporting
>>>>>> whether the feature is available?)
>>>>> I supposed that it has to be but I couldn't find a register in
>>>>> docs.
>>>>
>>>> Well, yes, interestingly the register is marked WARL, so
>>>> apparently
>>>> intended to by used for probing like you do. (I find the
>>>> definition
>>>> of
>>>> WARL a little odd though, as such writes supposedly aren't
>>>> necessarily
>>>> value preserving. For SATP this might mean that translation is
>>>> enabled
>>>> by a write of an unsupported mode, with a different number of
>>>> levels.
>>>> This isn't going to work very well, I'm afraid.)
>>> Agree. It will be an issue in case of a different number of levels.
>>>
>>> Then it looks there is no way to check if SATP mode is supported.
>>>
>>> So we have to rely on the fact that the developer specified
>>> RV_STAGE1_MODE correctly in the config file.
>>
>> Well, maybe the spec could be clarified in this regard. That WARL
>> behavior may be okay for some registers, but as said I think it isn't
>> enough of a guarantee for SATP probing. Alistair, Bob - any thoughts?
> I've re-read the manual regarding CSR_SATP and the code of detecting
> SATP mode will work fine.
> From the manual ( 4.1.11
> Supervisor Address Translation and Protection (satp) Register ):
> “Implementations are not required to support all MODE settings, and if
> satp is written with an unsupported MODE, the entire write has no
> effect; no fields in satp are modified.”

Ah, I see. That's the sentence I had overlooked (and that's unhelpfully
not only not implied to be that way, but actively implied to be different
by figures 4.11 and 4.12 naming [all] the individual fields as WARL).

Jan
diff mbox series

Patch

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 443f6bf15f..956ceb02df 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -1,5 +1,6 @@ 
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 obj-y += entry.o
+obj-y += mm.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/config.h b/xen/arch/riscv/include/asm/config.h
index 0c860e88ce..4cd4f7a701 100644
--- a/xen/arch/riscv/include/asm/config.h
+++ b/xen/arch/riscv/include/asm/config.h
@@ -70,12 +70,22 @@ 
   name:
 #endif
 
-#define XEN_VIRT_START  _AT(UL, 0x80200000)
+#ifdef CONFIG_RISCV_64
+#define XEN_VIRT_START 0xFFFFFFFFC0000000 /* (_AC(-1, UL) + 1 - GB(1)) */
+#else
+#error "RV32 isn't supported"
+#endif
 
 #define SMP_CACHE_BYTES (1 << 6)
 
 #define STACK_SIZE PAGE_SIZE
 
+#ifdef CONFIG_RISCV_64
+#define RV_STAGE1_MODE SATP_MODE_SV39
+#else
+#define RV_STAGE1_MODE SATP_MODE_SV32
+#endif
+
 #endif /* __RISCV_CONFIG_H__ */
 /*
  * Local variables:
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
new file mode 100644
index 0000000000..e16ce66fae
--- /dev/null
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -0,0 +1,9 @@ 
+#ifndef _ASM_RISCV_MM_H
+#define _ASM_RISCV_MM_H
+
+void setup_initial_pagetables(void);
+
+void enable_mmu(void);
+void cont_after_mmu_is_enabled(void);
+
+#endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/include/asm/page-bits.h b/xen/arch/riscv/include/asm/page-bits.h
index 1801820294..0879a527f2 100644
--- a/xen/arch/riscv/include/asm/page-bits.h
+++ b/xen/arch/riscv/include/asm/page-bits.h
@@ -1,6 +1,16 @@ 
 #ifndef __RISCV_PAGE_BITS_H__
 #define __RISCV_PAGE_BITS_H__
 
+#ifdef CONFIG_RISCV_64
+#define PAGETABLE_ORDER         (9)
+#else /* CONFIG_RISCV_32 */
+#define PAGETABLE_ORDER         (10)
+#endif
+
+#define PAGETABLE_ENTRIES       (1 << PAGETABLE_ORDER)
+
+#define PTE_PPN_SHIFT           10
+
 #define PAGE_SHIFT              12 /* 4 KiB Pages */
 #define PADDR_BITS              56 /* 44-bit PPN */
 
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
new file mode 100644
index 0000000000..daa880558e
--- /dev/null
+++ b/xen/arch/riscv/include/asm/page.h
@@ -0,0 +1,63 @@ 
+#ifndef _ASM_RISCV_PAGE_H
+#define _ASM_RISCV_PAGE_H
+
+#include <xen/const.h>
+#include <xen/types.h>
+
+#define VPN_MASK                    ((unsigned long)(PAGETABLE_ENTRIES - 1))
+
+#define XEN_PT_LEVEL_ORDER(lvl)     ((lvl) * PAGETABLE_ORDER)
+#define XEN_PT_LEVEL_SHIFT(lvl)     (XEN_PT_LEVEL_ORDER(lvl) + PAGE_SHIFT)
+#define XEN_PT_LEVEL_SIZE(lvl)      (_AT(paddr_t, 1) << XEN_PT_LEVEL_SHIFT(lvl))
+#define XEN_PT_LEVEL_MAP_MASK(lvl)  (~(XEN_PT_LEVEL_SIZE(lvl) - 1))
+#define XEN_PT_LEVEL_MASK(lvl)      (VPN_MASK << XEN_PT_LEVEL_SHIFT(lvl))
+
+#define PTE_VALID                   BIT(0, UL)
+#define PTE_READABLE                BIT(1, UL)
+#define PTE_WRITABLE                BIT(2, UL)
+#define PTE_EXECUTABLE              BIT(3, UL)
+#define PTE_USER                    BIT(4, UL)
+#define PTE_GLOBAL                  BIT(5, UL)
+#define PTE_ACCESSED                BIT(6, UL)
+#define PTE_DIRTY                   BIT(7, UL)
+#define PTE_RSW                     (BIT(8, UL) | BIT(9, UL))
+
+#define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
+#define PTE_TABLE                   (PTE_VALID)
+
+/* Calculate the offsets into the pagetables for a given VA */
+#define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
+
+#define pt_index(lvl, va) pt_linear_offset(lvl, (va) & XEN_PT_LEVEL_MASK(lvl))
+
+/* Page Table entry */
+typedef struct {
+#ifdef CONFIG_RISCV_64
+    uint64_t pte;
+#else
+    uint32_t pte;
+#endif
+} pte_t;
+
+#define pte_to_addr(x) (((x) >> PTE_PPN_SHIFT) << PAGE_SHIFT)
+
+#define addr_to_pte(x) (((x) >> PAGE_SHIFT) << PTE_PPN_SHIFT)
+
+static inline pte_t paddr_to_pte(const paddr_t paddr,
+                                 const unsigned long permissions)
+{
+    unsigned long tmp = addr_to_pte(paddr);
+    return (pte_t) { .pte = tmp | permissions };
+}
+
+static inline paddr_t pte_to_paddr(const pte_t pte)
+{
+    return pte_to_addr(pte.pte);
+}
+
+static inline bool pte_is_valid(const pte_t p)
+{
+    return p.pte & PTE_VALID;
+}
+
+#endif /* _ASM_RISCV_PAGE_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
new file mode 100644
index 0000000000..43b7181c33
--- /dev/null
+++ b/xen/arch/riscv/mm.c
@@ -0,0 +1,319 @@ 
+#include <xen/compiler.h>
+#include <xen/init.h>
+#include <xen/kernel.h>
+
+#include <asm/early_printk.h>
+#include <asm/config.h>
+#include <asm/csr.h>
+#include <asm/mm.h>
+#include <asm/page.h>
+#include <asm/processor.h>
+
+struct mmu_desc {
+    unsigned long num_levels;
+    unsigned int pgtbl_count;
+    pte_t *next_pgtbl;
+    pte_t *pgtbl_base;
+};
+
+extern unsigned char cpu0_boot_stack[STACK_SIZE];
+
+#define PHYS_OFFSET ((unsigned long)_start - XEN_VIRT_START)
+#define LOAD_TO_LINK(addr) ((addr) - PHYS_OFFSET)
+#define LINK_TO_LOAD(addr) ((addr) + PHYS_OFFSET)
+
+
+/*
+ * It is expected that Xen won't be more then 2 MB.
+ * The check in xen.lds.S guarantees that.
+ * At least 4 page (in case when Sv48 or Sv57 are used )
+ * tables are needed to cover 2 MB. One for each page level
+ * table with PAGE_SIZE = 4 Kb
+ *
+ * One L0 page table can cover 2 MB
+ * (512 entries of one page table * PAGE_SIZE).
+ *
+ * It might be needed one more page table in case when
+ * Xen load address isn't 2 MB aligned.
+ *
+ */
+#define PGTBL_INITIAL_COUNT     (5)
+
+#define PGTBL_ENTRY_AMOUNT  (PAGE_SIZE / sizeof(pte_t))
+
+pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
+stage1_pgtbl_root[PGTBL_ENTRY_AMOUNT];
+
+pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
+stage1_pgtbl_nonroot[PGTBL_INITIAL_COUNT * PGTBL_ENTRY_AMOUNT];
+
+#define HANDLE_PGTBL(curr_lvl_num)                                          \
+    index = pt_index(curr_lvl_num, page_addr);                              \
+    if ( pte_is_valid(pgtbl[index]) )                                       \
+    {                                                                       \
+        /* Find L{ 0-3 } table */                                           \
+        pgtbl = (pte_t *)pte_to_paddr(pgtbl[index]);                        \
+    }                                                                       \
+    else                                                                    \
+    {                                                                       \
+        /* Allocate new L{0-3} page table */                                \
+        if ( mmu_desc->pgtbl_count == PGTBL_INITIAL_COUNT )                 \
+        {                                                                   \
+            early_printk("(XEN) No initial table available\n");             \
+            /* panic(), BUG() or ASSERT() aren't ready now. */              \
+            die();                                                          \
+        }                                                                   \
+        mmu_desc->pgtbl_count++;                                            \
+        pgtbl[index] = paddr_to_pte((unsigned long)mmu_desc->next_pgtbl,    \
+                                    PTE_VALID);                             \
+        pgtbl = mmu_desc->next_pgtbl;                                       \
+        mmu_desc->next_pgtbl += PGTBL_ENTRY_AMOUNT;                         \
+    }
+
+static void __init setup_initial_mapping(struct mmu_desc *mmu_desc,
+                                         unsigned long map_start,
+                                         unsigned long map_end,
+                                         unsigned long pa_start,
+                                         unsigned long permissions)
+{
+    unsigned int index;
+    pte_t *pgtbl;
+    unsigned long page_addr;
+    pte_t pte_to_be_written;
+    unsigned long paddr;
+    unsigned long tmp_permissions;
+
+    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
+    {
+        early_printk("(XEN) Xen should be loaded at 4k boundary\n");
+        die();
+    }
+
+    if ( map_start & ~XEN_PT_LEVEL_MAP_MASK(0) ||
+         pa_start & ~XEN_PT_LEVEL_MAP_MASK(0) )
+    {
+        early_printk("(XEN) map and pa start addresses should be aligned\n");
+        /* panic(), BUG() or ASSERT() aren't ready now. */
+        die();
+    }
+
+    page_addr = map_start;
+    while ( page_addr < map_end )
+    {
+        pgtbl = mmu_desc->pgtbl_base;
+
+        switch (mmu_desc->num_levels)
+        {
+            case 4: /* Level 3 */
+                HANDLE_PGTBL(3);
+            case 3: /* Level 2 */
+                HANDLE_PGTBL(2);
+            case 2: /* Level 1 */
+                HANDLE_PGTBL(1);
+            case 1: /* Level 0 */
+                index = pt_index(0, page_addr);
+                paddr = (page_addr - map_start) + pa_start;
+
+                tmp_permissions = permissions;
+
+                if ( is_kernel_text(LINK_TO_LOAD(page_addr)) ||
+                     is_kernel_inittext(LINK_TO_LOAD(page_addr)) )
+                    tmp_permissions =
+                        PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
+
+                if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) )
+                    tmp_permissions = PTE_READABLE | PTE_VALID;
+
+                pte_to_be_written = paddr_to_pte(paddr, tmp_permissions);
+
+                if ( !pte_is_valid(pgtbl[index]) )
+                    pgtbl[index] = pte_to_be_written;
+                else
+                {
+                    /*
+                    * get an adresses of current pte and that one to
+                    * be written without permission flags
+                    */
+                    unsigned long curr_pte =
+                        pgtbl[index].pte & ~((1 << PTE_PPN_SHIFT) - 1);
+
+                    pte_to_be_written.pte &= ~((1 << PTE_PPN_SHIFT) - 1);
+
+                    if ( curr_pte != pte_to_be_written.pte )
+                    {
+                        early_printk("PTE that is intended to write isn't the"
+                                    "same that the once are overwriting\n");
+                        /* panic(), <asm/bug.h> aren't ready now. */
+                        die();
+                    }
+                }
+        }
+
+        /* Point to next page */
+        page_addr += XEN_PT_LEVEL_SIZE(0);
+    }
+}
+
+static void __init calc_pgtbl_lvls_num(struct  mmu_desc *mmu_desc)
+{
+    unsigned long satp_mode = RV_STAGE1_MODE;
+
+    /* Number of page table levels */
+    switch (satp_mode)
+    {
+        case SATP_MODE_SV32:
+            mmu_desc->num_levels = 2;
+            break;
+        case SATP_MODE_SV39:
+            mmu_desc->num_levels = 3;
+            break;
+        case SATP_MODE_SV48:
+            mmu_desc->num_levels = 4;
+            break;
+        default:
+            early_printk("(XEN) Unsupported SATP_MODE\n");
+            die();
+    }
+}
+
+static bool __init check_pgtbl_mode_support(struct mmu_desc *mmu_desc,
+                                            unsigned long load_start,
+                                            unsigned long satp_mode)
+{
+    bool is_mode_supported = false;
+    unsigned int index;
+    unsigned int page_table_level = (mmu_desc->num_levels - 1);
+    unsigned level_map_mask = XEN_PT_LEVEL_MAP_MASK(page_table_level);
+
+    unsigned long aligned_load_start = load_start & level_map_mask;
+    unsigned long aligned_page_size = XEN_PT_LEVEL_SIZE(page_table_level);
+    unsigned long xen_size = (unsigned long)(_end - start);
+
+    if ( (load_start + xen_size) > (aligned_load_start + aligned_page_size) )
+    {
+        early_printk("please place Xen to be in range of PAGE_SIZE "
+                     "where PAGE_SIZE is XEN_PT_LEVEL_SIZE( {L3 | L2 | L1} ) "
+                     "depending on expected SATP_MODE \n"
+                     "XEN_PT_LEVEL_SIZE is defined in <asm/page.h>\n");
+        die();
+    }
+
+    index = pt_index(page_table_level, aligned_load_start);
+    stage1_pgtbl_root[index] = paddr_to_pte(aligned_load_start,
+                                            PTE_LEAF_DEFAULT | PTE_EXECUTABLE);
+
+    asm volatile("sfence.vma");
+    csr_write(CSR_SATP,
+              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) |
+              satp_mode << SATP_MODE_SHIFT);
+
+    if ( (csr_read(CSR_SATP) >> SATP_MODE_SHIFT) == satp_mode )
+        is_mode_supported = true;
+
+    /* Clean MMU root page table and disable MMU */
+    stage1_pgtbl_root[index] = paddr_to_pte(0x0, 0x0);
+
+    csr_write(CSR_SATP, 0);
+    asm volatile("sfence.vma");
+
+    return is_mode_supported;
+}
+
+/*
+ * setup_initial_pagetables:
+ *
+ * Build the page tables for Xen that map the following:
+ *  1. Calculate page table's level numbers.
+ *  2. Init mmu description structure.
+ *  3. Check that linker addresses range doesn't overlap
+ *     with load addresses range
+ *  4. Map all linker addresses and load addresses ( it shouldn't
+ *     be 1:1 mapped and will be 1:1 mapped only in case if
+ *     linker address is equal to load address ) with
+ *     RW permissions by default.
+ *  5. Setup proper PTE permissions for each section.
+ */
+void __init setup_initial_pagetables(void)
+{
+    struct mmu_desc mmu_desc = { 0, 0, NULL, 0 };
+
+    /*
+     * Access to _{stard, end } is always PC-relative
+     * thereby when access them we will get load adresses
+     * of start and end of Xen
+     * To get linker addresses LOAD_TO_LINK() is required
+     * to use
+     */
+    unsigned long load_start    = (unsigned long)_start;
+    unsigned long load_end      = (unsigned long)_end;
+    unsigned long linker_start  = LOAD_TO_LINK(load_start);
+    unsigned long linker_end    = LOAD_TO_LINK(load_end);
+
+    if ( (linker_start != load_start) &&
+         (linker_start <= load_end) && (load_start <= linker_end) ) {
+        early_printk("(XEN) linker and load address ranges overlap\n");
+        die();
+    }
+
+    calc_pgtbl_lvls_num(&mmu_desc);
+
+    if ( !check_pgtbl_mode_support(&mmu_desc, load_start, RV_STAGE1_MODE) )
+    {
+        early_printk("requested MMU mode isn't supported by CPU\n"
+                     "Please choose different in <asm/config.h>\n");
+        die();
+    }
+
+    mmu_desc.pgtbl_base = stage1_pgtbl_root;
+    mmu_desc.next_pgtbl = stage1_pgtbl_nonroot;
+
+    setup_initial_mapping(&mmu_desc,
+                          linker_start,
+                          linker_end,
+                          load_start,
+                          PTE_LEAF_DEFAULT);
+}
+
+void __init noinline enable_mmu()
+{
+    /*
+     * Calculate a linker time address of the mmu_is_enabled
+     * label and update CSR_STVEC with it.
+     * MMU is configured in a way where linker addresses are mapped
+     * on load addresses so in a case when linker addresses are not equal
+     * to load addresses, after MMU is enabled, it will cause
+     * an exception and jump to linker time addresses.
+     * Otherwise if load addresses are equal to linker addresses the code
+     * after mmu_is_enabled label will be executed without exception.
+     */
+    csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned long)&&mmu_is_enabled));
+
+    /* Ensure page table writes precede loading the SATP */
+    asm volatile("sfence.vma");
+
+    /* Enable the MMU and load the new pagetable for Xen */
+    csr_write(CSR_SATP,
+              ((unsigned long)stage1_pgtbl_root >> PAGE_SHIFT) |
+              RV_STAGE1_MODE << SATP_MODE_SHIFT);
+
+    asm volatile(".align 2");
+      mmu_is_enabled:
+    /*
+     * Stack should be re-inited as:
+     * 1. Right now an address of the stack is relative to load time
+     *    addresses what will cause an issue in case of load start address
+     *    isn't equal to linker start address.
+     * 2. Addresses in stack are all load time relative which can be an
+     *    issue in case when load start address isn't equal to linker
+     *    start address.
+     */
+    asm volatile ("mv sp, %0" : : "r"((unsigned long)cpu0_boot_stack + STACK_SIZE));
+
+    /*
+     * We can't return to the caller because the stack was reseted
+     * and it may have stash some variable on the stack.
+     * Jump to a brand new function as the stack was reseted
+    */
+    cont_after_mmu_is_enabled();
+}
+
diff --git a/xen/arch/riscv/riscv64/head.S b/xen/arch/riscv/riscv64/head.S
index 8887f0cbd4..b3309d902c 100644
--- a/xen/arch/riscv/riscv64/head.S
+++ b/xen/arch/riscv/riscv64/head.S
@@ -1,4 +1,5 @@ 
 #include <asm/asm.h>
+#include <asm/asm-offsets.h>
 #include <asm/riscv_encoding.h>
 
         .section .text.header, "ax", %progbits
@@ -32,3 +33,4 @@  ENTRY(start)
         add     sp, sp, t0
 
         tail    start_xen
+
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index 3786f337e0..315804aa87 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -2,6 +2,7 @@ 
 #include <xen/init.h>
 
 #include <asm/early_printk.h>
+#include <asm/mm.h>
 
 /* Xen stack for bringing up the first CPU. */
 unsigned char __initdata cpu0_boot_stack[STACK_SIZE]
@@ -26,3 +27,13 @@  void __init noreturn start_xen(unsigned long bootcpu_id,
 
     unreachable();
 }
+
+void __init noreturn cont_after_mmu_is_enabled(void)
+{
+    early_printk("All set up\n");
+
+    for ( ;; )
+        asm volatile ("wfi");
+
+    unreachable();
+}
diff --git a/xen/arch/riscv/xen.lds.S b/xen/arch/riscv/xen.lds.S
index 31e0d3576c..2f7f76bee6 100644
--- a/xen/arch/riscv/xen.lds.S
+++ b/xen/arch/riscv/xen.lds.S
@@ -136,6 +136,7 @@  SECTIONS
     . = ALIGN(POINTER_ALIGN);
     __init_end = .;
 
+    . = ALIGN(PAGE_SIZE);
     .bss : {                     /* BSS */
         __bss_start = .;
         *(.bss.stack_aligned)
@@ -172,3 +173,6 @@  ASSERT(IS_ALIGNED(__bss_end,        POINTER_ALIGN), "__bss_end is misaligned")
 
 ASSERT(!SIZEOF(.got),      ".got non-empty")
 ASSERT(!SIZEOF(.got.plt),  ".got.plt non-empty")
+
+ASSERT(_end - _start <= MB(2), "Xen too large for early-boot assumptions")
+