diff mbox series

[v1,1/3] xen/riscv: introduce setup_initial_pages

Message ID a145fbbfb166d9f6bd4859b669d23a1f52004b2b.1677250203.git.oleksii.kurochko@gmail.com (mailing list archive)
State Superseded
Headers show
Series enable MMU for RISC-V | expand

Commit Message

Oleksii Feb. 24, 2023, 3:06 p.m. UTC
Mostly the code for setup_initial_pages was taken from Bobby's
repo except for the following changes:
* Use only a minimal part of the code enough to enable MMU
* rename {_}setup_initial_pagetables functions
* add writable argument for _setup_initial_pagetables to have
  an opportunity to make some sections read-only
* update setup_initial_pagetables function to make some sections
  read-only
* change the order of _setup_inital_pagetables()
  in setup_initial_pagetable():
  * first it is called for text, init, rodata sections
  * after call it for ranges [link_addr_start : link_addr_end] and
    [load_addr_start : load_addr_end]
  Before it was done first for the ranges and after for sections but
  in that case read-only status will be equal to 'true' and
  as sections' addresses  can/are inside the ranges the read-only status
  won't be updated for them as it was set up before.

Origin: https://gitlab.com/xen-on-risc-v/xen/-/tree/riscv-rebase 4af165b468af
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
 xen/arch/riscv/Makefile           |   1 +
 xen/arch/riscv/include/asm/mm.h   |   9 ++
 xen/arch/riscv/include/asm/page.h |  90 ++++++++++++
 xen/arch/riscv/mm.c               | 223 ++++++++++++++++++++++++++++++
 4 files changed, 323 insertions(+)
 create mode 100644 xen/arch/riscv/include/asm/mm.h
 create mode 100644 xen/arch/riscv/include/asm/page.h
 create mode 100644 xen/arch/riscv/mm.c

Comments

Andrew Cooper Feb. 24, 2023, 3:23 p.m. UTC | #1
On 24/02/2023 3:06 pm, Oleksii Kurochko wrote:
> diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
> new file mode 100644
> index 0000000000..fabbe1305f
> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -0,0 +1,90 @@
> +#ifndef _ASM_RISCV_PAGE_H
> +#define _ASM_RISCV_PAGE_H
> +
> +#include <xen/const.h>
> +#include <xen/types.h>
> +
> +#define PAGE_ENTRIES            512
> +#define VPN_BITS                (9)
> +#define VPN_MASK                ((unsigned long)((1 << VPN_BITS) - 1))
> +
> +#ifdef CONFIG_RISCV_64
> +/* L3 index Bit[47:39] */
> +#define THIRD_SHIFT             (39)
> +#define THIRD_MASK              (VPN_MASK << THIRD_SHIFT)
> +/* L2 index Bit[38:30] */
> +#define SECOND_SHIFT            (30)
> +#define SECOND_MASK             (VPN_MASK << SECOND_SHIFT)
> +/* L1 index Bit[29:21] */
> +#define FIRST_SHIFT             (21)
> +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> +/* L0 index Bit[20:12] */
> +#define ZEROETH_SHIFT           (12)
> +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)

Don't name these with words.  That's an error ultimately inherited from
an architectural mistake ARM.

These should be named L1 (4k) thru L4 (512T), and don't need separate
separate masks or shifts because it looks like RISC-V designed their
pagetables in a coherent and uniform way.

You'll find the code simplifies substantially if you have
PAGETABLE_ORDER 9 somewhere in here.

The shift is always (PAGE_ORDER + level * PAGETABLE_ORDER), and it's
rare that you need something other than "(addr >> shift) & mask".  About
the only time you need a virtual address masked but unshifted is for
debugging.

~Andrew
Julien Grall Feb. 25, 2023, 5:53 p.m. UTC | #2
Hi Oleksii,

On 24/02/2023 15:06, Oleksii Kurochko wrote:
> Mostly the code for setup_initial_pages was taken from Bobby's
> repo except for the following changes:
> * Use only a minimal part of the code enough to enable MMU
> * rename {_}setup_initial_pagetables functions
> * add writable argument for _setup_initial_pagetables to have
>    an opportunity to make some sections read-only
> * update setup_initial_pagetables function to make some sections
>    read-only
> * change the order of _setup_inital_pagetables()
>    in setup_initial_pagetable():
>    * first it is called for text, init, rodata sections
>    * after call it for ranges [link_addr_start : link_addr_end] and
>      [load_addr_start : load_addr_end]
>    Before it was done first for the ranges and after for sections but
>    in that case read-only status will be equal to 'true' and
>    as sections' addresses  can/are inside the ranges the read-only status
>    won't be updated for them as it was set up before.
> 
> Origin: https://gitlab.com/xen-on-risc-v/xen/-/tree/riscv-rebase 4af165b468af
> Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> ---
>   xen/arch/riscv/Makefile           |   1 +
>   xen/arch/riscv/include/asm/mm.h   |   9 ++
>   xen/arch/riscv/include/asm/page.h |  90 ++++++++++++
>   xen/arch/riscv/mm.c               | 223 ++++++++++++++++++++++++++++++
>   4 files changed, 323 insertions(+)
>   create mode 100644 xen/arch/riscv/include/asm/mm.h
>   create mode 100644 xen/arch/riscv/include/asm/page.h
>   create mode 100644 xen/arch/riscv/mm.c
> 
> diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
> index 443f6bf15f..956ceb02df 100644
> --- a/xen/arch/riscv/Makefile
> +++ b/xen/arch/riscv/Makefile
> @@ -1,5 +1,6 @@
>   obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
>   obj-y += entry.o
> +obj-y += mm.o
>   obj-$(CONFIG_RISCV_64) += riscv64/
>   obj-y += sbi.o
>   obj-y += setup.o
> diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
> new file mode 100644
> index 0000000000..fc1866b1d8
> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/mm.h
> @@ -0,0 +1,9 @@
> +#ifndef _ASM_RISCV_MM_H
> +#define _ASM_RISCV_MM_H
> +
> +void setup_initial_pagetables(unsigned long load_addr_start,
> +                              unsigned long load_addr_end,
> +                              unsigned long linker_addr_start,
> +                              unsigned long linker_addr_end);
> +
> +#endif /* _ASM_RISCV_MM_H */
> diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
> new file mode 100644
> index 0000000000..fabbe1305f
> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -0,0 +1,90 @@
> +#ifndef _ASM_RISCV_PAGE_H
> +#define _ASM_RISCV_PAGE_H
> +
> +#include <xen/const.h>
> +#include <xen/types.h>
> +
> +#define PAGE_ENTRIES            512

NIT: AFAIU, the number here is based on ...

> +#define VPN_BITS                (9)

... this. So I would suggest to define PAGE_ENTRIES using VPN_BITS.

> +#define VPN_MASK                ((unsigned long)((1 << VPN_BITS) - 1))
NIT: Use 1UL and you can avoid the cast.

> +
> +#ifdef CONFIG_RISCV_64
> +/* L3 index Bit[47:39] */
> +#define THIRD_SHIFT             (39)
> +#define THIRD_MASK              (VPN_MASK << THIRD_SHIFT)
> +/* L2 index Bit[38:30] */
> +#define SECOND_SHIFT            (30)
> +#define SECOND_MASK             (VPN_MASK << SECOND_SHIFT)
> +/* L1 index Bit[29:21] */
> +#define FIRST_SHIFT             (21)
> +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> +/* L0 index Bit[20:12] */
> +#define ZEROETH_SHIFT           (12)
> +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)

On Arm, we are trying to phase out ZEROETH_* and co because the name is 
too generic. Instead, we now introduce a generic macro that take a level 
and then compute the mask/shift (see XEN_PT_LEVEL_*).

You should be able to do in RISC-V and reduce the amount of defines 
introduced.

> +
> +#else // CONFIG_RISCV_32

Coding style: comments in Xen are using /* ... */

> +
> +/* L1 index Bit[31:22] */
> +#define FIRST_SHIFT             (22)
> +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> +
> +/* L0 index Bit[21:12] */
> +#define ZEROETH_SHIFT           (12)
> +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
> +#endif
> +
> +#define THIRD_SIZE              (1 << THIRD_SHIFT)
> +#define THIRD_MAP_MASK          (~(THIRD_SIZE - 1))
> +#define SECOND_SIZE             (1 << SECOND_SHIFT)
> +#define SECOND_MAP_MASK         (~(SECOND_SIZE - 1))
> +#define FIRST_SIZE              (1 << FIRST_SHIFT)
> +#define FIRST_MAP_MASK          (~(FIRST_SIZE - 1))
> +#define ZEROETH_SIZE            (1 << ZEROETH_SHIFT)
> +#define ZEROETH_MAP_MASK        (~(ZEROETH_SIZE - 1))
> +
> +#define PTE_SHIFT               10
> +
> +#define PTE_VALID               BIT(0, UL)
> +#define PTE_READABLE            BIT(1, UL)
> +#define PTE_WRITABLE            BIT(2, UL)
> +#define PTE_EXECUTABLE          BIT(3, UL)
> +#define PTE_USER                BIT(4, UL)
> +#define PTE_GLOBAL              BIT(5, UL)
> +#define PTE_ACCESSED            BIT(6, UL)
> +#define PTE_DIRTY               BIT(7, UL)
> +#define PTE_RSW                 (BIT(8, UL) | BIT(9, UL))
> +
> +#define PTE_LEAF_DEFAULT        (PTE_VALID | PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)

We should avoid vulnerable default flags. So this should either be RW or RX.

> +#define PTE_TABLE               (PTE_VALID)
> +
> +/* Calculate the offsets into the pagetables for a given VA */
> +#define zeroeth_linear_offset(va)   ((va) >> ZEROETH_SHIFT)
> +#define first_linear_offset(va)     ((va) >> FIRST_SHIFT)
> +#define second_linear_offset(va)    ((va) >> SECOND_SHIFT)
> +#define third_linear_offset(va)     ((va) >> THIRD_SHIFT)
> +
> +#define pagetable_zeroeth_index(va) zeroeth_linear_offset((va) & ZEROETH_MASK)
> +#define pagetable_first_index(va)   first_linear_offset((va) & FIRST_MASK)
> +#define pagetable_second_index(va)  second_linear_offset((va) & SECOND_MASK)
> +#define pagetable_third_index(va)   third_linear_offset((va) & THIRD_MASK)
> +
> +/* Page Table entry */
> +typedef struct {
> +    uint64_t pte;
> +} pte_t;
> +
> +/* Shift the VPN[x] or PPN[x] fields of a virtual or physical address
> + * to become the shifted PPN[x] fields of a page table entry */
> +#define addr_to_ppn(x) (((x) >> PAGE_SHIFT) << PTE_SHIFT)
> +
> +static inline pte_t paddr_to_pte(unsigned long paddr)
> +{
> +    return (pte_t) { .pte = addr_to_ppn(paddr) };
> +}
> +
> +static inline bool pte_is_valid(pte_t *p)
> +{
> +    return p->pte & PTE_VALID;
> +}
> +
> +#endif /* _ASM_RISCV_PAGE_H */
> diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
> new file mode 100644
> index 0000000000..6e172376eb
> --- /dev/null
> +++ b/xen/arch/riscv/mm.c
> @@ -0,0 +1,223 @@
> +#include <xen/init.h>
> +#include <xen/lib.h>
> +
> +#include <asm/csr.h>
> +#include <asm/mm.h>
> +#include <asm/page.h>
> +
> +/*
> + * xen_second_pagetable is indexed with the VPN[2] page table entry field
> + * xen_first_pagetable is accessed from the VPN[1] page table entry field
> + * xen_zeroeth_pagetable is accessed from the VPN[0] page table entry field
> + */
> +pte_t xen_second_pagetable[PAGE_ENTRIES] __attribute__((__aligned__(PAGE_SIZE)));
> +static pte_t xen_first_pagetable[PAGE_ENTRIES]
> +    __attribute__((__aligned__(PAGE_SIZE)));
> +static pte_t xen_zeroeth_pagetable[PAGE_ENTRIES]
> +    __attribute__((__aligned__(PAGE_SIZE)));
> +
> +extern unsigned long _stext;
> +extern unsigned long _etext;
> +extern unsigned long __init_begin;
> +extern unsigned long __init_end;
> +extern unsigned long _srodata;
> +extern unsigned long _erodata;
> +
> +paddr_t phys_offset;

This is defined, set but not used.

> +
> +#define resolve_early_addr(x) \

This helper seems to behave the same wasy as linker_addr(). So any 
reason to not use it?

I will make this assumption this can be used and not comment on the 
implement of resolve_early_addr().

> +    ({                                                                          \
> +         unsigned long * __##x;                                                 \
> +        if ( load_addr_start <= x && x < load_addr_end )                        \
> +            __##x = (unsigned long *)x;                                         \
> +        else                                                                    \
> +            __##x = (unsigned long *)(x + load_addr_start - linker_addr_start); \
> +        __##x;                                                                  \
> +     })
> +
> +static void __init clear_pagetables(unsigned long load_addr_start,
> +                             unsigned long load_addr_end,
> +                             unsigned long linker_addr_start,
> +                             unsigned long linker_addr_end)
> +{
> +    unsigned long *p;
> +    unsigned long page;
> +    unsigned long i;
> +
> +    page = (unsigned long)&xen_second_pagetable[0];
> +
> +    p = resolve_early_addr(page);
> +    for ( i = 0; i < ARRAY_SIZE(xen_second_pagetable); i++ )

The entries in xen_second_pagetable are a pte_t (uint64_t). But ...

> +    {
> +        p[i] = 0ULL;

... the type here will be unsigned long. So you may not fully zero the 
page-table on 32-bit architecture. Therefore you want to define as pte_t.

That said, given the page table will be part of BSS, you should not need 
to zero again assuming you clear BSS before hand.

If you clear afterwards, then you *must* move them out of BSS.

The same applies for xen_{first, zeroeth}_pagetable below.

> +    }
> +
> +    page = (unsigned long)&xen_first_pagetable[0];
> +    p = resolve_early_addr(page);
> +    for ( i = 0; i < ARRAY_SIZE(xen_first_pagetable); i++ )
> +    {
> +        p[i] = 0ULL;
> +    }
> +
> +    page = (unsigned long)&xen_zeroeth_pagetable[0];
> +    p = resolve_early_addr(page);
> +    for ( i = 0; i < ARRAY_SIZE(xen_zeroeth_pagetable); i++ )
> +    {
> +        p[i] = 0ULL;
> +    }
> +}
> +
> +/*
> + * WARNING: load_addr() and linker_addr() are to be called only when the MMU is
> + * disabled and only when executed by the primary CPU.  They cannot refer to
> + * any global variable or functions.

I find interesting you are saying when _setup_initial_pagetables() is 
called from setup_initial_pagetables(). Would you be able to explain how 
this is different?

> + */
> +
> +/*
> + * Convert an addressed layed out at link time to the address where it was loaded

Typo: s/addressed/address/ ?

> + * by the bootloader.
> + */

Looking at the implementation, you seem to consider that any address not 
in the range [linker_addr_start, linker_addr_end[ will have a 1:1 mappings.

I am not sure this is what you want. So I would consider to throw an 
error if such address is passed.

> +#define load_addr(linker_address)                                              \
> +    ({                                                                         \
> +        unsigned long __linker_address = (unsigned long)(linker_address);      \
> +        if ( linker_addr_start <= __linker_address &&                          \
> +            __linker_address < linker_addr_end )                               \
> +        {                                                                      \
> +            __linker_address =                                                 \
> +                __linker_address - linker_addr_start + load_addr_start;        \
> +        }                                                                      \
> +        __linker_address;                                                      \
> +    })
> +
> +/* Convert boot-time Xen address from where it was loaded by the boot loader to the address it was layed out
> + * at link-time.
> + */

Coding style: The first line is too long and multi-line comments look like:

/*
  * Foo
  * Bar
  */

> +#define linker_addr(load_address)                                              \

Same remark as for load_addr() above.

> +    ({                                                                         \
> +        unsigned long __load_address = (unsigned long)(load_address);          \
> +        if ( load_addr_start <= __load_address &&                              \
> +            __load_address < load_addr_end )                                   \
> +        {                                                                      \
> +            __load_address =                                                   \
> +                __load_address - load_addr_start + linker_addr_start;          \
> +        }                                                                      \
> +        __load_address;                                                        \
> +    })
> +
> +static void __attribute__((section(".entry")))
> +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t *zeroeth,
Can this be named to setup_initial_mapping() so this is clearer and 
avoid the one '_' different with the function below.

> +                         unsigned long map_start,
> +                         unsigned long map_end,
> +                         unsigned long pa_start,
> +                         bool writable)

What about the executable bit?

> +{
> +    unsigned long page_addr;
> +    unsigned long index2;
> +    unsigned long index1;
> +    unsigned long index0;

index* could be defined in the loop below.

> +
> +    /* align start addresses */
> +    map_start &= ZEROETH_MAP_MASK;
> +    pa_start &= ZEROETH_MAP_MASK;

Hmmm... I would actually expect the address to be properly aligned and 
therefore not require an alignment here.

Otherwise, this raise the question of what happen if you have region 
using the same page?

> +
> +    page_addr = map_start;
> +    while ( page_addr < map_end )

Looking at the loop, it looks like you are assuming that the region will 
never cross a boundary of a page-table (either L0, L1, L2). I am not 
convinced you can make such assumption (see below).

But if you really want to make such assumption then you should add some 
guard (either BUILD_BUG_ON(), ASSERT(), proper check) in your code to 
avoid any surprise in the future.

> +    {
> +        index2 = pagetable_second_index(page_addr);
> +        index1 = pagetable_first_index(page_addr);
> +        index0 = pagetable_zeroeth_index(page_addr);
> +
> +        /* Setup level2 table */
> +        second[index2] = paddr_to_pte((unsigned long)first);
> +        second[index2].pte |= PTE_TABLE;
> +
> +        /* Setup level1 table */
> +        first[index1] = paddr_to_pte((unsigned long)zeroeth);
> +        first[index1].pte |= PTE_TABLE;
> +
> +        /* Setup level0 table */
> +        if ( !pte_is_valid(&zeroeth[index0]) )

Can you explain why you are checking !pte_is_valid() for the L0 entry 
but not the other?

> +        {
> +            /* Update level0 table */
> +            zeroeth[index0] = paddr_to_pte((page_addr - map_start) + pa_start);
> +            zeroeth[index0].pte |= PTE_LEAF_DEFAULT;
> +            zeroeth[index0].pte &= ~((!writable) ? PTE_WRITABLE : 0);

Looking at the default value, it would mean that a non-writable mapping 
is automatically executable. This seems wrong for the section is not 
meant to be executable (like rodata).

> +        }
> +
> +        /* Point to next page */
> +        page_addr += ZEROETH_SIZE;
> +    }
> +}
> +
> +/*
> + * setup_initial_pagetables:
> + *
> + * 1) Build the page tables for Xen that map the following:
> + *   1.1)  The physical location of Xen (where the bootloader loaded it)
> + *   1.2)  The link-time location of Xen (where the linker expected Xen's
> + *         addresses to be)
> + * 2) Load the page table into the SATP and enable the MMU
> + */
> +void __attribute__((section(".entry")))

I couldn't find a section ".entry" in the linker.

> +setup_initial_pagetables(unsigned long load_addr_start,
> +                         unsigned long load_addr_end,
> +                         unsigned long linker_addr_start,
> +                         unsigned long linker_addr_end)
> +{
> +    pte_t *second;
> +    pte_t *first;
> +    pte_t *zeroeth;
> +
> +    clear_pagetables(load_addr_start, load_addr_end,
> +                     linker_addr_start, linker_addr_end);
> +
> +    /* Get the addresses where the page tables were loaded */
> +    second  = (pte_t *)load_addr(&xen_second_pagetable);
> +    first   = (pte_t *)load_addr(&xen_first_pagetable);
> +    zeroeth = (pte_t *)load_addr(&xen_zeroeth_pagetable);

I would consider to embed the type cast in load_addr() so you are adding 
some type safety within your code.

> +
> +    /*
> +     * Create a mapping from Xen's link-time addresses to where they were actually loaded.

This is line is way long than 80 characters. Please make sure to wrap it 
80 characters.

> +     */
> +    _setup_initial_pagetables(second, first, zeroeth,
> +                              linker_addr(&_stext),
> +                              linker_addr(&_etext),
> +                              load_addr(&_stext),
> +                              false);
> +    _setup_initial_pagetables(second, first, zeroeth,
> +                              linker_addr(&__init_begin),
> +                              linker_addr(&__init_end),
> +                              load_addr(&__init_begin),
> +                              true);
> +    _setup_initial_pagetables(second, first, zeroeth,
> +                              linker_addr(&_srodata),
> +                              linker_addr(&_erodata),
> +                              load_addr(&_srodata),
> +                              false);
> +    _setup_initial_pagetables(second, first, zeroeth,
> +                              linker_addr_start,
> +                              linker_addr_end,
> +                              load_addr_start,
> +                              true);

Where do you guarantee that Xen will always fit in an L0 table and the 
start address is aligned to the size of an L0 table?

> +
> +    /*
> +     * Create a mapping of the load time address range to... the load time address range.

Same about the line length here.

> +     * This mapping is used at boot time only.
> +     */
> +    _setup_initial_pagetables(second, first, zeroeth,

This can only work if Xen is loaded at its linked address. So you need a 
separate set of L0, L1 tables for the identity mapping.

That said, this would not be sufficient because:
   1) Xen may not be loaded at a 2M boundary (you can control with 
U-boot, but not with EFI). So this may cross a boundary and therefore 
need multiple pages.
   2) The load region may overlap the link address

While I think it would be good to handle those cases from the start, I 
would understand why are not easy to solve. So I think the minimum is to 
throw some errors if you are in a case you can't support.

> +                              load_addr_start,
> +                              load_addr_end,
> +                              load_addr_start,
> +                              true); > +
> +    /* Ensure page table writes precede loading the SATP */
> +    asm volatile("sfence.vma");
> +
> +    /* Enable the MMU and load the new pagetable for Xen */
> +    csr_write(CSR_SATP,
> +              (load_addr(xen_second_pagetable) >> PAGE_SHIFT) | SATP_MODE_SV39 << SATP_MODE_SHIFT);

IHMO, it would make sense to introduce within the series the code to 
jump off the identity mapping and then remove it.

> +
> +    phys_offset = load_addr_start - linker_addr_start;
> +
> +    return;
> +}

Cheers,
Jan Beulich Feb. 27, 2023, 3:12 p.m. UTC | #3
On 24.02.2023 16:06, Oleksii Kurochko wrote:
> --- /dev/null
> +++ b/xen/arch/riscv/include/asm/page.h
> @@ -0,0 +1,90 @@
> +#ifndef _ASM_RISCV_PAGE_H
> +#define _ASM_RISCV_PAGE_H
> +
> +#include <xen/const.h>
> +#include <xen/types.h>
> +
> +#define PAGE_ENTRIES            512
> +#define VPN_BITS                (9)
> +#define VPN_MASK                ((unsigned long)((1 << VPN_BITS) - 1))
> +
> +#ifdef CONFIG_RISCV_64
> +/* L3 index Bit[47:39] */
> +#define THIRD_SHIFT             (39)
> +#define THIRD_MASK              (VPN_MASK << THIRD_SHIFT)
> +/* L2 index Bit[38:30] */
> +#define SECOND_SHIFT            (30)
> +#define SECOND_MASK             (VPN_MASK << SECOND_SHIFT)
> +/* L1 index Bit[29:21] */
> +#define FIRST_SHIFT             (21)
> +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> +/* L0 index Bit[20:12] */
> +#define ZEROETH_SHIFT           (12)
> +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
> +
> +#else // CONFIG_RISCV_32
> +
> +/* L1 index Bit[31:22] */
> +#define FIRST_SHIFT             (22)
> +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> +
> +/* L0 index Bit[21:12] */
> +#define ZEROETH_SHIFT           (12)
> +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
> +#endif
> +
> +#define THIRD_SIZE              (1 << THIRD_SHIFT)
> +#define THIRD_MAP_MASK          (~(THIRD_SIZE - 1))
> +#define SECOND_SIZE             (1 << SECOND_SHIFT)
> +#define SECOND_MAP_MASK         (~(SECOND_SIZE - 1))
> +#define FIRST_SIZE              (1 << FIRST_SHIFT)
> +#define FIRST_MAP_MASK          (~(FIRST_SIZE - 1))
> +#define ZEROETH_SIZE            (1 << ZEROETH_SHIFT)
> +#define ZEROETH_MAP_MASK        (~(ZEROETH_SIZE - 1))
> +
> +#define PTE_SHIFT               10
> +
> +#define PTE_VALID               BIT(0, UL)
> +#define PTE_READABLE            BIT(1, UL)
> +#define PTE_WRITABLE            BIT(2, UL)
> +#define PTE_EXECUTABLE          BIT(3, UL)
> +#define PTE_USER                BIT(4, UL)
> +#define PTE_GLOBAL              BIT(5, UL)
> +#define PTE_ACCESSED            BIT(6, UL)
> +#define PTE_DIRTY               BIT(7, UL)
> +#define PTE_RSW                 (BIT(8, UL) | BIT(9, UL))
> +
> +#define PTE_LEAF_DEFAULT        (PTE_VALID | PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
> +#define PTE_TABLE               (PTE_VALID)
> +
> +/* Calculate the offsets into the pagetables for a given VA */
> +#define zeroeth_linear_offset(va)   ((va) >> ZEROETH_SHIFT)
> +#define first_linear_offset(va)     ((va) >> FIRST_SHIFT)
> +#define second_linear_offset(va)    ((va) >> SECOND_SHIFT)
> +#define third_linear_offset(va)     ((va) >> THIRD_SHIFT)
> +
> +#define pagetable_zeroeth_index(va) zeroeth_linear_offset((va) & ZEROETH_MASK)
> +#define pagetable_first_index(va)   first_linear_offset((va) & FIRST_MASK)
> +#define pagetable_second_index(va)  second_linear_offset((va) & SECOND_MASK)
> +#define pagetable_third_index(va)   third_linear_offset((va) & THIRD_MASK)
> +
> +/* Page Table entry */
> +typedef struct {
> +    uint64_t pte;
> +} pte_t;
> +
> +/* Shift the VPN[x] or PPN[x] fields of a virtual or physical address
> + * to become the shifted PPN[x] fields of a page table entry */
> +#define addr_to_ppn(x) (((x) >> PAGE_SHIFT) << PTE_SHIFT)
> +
> +static inline pte_t paddr_to_pte(unsigned long paddr)
> +{
> +    return (pte_t) { .pte = addr_to_ppn(paddr) };
> +}
> +
> +static inline bool pte_is_valid(pte_t *p)

Btw - const whenever possible please, especially in such basic helpers.

> --- /dev/null
> +++ b/xen/arch/riscv/mm.c
> @@ -0,0 +1,223 @@
> +#include <xen/init.h>
> +#include <xen/lib.h>
> +
> +#include <asm/csr.h>
> +#include <asm/mm.h>
> +#include <asm/page.h>
> +
> +/*
> + * xen_second_pagetable is indexed with the VPN[2] page table entry field
> + * xen_first_pagetable is accessed from the VPN[1] page table entry field
> + * xen_zeroeth_pagetable is accessed from the VPN[0] page table entry field
> + */
> +pte_t xen_second_pagetable[PAGE_ENTRIES] __attribute__((__aligned__(PAGE_SIZE)));

static?

> +static pte_t xen_first_pagetable[PAGE_ENTRIES]
> +    __attribute__((__aligned__(PAGE_SIZE)));
> +static pte_t xen_zeroeth_pagetable[PAGE_ENTRIES]
> +    __attribute__((__aligned__(PAGE_SIZE)));

Please use __aligned() instead of open-coding it. You also may want to
specifiy the section here explicitly, as .bss.page_aligned (as we do
elsewhere).

> +extern unsigned long _stext;
> +extern unsigned long _etext;
> +extern unsigned long __init_begin;
> +extern unsigned long __init_end;
> +extern unsigned long _srodata;
> +extern unsigned long _erodata;

Please use kernel.h and drop then colliding declarations. For what's
left please use array types, as suggested elsewhere already.

> +paddr_t phys_offset;
> +
> +#define resolve_early_addr(x) \
> +    ({                                                                          \
> +         unsigned long * __##x;                                                 \
> +        if ( load_addr_start <= x && x < load_addr_end )                        \

Nit: Mismatched indentation.

> +            __##x = (unsigned long *)x;                                         \
> +        else                                                                    \
> +            __##x = (unsigned long *)(x + load_addr_start - linker_addr_start); \
> +        __##x;                                                                  \
> +     })
> +
> +static void __init clear_pagetables(unsigned long load_addr_start,
> +                             unsigned long load_addr_end,
> +                             unsigned long linker_addr_start,
> +                             unsigned long linker_addr_end)

Nit (style): Indentation.

> +{
> +    unsigned long *p;
> +    unsigned long page;
> +    unsigned long i;
> +
> +    page = (unsigned long)&xen_second_pagetable[0];
> +
> +    p = resolve_early_addr(page);
> +    for ( i = 0; i < ARRAY_SIZE(xen_second_pagetable); i++ )
> +    {
> +        p[i] = 0ULL;
> +    }

We typically omit braces around single-statement bodies. Here,
though: Why do you do this in the first place? These static arrays
all start out zero-initialized anyway (from when you clear .bss).
Plus even if they didn't - why not memset()?

> +    page = (unsigned long)&xen_first_pagetable[0];
> +    p = resolve_early_addr(page);
> +    for ( i = 0; i < ARRAY_SIZE(xen_first_pagetable); i++ )
> +    {
> +        p[i] = 0ULL;
> +    }
> +
> +    page = (unsigned long)&xen_zeroeth_pagetable[0];
> +    p = resolve_early_addr(page);
> +    for ( i = 0; i < ARRAY_SIZE(xen_zeroeth_pagetable); i++ )
> +    {
> +        p[i] = 0ULL;
> +    }
> +}
> +
> +/*
> + * WARNING: load_addr() and linker_addr() are to be called only when the MMU is
> + * disabled and only when executed by the primary CPU.  They cannot refer to
> + * any global variable or functions.
> + */
> +
> +/*
> + * Convert an addressed layed out at link time to the address where it was loaded
> + * by the bootloader.
> + */
> +#define load_addr(linker_address)                                              \
> +    ({                                                                         \
> +        unsigned long __linker_address = (unsigned long)(linker_address);      \
> +        if ( linker_addr_start <= __linker_address &&                          \
> +            __linker_address < linker_addr_end )                               \
> +        {                                                                      \
> +            __linker_address =                                                 \
> +                __linker_address - linker_addr_start + load_addr_start;        \
> +        }                                                                      \
> +        __linker_address;                                                      \
> +    })
> +
> +/* Convert boot-time Xen address from where it was loaded by the boot loader to the address it was layed out
> + * at link-time.
> + */
> +#define linker_addr(load_address)                                              \
> +    ({                                                                         \
> +        unsigned long __load_address = (unsigned long)(load_address);          \
> +        if ( load_addr_start <= __load_address &&                              \
> +            __load_address < load_addr_end )                                   \
> +        {                                                                      \
> +            __load_address =                                                   \
> +                __load_address - load_addr_start + linker_addr_start;          \
> +        }                                                                      \
> +        __load_address;                                                        \
> +    })
> +
> +static void __attribute__((section(".entry")))
> +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t *zeroeth,

Why the special section (also again further down)?

Jan
Jan Beulich Feb. 27, 2023, 3:19 p.m. UTC | #4
On 27.02.2023 16:12, Jan Beulich wrote:
> On 24.02.2023 16:06, Oleksii Kurochko wrote:
>> +static void __attribute__((section(".entry")))
>> +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t *zeroeth,
> 
> Why the special section (also again further down)?

Looking at patch 2 it occurred to me that you probably mean __init here.

Jan
Oleksii Feb. 27, 2023, 4:52 p.m. UTC | #5
On Sat, 2023-02-25 at 17:53 +0000, Julien Grall wrote:
> Hi Oleksii,
> 
> On 24/02/2023 15:06, Oleksii Kurochko wrote:
> > Mostly the code for setup_initial_pages was taken from Bobby's
> > repo except for the following changes:
> > * Use only a minimal part of the code enough to enable MMU
> > * rename {_}setup_initial_pagetables functions
> > * add writable argument for _setup_initial_pagetables to have
> >    an opportunity to make some sections read-only
> > * update setup_initial_pagetables function to make some sections
> >    read-only
> > * change the order of _setup_inital_pagetables()
> >    in setup_initial_pagetable():
> >    * first it is called for text, init, rodata sections
> >    * after call it for ranges [link_addr_start : link_addr_end] and
> >      [load_addr_start : load_addr_end]
> >    Before it was done first for the ranges and after for sections
> > but
> >    in that case read-only status will be equal to 'true' and
> >    as sections' addresses  can/are inside the ranges the read-only
> > status
> >    won't be updated for them as it was set up before.
> > 
> > Origin:
> > https://gitlab.com/xen-on-risc-v/xen/-/tree/riscv-rebase 4af165b468
> > af
> > Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
> > ---
> >   xen/arch/riscv/Makefile           |   1 +
> >   xen/arch/riscv/include/asm/mm.h   |   9 ++
> >   xen/arch/riscv/include/asm/page.h |  90 ++++++++++++
> >   xen/arch/riscv/mm.c               | 223
> > ++++++++++++++++++++++++++++++
> >   4 files changed, 323 insertions(+)
> >   create mode 100644 xen/arch/riscv/include/asm/mm.h
> >   create mode 100644 xen/arch/riscv/include/asm/page.h
> >   create mode 100644 xen/arch/riscv/mm.c
> > 
> > diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
> > index 443f6bf15f..956ceb02df 100644
> > --- a/xen/arch/riscv/Makefile
> > +++ b/xen/arch/riscv/Makefile
> > @@ -1,5 +1,6 @@
> >   obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
> >   obj-y += entry.o
> > +obj-y += mm.o
> >   obj-$(CONFIG_RISCV_64) += riscv64/
> >   obj-y += sbi.o
> >   obj-y += setup.o
> > diff --git a/xen/arch/riscv/include/asm/mm.h
> > b/xen/arch/riscv/include/asm/mm.h
> > new file mode 100644
> > index 0000000000..fc1866b1d8
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/mm.h
> > @@ -0,0 +1,9 @@
> > +#ifndef _ASM_RISCV_MM_H
> > +#define _ASM_RISCV_MM_H
> > +
> > +void setup_initial_pagetables(unsigned long load_addr_start,
> > +                              unsigned long load_addr_end,
> > +                              unsigned long linker_addr_start,
> > +                              unsigned long linker_addr_end);
> > +
> > +#endif /* _ASM_RISCV_MM_H */
> > diff --git a/xen/arch/riscv/include/asm/page.h
> > b/xen/arch/riscv/include/asm/page.h
> > new file mode 100644
> > index 0000000000..fabbe1305f
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/page.h
> > @@ -0,0 +1,90 @@
> > +#ifndef _ASM_RISCV_PAGE_H
> > +#define _ASM_RISCV_PAGE_H
> > +
> > +#include <xen/const.h>
> > +#include <xen/types.h>
> > +
> > +#define PAGE_ENTRIES            512
> 
> NIT: AFAIU, the number here is based on ...
> 
> > +#define VPN_BITS                (9)
> 
> ... this. So I would suggest to define PAGE_ENTRIES using VPN_BITS.
Sure. It should be defined using VPN_BITS. Thanks.
> 
> > +#define VPN_MASK                ((unsigned long)((1 << VPN_BITS) -
> > 1))
> NIT: Use 1UL and you can avoid the cast.
Thanks. I'll update that in the next version of patch series.
> 
> > +
> > +#ifdef CONFIG_RISCV_64
> > +/* L3 index Bit[47:39] */
> > +#define THIRD_SHIFT             (39)
> > +#define THIRD_MASK              (VPN_MASK << THIRD_SHIFT)
> > +/* L2 index Bit[38:30] */
> > +#define SECOND_SHIFT            (30)
> > +#define SECOND_MASK             (VPN_MASK << SECOND_SHIFT)
> > +/* L1 index Bit[29:21] */
> > +#define FIRST_SHIFT             (21)
> > +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> > +/* L0 index Bit[20:12] */
> > +#define ZEROETH_SHIFT           (12)
> > +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
> 
> On Arm, we are trying to phase out ZEROETH_* and co because the name
> is 
> too generic. Instead, we now introduce a generic macro that take a
> level 
> and then compute the mask/shift (see XEN_PT_LEVEL_*).
> 
> You should be able to do in RISC-V and reduce the amount of defines 
> introduced.
Thanks. I'll look at XEN_PT_LEVEL_*. I'll re-read Andrew's comment but
as far as I understand after quick reading we can remove mostly that.
> 
> > +
> > +#else // CONFIG_RISCV_32
> 
> Coding style: comments in Xen are using /* ... */
> 
> > +
> > +/* L1 index Bit[31:22] */
> > +#define FIRST_SHIFT             (22)
> > +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> > +
> > +/* L0 index Bit[21:12] */
> > +#define ZEROETH_SHIFT           (12)
> > +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
> > +#endif
> > +
> > +#define THIRD_SIZE              (1 << THIRD_SHIFT)
> > +#define THIRD_MAP_MASK          (~(THIRD_SIZE - 1))
> > +#define SECOND_SIZE             (1 << SECOND_SHIFT)
> > +#define SECOND_MAP_MASK         (~(SECOND_SIZE - 1))
> > +#define FIRST_SIZE              (1 << FIRST_SHIFT)
> > +#define FIRST_MAP_MASK          (~(FIRST_SIZE - 1))
> > +#define ZEROETH_SIZE            (1 << ZEROETH_SHIFT)
> > +#define ZEROETH_MAP_MASK        (~(ZEROETH_SIZE - 1))
> > +
> > +#define PTE_SHIFT               10
> > +
> > +#define PTE_VALID               BIT(0, UL)
> > +#define PTE_READABLE            BIT(1, UL)
> > +#define PTE_WRITABLE            BIT(2, UL)
> > +#define PTE_EXECUTABLE          BIT(3, UL)
> > +#define PTE_USER                BIT(4, UL)
> > +#define PTE_GLOBAL              BIT(5, UL)
> > +#define PTE_ACCESSED            BIT(6, UL)
> > +#define PTE_DIRTY               BIT(7, UL)
> > +#define PTE_RSW                 (BIT(8, UL) | BIT(9, UL))
> > +
> > +#define PTE_LEAF_DEFAULT        (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE | PTE_EXECUTABLE)
> 
> We should avoid vulnerable default flags. So this should either be RW
> or RX.
Thanks. I'll take it into account.
> 
> > +#define PTE_TABLE               (PTE_VALID)
> > +
> > +/* Calculate the offsets into the pagetables for a given VA */
> > +#define zeroeth_linear_offset(va)   ((va) >> ZEROETH_SHIFT)
> > +#define first_linear_offset(va)     ((va) >> FIRST_SHIFT)
> > +#define second_linear_offset(va)    ((va) >> SECOND_SHIFT)
> > +#define third_linear_offset(va)     ((va) >> THIRD_SHIFT)
> > +
> > +#define pagetable_zeroeth_index(va) zeroeth_linear_offset((va) &
> > ZEROETH_MASK)
> > +#define pagetable_first_index(va)   first_linear_offset((va) &
> > FIRST_MASK)
> > +#define pagetable_second_index(va)  second_linear_offset((va) &
> > SECOND_MASK)
> > +#define pagetable_third_index(va)   third_linear_offset((va) &
> > THIRD_MASK)
> > +
> > +/* Page Table entry */
> > +typedef struct {
> > +    uint64_t pte;
> > +} pte_t;
> > +
> > +/* Shift the VPN[x] or PPN[x] fields of a virtual or physical
> > address
> > + * to become the shifted PPN[x] fields of a page table entry */
> > +#define addr_to_ppn(x) (((x) >> PAGE_SHIFT) << PTE_SHIFT)
> > +
> > +static inline pte_t paddr_to_pte(unsigned long paddr)
> > +{
> > +    return (pte_t) { .pte = addr_to_ppn(paddr) };
> > +}
> > +
> > +static inline bool pte_is_valid(pte_t *p)
> > +{
> > +    return p->pte & PTE_VALID;
> > +}
> > +
> > +#endif /* _ASM_RISCV_PAGE_H */
> > diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
> > new file mode 100644
> > index 0000000000..6e172376eb
> > --- /dev/null
> > +++ b/xen/arch/riscv/mm.c
> > @@ -0,0 +1,223 @@
> > +#include <xen/init.h>
> > +#include <xen/lib.h>
> > +
> > +#include <asm/csr.h>
> > +#include <asm/mm.h>
> > +#include <asm/page.h>
> > +
> > +/*
> > + * xen_second_pagetable is indexed with the VPN[2] page table
> > entry field
> > + * xen_first_pagetable is accessed from the VPN[1] page table
> > entry field
> > + * xen_zeroeth_pagetable is accessed from the VPN[0] page table
> > entry field
> > + */
> > +pte_t xen_second_pagetable[PAGE_ENTRIES]
> > __attribute__((__aligned__(PAGE_SIZE)));
> > +static pte_t xen_first_pagetable[PAGE_ENTRIES]
> > +    __attribute__((__aligned__(PAGE_SIZE)));
> > +static pte_t xen_zeroeth_pagetable[PAGE_ENTRIES]
> > +    __attribute__((__aligned__(PAGE_SIZE)));
> > +
> > +extern unsigned long _stext;
> > +extern unsigned long _etext;
> > +extern unsigned long __init_begin;
> > +extern unsigned long __init_end;
> > +extern unsigned long _srodata;
> > +extern unsigned long _erodata;
> > +
> > +paddr_t phys_offset;
> 
> This is defined, set but not used.
> 
> > +
> > +#define resolve_early_addr(x) \
> 
> This helper seems to behave the same wasy as linker_addr(). So any 
> reason to not use it?
linker_addr() script can be used instead. It looks I missed something
before and it is spilled out into two equal macros.
> 
> I will make this assumption this can be used and not comment on the 
> implement of resolve_early_addr().
> 
> > +   
> > ({                                                                 
> >          \
> > +         unsigned long *
> > __##x;                                                 \
> > +        if ( load_addr_start <= x && x < load_addr_end
> > )                        \
> > +            __##x = (unsigned long
> > *)x;                                         \
> > +       
> > else                                                               
> >      \
> > +            __##x = (unsigned long *)(x + load_addr_start -
> > linker_addr_start); \
> > +       
> > __##x;                                                             
> >      \
> > +     })
> > +
> > +static void __init clear_pagetables(unsigned long load_addr_start,
> > +                             unsigned long load_addr_end,
> > +                             unsigned long linker_addr_start,
> > +                             unsigned long linker_addr_end)
> > +{
> > +    unsigned long *p;
> > +    unsigned long page;
> > +    unsigned long i;
> > +
> > +    page = (unsigned long)&xen_second_pagetable[0];
> > +
> > +    p = resolve_early_addr(page);
> > +    for ( i = 0; i < ARRAY_SIZE(xen_second_pagetable); i++ )
> 
> The entries in xen_second_pagetable are a pte_t (uint64_t). But ...
> 
> > +    {
> > +        p[i] = 0ULL;
> 
> ... the type here will be unsigned long. So you may not fully zero
> the 
> page-table on 32-bit architecture. Therefore you want to define as
> pte_t.
> 
> That said, given the page table will be part of BSS, you should not
> need 
> to zero again assuming you clear BSS before hand.
> 
> If you clear afterwards, then you *must* move them out of BSS.
> 
> The same applies for xen_{first, zeroeth}_pagetable below.
I didn't have initialized page tables so that is why I needed
clear_pagetables() but I think you are right and clear_pagetables can
be removed at all as page tables will be initialized during BSS
initialization.
> 
> > +    }
> > +
> > +    page = (unsigned long)&xen_first_pagetable[0];
> > +    p = resolve_early_addr(page);
> > +    for ( i = 0; i < ARRAY_SIZE(xen_first_pagetable); i++ )
> > +    {
> > +        p[i] = 0ULL;
> > +    }
> > +
> > +    page = (unsigned long)&xen_zeroeth_pagetable[0];
> > +    p = resolve_early_addr(page);
> > +    for ( i = 0; i < ARRAY_SIZE(xen_zeroeth_pagetable); i++ )
> > +    {
> > +        p[i] = 0ULL;
> > +    }
> > +}
> > +
> > +/*
> > + * WARNING: load_addr() and linker_addr() are to be called only
> > when the MMU is
> > + * disabled and only when executed by the primary CPU.  They
> > cannot refer to
> > + * any global variable or functions.
> 
> I find interesting you are saying when _setup_initial_pagetables() is
> called from setup_initial_pagetables(). Would you be able to explain
> how 
> this is different?
I am not sure that I understand your question correctly but
_setup_initial_pagetables() was introduced to map some addresses with
write/read flag. Probably I have to rename it to something that is more
clear.
> 
> > + */
> > +
> > +/*
> > + * Convert an addressed layed out at link time to the address
> > where it was loaded
> 
> Typo: s/addressed/address/ ?
Yes, it should be address. and 'layed out' should be changed to 'laid
out'...
> 
> > + * by the bootloader.
> > + */
> 
> Looking at the implementation, you seem to consider that any address
> not 
> in the range [linker_addr_start, linker_addr_end[ will have a 1:1
> mappings.
> 
> I am not sure this is what you want. So I would consider to throw an 
> error if such address is passed.
I thought that at this stage and if no relocation was done it is 1:1
except the case when load_addr_start != linker_addr_start.


> 
> > +#define
> > load_addr(linker_address)                                          
> >     \
> > +   
> > ({                                                                 
> >         \
> > +        unsigned long __linker_address = (unsigned
> > long)(linker_address);      \
> > +        if ( linker_addr_start <= __linker_address
> > &&                          \
> > +            __linker_address < linker_addr_end
> > )                               \
> > +       
> > {                                                                  
> >     \
> > +            __linker_address
> > =                                                 \
> > +                __linker_address - linker_addr_start +
> > load_addr_start;        \
> > +       
> > }                                                                  
> >     \
> > +       
> > __linker_address;                                                  
> >     \
> > +    })
> > +
> > +/* Convert boot-time Xen address from where it was loaded by the
> > boot loader to the address it was layed out
> > + * at link-time.
> > + */
> 
> Coding style: The first line is too long and multi-line comments look
> like:
> 
> /*
>   * Foo
>   * Bar
>   */
> 
> > +#define
> > linker_addr(load_address)                                          
> >     \
> 
> Same remark as for load_addr() above.
> 
> > +   
> > ({                                                                 
> >         \
> > +        unsigned long __load_address = (unsigned
> > long)(load_address);          \
> > +        if ( load_addr_start <= __load_address
> > &&                              \
> > +            __load_address < load_addr_end
> > )                                   \
> > +       
> > {                                                                  
> >     \
> > +            __load_address
> > =                                                   \
> > +                __load_address - load_addr_start +
> > linker_addr_start;          \
> > +       
> > }                                                                  
> >     \
> > +       
> > __load_address;                                                    
> >     \
> > +    })
> > +
> > +static void __attribute__((section(".entry")))
> > +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t
> > *zeroeth,
> Can this be named to setup_initial_mapping() so this is clearer and 
> avoid the one '_' different with the function below.
Sure. It will be better.
> 
> > +                         unsigned long map_start,
> > +                         unsigned long map_end,
> > +                         unsigned long pa_start,
> > +                         bool writable)
> 
> What about the executable bit?
It's always executable... But as you mentioned above PTE_LEAF_DEFAULT
should be either RX or RW.
I think it makes sense to add flags instead of writable.
> 
> > +{
> > +    unsigned long page_addr;
> > +    unsigned long index2;
> > +    unsigned long index1;
> > +    unsigned long index0;
> 
> index* could be defined in the loop below.
It could. But I am curious why it is better?
> 
> > +
> > +    /* align start addresses */
> > +    map_start &= ZEROETH_MAP_MASK;
> > +    pa_start &= ZEROETH_MAP_MASK;
> 
> Hmmm... I would actually expect the address to be properly aligned
> and 
> therefore not require an alignment here.
> 
> Otherwise, this raise the question of what happen if you have region 
> using the same page?
That map_start &=  ZEROETH_MAP_MASK is needed to page number of address
w/o page offset.

> 
> > +
> > +    page_addr = map_start;
> > +    while ( page_addr < map_end )
> 
> Looking at the loop, it looks like you are assuming that the region
> will 
> never cross a boundary of a page-table (either L0, L1, L2). I am not 
> convinced you can make such assumption (see below).
> 
> But if you really want to make such assumption then you should add
> some 
> guard (either BUILD_BUG_ON(), ASSERT(), proper check) in your code to
> avoid any surprise in the future.
I am not sure that I fully understand what is the problem here.
The address is aligned on (1<<12) boundary and each itearation is
mapped (1<<12) page so all looks fine or I misunderstood you.
> 
> > +    {
> > +        index2 = pagetable_second_index(page_addr);
> > +        index1 = pagetable_first_index(page_addr);
> > +        index0 = pagetable_zeroeth_index(page_addr);
> > +
> > +        /* Setup level2 table */
> > +        second[index2] = paddr_to_pte((unsigned long)first);
> > +        second[index2].pte |= PTE_TABLE;
> > +
> > +        /* Setup level1 table */
> > +        first[index1] = paddr_to_pte((unsigned long)zeroeth);
> > +        first[index1].pte |= PTE_TABLE;
> > +
> > +        /* Setup level0 table */
> > +        if ( !pte_is_valid(&zeroeth[index0]) )
> 
> Can you explain why you are checking !pte_is_valid() for the L0 entry
> but not the other?
My mistake it should be checked for each level.
> 
> > +        {
> > +            /* Update level0 table */
> > +            zeroeth[index0] = paddr_to_pte((page_addr - map_start)
> > + pa_start);
> > +            zeroeth[index0].pte |= PTE_LEAF_DEFAULT;
> > +            zeroeth[index0].pte &= ~((!writable) ? PTE_WRITABLE :
> > 0);
> 
> Looking at the default value, it would mean that a non-writable
> mapping 
> is automatically executable. This seems wrong for the section is not 
> meant to be executable (like rodata).
Yes, you are right. I'll reowrk setup_initial_mapping() to pass flags
instead of write/read - only flag.
> 
> > +        }
> > +
> > +        /* Point to next page */
> > +        page_addr += ZEROETH_SIZE;
> > +    }
> > +}
> > +
> > +/*
> > + * setup_initial_pagetables:
> > + *
> > + * 1) Build the page tables for Xen that map the following:
> > + *   1.1)  The physical location of Xen (where the bootloader
> > loaded it)
> > + *   1.2)  The link-time location of Xen (where the linker
> > expected Xen's
> > + *         addresses to be)
> > + * 2) Load the page table into the SATP and enable the MMU
> > + */
> > +void __attribute__((section(".entry")))
> 
> I couldn't find a section ".entry" in the linker.
> 
> > +setup_initial_pagetables(unsigned long load_addr_start,
> > +                         unsigned long load_addr_end,
> > +                         unsigned long linker_addr_start,
> > +                         unsigned long linker_addr_end)
> > +{
> > +    pte_t *second;
> > +    pte_t *first;
> > +    pte_t *zeroeth;
> > +
> > +    clear_pagetables(load_addr_start, load_addr_end,
> > +                     linker_addr_start, linker_addr_end);
> > +
> > +    /* Get the addresses where the page tables were loaded */
> > +    second  = (pte_t *)load_addr(&xen_second_pagetable);
> > +    first   = (pte_t *)load_addr(&xen_first_pagetable);
> > +    zeroeth = (pte_t *)load_addr(&xen_zeroeth_pagetable);
> 
> I would consider to embed the type cast in load_addr() so you are
> adding 
> some type safety within your code.
Definitely it will be better but setup_initial_mapping() uses 'unsigned
long' for passing address that should be mapped.
> 
> > +
> > +    /*
> > +     * Create a mapping from Xen's link-time addresses to where
> > they were actually loaded.
> 
> This is line is way long than 80 characters. Please make sure to wrap
> it 
> 80 characters.
> 
> > +     */
> > +    _setup_initial_pagetables(second, first, zeroeth,
> > +                              linker_addr(&_stext),
> > +                              linker_addr(&_etext),
> > +                              load_addr(&_stext),
> > +                              false);
> > +    _setup_initial_pagetables(second, first, zeroeth,
> > +                              linker_addr(&__init_begin),
> > +                              linker_addr(&__init_end),
> > +                              load_addr(&__init_begin),
> > +                              true);
> > +    _setup_initial_pagetables(second, first, zeroeth,
> > +                              linker_addr(&_srodata),
> > +                              linker_addr(&_erodata),
> > +                              load_addr(&_srodata),
> > +                              false);
> > +    _setup_initial_pagetables(second, first, zeroeth,
> > +                              linker_addr_start,
> > +                              linker_addr_end,
> > +                              load_addr_start,
> > +                              true);
> 
> Where do you guarantee that Xen will always fit in an L0 table and
> the 
> start address is aligned to the size of an L0 table?
I don't guarantee that it fit in an L0 table but the start address is
aligned to the size of the L0 table at the start.
> 
> > +
> > +    /*
> > +     * Create a mapping of the load time address range to... the
> > load time address range.
> 
> Same about the line length here.
> 
> > +     * This mapping is used at boot time only.
> > +     */
> > +    _setup_initial_pagetables(second, first, zeroeth,
> 
> This can only work if Xen is loaded at its linked address. So you
> need a 
> separate set of L0, L1 tables for the identity mapping.
> 
> That said, this would not be sufficient because:
>    1) Xen may not be loaded at a 2M boundary (you can control with 
> U-boot, but not with EFI). So this may cross a boundary and therefore
> need multiple pages.
>    2) The load region may overlap the link address
> 
> While I think it would be good to handle those cases from the start,
> I 
> would understand why are not easy to solve. So I think the minimum is
> to 
> throw some errors if you are in a case you can't support.
Do you mean to throw some error in load_addr()/linkder_addr()?
> 
> > +                              load_addr_start,
> > +                              load_addr_end,
> > +                              load_addr_start,
> > +                              true); > +
> > +    /* Ensure page table writes precede loading the SATP */
> > +    asm volatile("sfence.vma");
> > +
> > +    /* Enable the MMU and load the new pagetable for Xen */
> > +    csr_write(CSR_SATP,
> > +              (load_addr(xen_second_pagetable) >> PAGE_SHIFT) |
> > SATP_MODE_SV39 << SATP_MODE_SHIFT);
> 
> IHMO, it would make sense to introduce within the series the code to 
> jump off the identity mapping and then remove it.
Probably I have to spend more time to understand identity mapping but
it looks like current  one RISC-V port from Bobby's doesn't jump off
and remove the identity mapping.
> 
> > +
> > +    phys_offset = load_addr_start - linker_addr_start;
> > +
> > +    return;
> > +}
> 
> Cheers,
> 
~ Oleksii
Julien Grall Feb. 27, 2023, 5:36 p.m. UTC | #6
Hi Oleksii,

On 27/02/2023 16:52, Oleksii wrote:
> On Sat, 2023-02-25 at 17:53 +0000, Julien Grall wrote:
>>> +/*
>>> + * WARNING: load_addr() and linker_addr() are to be called only
>>> when the MMU is
>>> + * disabled and only when executed by the primary CPU.  They
>>> cannot refer to
>>> + * any global variable or functions.
>>
>> I find interesting you are saying when _setup_initial_pagetables() is
>> called from setup_initial_pagetables(). Would you be able to explain
>> how
>> this is different?
> I am not sure that I understand your question correctly but
> _setup_initial_pagetables() was introduced to map some addresses with
> write/read flag. Probably I have to rename it to something that is more
> clear.

So the comment suggests that you code cannot refer to global 
functions/variables when the MMU is off. So I have multiple questions:
   * Why only global? IOW, why static would be OK?
   * setup_initial_pagetables() has a call to 
_setup_initial_pagetables() (IOW referring to another function). Why is 
it fine?
   * You have code in the next patch referring to global variables 
(mainly _start and _end). How is this different?

>>
>>> + */
>>> +
>>> +/*
>>> + * Convert an addressed layed out at link time to the address
>>> where it was loaded
>>
>> Typo: s/addressed/address/ ?
> Yes, it should be address. and 'layed out' should be changed to 'laid
> out'...
>>
>>> + * by the bootloader.
>>> + */
>>
>> Looking at the implementation, you seem to consider that any address
>> not
>> in the range [linker_addr_start, linker_addr_end[ will have a 1:1
>> mappings.
>>
>> I am not sure this is what you want. So I would consider to throw an
>> error if such address is passed.
> I thought that at this stage and if no relocation was done it is 1:1
> except the case when load_addr_start != linker_addr_start.

The problem is what you try to map one to one may clash with the linked 
region for Xen. So it is not always possible to map the region 1:1.

Therefore, I don't see any use for the else part here.

> 
> 
>>
>>> +#define
>>> load_addr(linker_address)
>>>      \
>>> +
>>> ({
>>>          \
>>> +        unsigned long __linker_address = (unsigned
>>> long)(linker_address);      \
>>> +        if ( linker_addr_start <= __linker_address
>>> &&                          \
>>> +            __linker_address < linker_addr_end
>>> )                               \
>>> +
>>> {
>>>      \
>>> +            __linker_address
>>> =                                                 \
>>> +                __linker_address - linker_addr_start +
>>> load_addr_start;        \
>>> +
>>> }
>>>      \
>>> +
>>> __linker_address;
>>>      \
>>> +    })
>>> +
>>> +/* Convert boot-time Xen address from where it was loaded by the
>>> boot loader to the address it was layed out
>>> + * at link-time.
>>> + */
>>
>> Coding style: The first line is too long and multi-line comments look
>> like:
>>
>> /*
>>    * Foo
>>    * Bar
>>    */
>>
>>> +#define
>>> linker_addr(load_address)
>>>      \
>>
>> Same remark as for load_addr() above.
>>
>>> +
>>> ({
>>>          \
>>> +        unsigned long __load_address = (unsigned
>>> long)(load_address);          \
>>> +        if ( load_addr_start <= __load_address
>>> &&                              \
>>> +            __load_address < load_addr_end
>>> )                                   \
>>> +
>>> {
>>>      \
>>> +            __load_address
>>> =                                                   \
>>> +                __load_address - load_addr_start +
>>> linker_addr_start;          \
>>> +
>>> }
>>>      \
>>> +
>>> __load_address;
>>>      \
>>> +    })
>>> +
>>> +static void __attribute__((section(".entry")))
>>> +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t
>>> *zeroeth,
>> Can this be named to setup_initial_mapping() so this is clearer and
>> avoid the one '_' different with the function below.
> Sure. It will be better.
>>
>>> +                         unsigned long map_start,
>>> +                         unsigned long map_end,
>>> +                         unsigned long pa_start,
>>> +                         bool writable)
>>
>> What about the executable bit?
> It's always executable... But as you mentioned above PTE_LEAF_DEFAULT
> should be either RX or RW.
> I think it makes sense to add flags instead of writable.
>>
>>> +{
>>> +    unsigned long page_addr;
>>> +    unsigned long index2;
>>> +    unsigned long index1;
>>> +    unsigned long index0;
>>
>> index* could be defined in the loop below.
> It could. But I am curious why it is better?
>>
>>> +
>>> +    /* align start addresses */
>>> +    map_start &= ZEROETH_MAP_MASK;
>>> +    pa_start &= ZEROETH_MAP_MASK;
>>
>> Hmmm... I would actually expect the address to be properly aligned
>> and
>> therefore not require an alignment here.
>>
>> Otherwise, this raise the question of what happen if you have region
>> using the same page?
> That map_start &=  ZEROETH_MAP_MASK is needed to page number of address
> w/o page offset.

My point is why would the page offset be non-zero?

>>
>>> +
>>> +    page_addr = map_start;
>>> +    while ( page_addr < map_end )
>>
>> Looking at the loop, it looks like you are assuming that the region
>> will
>> never cross a boundary of a page-table (either L0, L1, L2). I am not
>> convinced you can make such assumption (see below).
>>
>> But if you really want to make such assumption then you should add
>> some
>> guard (either BUILD_BUG_ON(), ASSERT(), proper check) in your code to
>> avoid any surprise in the future.
> I am not sure that I fully understand what is the problem here.
> The address is aligned on (1<<12) boundary and each itearation is
> mapped (1<<12) page so all looks fine or I misunderstood you.

Let's take an example, imagine the region you want to map is 4MB. 
AFAICT, you are only passing one L0 page-table. So your code will end up 
to overwrite the previous entries in the zeroeth page-table and then add 
another link in the L1 page-table.

>>
>>> +    {
>>> +        index2 = pagetable_second_index(page_addr);
>>> +        index1 = pagetable_first_index(page_addr);
>>> +        index0 = pagetable_zeroeth_index(page_addr);
>>> +
>>> +        /* Setup level2 table */
>>> +        second[index2] = paddr_to_pte((unsigned long)first);
>>> +        second[index2].pte |= PTE_TABLE;
>>> +
>>> +        /* Setup level1 table */
>>> +        first[index1] = paddr_to_pte((unsigned long)zeroeth);
>>> +        first[index1].pte |= PTE_TABLE;
>>> +
>>> +        /* Setup level0 table */
>>> +        if ( !pte_is_valid(&zeroeth[index0]) )
>>
>> Can you explain why you are checking !pte_is_valid() for the L0 entry
>> but not the other?
> My mistake it should be checked for each level.

In which case, shouldn't you return an error if the entry is always valid?

>>
>>> +        {
>>> +            /* Update level0 table */
>>> +            zeroeth[index0] = paddr_to_pte((page_addr - map_start)
>>> + pa_start);
>>> +            zeroeth[index0].pte |= PTE_LEAF_DEFAULT;
>>> +            zeroeth[index0].pte &= ~((!writable) ? PTE_WRITABLE :
>>> 0);
>>
>> Looking at the default value, it would mean that a non-writable
>> mapping
>> is automatically executable. This seems wrong for the section is not
>> meant to be executable (like rodata).
> Yes, you are right. I'll reowrk setup_initial_mapping() to pass flags
> instead of write/read - only flag.
>>
>>> +        }
>>> +
>>> +        /* Point to next page */
>>> +        page_addr += ZEROETH_SIZE;
>>> +    }
>>> +}
>>> +
>>> +/*
>>> + * setup_initial_pagetables:
>>> + *
>>> + * 1) Build the page tables for Xen that map the following:
>>> + *   1.1)  The physical location of Xen (where the bootloader
>>> loaded it)
>>> + *   1.2)  The link-time location of Xen (where the linker
>>> expected Xen's
>>> + *         addresses to be)
>>> + * 2) Load the page table into the SATP and enable the MMU
>>> + */
>>> +void __attribute__((section(".entry")))
>>
>> I couldn't find a section ".entry" in the linker.
>>
>>> +setup_initial_pagetables(unsigned long load_addr_start,
>>> +                         unsigned long load_addr_end,
>>> +                         unsigned long linker_addr_start,
>>> +                         unsigned long linker_addr_end)
>>> +{
>>> +    pte_t *second;
>>> +    pte_t *first;
>>> +    pte_t *zeroeth;
>>> +
>>> +    clear_pagetables(load_addr_start, load_addr_end,
>>> +                     linker_addr_start, linker_addr_end);
>>> +
>>> +    /* Get the addresses where the page tables were loaded */
>>> +    second  = (pte_t *)load_addr(&xen_second_pagetable);
>>> +    first   = (pte_t *)load_addr(&xen_first_pagetable);
>>> +    zeroeth = (pte_t *)load_addr(&xen_zeroeth_pagetable);
>>
>> I would consider to embed the type cast in load_addr() so you are
>> adding
>> some type safety within your code.
> Definitely it will be better but setup_initial_mapping() uses 'unsigned
> long' for passing address that should be mapped.

One possibility would be to introduce a new wrapper for the typesafety. 
Anyway, it is not essential for now. Let's at least get the logic 
correct first :).

>>
>>> +
>>> +    /*
>>> +     * Create a mapping from Xen's link-time addresses to where
>>> they were actually loaded.
>>
>> This is line is way long than 80 characters. Please make sure to wrap
>> it
>> 80 characters.
>>
>>> +     */
>>> +    _setup_initial_pagetables(second, first, zeroeth,
>>> +                              linker_addr(&_stext),
>>> +                              linker_addr(&_etext),
>>> +                              load_addr(&_stext),
>>> +                              false);
>>> +    _setup_initial_pagetables(second, first, zeroeth,
>>> +                              linker_addr(&__init_begin),
>>> +                              linker_addr(&__init_end),
>>> +                              load_addr(&__init_begin),
>>> +                              true);
>>> +    _setup_initial_pagetables(second, first, zeroeth,
>>> +                              linker_addr(&_srodata),
>>> +                              linker_addr(&_erodata),
>>> +                              load_addr(&_srodata),
>>> +                              false);
>>> +    _setup_initial_pagetables(second, first, zeroeth,
>>> +                              linker_addr_start,
>>> +                              linker_addr_end,
>>> +                              load_addr_start,
>>> +                              true);
>>
>> Where do you guarantee that Xen will always fit in an L0 table and
>> the
>> start address is aligned to the size of an L0 table?
> I don't guarantee that it fit in an L0 table but the start address is
> aligned to the size of the L0 table at the start.
Then it should be fixed.

>>
>>> +
>>> +    /*
>>> +     * Create a mapping of the load time address range to... the
>>> load time address range.
>>
>> Same about the line length here.
>>
>>> +     * This mapping is used at boot time only.
>>> +     */
>>> +    _setup_initial_pagetables(second, first, zeroeth,
>>
>> This can only work if Xen is loaded at its linked address. So you
>> need a
>> separate set of L0, L1 tables for the identity mapping.
>>
>> That said, this would not be sufficient because:
>>     1) Xen may not be loaded at a 2M boundary (you can control with
>> U-boot, but not with EFI). So this may cross a boundary and therefore
>> need multiple pages.
>>     2) The load region may overlap the link address
>>
>> While I think it would be good to handle those cases from the start,
>> I
>> would understand why are not easy to solve. So I think the minimum is
>> to
>> throw some errors if you are in a case you can't support.
> Do you mean to throw some error in load_addr()/linkder_addr()?

In this case, I meant to check if load_addr != linker_addr, then throw 
an error.

Cheers,
Oleksii March 5, 2023, 4:25 p.m. UTC | #7
Hi Julien,

On Mon, 2023-02-27 at 17:36 +0000, Julien Grall wrote:
> Hi Oleksii,
> 
> On 27/02/2023 16:52, Oleksii wrote:
> > On Sat, 2023-02-25 at 17:53 +0000, Julien Grall wrote:
> > > > +/*
> > > > + * WARNING: load_addr() and linker_addr() are to be called
> > > > only
> > > > when the MMU is
> > > > + * disabled and only when executed by the primary CPU.  They
> > > > cannot refer to
> > > > + * any global variable or functions.
> > > 
> > > I find interesting you are saying when
> > > _setup_initial_pagetables() is
> > > called from setup_initial_pagetables(). Would you be able to
> > > explain
> > > how
> > > this is different?
> > I am not sure that I understand your question correctly but
> > _setup_initial_pagetables() was introduced to map some addresses
> > with
> > write/read flag. Probably I have to rename it to something that is
> > more
> > clear.
> 
> So the comment suggests that you code cannot refer to global 
> functions/variables when the MMU is off. So I have multiple
> questions:
>    * Why only global? IOW, why static would be OK?
>    * setup_initial_pagetables() has a call to 
> _setup_initial_pagetables() (IOW referring to another function). Why
> is 
> it fine?
>    * You have code in the next patch referring to global variables 
> (mainly _start and _end). How is this different?
> 
> > > 
> > > > + */
> > > > +
> > > > +/*
> > > > + * Convert an addressed layed out at link time to the address
> > > > where it was loaded
> > > 
> > > Typo: s/addressed/address/ ?
> > Yes, it should be address. and 'layed out' should be changed to
> > 'laid
> > out'...
> > > 
> > > > + * by the bootloader.
> > > > + */
> > > 
> > > Looking at the implementation, you seem to consider that any
> > > address
> > > not
> > > in the range [linker_addr_start, linker_addr_end[ will have a 1:1
> > > mappings.
> > > 
> > > I am not sure this is what you want. So I would consider to throw
> > > an
> > > error if such address is passed.
> > I thought that at this stage and if no relocation was done it is
> > 1:1
> > except the case when load_addr_start != linker_addr_start.
> 
> The problem is what you try to map one to one may clash with the
> linked 
> region for Xen. So it is not always possible to map the region 1:1.
> 
> Therefore, I don't see any use for the else part here.
Got it. Thanks.

I am curious than what is the correct approach in general to handle
this situation?
I mean that throw an error it is one option but if I would like to do
that w/o throwing an error. Should it done some relocation in that
case?

> 
> > 
> > 
> > > 
> > > > +#define
> > > > load_addr(linker_address)
> > > >      \
> > > > +
> > > > ({
> > > >          \
> > > > +        unsigned long __linker_address = (unsigned
> > > > long)(linker_address);      \
> > > > +        if ( linker_addr_start <= __linker_address
> > > > &&                          \
> > > > +            __linker_address < linker_addr_end
> > > > )                               \
> > > > +
> > > > {
> > > >      \
> > > > +            __linker_address
> > > > =                                                 \
> > > > +                __linker_address - linker_addr_start +
> > > > load_addr_start;        \
> > > > +
> > > > }
> > > >      \
> > > > +
> > > > __linker_address;
> > > >      \
> > > > +    })
> > > > +
> > > > +/* Convert boot-time Xen address from where it was loaded by
> > > > the
> > > > boot loader to the address it was layed out
> > > > + * at link-time.
> > > > + */
> > > 
> > > Coding style: The first line is too long and multi-line comments
> > > look
> > > like:
> > > 
> > > /*
> > >    * Foo
> > >    * Bar
> > >    */
> > > 
> > > > +#define
> > > > linker_addr(load_address)
> > > >      \
> > > 
> > > Same remark as for load_addr() above.
> > > 
> > > > +
> > > > ({
> > > >          \
> > > > +        unsigned long __load_address = (unsigned
> > > > long)(load_address);          \
> > > > +        if ( load_addr_start <= __load_address
> > > > &&                              \
> > > > +            __load_address < load_addr_end
> > > > )                                   \
> > > > +
> > > > {
> > > >      \
> > > > +            __load_address
> > > > =                                                   \
> > > > +                __load_address - load_addr_start +
> > > > linker_addr_start;          \
> > > > +
> > > > }
> > > >      \
> > > > +
> > > > __load_address;
> > > >      \
> > > > +    })
> > > > +
> > > > +static void __attribute__((section(".entry")))
> > > > +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t
> > > > *zeroeth,
> > > Can this be named to setup_initial_mapping() so this is clearer
> > > and
> > > avoid the one '_' different with the function below.
> > Sure. It will be better.
> > > 
> > > > +                         unsigned long map_start,
> > > > +                         unsigned long map_end,
> > > > +                         unsigned long pa_start,
> > > > +                         bool writable)
> > > 
> > > What about the executable bit?
> > It's always executable... But as you mentioned above
> > PTE_LEAF_DEFAULT
> > should be either RX or RW.
> > I think it makes sense to add flags instead of writable.
> > > 
> > > > +{
> > > > +    unsigned long page_addr;
> > > > +    unsigned long index2;
> > > > +    unsigned long index1;
> > > > +    unsigned long index0;
> > > 
> > > index* could be defined in the loop below.
> > It could. But I am curious why it is better?
> > > 
> > > > +
> > > > +    /* align start addresses */
> > > > +    map_start &= ZEROETH_MAP_MASK;
> > > > +    pa_start &= ZEROETH_MAP_MASK;
> > > 
> > > Hmmm... I would actually expect the address to be properly
> > > aligned
> > > and
> > > therefore not require an alignment here.
> > > 
> > > Otherwise, this raise the question of what happen if you have
> > > region
> > > using the same page?
> > That map_start &=  ZEROETH_MAP_MASK is needed to page number of
> > address
> > w/o page offset.
> 
> My point is why would the page offset be non-zero?
I checked a linker script and addresses that passed to
setup_initial_mapping() and they are really always aligned so there is
no any sense in additional alignment.

> 
> > > 
> > > > +
> > > > +    page_addr = map_start;
> > > > +    while ( page_addr < map_end )
> > > 
> > > Looking at the loop, it looks like you are assuming that the
> > > region
> > > will
> > > never cross a boundary of a page-table (either L0, L1, L2). I am
> > > not
> > > convinced you can make such assumption (see below).
> > > 
> > > But if you really want to make such assumption then you should
> > > add
> > > some
> > > guard (either BUILD_BUG_ON(), ASSERT(), proper check) in your
> > > code to
> > > avoid any surprise in the future.
> > I am not sure that I fully understand what is the problem here.
> > The address is aligned on (1<<12) boundary and each itearation is
> > mapped (1<<12) page so all looks fine or I misunderstood you.
> 
> Let's take an example, imagine the region you want to map is 4MB. 
> AFAICT, you are only passing one L0 page-table. So your code will end
> up 
> to overwrite the previous entries in the zeroeth page-table and then
> add 
> another link in the L1 page-table.
Got it. Then it looks that current approach isn't correct totally...

> 
> > > 
> > > > +    {
> > > > +        index2 = pagetable_second_index(page_addr);
> > > > +        index1 = pagetable_first_index(page_addr);
> > > > +        index0 = pagetable_zeroeth_index(page_addr);
> > > > +
> > > > +        /* Setup level2 table */
> > > > +        second[index2] = paddr_to_pte((unsigned long)first);
> > > > +        second[index2].pte |= PTE_TABLE;
> > > > +
> > > > +        /* Setup level1 table */
> > > > +        first[index1] = paddr_to_pte((unsigned long)zeroeth);
> > > > +        first[index1].pte |= PTE_TABLE;
> > > > +
> > > > +        /* Setup level0 table */
> > > > +        if ( !pte_is_valid(&zeroeth[index0]) )
> > > 
> > > Can you explain why you are checking !pte_is_valid() for the L0
> > > entry
> > > but not the other?
> > My mistake it should be checked for each level.
> 
> In which case, shouldn't you return an error if the entry is always
> valid?
> 
> > > 
> > > > +        {
> > > > +            /* Update level0 table */
> > > > +            zeroeth[index0] = paddr_to_pte((page_addr -
> > > > map_start)
> > > > + pa_start);
> > > > +            zeroeth[index0].pte |= PTE_LEAF_DEFAULT;
> > > > +            zeroeth[index0].pte &= ~((!writable) ?
> > > > PTE_WRITABLE :
> > > > 0);
> > > 
> > > Looking at the default value, it would mean that a non-writable
> > > mapping
> > > is automatically executable. This seems wrong for the section is
> > > not
> > > meant to be executable (like rodata).
> > Yes, you are right. I'll reowrk setup_initial_mapping() to pass
> > flags
> > instead of write/read - only flag.
> > > 
> > > > +        }
> > > > +
> > > > +        /* Point to next page */
> > > > +        page_addr += ZEROETH_SIZE;
> > > > +    }
> > > > +}
> > > > +
> > > > +/*
> > > > + * setup_initial_pagetables:
> > > > + *
> > > > + * 1) Build the page tables for Xen that map the following:
> > > > + *   1.1)  The physical location of Xen (where the bootloader
> > > > loaded it)
> > > > + *   1.2)  The link-time location of Xen (where the linker
> > > > expected Xen's
> > > > + *         addresses to be)
> > > > + * 2) Load the page table into the SATP and enable the MMU
> > > > + */
> > > > +void __attribute__((section(".entry")))
> > > 
> > > I couldn't find a section ".entry" in the linker.
> > > 
> > > > +setup_initial_pagetables(unsigned long load_addr_start,
> > > > +                         unsigned long load_addr_end,
> > > > +                         unsigned long linker_addr_start,
> > > > +                         unsigned long linker_addr_end)
> > > > +{
> > > > +    pte_t *second;
> > > > +    pte_t *first;
> > > > +    pte_t *zeroeth;
> > > > +
> > > > +    clear_pagetables(load_addr_start, load_addr_end,
> > > > +                     linker_addr_start, linker_addr_end);
> > > > +
> > > > +    /* Get the addresses where the page tables were loaded */
> > > > +    second  = (pte_t *)load_addr(&xen_second_pagetable);
> > > > +    first   = (pte_t *)load_addr(&xen_first_pagetable);
> > > > +    zeroeth = (pte_t *)load_addr(&xen_zeroeth_pagetable);
> > > 
> > > I would consider to embed the type cast in load_addr() so you are
> > > adding
> > > some type safety within your code.
> > Definitely it will be better but setup_initial_mapping() uses
> > 'unsigned
> > long' for passing address that should be mapped.
> 
> One possibility would be to introduce a new wrapper for the
> typesafety. 
> Anyway, it is not essential for now. Let's at least get the logic 
> correct first :).
Agree. the logic should be fixed first.
> 
> > > 
> > > > +
> > > > +    /*
> > > > +     * Create a mapping from Xen's link-time addresses to
> > > > where
> > > > they were actually loaded.
> > > 
> > > This is line is way long than 80 characters. Please make sure to
> > > wrap
> > > it
> > > 80 characters.
> > > 
> > > > +     */
> > > > +    _setup_initial_pagetables(second, first, zeroeth,
> > > > +                              linker_addr(&_stext),
> > > > +                              linker_addr(&_etext),
> > > > +                              load_addr(&_stext),
> > > > +                              false);
> > > > +    _setup_initial_pagetables(second, first, zeroeth,
> > > > +                              linker_addr(&__init_begin),
> > > > +                              linker_addr(&__init_end),
> > > > +                              load_addr(&__init_begin),
> > > > +                              true);
> > > > +    _setup_initial_pagetables(second, first, zeroeth,
> > > > +                              linker_addr(&_srodata),
> > > > +                              linker_addr(&_erodata),
> > > > +                              load_addr(&_srodata),
> > > > +                              false);
> > > > +    _setup_initial_pagetables(second, first, zeroeth,
> > > > +                              linker_addr_start,
> > > > +                              linker_addr_end,
> > > > +                              load_addr_start,
> > > > +                              true);
> > > 
> > > Where do you guarantee that Xen will always fit in an L0 table
> > > and
> > > the
> > > start address is aligned to the size of an L0 table?
> > I don't guarantee that it fit in an L0 table but the start address
> > is
> > aligned to the size of the L0 table at the start.
> Then it should be fixed.
> 
> > > 
> > > > +
> > > > +    /*
> > > > +     * Create a mapping of the load time address range to...
> > > > the
> > > > load time address range.
> > > 
> > > Same about the line length here.
> > > 
> > > > +     * This mapping is used at boot time only.
> > > > +     */
> > > > +    _setup_initial_pagetables(second, first, zeroeth,
> > > 
> > > This can only work if Xen is loaded at its linked address. So you
> > > need a
> > > separate set of L0, L1 tables for the identity mapping.
> > > 
> > > That said, this would not be sufficient because:
> > >     1) Xen may not be loaded at a 2M boundary (you can control
> > > with
> > > U-boot, but not with EFI). So this may cross a boundary and
> > > therefore
> > > need multiple pages.
> > >     2) The load region may overlap the link address
> > > 
> > > While I think it would be good to handle those cases from the
> > > start,
> > > I
> > > would understand why are not easy to solve. So I think the
> > > minimum is
> > > to
> > > throw some errors if you are in a case you can't support.
> > Do you mean to throw some error in load_addr()/linkder_addr()?
> 
> In this case, I meant to check if load_addr != linker_addr, then
> throw 
> an error.
I am not sure that it is needed now and it is easier to throw an error
but is option exist to handler situation when load_addr != linker_addr
except throwing an error? relocate?


~ Oleksii
Oleksii March 5, 2023, 9:28 p.m. UTC | #8
> 
> > 
> > > > 
> > > > > +
> > > > > +    page_addr = map_start;
> > > > > +    while ( page_addr < map_end )
> > > > 
> > > > Looking at the loop, it looks like you are assuming that the
> > > > region
> > > > will
> > > > never cross a boundary of a page-table (either L0, L1, L2). I
> > > > am
> > > > not
> > > > convinced you can make such assumption (see below).
> > > > 
> > > > But if you really want to make such assumption then you should
> > > > add
> > > > some
> > > > guard (either BUILD_BUG_ON(), ASSERT(), proper check) in your
> > > > code to
> > > > avoid any surprise in the future.
> > > I am not sure that I fully understand what is the problem here.
> > > The address is aligned on (1<<12) boundary and each itearation is
> > > mapped (1<<12) page so all looks fine or I misunderstood you.
> > 
> > Let's take an example, imagine the region you want to map is 4MB. 
> > AFAICT, you are only passing one L0 page-table. So your code will
> > end
> > up 
> > to overwrite the previous entries in the zeroeth page-table and
> > then
> > add 
> > another link in the L1 page-table.
> Got it. Then it looks that current approach isn't correct totally...
Or as an option we can add to xen.lds.S something like:

  ASSERT(_end - _start <= MB(L0_ENTRIES*PAGE_SIZE), "Xen too large")

~ Oleksii
Oleksii March 6, 2023, 6:38 a.m. UTC | #9
On Mon, 2023-02-27 at 16:12 +0100, Jan Beulich wrote:
> On 24.02.2023 16:06, Oleksii Kurochko wrote:
> > --- /dev/null
> > +++ b/xen/arch/riscv/include/asm/page.h
> > @@ -0,0 +1,90 @@
> > +#ifndef _ASM_RISCV_PAGE_H
> > +#define _ASM_RISCV_PAGE_H
> > +
> > +#include <xen/const.h>
> > +#include <xen/types.h>
> > +
> > +#define PAGE_ENTRIES            512
> > +#define VPN_BITS                (9)
> > +#define VPN_MASK                ((unsigned long)((1 << VPN_BITS) -
> > 1))
> > +
> > +#ifdef CONFIG_RISCV_64
> > +/* L3 index Bit[47:39] */
> > +#define THIRD_SHIFT             (39)
> > +#define THIRD_MASK              (VPN_MASK << THIRD_SHIFT)
> > +/* L2 index Bit[38:30] */
> > +#define SECOND_SHIFT            (30)
> > +#define SECOND_MASK             (VPN_MASK << SECOND_SHIFT)
> > +/* L1 index Bit[29:21] */
> > +#define FIRST_SHIFT             (21)
> > +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> > +/* L0 index Bit[20:12] */
> > +#define ZEROETH_SHIFT           (12)
> > +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
> > +
> > +#else // CONFIG_RISCV_32
> > +
> > +/* L1 index Bit[31:22] */
> > +#define FIRST_SHIFT             (22)
> > +#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
> > +
> > +/* L0 index Bit[21:12] */
> > +#define ZEROETH_SHIFT           (12)
> > +#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
> > +#endif
> > +
> > +#define THIRD_SIZE              (1 << THIRD_SHIFT)
> > +#define THIRD_MAP_MASK          (~(THIRD_SIZE - 1))
> > +#define SECOND_SIZE             (1 << SECOND_SHIFT)
> > +#define SECOND_MAP_MASK         (~(SECOND_SIZE - 1))
> > +#define FIRST_SIZE              (1 << FIRST_SHIFT)
> > +#define FIRST_MAP_MASK          (~(FIRST_SIZE - 1))
> > +#define ZEROETH_SIZE            (1 << ZEROETH_SHIFT)
> > +#define ZEROETH_MAP_MASK        (~(ZEROETH_SIZE - 1))
> > +
> > +#define PTE_SHIFT               10
> > +
> > +#define PTE_VALID               BIT(0, UL)
> > +#define PTE_READABLE            BIT(1, UL)
> > +#define PTE_WRITABLE            BIT(2, UL)
> > +#define PTE_EXECUTABLE          BIT(3, UL)
> > +#define PTE_USER                BIT(4, UL)
> > +#define PTE_GLOBAL              BIT(5, UL)
> > +#define PTE_ACCESSED            BIT(6, UL)
> > +#define PTE_DIRTY               BIT(7, UL)
> > +#define PTE_RSW                 (BIT(8, UL) | BIT(9, UL))
> > +
> > +#define PTE_LEAF_DEFAULT        (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE | PTE_EXECUTABLE)
> > +#define PTE_TABLE               (PTE_VALID)
> > +
> > +/* Calculate the offsets into the pagetables for a given VA */
> > +#define zeroeth_linear_offset(va)   ((va) >> ZEROETH_SHIFT)
> > +#define first_linear_offset(va)     ((va) >> FIRST_SHIFT)
> > +#define second_linear_offset(va)    ((va) >> SECOND_SHIFT)
> > +#define third_linear_offset(va)     ((va) >> THIRD_SHIFT)
> > +
> > +#define pagetable_zeroeth_index(va) zeroeth_linear_offset((va) &
> > ZEROETH_MASK)
> > +#define pagetable_first_index(va)   first_linear_offset((va) &
> > FIRST_MASK)
> > +#define pagetable_second_index(va)  second_linear_offset((va) &
> > SECOND_MASK)
> > +#define pagetable_third_index(va)   third_linear_offset((va) &
> > THIRD_MASK)
> > +
> > +/* Page Table entry */
> > +typedef struct {
> > +    uint64_t pte;
> > +} pte_t;
> > +
> > +/* Shift the VPN[x] or PPN[x] fields of a virtual or physical
> > address
> > + * to become the shifted PPN[x] fields of a page table entry */
> > +#define addr_to_ppn(x) (((x) >> PAGE_SHIFT) << PTE_SHIFT)
> > +
> > +static inline pte_t paddr_to_pte(unsigned long paddr)
> > +{
> > +    return (pte_t) { .pte = addr_to_ppn(paddr) };
> > +}
> > +
> > +static inline bool pte_is_valid(pte_t *p)
> 
> Btw - const whenever possible please, especially in such basic
> helpers.
Sure. Thanks.
> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/mm.c
> > @@ -0,0 +1,223 @@
> > +#include <xen/init.h>
> > +#include <xen/lib.h>
> > +
> > +#include <asm/csr.h>
> > +#include <asm/mm.h>
> > +#include <asm/page.h>
> > +
> > +/*
> > + * xen_second_pagetable is indexed with the VPN[2] page table
> > entry field
> > + * xen_first_pagetable is accessed from the VPN[1] page table
> > entry field
> > + * xen_zeroeth_pagetable is accessed from the VPN[0] page table
> > entry field
> > + */
> > +pte_t xen_second_pagetable[PAGE_ENTRIES]
> > __attribute__((__aligned__(PAGE_SIZE)));
> 
> static?
It should be static.
Thanks.
> 
> > +static pte_t xen_first_pagetable[PAGE_ENTRIES]
> > +    __attribute__((__aligned__(PAGE_SIZE)));
> > +static pte_t xen_zeroeth_pagetable[PAGE_ENTRIES]
> > +    __attribute__((__aligned__(PAGE_SIZE)));
> 
> Please use __aligned() instead of open-coding it. You also may want
> to
> specifiy the section here explicitly, as .bss.page_aligned (as we do
> elsewhere).
> 
> > +extern unsigned long _stext;
> > +extern unsigned long _etext;
> > +extern unsigned long __init_begin;
> > +extern unsigned long __init_end;
> > +extern unsigned long _srodata;
> > +extern unsigned long _erodata;
> 
> Please use kernel.h and drop then colliding declarations. For what's
> left please use array types, as suggested elsewhere already.
Thanks. I'll take it into account.
> 
> > +paddr_t phys_offset;
> > +
> > +#define resolve_early_addr(x) \
> > +   
> > ({                                                                 
> >          \
> > +         unsigned long *
> > __##x;                                                 \
> > +        if ( load_addr_start <= x && x < load_addr_end
> > )                        \
> 
> Nit: Mismatched indentation.
> 
> > +            __##x = (unsigned long
> > *)x;                                         \
> > +       
> > else                                                               
> >      \
> > +            __##x = (unsigned long *)(x + load_addr_start -
> > linker_addr_start); \
> > +       
> > __##x;                                                             
> >      \
> > +     })
> > +
> > +static void __init clear_pagetables(unsigned long load_addr_start,
> > +                             unsigned long load_addr_end,
> > +                             unsigned long linker_addr_start,
> > +                             unsigned long linker_addr_end)
> 
> Nit (style): Indentation.
> 
> > +{
> > +    unsigned long *p;
> > +    unsigned long page;
> > +    unsigned long i;
> > +
> > +    page = (unsigned long)&xen_second_pagetable[0];
> > +
> > +    p = resolve_early_addr(page);
> > +    for ( i = 0; i < ARRAY_SIZE(xen_second_pagetable); i++ )
> > +    {
> > +        p[i] = 0ULL;
> > +    }
> 
> We typically omit braces around single-statement bodies. Here,
> though: Why do you do this in the first place? These static arrays
> all start out zero-initialized anyway (from when you clear .bss).
> Plus even if they didn't - why not memset()?
clear_pagetables() will be deleted as you mentioned it will be zeroed
during initialization of .bss.
It wasn't done before because .bss wasn't initialized. Now the patches
are waiting to be merged.
> 
> > +    page = (unsigned long)&xen_first_pagetable[0];
> > +    p = resolve_early_addr(page);
> > +    for ( i = 0; i < ARRAY_SIZE(xen_first_pagetable); i++ )
> > +    {
> > +        p[i] = 0ULL;
> > +    }
> > +
> > +    page = (unsigned long)&xen_zeroeth_pagetable[0];
> > +    p = resolve_early_addr(page);
> > +    for ( i = 0; i < ARRAY_SIZE(xen_zeroeth_pagetable); i++ )
> > +    {
> > +        p[i] = 0ULL;
> > +    }
> > +}
> > +
> > +/*
> > + * WARNING: load_addr() and linker_addr() are to be called only
> > when the MMU is
> > + * disabled and only when executed by the primary CPU.  They
> > cannot refer to
> > + * any global variable or functions.
> > + */
> > +
> > +/*
> > + * Convert an addressed layed out at link time to the address
> > where it was loaded
> > + * by the bootloader.
> > + */
> > +#define
> > load_addr(linker_address)                                          
> >     \
> > +   
> > ({                                                                 
> >         \
> > +        unsigned long __linker_address = (unsigned
> > long)(linker_address);      \
> > +        if ( linker_addr_start <= __linker_address
> > &&                          \
> > +            __linker_address < linker_addr_end
> > )                               \
> > +       
> > {                                                                  
> >     \
> > +            __linker_address
> > =                                                 \
> > +                __linker_address - linker_addr_start +
> > load_addr_start;        \
> > +       
> > }                                                                  
> >     \
> > +       
> > __linker_address;                                                  
> >     \
> > +    })
> > +
> > +/* Convert boot-time Xen address from where it was loaded by the
> > boot loader to the address it was layed out
> > + * at link-time.
> > + */
> > +#define
> > linker_addr(load_address)                                          
> >     \
> > +   
> > ({                                                                 
> >         \
> > +        unsigned long __load_address = (unsigned
> > long)(load_address);          \
> > +        if ( load_addr_start <= __load_address
> > &&                              \
> > +            __load_address < load_addr_end
> > )                                   \
> > +       
> > {                                                                  
> >     \
> > +            __load_address
> > =                                                   \
> > +                __load_address - load_addr_start +
> > linker_addr_start;          \
> > +       
> > }                                                                  
> >     \
> > +       
> > __load_address;                                                    
> >     \
> > +    })
> > +
> > +static void __attribute__((section(".entry")))
> > +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t
> > *zeroeth,
> 
> Why the special section (also again further down)?
There is no specific reason in case of Xen. I'll remove that.

~ Oleksii
Oleksii March 6, 2023, 6:39 a.m. UTC | #10
On Mon, 2023-02-27 at 16:19 +0100, Jan Beulich wrote:
> On 27.02.2023 16:12, Jan Beulich wrote:
> > On 24.02.2023 16:06, Oleksii Kurochko wrote:
> > > +static void __attribute__((section(".entry")))
> > > +_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t
> > > *zeroeth,
> > 
> > Why the special section (also again further down)?
> 
> Looking at patch 2 it occurred to me that you probably mean __init
> here.
Yes, you are right.
> 
> Jan
Julien Grall March 21, 2023, 4:25 p.m. UTC | #11
On 05/03/2023 16:25, Oleksii wrote:
> Hi Julien,

Hi,

Sorry for the late answer. I was away for the past couple of weeks.

> On Mon, 2023-02-27 at 17:36 +0000, Julien Grall wrote:
>> Hi Oleksii,
>>
>> On 27/02/2023 16:52, Oleksii wrote:
>>> On Sat, 2023-02-25 at 17:53 +0000, Julien Grall wrote:
>>>>> +/*
>>>>> + * WARNING: load_addr() and linker_addr() are to be called
>>>>> only
>>>>> when the MMU is
>>>>> + * disabled and only when executed by the primary CPU.  They
>>>>> cannot refer to
>>>>> + * any global variable or functions.
>>>>
>>>> I find interesting you are saying when
>>>> _setup_initial_pagetables() is
>>>> called from setup_initial_pagetables(). Would you be able to
>>>> explain
>>>> how
>>>> this is different?
>>> I am not sure that I understand your question correctly but
>>> _setup_initial_pagetables() was introduced to map some addresses
>>> with
>>> write/read flag. Probably I have to rename it to something that is
>>> more
>>> clear.
>>
>> So the comment suggests that you code cannot refer to global
>> functions/variables when the MMU is off. So I have multiple
>> questions:
>>     * Why only global? IOW, why static would be OK?
>>     * setup_initial_pagetables() has a call to
>> _setup_initial_pagetables() (IOW referring to another function). Why
>> is
>> it fine?
>>     * You have code in the next patch referring to global variables
>> (mainly _start and _end). How is this different?
>>
>>>>
>>>>> + */
>>>>> +
>>>>> +/*
>>>>> + * Convert an addressed layed out at link time to the address
>>>>> where it was loaded
>>>>
>>>> Typo: s/addressed/address/ ?
>>> Yes, it should be address. and 'layed out' should be changed to
>>> 'laid
>>> out'...
>>>>
>>>>> + * by the bootloader.
>>>>> + */
>>>>
>>>> Looking at the implementation, you seem to consider that any
>>>> address
>>>> not
>>>> in the range [linker_addr_start, linker_addr_end[ will have a 1:1
>>>> mappings.
>>>>
>>>> I am not sure this is what you want. So I would consider to throw
>>>> an
>>>> error if such address is passed.
>>> I thought that at this stage and if no relocation was done it is
>>> 1:1
>>> except the case when load_addr_start != linker_addr_start.
>>
>> The problem is what you try to map one to one may clash with the
>> linked
>> region for Xen. So it is not always possible to map the region 1:1.
>>
>> Therefore, I don't see any use for the else part here.
> Got it. Thanks.
> 
> I am curious than what is the correct approach in general to handle
> this situation?
There are multiple approach to handle it and I don't know which one 
would be best :). Relocation is one...

> I mean that throw an error it is one option but if I would like to do
> that w/o throwing an error. Should it done some relocation in that
> case?
... solution. For Arm, I decided to avoid relocation it requires more 
work in assembly.

Let me describe what we did and you can decide what you want to do in 
RISC-V.

For Arm64, as we have plenty of virtual address space, I decided to 
reshuffle the layout so Xen is running a very high address (so it is 
unlikely to clash).

For Arm32, we have a smaller address space (4GB) so instead we are going 
through a temporary area to enable the MMU when the load and runtime 
region clash. The sequence is:

   1) Map Xen to a temporary area
   2) Enable the MMU and jump to the temporary area
   3) Map Xen to the runtime area
   4) Jump to the runtime area
   5) Remove the temporary area

[...]

>>>> Hmmm... I would actually expect the address to be properly
>>>> aligned
>>>> and
>>>> therefore not require an alignment here.
>>>>
>>>> Otherwise, this raise the question of what happen if you have
>>>> region
>>>> using the same page?
>>> That map_start &=  ZEROETH_MAP_MASK is needed to page number of
>>> address
>>> w/o page offset.
>>
>> My point is why would the page offset be non-zero?
> I checked a linker script and addresses that passed to
> setup_initial_mapping() and they are really always aligned so there is
> no any sense in additional alignment.

Ok. I would suggest to add some ASSERT()/BUG_ON() in order to confirm 
this is always the case.

[...]

>>>>
>>>>> +
>>>>> +    /*
>>>>> +     * Create a mapping of the load time address range to...
>>>>> the
>>>>> load time address range.
>>>>
>>>> Same about the line length here.
>>>>
>>>>> +     * This mapping is used at boot time only.
>>>>> +     */
>>>>> +    _setup_initial_pagetables(second, first, zeroeth,
>>>>
>>>> This can only work if Xen is loaded at its linked address. So you
>>>> need a
>>>> separate set of L0, L1 tables for the identity mapping.
>>>>
>>>> That said, this would not be sufficient because:
>>>>      1) Xen may not be loaded at a 2M boundary (you can control
>>>> with
>>>> U-boot, but not with EFI). So this may cross a boundary and
>>>> therefore
>>>> need multiple pages.
>>>>      2) The load region may overlap the link address
>>>>
>>>> While I think it would be good to handle those cases from the
>>>> start,
>>>> I
>>>> would understand why are not easy to solve. So I think the
>>>> minimum is
>>>> to
>>>> throw some errors if you are in a case you can't support.
>>> Do you mean to throw some error in load_addr()/linkder_addr()?
>>
>> In this case, I meant to check if load_addr != linker_addr, then
>> throw
>> an error.
> I am not sure that it is needed now and it is easier to throw an error
> but is option exist to handler situation when load_addr != linker_addr
> except throwing an error? relocate?

I believe I answered this above.

Cheers,
Oleksii March 22, 2023, 9:14 a.m. UTC | #12
Hi Julien,

On Tue, 2023-03-21 at 16:25 +0000, Julien Grall wrote:
> 
> 
> On 05/03/2023 16:25, Oleksii wrote:
> > Hi Julien,
> 
> Hi,
> 
> Sorry for the late answer. I was away for the past couple of weeks.
> 
> > On Mon, 2023-02-27 at 17:36 +0000, Julien Grall wrote:
> > > Hi Oleksii,
> > > 
> > > On 27/02/2023 16:52, Oleksii wrote:
> > > > On Sat, 2023-02-25 at 17:53 +0000, Julien Grall wrote:
> > > > > > +/*
> > > > > > + * WARNING: load_addr() and linker_addr() are to be called
> > > > > > only
> > > > > > when the MMU is
> > > > > > + * disabled and only when executed by the primary CPU. 
> > > > > > They
> > > > > > cannot refer to
> > > > > > + * any global variable or functions.
> > > > > 
> > > > > I find interesting you are saying when
> > > > > _setup_initial_pagetables() is
> > > > > called from setup_initial_pagetables(). Would you be able to
> > > > > explain
> > > > > how
> > > > > this is different?
> > > > I am not sure that I understand your question correctly but
> > > > _setup_initial_pagetables() was introduced to map some
> > > > addresses
> > > > with
> > > > write/read flag. Probably I have to rename it to something that
> > > > is
> > > > more
> > > > clear.
> > > 
> > > So the comment suggests that you code cannot refer to global
> > > functions/variables when the MMU is off. So I have multiple
> > > questions:
> > >     * Why only global? IOW, why static would be OK?
> > >     * setup_initial_pagetables() has a call to
> > > _setup_initial_pagetables() (IOW referring to another function).
> > > Why
> > > is
> > > it fine?
> > >     * You have code in the next patch referring to global
> > > variables
> > > (mainly _start and _end). How is this different?
> > > 
> > > > > 
> > > > > > + */
> > > > > > +
> > > > > > +/*
> > > > > > + * Convert an addressed layed out at link time to the
> > > > > > address
> > > > > > where it was loaded
> > > > > 
> > > > > Typo: s/addressed/address/ ?
> > > > Yes, it should be address. and 'layed out' should be changed to
> > > > 'laid
> > > > out'...
> > > > > 
> > > > > > + * by the bootloader.
> > > > > > + */
> > > > > 
> > > > > Looking at the implementation, you seem to consider that any
> > > > > address
> > > > > not
> > > > > in the range [linker_addr_start, linker_addr_end[ will have a
> > > > > 1:1
> > > > > mappings.
> > > > > 
> > > > > I am not sure this is what you want. So I would consider to
> > > > > throw
> > > > > an
> > > > > error if such address is passed.
> > > > I thought that at this stage and if no relocation was done it
> > > > is
> > > > 1:1
> > > > except the case when load_addr_start != linker_addr_start.
> > > 
> > > The problem is what you try to map one to one may clash with the
> > > linked
> > > region for Xen. So it is not always possible to map the region
> > > 1:1.
> > > 
> > > Therefore, I don't see any use for the else part here.
> > Got it. Thanks.
> > 
> > I am curious than what is the correct approach in general to handle
> > this situation?
> There are multiple approach to handle it and I don't know which one 
> would be best :). Relocation is one...
> 
> > I mean that throw an error it is one option but if I would like to
> > do
> > that w/o throwing an error. Should it done some relocation in that
> > case?
> ... solution. For Arm, I decided to avoid relocation it requires more
> work in assembly.
> 
> Let me describe what we did and you can decide what you want to do in
> RISC-V.
> 
> For Arm64, as we have plenty of virtual address space, I decided to 
> reshuffle the layout so Xen is running a very high address (so it is 
> unlikely to clash).
I thought about running Xen very high address.
Thanks. I think it is a nice option to do the same for RISC-V64.

> 
> For Arm32, we have a smaller address space (4GB) so instead we are
> going 
> through a temporary area to enable the MMU when the load and runtime 
> region clash. The sequence is:
> 
>    1) Map Xen to a temporary area
>    2) Enable the MMU and jump to the temporary area
>    3) Map Xen to the runtime area
>    4) Jump to the runtime area
>    5) Remove the temporary area
> 
It is the same for RV32. As we don't support RV32 I will use:
  #error "Add support of MMU for RV32"
> [...]
> 
> > > > > Hmmm... I would actually expect the address to be properly
> > > > > aligned
> > > > > and
> > > > > therefore not require an alignment here.
> > > > > 
> > > > > Otherwise, this raise the question of what happen if you have
> > > > > region
> > > > > using the same page?
> > > > That map_start &=  ZEROETH_MAP_MASK is needed to page number of
> > > > address
> > > > w/o page offset.
> > > 
> > > My point is why would the page offset be non-zero?
> > I checked a linker script and addresses that passed to
> > setup_initial_mapping() and they are really always aligned so there
> > is
> > no any sense in additional alignment.
> 
> Ok. I would suggest to add some ASSERT()/BUG_ON() in order to confirm
> this is always the case.
> 
> [...]
> 
> > > > > 
> > > > > > +
> > > > > > +    /*
> > > > > > +     * Create a mapping of the load time address range
> > > > > > to...
> > > > > > the
> > > > > > load time address range.
> > > > > 
> > > > > Same about the line length here.
> > > > > 
> > > > > > +     * This mapping is used at boot time only.
> > > > > > +     */
> > > > > > +    _setup_initial_pagetables(second, first, zeroeth,
> > > > > 
> > > > > This can only work if Xen is loaded at its linked address. So
> > > > > you
> > > > > need a
> > > > > separate set of L0, L1 tables for the identity mapping.
> > > > > 
> > > > > That said, this would not be sufficient because:
> > > > >      1) Xen may not be loaded at a 2M boundary (you can
> > > > > control
> > > > > with
> > > > > U-boot, but not with EFI). So this may cross a boundary and
> > > > > therefore
> > > > > need multiple pages.
> > > > >      2) The load region may overlap the link address
> > > > > 
> > > > > While I think it would be good to handle those cases from the
> > > > > start,
> > > > > I
> > > > > would understand why are not easy to solve. So I think the
> > > > > minimum is
> > > > > to
> > > > > throw some errors if you are in a case you can't support.
> > > > Do you mean to throw some error in load_addr()/linkder_addr()?
> > > 
> > > In this case, I meant to check if load_addr != linker_addr, then
> > > throw
> > > an error.
> > I am not sure that it is needed now and it is easier to throw an
> > error
> > but is option exist to handler situation when load_addr !=
> > linker_addr
> > except throwing an error? relocate?
> 
> I believe I answered this above.
Yeah, you answered my question completely. Thank you very much.

~ Oleksii
diff mbox series

Patch

diff --git a/xen/arch/riscv/Makefile b/xen/arch/riscv/Makefile
index 443f6bf15f..956ceb02df 100644
--- a/xen/arch/riscv/Makefile
+++ b/xen/arch/riscv/Makefile
@@ -1,5 +1,6 @@ 
 obj-$(CONFIG_EARLY_PRINTK) += early_printk.o
 obj-y += entry.o
+obj-y += mm.o
 obj-$(CONFIG_RISCV_64) += riscv64/
 obj-y += sbi.o
 obj-y += setup.o
diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
new file mode 100644
index 0000000000..fc1866b1d8
--- /dev/null
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -0,0 +1,9 @@ 
+#ifndef _ASM_RISCV_MM_H
+#define _ASM_RISCV_MM_H
+
+void setup_initial_pagetables(unsigned long load_addr_start,
+                              unsigned long load_addr_end,
+                              unsigned long linker_addr_start,
+                              unsigned long linker_addr_end);
+
+#endif /* _ASM_RISCV_MM_H */
diff --git a/xen/arch/riscv/include/asm/page.h b/xen/arch/riscv/include/asm/page.h
new file mode 100644
index 0000000000..fabbe1305f
--- /dev/null
+++ b/xen/arch/riscv/include/asm/page.h
@@ -0,0 +1,90 @@ 
+#ifndef _ASM_RISCV_PAGE_H
+#define _ASM_RISCV_PAGE_H
+
+#include <xen/const.h>
+#include <xen/types.h>
+
+#define PAGE_ENTRIES            512
+#define VPN_BITS                (9)
+#define VPN_MASK                ((unsigned long)((1 << VPN_BITS) - 1))
+
+#ifdef CONFIG_RISCV_64
+/* L3 index Bit[47:39] */
+#define THIRD_SHIFT             (39)
+#define THIRD_MASK              (VPN_MASK << THIRD_SHIFT)
+/* L2 index Bit[38:30] */
+#define SECOND_SHIFT            (30)
+#define SECOND_MASK             (VPN_MASK << SECOND_SHIFT)
+/* L1 index Bit[29:21] */
+#define FIRST_SHIFT             (21)
+#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
+/* L0 index Bit[20:12] */
+#define ZEROETH_SHIFT           (12)
+#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
+
+#else // CONFIG_RISCV_32
+
+/* L1 index Bit[31:22] */
+#define FIRST_SHIFT             (22)
+#define FIRST_MASK              (VPN_MASK << FIRST_SHIFT)
+
+/* L0 index Bit[21:12] */
+#define ZEROETH_SHIFT           (12)
+#define ZEROETH_MASK            (VPN_MASK << ZEROETH_SHIFT)
+#endif
+
+#define THIRD_SIZE              (1 << THIRD_SHIFT)
+#define THIRD_MAP_MASK          (~(THIRD_SIZE - 1))
+#define SECOND_SIZE             (1 << SECOND_SHIFT)
+#define SECOND_MAP_MASK         (~(SECOND_SIZE - 1))
+#define FIRST_SIZE              (1 << FIRST_SHIFT)
+#define FIRST_MAP_MASK          (~(FIRST_SIZE - 1))
+#define ZEROETH_SIZE            (1 << ZEROETH_SHIFT)
+#define ZEROETH_MAP_MASK        (~(ZEROETH_SIZE - 1))
+
+#define PTE_SHIFT               10
+
+#define PTE_VALID               BIT(0, UL)
+#define PTE_READABLE            BIT(1, UL)
+#define PTE_WRITABLE            BIT(2, UL)
+#define PTE_EXECUTABLE          BIT(3, UL)
+#define PTE_USER                BIT(4, UL)
+#define PTE_GLOBAL              BIT(5, UL)
+#define PTE_ACCESSED            BIT(6, UL)
+#define PTE_DIRTY               BIT(7, UL)
+#define PTE_RSW                 (BIT(8, UL) | BIT(9, UL))
+
+#define PTE_LEAF_DEFAULT        (PTE_VALID | PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE)
+#define PTE_TABLE               (PTE_VALID)
+
+/* Calculate the offsets into the pagetables for a given VA */
+#define zeroeth_linear_offset(va)   ((va) >> ZEROETH_SHIFT)
+#define first_linear_offset(va)     ((va) >> FIRST_SHIFT)
+#define second_linear_offset(va)    ((va) >> SECOND_SHIFT)
+#define third_linear_offset(va)     ((va) >> THIRD_SHIFT)
+
+#define pagetable_zeroeth_index(va) zeroeth_linear_offset((va) & ZEROETH_MASK)
+#define pagetable_first_index(va)   first_linear_offset((va) & FIRST_MASK)
+#define pagetable_second_index(va)  second_linear_offset((va) & SECOND_MASK)
+#define pagetable_third_index(va)   third_linear_offset((va) & THIRD_MASK)
+
+/* Page Table entry */
+typedef struct {
+    uint64_t pte;
+} pte_t;
+
+/* Shift the VPN[x] or PPN[x] fields of a virtual or physical address
+ * to become the shifted PPN[x] fields of a page table entry */
+#define addr_to_ppn(x) (((x) >> PAGE_SHIFT) << PTE_SHIFT)
+
+static inline pte_t paddr_to_pte(unsigned long paddr)
+{
+    return (pte_t) { .pte = addr_to_ppn(paddr) };
+}
+
+static inline bool pte_is_valid(pte_t *p)
+{
+    return p->pte & PTE_VALID;
+}
+
+#endif /* _ASM_RISCV_PAGE_H */
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
new file mode 100644
index 0000000000..6e172376eb
--- /dev/null
+++ b/xen/arch/riscv/mm.c
@@ -0,0 +1,223 @@ 
+#include <xen/init.h>
+#include <xen/lib.h>
+
+#include <asm/csr.h>
+#include <asm/mm.h>
+#include <asm/page.h>
+
+/*
+ * xen_second_pagetable is indexed with the VPN[2] page table entry field
+ * xen_first_pagetable is accessed from the VPN[1] page table entry field
+ * xen_zeroeth_pagetable is accessed from the VPN[0] page table entry field
+ */
+pte_t xen_second_pagetable[PAGE_ENTRIES] __attribute__((__aligned__(PAGE_SIZE)));
+static pte_t xen_first_pagetable[PAGE_ENTRIES]
+    __attribute__((__aligned__(PAGE_SIZE)));
+static pte_t xen_zeroeth_pagetable[PAGE_ENTRIES]
+    __attribute__((__aligned__(PAGE_SIZE)));
+
+extern unsigned long _stext;
+extern unsigned long _etext;
+extern unsigned long __init_begin;
+extern unsigned long __init_end;
+extern unsigned long _srodata;
+extern unsigned long _erodata;
+
+paddr_t phys_offset;
+
+#define resolve_early_addr(x) \
+    ({                                                                          \
+         unsigned long * __##x;                                                 \
+        if ( load_addr_start <= x && x < load_addr_end )                        \
+            __##x = (unsigned long *)x;                                         \
+        else                                                                    \
+            __##x = (unsigned long *)(x + load_addr_start - linker_addr_start); \
+        __##x;                                                                  \
+     })
+
+static void __init clear_pagetables(unsigned long load_addr_start,
+                             unsigned long load_addr_end,
+                             unsigned long linker_addr_start,
+                             unsigned long linker_addr_end)
+{
+    unsigned long *p;
+    unsigned long page;
+    unsigned long i;
+
+    page = (unsigned long)&xen_second_pagetable[0];
+
+    p = resolve_early_addr(page);
+    for ( i = 0; i < ARRAY_SIZE(xen_second_pagetable); i++ )
+    {
+        p[i] = 0ULL;
+    }
+
+    page = (unsigned long)&xen_first_pagetable[0];
+    p = resolve_early_addr(page);
+    for ( i = 0; i < ARRAY_SIZE(xen_first_pagetable); i++ )
+    {
+        p[i] = 0ULL;
+    }
+
+    page = (unsigned long)&xen_zeroeth_pagetable[0];
+    p = resolve_early_addr(page);
+    for ( i = 0; i < ARRAY_SIZE(xen_zeroeth_pagetable); i++ )
+    {
+        p[i] = 0ULL;
+    }
+}
+
+/*
+ * WARNING: load_addr() and linker_addr() are to be called only when the MMU is
+ * disabled and only when executed by the primary CPU.  They cannot refer to
+ * any global variable or functions.
+ */
+
+/*
+ * Convert an addressed layed out at link time to the address where it was loaded
+ * by the bootloader.
+ */
+#define load_addr(linker_address)                                              \
+    ({                                                                         \
+        unsigned long __linker_address = (unsigned long)(linker_address);      \
+        if ( linker_addr_start <= __linker_address &&                          \
+            __linker_address < linker_addr_end )                               \
+        {                                                                      \
+            __linker_address =                                                 \
+                __linker_address - linker_addr_start + load_addr_start;        \
+        }                                                                      \
+        __linker_address;                                                      \
+    })
+
+/* Convert boot-time Xen address from where it was loaded by the boot loader to the address it was layed out
+ * at link-time.
+ */
+#define linker_addr(load_address)                                              \
+    ({                                                                         \
+        unsigned long __load_address = (unsigned long)(load_address);          \
+        if ( load_addr_start <= __load_address &&                              \
+            __load_address < load_addr_end )                                   \
+        {                                                                      \
+            __load_address =                                                   \
+                __load_address - load_addr_start + linker_addr_start;          \
+        }                                                                      \
+        __load_address;                                                        \
+    })
+
+static void __attribute__((section(".entry")))
+_setup_initial_pagetables(pte_t *second, pte_t *first, pte_t *zeroeth,
+                         unsigned long map_start,
+                         unsigned long map_end,
+                         unsigned long pa_start,
+                         bool writable)
+{
+    unsigned long page_addr;
+    unsigned long index2;
+    unsigned long index1;
+    unsigned long index0;
+
+    /* align start addresses */
+    map_start &= ZEROETH_MAP_MASK;
+    pa_start &= ZEROETH_MAP_MASK;
+
+    page_addr = map_start;
+    while ( page_addr < map_end )
+    {
+        index2 = pagetable_second_index(page_addr);
+        index1 = pagetable_first_index(page_addr);
+        index0 = pagetable_zeroeth_index(page_addr);
+
+        /* Setup level2 table */
+        second[index2] = paddr_to_pte((unsigned long)first);
+        second[index2].pte |= PTE_TABLE;
+
+        /* Setup level1 table */
+        first[index1] = paddr_to_pte((unsigned long)zeroeth);
+        first[index1].pte |= PTE_TABLE;
+
+        /* Setup level0 table */
+        if ( !pte_is_valid(&zeroeth[index0]) )
+        {
+            /* Update level0 table */
+            zeroeth[index0] = paddr_to_pte((page_addr - map_start) + pa_start);
+            zeroeth[index0].pte |= PTE_LEAF_DEFAULT;
+            zeroeth[index0].pte &= ~((!writable) ? PTE_WRITABLE : 0);
+        }
+
+        /* Point to next page */
+        page_addr += ZEROETH_SIZE;
+    }
+}
+
+/*
+ * setup_initial_pagetables:
+ *
+ * 1) Build the page tables for Xen that map the following:
+ *   1.1)  The physical location of Xen (where the bootloader loaded it)
+ *   1.2)  The link-time location of Xen (where the linker expected Xen's
+ *         addresses to be)
+ * 2) Load the page table into the SATP and enable the MMU
+ */
+void __attribute__((section(".entry")))
+setup_initial_pagetables(unsigned long load_addr_start,
+                         unsigned long load_addr_end,
+                         unsigned long linker_addr_start,
+                         unsigned long linker_addr_end)
+{
+    pte_t *second;
+    pte_t *first;
+    pte_t *zeroeth;
+
+    clear_pagetables(load_addr_start, load_addr_end,
+                     linker_addr_start, linker_addr_end);
+
+    /* Get the addresses where the page tables were loaded */
+    second  = (pte_t *)load_addr(&xen_second_pagetable);
+    first   = (pte_t *)load_addr(&xen_first_pagetable);
+    zeroeth = (pte_t *)load_addr(&xen_zeroeth_pagetable);
+
+    /*
+     * Create a mapping from Xen's link-time addresses to where they were actually loaded.
+     */
+    _setup_initial_pagetables(second, first, zeroeth,
+                              linker_addr(&_stext),
+                              linker_addr(&_etext),
+                              load_addr(&_stext),
+                              false);
+    _setup_initial_pagetables(second, first, zeroeth,
+                              linker_addr(&__init_begin),
+                              linker_addr(&__init_end),
+                              load_addr(&__init_begin),
+                              true);
+    _setup_initial_pagetables(second, first, zeroeth,
+                              linker_addr(&_srodata),
+                              linker_addr(&_erodata),
+                              load_addr(&_srodata),
+                              false);
+    _setup_initial_pagetables(second, first, zeroeth,
+                              linker_addr_start,
+                              linker_addr_end,
+                              load_addr_start,
+                              true);
+
+    /*
+     * Create a mapping of the load time address range to... the load time address range.
+     * This mapping is used at boot time only.
+     */
+    _setup_initial_pagetables(second, first, zeroeth,
+                              load_addr_start,
+                              load_addr_end,
+                              load_addr_start,
+                              true);
+
+    /* Ensure page table writes precede loading the SATP */
+    asm volatile("sfence.vma");
+
+    /* Enable the MMU and load the new pagetable for Xen */
+    csr_write(CSR_SATP,
+              (load_addr(xen_second_pagetable) >> PAGE_SHIFT) | SATP_MODE_SV39 << SATP_MODE_SHIFT);
+
+    phys_offset = load_addr_start - linker_addr_start;
+
+    return;
+}