diff mbox series

[v4,2/2] xen/riscv: introduce identity mapping

Message ID a8ab1829ab718dda869db3df3348ded211e81967.1690191480.git.oleksii.kurochko@gmail.com (mailing list archive)
State Superseded
Headers show
Series xen/riscv: introduce identity mapping | expand

Commit Message

Oleksii Kurochko July 24, 2023, 9:42 a.m. UTC
The way how switch to virtual address was implemented in the
commit e66003e7be ("xen/riscv: introduce setup_initial_pages")
isn't safe enough as:
* enable_mmu() depends on hooking all exceptions
  and pagefault.
* Any exception other than pagefault, or not taking a pagefault
  causes it to malfunction, which means you will fail to boot
  depending on where Xen was loaded into memory.

Instead of the proposed way of switching to virtual addresses was
decided to use identity mapping of the entrire Xen and after
switching to virtual addresses identity mapping is removed from
page-tables in the following way: recursively visit all ptes related
to identity mapping and remove them.

Fixes: e66003e7be ("xen/riscv: introduce setup_initial_pages")
Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
Changes in V4:
 - remove definition of ARRAY_SIZE and ROUNDUP as <xen/macors.h> was introduced where these macros are located now.
 - update definition of PGTBL_INITIAL_COUNT
 - update the commit message
 - update the algo of identity mapping removing
---
Changes in V3:
 - remove unrelated to the patch changes ( SPDX tags in config.h ).
 - update definition of PGTBL_INITIAL_COUNT taking into account identity mapping.
 - refactor remove_identity_mapping() function.
 - add explanatory comments in xen.lds.S and mm.c.
 - update commit message.
 - move save/restore of a0/a1 registers to [PATCH v2 2/3] xen/riscv: introduce
   function for physical offset calculation.
---
Changes in V2:
  - update definition of PGTBL_INITIAL_COUNT and the comment above.
  - code style fixes.
  - 1:1 mapping for entire Xen.
  - remove id_addrs array becase entire Xen is mapped.
  - reverse condition for cycle inside remove_identity_mapping().
  - fix page table walk in remove_identity_mapping().
  - update the commit message.
  - add Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
  - save hart_id and dtb_addr before call MMU related C functions.
  - use phys_offset variable instead of doing calcultations to get phys offset
    in head.S file. ( it can be easily done as entire Xen is 1:1 mapped )
  - declare enable_muu() as __init.
---
 xen/arch/riscv/include/asm/mm.h |   3 +-
 xen/arch/riscv/mm.c             | 101 ++++++++++++++++++++------------
 xen/arch/riscv/riscv64/head.S   |  22 +++++++
 xen/arch/riscv/setup.c          |  14 +----
 xen/arch/riscv/xen.lds.S        |   4 ++
 5 files changed, 93 insertions(+), 51 deletions(-)

Comments

Jan Beulich July 24, 2023, 2:11 p.m. UTC | #1
On 24.07.2023 11:42, Oleksii Kurochko wrote:
> @@ -35,8 +36,10 @@ unsigned long __ro_after_init phys_offset;
>   *
>   * It might be needed one more page table in case when Xen load address
>   * isn't 2 MB aligned.
> + *
> + * CONFIG_PAGING_LEVELS page tables are needed for identity mapping.
>   */
> -#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) + 1)
> +#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) * 2) + 1

Comment addition and code change are at least apparently out of sync:
With such a comment and without thinking much one would expect the
constant to be bumped by CONFIG_PAGING_LEVELS. It is true that you
only need CONFIG_PAGING_LEVELS - 1, because the root table is shared,
but that would then be nice to also clarify in the comment. E.g.

"CONFIG_PAGING_LEVELS page tables are needed for the identity mapping,
 except that the root page table is shared with the initial mapping."

Also - where did the outermost pair of parentheses go? (Really you
don't need to parenthesize the multiplication, so the last closing
one can simply move last.)

> @@ -75,10 +78,11 @@ static void __init setup_initial_mapping(struct mmu_desc *mmu_desc,
>      unsigned int index;
>      pte_t *pgtbl;
>      unsigned long page_addr;
> +    bool is_identity_mapping = map_start == pa_start;
>  
> -    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
> +    if ( !IS_ALIGNED((unsigned long)_start, KB(4)) )
>      {
> -        early_printk("(XEN) Xen should be loaded at 4k boundary\n");
> +        early_printk("(XEN) Xen should be loaded at 4KB boundary\n");

The change to the message looks unrelated.

> @@ -255,25 +261,44 @@ void __init noreturn noinline enable_mmu()
>      csr_write(CSR_SATP,
>                PFN_DOWN((unsigned long)stage1_pgtbl_root) |
>                RV_STAGE1_MODE << SATP_MODE_SHIFT);
> +}
>  
> -    asm volatile ( ".p2align 2" );
> - mmu_is_enabled:
> -    /*
> -     * Stack should be re-inited as:
> -     * 1. Right now an address of the stack is relative to load time
> -     *    addresses what will cause an issue in case of load start address
> -     *    isn't equal to linker start address.
> -     * 2. Addresses in stack are all load time relative which can be an
> -     *    issue in case when load start address isn't equal to linker
> -     *    start address.
> -     *
> -     * We can't return to the caller because the stack was reseted
> -     * and it may have stash some variable on the stack.
> -     * Jump to a brand new function as the stack was reseted
> -     */
> +void __init remove_identity_mapping(void)
> +{
> +    static pte_t *pgtbl = stage1_pgtbl_root;
> +    static unsigned long load_start = XEN_VIRT_START;
> +    static unsigned int pt_level = CONFIG_PAGING_LEVELS - 1;

These all want to be __initdata, but personally I find this way of
recursing a little odd. Let's see what the maintainers say.

> +    unsigned long load_end = LINK_TO_LOAD(_end);
> +    unsigned long xen_size;
> +    unsigned long pt_level_size = XEN_PT_LEVEL_SIZE(pt_level);
> +    unsigned long pte_nums;
> +
> +    unsigned long virt_indx = pt_index(pt_level, XEN_VIRT_START);
> +    unsigned long indx;
> +
> +    if ( load_start == XEN_VIRT_START )
> +        load_start = LINK_TO_LOAD(_start);
> +
> +    xen_size = load_end - load_start;

When you come here recursively, don't you need to limit this
instance of the function to a single page table's worth of address
space (at the given level), using load_end only if that's yet
lower?

> +    pte_nums = ROUNDUP(xen_size, pt_level_size) / pt_level_size;
> +
> +    while ( pte_nums-- )
> +    {
> +        indx = pt_index(pt_level, load_start);
>  
> -    switch_stack_and_jump((unsigned long)cpu0_boot_stack + STACK_SIZE,
> -                          cont_after_mmu_is_enabled);
> +        if ( virt_indx != indx )
> +        {
> +            pgtbl[indx].pte = 0;
> +            load_start += XEN_PT_LEVEL_SIZE(pt_level);
> +        }
> +        else
> +        {
> +            pgtbl =  (pte_t *)LOAD_TO_LINK(pte_to_paddr(pgtbl[indx]));

Nit: Stray double blank.

> +            pt_level--;
> +            remove_identity_mapping();

Don't you need to restore pgtbl and pt_level here before the loop
can continue? And because of depending on load_start, which would
have moved, xen_size also needs suitably reducing, I think. Plus
pte_nums as well, since that in turn was calculated from xen_size.

Jan
Oleksii Kurochko July 24, 2023, 4 p.m. UTC | #2
On Mon, 2023-07-24 at 16:11 +0200, Jan Beulich wrote:
> On 24.07.2023 11:42, Oleksii Kurochko wrote:
> > @@ -35,8 +36,10 @@ unsigned long __ro_after_init phys_offset;
> >   *
> >   * It might be needed one more page table in case when Xen load
> > address
> >   * isn't 2 MB aligned.
> > + *
> > + * CONFIG_PAGING_LEVELS page tables are needed for identity
> > mapping.
> >   */
> > -#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) + 1)
> > +#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) * 2) + 1
> 
> Comment addition and code change are at least apparently out of sync:
> With such a comment and without thinking much one would expect the
> constant to be bumped by CONFIG_PAGING_LEVELS. It is true that you
> only need CONFIG_PAGING_LEVELS - 1, because the root table is shared,
> but that would then be nice to also clarify in the comment. E.g.
> 
> "CONFIG_PAGING_LEVELS page tables are needed for the identity
> mapping,
>  except that the root page table is shared with the initial mapping."
Thanks. I'll take into account in the next patch version.

> 
> Also - where did the outermost pair of parentheses go? (Really you
> don't need to parenthesize the multiplication, so the last closing
> one can simply move last.)
Missed it. Thanks. I'll move last parentheses to the end.
> 
> > @@ -75,10 +78,11 @@ static void __init setup_initial_mapping(struct
> > mmu_desc *mmu_desc,
> >      unsigned int index;
> >      pte_t *pgtbl;
> >      unsigned long page_addr;
> > +    bool is_identity_mapping = map_start == pa_start;
> >  
> > -    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
> > +    if ( !IS_ALIGNED((unsigned long)_start, KB(4)) )
> >      {
> > -        early_printk("(XEN) Xen should be loaded at 4k
> > boundary\n");
> > +        early_printk("(XEN) Xen should be loaded at 4KB
> > boundary\n");
> 
> The change to the message looks unrelated.
Yes, you are right. I'll remove that change to the message.

> 
> > @@ -255,25 +261,44 @@ void __init noreturn noinline enable_mmu()
> >      csr_write(CSR_SATP,
> >                PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> >                RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > +}
> >  
> > -    asm volatile ( ".p2align 2" );
> > - mmu_is_enabled:
> > -    /*
> > -     * Stack should be re-inited as:
> > -     * 1. Right now an address of the stack is relative to load
> > time
> > -     *    addresses what will cause an issue in case of load start
> > address
> > -     *    isn't equal to linker start address.
> > -     * 2. Addresses in stack are all load time relative which can
> > be an
> > -     *    issue in case when load start address isn't equal to
> > linker
> > -     *    start address.
> > -     *
> > -     * We can't return to the caller because the stack was reseted
> > -     * and it may have stash some variable on the stack.
> > -     * Jump to a brand new function as the stack was reseted
> > -     */
> > +void __init remove_identity_mapping(void)
> > +{
> > +    static pte_t *pgtbl = stage1_pgtbl_root;
> > +    static unsigned long load_start = XEN_VIRT_START;
> > +    static unsigned int pt_level = CONFIG_PAGING_LEVELS - 1;
> 
> These all want to be __initdata, but personally I find this way of
> recursing a little odd. Let's see what the maintainers say.
I'm not completely happy either. Initially I thought that it would be
better to pass all this stuff as function's arguments.

But then it is needed to provide an access to stage1_pgtbl_root (
get_root_pt() function ? ). So remove_identity_mapping() will be called
as remove_identity_mapping(get_root_pt(), _start, CONFIG_PAGING_LELVELS
-1 ) or remove_identity_mapping(NULL, _start, CONFIG_PAGING_LELVELS -1
) and then check if first argument is NULL then initialize it with
stage1_pgtbl_root.
Also I am not sure that an 'user' should provide all this information
to such function.

Could you recommend something better?

> 
> > +    unsigned long load_end = LINK_TO_LOAD(_end);
> > +    unsigned long xen_size;
> > +    unsigned long pt_level_size = XEN_PT_LEVEL_SIZE(pt_level);
> > +    unsigned long pte_nums;
> > +
> > +    unsigned long virt_indx = pt_index(pt_level, XEN_VIRT_START);
> > +    unsigned long indx;
> > +
> > +    if ( load_start == XEN_VIRT_START )
> > +        load_start = LINK_TO_LOAD(_start);
> > +
> > +    xen_size = load_end - load_start;
> 
> When you come here recursively, don't you need to limit this
> instance of the function to a single page table's worth of address
> space (at the given level), using load_end only if that's yet
> lower?
Do you mean a case when load_start > load_end? If yes then I missed
that.
> 
> > +    pte_nums = ROUNDUP(xen_size, pt_level_size) / pt_level_size;
> > +
> > +    while ( pte_nums-- )
> > +    {
> > +        indx = pt_index(pt_level, load_start);
> >  
> > -    switch_stack_and_jump((unsigned long)cpu0_boot_stack +
> > STACK_SIZE,
> > -                          cont_after_mmu_is_enabled);
> > +        if ( virt_indx != indx )
> > +        {
> > +            pgtbl[indx].pte = 0;
> > +            load_start += XEN_PT_LEVEL_SIZE(pt_level);
> > +        }
> > +        else
> > +        {
> > +            pgtbl =  (pte_t
> > *)LOAD_TO_LINK(pte_to_paddr(pgtbl[indx]));
> 
> Nit: Stray double blank.
Thanks.
> 
> > +            pt_level--;
> > +            remove_identity_mapping();
> 
> Don't you need to restore pgtbl and pt_level here before the loop
> can continue? And because of depending on load_start, which would
> have moved, xen_size also needs suitably reducing, I think. Plus
> pte_nums as well, since that in turn was calculated from xen_size.
I forgot to restore pgtbl and pt_level because initially I used a
function arguments to pass that information so it wasn't needed to
restore them.

Regarding xen_size and pte_nums it looks like it is needed to init only
once on each page table level.
For example we have the following situation:
  ----------------------
   non-identity-mapping
   identity-mapping
  ---------------------- C
   identity-mapping
  ---------------------- B
   identity-mapping
  ---------------------- A
So we calculated that we need to remove 3 ptes, for first two ptes ( as
only identity mapping is there) are removed without any issue, then
move load_addr to C and run recursion
for the pte 'C' to go to next page table level.
At new level we are calculating how many ptes are needed to be removed
and remove only necessary amount of ptes.
When we will back to prev page table level pte_num will be 1 then we
will go to the head of the cycle, decrease pte_num to 0 and exit.

The same is with the case when non-idenitity-mapping is lower than
identity mapping ( but it looks like it is not a case becase
XEN_VIRT_START addr is the highest address by its defintion. Probably
it is needed to add a check ):
we know that pte_num = 3 at some level. Then we go to the next level
and remove there only identity map ptes, back to previous level,
decrease pte_num to 2 and remove only 2 remaining ptes.

~ Oleksii
Jan Beulich July 25, 2023, 8:16 a.m. UTC | #3
On 24.07.2023 18:00, Oleksii wrote:
> On Mon, 2023-07-24 at 16:11 +0200, Jan Beulich wrote:
>> On 24.07.2023 11:42, Oleksii Kurochko wrote:
>>> +void __init remove_identity_mapping(void)
>>> +{
>>> +    static pte_t *pgtbl = stage1_pgtbl_root;
>>> +    static unsigned long load_start = XEN_VIRT_START;
>>> +    static unsigned int pt_level = CONFIG_PAGING_LEVELS - 1;
>>
>> These all want to be __initdata, but personally I find this way of
>> recursing a little odd. Let's see what the maintainers say.
> I'm not completely happy either. Initially I thought that it would be
> better to pass all this stuff as function's arguments.
> 
> But then it is needed to provide an access to stage1_pgtbl_root (
> get_root_pt() function ? ). So remove_identity_mapping() will be called
> as remove_identity_mapping(get_root_pt(), _start, CONFIG_PAGING_LELVELS
> -1 ) or remove_identity_mapping(NULL, _start, CONFIG_PAGING_LELVELS -1
> ) and then check if first argument is NULL then initialize it with
> stage1_pgtbl_root.
> Also I am not sure that an 'user' should provide all this information
> to such function.
> 
> Could you recommend something better?

Well, to avoid the mess you are describing I would consider making
remove_identity_mapping() itself a thin wrapper around the actual
function which then invokes itself recursively. That'll keep the
top-level call site tidy, and all the technical details confined to
the (then) two functions.

>>> +    unsigned long load_end = LINK_TO_LOAD(_end);
>>> +    unsigned long xen_size;
>>> +    unsigned long pt_level_size = XEN_PT_LEVEL_SIZE(pt_level);
>>> +    unsigned long pte_nums;
>>> +
>>> +    unsigned long virt_indx = pt_index(pt_level, XEN_VIRT_START);
>>> +    unsigned long indx;
>>> +
>>> +    if ( load_start == XEN_VIRT_START )
>>> +        load_start = LINK_TO_LOAD(_start);
>>> +
>>> +    xen_size = load_end - load_start;
>>
>> When you come here recursively, don't you need to limit this
>> instance of the function to a single page table's worth of address
>> space (at the given level), using load_end only if that's yet
>> lower?
> Do you mean a case when load_start > load_end? If yes then I missed
> that.

No, my concern is with the page table presently acted upon covering
only a limited part of the address space. That "limited" is the full
address space only for the top level table. You won't observe this
though unless the Xen image crosses a 2Mb boundary. But if it does
(and it may help if you arranged for big enough an image just for
the purpose of debugging, simply by inserting enough dead code or
data), and if all mappings are 4k ones, you'd run past the first L1
table's worth of mappings on the first invocation of this function
with a L1 table. (Of course hitting the condition may further
require Xen and 1:1 mappings to be sufficiently close to one another,
so that the zapping doesn't already occur at higher levels. But then
the same situation could arise at higher levels when the image
crosses a 1Gb or 512Gb boundary.)

>>> +    pte_nums = ROUNDUP(xen_size, pt_level_size) / pt_level_size;
>>> +
>>> +    while ( pte_nums-- )
>>> +    {
>>> +        indx = pt_index(pt_level, load_start);
>>>  
>>> -    switch_stack_and_jump((unsigned long)cpu0_boot_stack +
>>> STACK_SIZE,
>>> -                          cont_after_mmu_is_enabled);
>>> +        if ( virt_indx != indx )
>>> +        {
>>> +            pgtbl[indx].pte = 0;
>>> +            load_start += XEN_PT_LEVEL_SIZE(pt_level);
>>> +        }
>>> +        else
>>> +        {
>>> +            pgtbl =  (pte_t
>>> *)LOAD_TO_LINK(pte_to_paddr(pgtbl[indx]));
>>
>> Nit: Stray double blank.
> Thanks.
>>
>>> +            pt_level--;
>>> +            remove_identity_mapping();
>>
>> Don't you need to restore pgtbl and pt_level here before the loop
>> can continue? And because of depending on load_start, which would
>> have moved, xen_size also needs suitably reducing, I think. Plus
>> pte_nums as well, since that in turn was calculated from xen_size.
> I forgot to restore pgtbl and pt_level because initially I used a
> function arguments to pass that information so it wasn't needed to
> restore them.
> 
> Regarding xen_size and pte_nums it looks like it is needed to init only
> once on each page table level.
> For example we have the following situation:
>   ----------------------
>    non-identity-mapping
>    identity-mapping
>   ---------------------- C
>    identity-mapping
>   ---------------------- B
>    identity-mapping
>   ---------------------- A
> So we calculated that we need to remove 3 ptes, for first two ptes ( as
> only identity mapping is there) are removed without any issue, then
> move load_addr to C and run recursion
> for the pte 'C' to go to next page table level.
> At new level we are calculating how many ptes are needed to be removed
> and remove only necessary amount of ptes.
> When we will back to prev page table level pte_num will be 1 then we
> will go to the head of the cycle, decrease pte_num to 0 and exit.

I think I agree that this case is okay.

> The same is with the case when non-idenitity-mapping is lower than
> identity mapping ( but it looks like it is not a case becase
> XEN_VIRT_START addr is the highest address by its defintion. Probably
> it is needed to add a check ):

And it looks like this case being impossible is what voids my respective
remark. (Might be worth adding a comment to this effect.)

Jan

> we know that pte_num = 3 at some level. Then we go to the next level
> and remove there only identity map ptes, back to previous level,
> decrease pte_num to 2 and remove only 2 remaining ptes.
> 
> ~ Oleksii
>
Oleksii Kurochko July 25, 2023, 12:42 p.m. UTC | #4
On Tue, 2023-07-25 at 10:16 +0200, Jan Beulich wrote:
> On 24.07.2023 18:00, Oleksii wrote:
> > On Mon, 2023-07-24 at 16:11 +0200, Jan Beulich wrote:
> > > On 24.07.2023 11:42, Oleksii Kurochko wrote:
> > > > +void __init remove_identity_mapping(void)
> > > > +{
> > > > +    static pte_t *pgtbl = stage1_pgtbl_root;
> > > > +    static unsigned long load_start = XEN_VIRT_START;
> > > > +    static unsigned int pt_level = CONFIG_PAGING_LEVELS - 1;
> > > 
> > > These all want to be __initdata, but personally I find this way
> > > of
> > > recursing a little odd. Let's see what the maintainers say.
> > I'm not completely happy either. Initially I thought that it would
> > be
> > better to pass all this stuff as function's arguments.
> > 
> > But then it is needed to provide an access to stage1_pgtbl_root (
> > get_root_pt() function ? ). So remove_identity_mapping() will be
> > called
> > as remove_identity_mapping(get_root_pt(), _start,
> > CONFIG_PAGING_LELVELS
> > -1 ) or remove_identity_mapping(NULL, _start, CONFIG_PAGING_LELVELS
> > -1
> > ) and then check if first argument is NULL then initialize it with
> > stage1_pgtbl_root.
> > Also I am not sure that an 'user' should provide all this
> > information
> > to such function.
> > 
> > Could you recommend something better?
> 
> Well, to avoid the mess you are describing I would consider making
> remove_identity_mapping() itself a thin wrapper around the actual
> function which then invokes itself recursively. That'll keep the
> top-level call site tidy, and all the technical details confined to
> the (then) two functions.
Thanks a lot for the recommendation.

> 
> > > > +    unsigned long load_end = LINK_TO_LOAD(_end);
> > > > +    unsigned long xen_size;
> > > > +    unsigned long pt_level_size = XEN_PT_LEVEL_SIZE(pt_level);
> > > > +    unsigned long pte_nums;
> > > > +
> > > > +    unsigned long virt_indx = pt_index(pt_level,
> > > > XEN_VIRT_START);
> > > > +    unsigned long indx;
> > > > +
> > > > +    if ( load_start == XEN_VIRT_START )
> > > > +        load_start = LINK_TO_LOAD(_start);
> > > > +
> > > > +    xen_size = load_end - load_start;
> > > 
> > > When you come here recursively, don't you need to limit this
> > > instance of the function to a single page table's worth of
> > > address
> > > space (at the given level), using load_end only if that's yet
> > > lower?
> > Do you mean a case when load_start > load_end? If yes then I missed
> > that.
> 
> No, my concern is with the page table presently acted upon covering
> only a limited part of the address space. That "limited" is the full
> address space only for the top level table. You won't observe this
> though unless the Xen image crosses a 2Mb boundary. But if it does
> (and it may help if you arranged for big enough an image just for
> the purpose of debugging, simply by inserting enough dead code or
> data), and if all mappings are 4k ones, you'd run past the first L1
> table's worth of mappings on the first invocation of this function
> with a L1 table. (Of course hitting the condition may further
> require Xen and 1:1 mappings to be sufficiently close to one another,
> so that the zapping doesn't already occur at higher levels. But then
> the same situation could arise at higher levels when the image
> crosses a 1Gb or 512Gb boundary.)
In this case, all should be fine.

If we put identity and non-identity mapping as closely as possible.
I tested with the following input:
   XEN_VIRT_START = 0x80670000
   load start = 0x80200000
   Xen size = 4648960

So now pte on L2 level is shared, so we will go to L1, and calculate
the amount of pte nums on the L1 level. In the current one case, it is
3, so it will delete the first two L1's ptes ( as they have different
index from index of XEN_VIRT_START at L1 level so we are sure that
these ptes are used only for identity mapping and can be removed ),
move load_start += 4 MB, and for the last one L1's ptes which is shared
with non-identity mapping it will go to L0 table, calculate a number of
ptes needed to be cleaned based on 'new' load start and page level ( it
is 0x6f in the current case ) to finish removing of identity mapping.

I added some prints inside:
    if ( virt_indx != indx )
    { ....
And received the expected output I described above:
(level number) '-' means removed

(1) -
(1) -
(0) -
... 0x6f times
(0) -

> 
> > > > +    pte_nums = ROUNDUP(xen_size, pt_level_size) /
> > > > pt_level_size;
> > > > +
> > > > +    while ( pte_nums-- )
> > > > +    {
> > > > +        indx = pt_index(pt_level, load_start);
> > > >  
> > > > -    switch_stack_and_jump((unsigned long)cpu0_boot_stack +
> > > > STACK_SIZE,
> > > > -                          cont_after_mmu_is_enabled);
> > > > +        if ( virt_indx != indx )
> > > > +        {
> > > > +            pgtbl[indx].pte = 0;
> > > > +            load_start += XEN_PT_LEVEL_SIZE(pt_level);
> > > > +        }
> > > > +        else
> > > > +        {
> > > > +            pgtbl =  (pte_t
> > > > *)LOAD_TO_LINK(pte_to_paddr(pgtbl[indx]));
> > > 
> > > Nit: Stray double blank.
> > Thanks.
> > > 
> > > > +            pt_level--;
> > > > +            remove_identity_mapping();
> > > 
> > > Don't you need to restore pgtbl and pt_level here before the loop
> > > can continue? And because of depending on load_start, which would
> > > have moved, xen_size also needs suitably reducing, I think. Plus
> > > pte_nums as well, since that in turn was calculated from
> > > xen_size.
> > I forgot to restore pgtbl and pt_level because initially I used a
> > function arguments to pass that information so it wasn't needed to
> > restore them.
> > 
> > Regarding xen_size and pte_nums it looks like it is needed to init
> > only
> > once on each page table level.
> > For example we have the following situation:
> >   ----------------------
> >    non-identity-mapping
> >    identity-mapping
> >   ---------------------- C
> >    identity-mapping
> >   ---------------------- B
> >    identity-mapping
> >   ---------------------- A
> > So we calculated that we need to remove 3 ptes, for first two ptes
> > ( as
> > only identity mapping is there) are removed without any issue, then
> > move load_addr to C and run recursion
> > for the pte 'C' to go to next page table level.
> > At new level we are calculating how many ptes are needed to be
> > removed
> > and remove only necessary amount of ptes.
> > When we will back to prev page table level pte_num will be 1 then
> > we
> > will go to the head of the cycle, decrease pte_num to 0 and exit.
> 
> I think I agree that this case is okay.
> 
> > The same is with the case when non-idenitity-mapping is lower than
> > identity mapping ( but it looks like it is not a case becase
> > XEN_VIRT_START addr is the highest address by its defintion.
> > Probably
> > it is needed to add a check ):
> 
> And it looks like this case being impossible is what voids my
> respective
> remark. (Might be worth adding a comment to this effect.)
I'll add a comment in remove_identity_function().

~ Oleksii
Oleksii Kurochko July 26, 2023, 11:23 a.m. UTC | #5
Hi all,

I would like to ask for advice on whether it would be easier, less bug-
provoking ( during identity mapping to remove of whole Xen ) to have a
separate identity section that won't be more than PAGE_SIZE.

Please take a look at the changes below. Comments are welcome.

diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index 7d1a8beba8..ba4af48fc6 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -26,6 +26,8 @@ static unsigned long __ro_after_init phys_offset;
 #define LOAD_TO_LINK(addr) ((unsigned long)(addr) - phys_offset)
 #define LINK_TO_LOAD(addr) ((unsigned long)(addr) + phys_offset)
 
+extern char _ident_start[], _ident_end[];
+
 /*
  * It is expected that Xen won't be more then 2 MB.
  * The check in xen.lds.S guarantees that.
@@ -112,7 +114,9 @@ static void __init setup_initial_mapping(struct
mmu_desc *mmu_desc,
         case 1: /* Level 0 */
             {
                 unsigned long paddr = (page_addr - map_start) +
pa_start;
-                unsigned int permissions = PTE_LEAF_DEFAULT;
+                unsigned int permissions = is_identity_mapping
+                                           ? PTE_LEAF_DEFAULT |
PTE_EXECUTABLE
+                                           : PTE_LEAF_DEFAULT ;
                 unsigned long addr = is_identity_mapping
                                      ? page_addr :
LINK_TO_LOAD(page_addr);
                 pte_t pte_to_be_written;
@@ -248,9 +252,9 @@ void __init setup_initial_pagetables(void)
         return;
 
     setup_initial_mapping(&mmu_desc,
-                          load_start,
-                          load_end,
-                          load_start);
+                          (unsigned long)_ident_start,
+                          (unsigned long)_ident_end,
+                          (unsigned long)_ident_start);
 }
 
 void __init enable_mmu(void)
@@ -264,6 +268,19 @@ void __init enable_mmu(void)
               RV_STAGE1_MODE << SATP_MODE_SHIFT);
 }
 
+void __attribute__((naked)) __section(".ident") turn_on_mmu(unsigned
long ra)
+{
+    /* Ensure page table writes precede loading the SATP */
+    sfence_vma();
+
+    /* Enable the MMU and load the new pagetable for Xen */
+    csr_write(CSR_SATP,
+              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
+              RV_STAGE1_MODE << SATP_MODE_SHIFT);
+
+    asm volatile( "jr %0\n" : : "r"(ra) );
+}
+
 static void __init __remove_identity_mapping(pte_t *pgtbl,
                                              unsigned long load_start,
                                              unsigned int pt_level)
@@ -297,20 +314,42 @@ static void __init
__remove_identity_mapping(pte_t *pgtbl,
 
 void __init remove_identity_mapping(void)
 {
-    unsigned long load_start = LINK_TO_LOAD(_start);
+    unsigned int i;
+    pte_t *pgtbl;
+    unsigned int index, xen_index;
+    unsigned long ident_start = LINK_TO_LOAD(_ident_start);
 
-    if ( XEN_VIRT_START <= load_start )
+    for ( pgtbl = stage1_pgtbl_root, i = CONFIG_PAGING_LEVELS; i; i--
)
     {
-        early_printk("remove identity mapping algo expects that"
-                     "XEN_VIRT_START > load_start\n");
-        die();
-    }
+        index = pt_index(i - 1, ident_start);
+        xen_index = pt_index(i - 1, XEN_VIRT_START);
 
-    __remove_identity_mapping(stage1_pgtbl_root,
-                              LINK_TO_LOAD(_start),
-                              CONFIG_PAGING_LEVELS - 1);
+        if ( index != xen_index )
+        {
+            pgtbl[index].pte = 0;
+            break;
+        }
+
+        pgtbl = (pte_t *)pte_to_paddr(pgtbl[index]);
+    }
 }
 
+// void __init remove_identity_mapping(void)
+// {
+//     unsigned long load_start = LINK_TO_LOAD(_start);
+
+//     if ( XEN_VIRT_START <= load_start )
+//     {
+//         early_printk("remove identity mapping algo expects that"
+//                      "XEN_VIRT_START > load_start\n");
+//         die();
+//     }
+
+//     __remove_identity_mapping(stage1_pgtbl_root,
+//                               LINK_TO_LOAD(_start),
+//                               CONFIG_PAGING_LEVELS - 1);
+// }
+
 /*
  * calc_phys_offset() should be used before MMU is enabled because
access to
  * start() is PC-relative and in case when load_addr != linker_addr
phys_offset
diff --git a/xen/arch/riscv/riscv64/head.S
b/xen/arch/riscv/riscv64/head.S
index 613e25ea6f..bb529f6a11 100644
--- a/xen/arch/riscv/riscv64/head.S
+++ b/xen/arch/riscv/riscv64/head.S
@@ -41,14 +41,12 @@ ENTRY(start)
 
         jal     setup_initial_pagetables
 
-        jal     enable_mmu
-
         /* Calculate proper VA after jump from 1:1 mapping */
         la      t0, .L_primary_switched
         sub     t0, t0, s2
 
-        /* Jump from 1:1 mapping world */
-        jr      t0
+        mv      a0, t0
+        jal     turn_on_mmu
 
 .L_primary_switched:
         /*
diff --git a/xen/arch/riscv/xen.lds.S b/xen/arch/riscv/xen.lds.S
index 31ccebadcb..ffa0225332 100644
--- a/xen/arch/riscv/xen.lds.S
+++ b/xen/arch/riscv/xen.lds.S
@@ -37,6 +37,13 @@ SECTIONS
         _etext = .;             /* End of text section */
     } :text
 
+    .ident : {
+        . = ALIGN(PAGE_SIZE);
+        _ident_start = .;
+        *(.ident)
+        _ident_end = .;
+    } :text
+
     . = ALIGN(PAGE_SIZE);
     .rodata : {
         _srodata = .;          /* Read-only data */
@@ -178,3 +185,5 @@ ASSERT(!SIZEOF(.got.plt),  ".got.plt non-empty")
  * PGTBL_INITIAL_COUNT.
  */
 ASSERT(_end - _start <= MB(2), "Xen too large for early-boot
assumptions")
+
+ASSERT(_ident_end - _ident_start <= KB(4), "Identity section is bigger
then 4Kb")

~ Oleksii
Jan Beulich July 26, 2023, 11:58 a.m. UTC | #6
On 26.07.2023 13:23, Oleksii wrote:
> I would like to ask for advice on whether it would be easier, less bug-
> provoking ( during identity mapping to remove of whole Xen ) to have a
> separate identity section that won't be more than PAGE_SIZE.

I'm afraid you can't safely do this in C, or at least not without
further checking on what the compiler actually did.

> @@ -264,6 +268,19 @@ void __init enable_mmu(void)
>                RV_STAGE1_MODE << SATP_MODE_SHIFT);
>  }
>  
> +void __attribute__((naked)) __section(".ident") turn_on_mmu(unsigned
> long ra)

Did you read what gcc doc says about "naked"? Extended asm() isn't
supported there. Since ...

> +{
> +    /* Ensure page table writes precede loading the SATP */
> +    sfence_vma();
> +
> +    /* Enable the MMU and load the new pagetable for Xen */
> +    csr_write(CSR_SATP,
> +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> +
> +    asm volatile( "jr %0\n" : : "r"(ra) );
> +}

... none of this really requires C, I think we're at the point where
(iirc) Andrew's and my suggestion wants following, moving this to
assembly code (at which point it doesn't need to be a separate
function). You can still build page tables in C, of course. (Likely
you then also won't need a separate section; some minimal alignment
guarantees ought to suffice to make sure the critical code is
confined to a single page.)

Jan
Oleksii Kurochko July 26, 2023, 1:12 p.m. UTC | #7
On Wed, 2023-07-26 at 13:58 +0200, Jan Beulich wrote:
> On 26.07.2023 13:23, Oleksii wrote:
> > I would like to ask for advice on whether it would be easier, less
> > bug-
> > provoking ( during identity mapping to remove of whole Xen ) to
> > have a
> > separate identity section that won't be more than PAGE_SIZE.
> 
> I'm afraid you can't safely do this in C, or at least not without
> further checking on what the compiler actually did.
> 
> > @@ -264,6 +268,19 @@ void __init enable_mmu(void)
> >                RV_STAGE1_MODE << SATP_MODE_SHIFT);
> >  }
> >  
> > +void __attribute__((naked)) __section(".ident")
> > turn_on_mmu(unsigned
> > long ra)
> 
> Did you read what gcc doc says about "naked"? Extended asm() isn't
> supported there. Since ...
> 
> > +{
> > +    /* Ensure page table writes precede loading the SATP */
> > +    sfence_vma();
> > +
> > +    /* Enable the MMU and load the new pagetable for Xen */
> > +    csr_write(CSR_SATP,
> > +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> > +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > +
> > +    asm volatile( "jr %0\n" : : "r"(ra) );
> > +}
> 
> ... none of this really requires C, I think we're at the point where
> (iirc) Andrew's and my suggestion wants following, moving this to
> assembly code (at which point it doesn't need to be a separate
> function). You can still build page tables in C, of course. (Likely
> you then also won't need a separate section; some minimal alignment
> guarantees ought to suffice to make sure the critical code is
> confined to a single page.)

Thanks. I'll move all of this to assembly code.
Regarding alignment it is needed alignment on start and end of
function:
    .balign PAGE_SIZE
    GLOBAL(turn_on_mmu)
        ...
    .balign PAGE_SIZE
    ENDPROC(turn_on_mmu)

Does the better way exist?

~ Oleksii
Jan Beulich July 26, 2023, 3 p.m. UTC | #8
On 26.07.2023 15:12, Oleksii wrote:
> On Wed, 2023-07-26 at 13:58 +0200, Jan Beulich wrote:
>> On 26.07.2023 13:23, Oleksii wrote:
>>> I would like to ask for advice on whether it would be easier, less
>>> bug-
>>> provoking ( during identity mapping to remove of whole Xen ) to
>>> have a
>>> separate identity section that won't be more than PAGE_SIZE.
>>
>> I'm afraid you can't safely do this in C, or at least not without
>> further checking on what the compiler actually did.
>>
>>> @@ -264,6 +268,19 @@ void __init enable_mmu(void)
>>>                RV_STAGE1_MODE << SATP_MODE_SHIFT);
>>>  }
>>>  
>>> +void __attribute__((naked)) __section(".ident")
>>> turn_on_mmu(unsigned
>>> long ra)
>>
>> Did you read what gcc doc says about "naked"? Extended asm() isn't
>> supported there. Since ...
>>
>>> +{
>>> +    /* Ensure page table writes precede loading the SATP */
>>> +    sfence_vma();
>>> +
>>> +    /* Enable the MMU and load the new pagetable for Xen */
>>> +    csr_write(CSR_SATP,
>>> +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
>>> +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
>>> +
>>> +    asm volatile( "jr %0\n" : : "r"(ra) );
>>> +}
>>
>> ... none of this really requires C, I think we're at the point where
>> (iirc) Andrew's and my suggestion wants following, moving this to
>> assembly code (at which point it doesn't need to be a separate
>> function). You can still build page tables in C, of course. (Likely
>> you then also won't need a separate section; some minimal alignment
>> guarantees ought to suffice to make sure the critical code is
>> confined to a single page.)
> 
> Thanks. I'll move all of this to assembly code.
> Regarding alignment it is needed alignment on start and end of
> function:
>     .balign PAGE_SIZE
>     GLOBAL(turn_on_mmu)
>         ...
>     .balign PAGE_SIZE
>     ENDPROC(turn_on_mmu)
> 
> Does the better way exist?

The function is only going to be a handful of instructions. Its
alignment doesn't need to be larger than the next power of 2. I
expect you'll be good with 64-byte alignment. (In no case do you
need to align the end of the function: Putting other stuff there
is not a problem at all.) What you want in any event is a build
time check that the within-a-page constraint is met.

Jan
Oleksii Kurochko July 26, 2023, 3:54 p.m. UTC | #9
On Wed, 2023-07-26 at 17:00 +0200, Jan Beulich wrote:
> On 26.07.2023 15:12, Oleksii wrote:
> > On Wed, 2023-07-26 at 13:58 +0200, Jan Beulich wrote:
> > > On 26.07.2023 13:23, Oleksii wrote:
> > > > I would like to ask for advice on whether it would be easier,
> > > > less
> > > > bug-
> > > > provoking ( during identity mapping to remove of whole Xen ) to
> > > > have a
> > > > separate identity section that won't be more than PAGE_SIZE.
> > > 
> > > I'm afraid you can't safely do this in C, or at least not without
> > > further checking on what the compiler actually did.
> > > 
> > > > @@ -264,6 +268,19 @@ void __init enable_mmu(void)
> > > >                RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > > >  }
> > > >  
> > > > +void __attribute__((naked)) __section(".ident")
> > > > turn_on_mmu(unsigned
> > > > long ra)
> > > 
> > > Did you read what gcc doc says about "naked"? Extended asm()
> > > isn't
> > > supported there. Since ...
> > > 
> > > > +{
> > > > +    /* Ensure page table writes precede loading the SATP */
> > > > +    sfence_vma();
> > > > +
> > > > +    /* Enable the MMU and load the new pagetable for Xen */
> > > > +    csr_write(CSR_SATP,
> > > > +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> > > > +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > > > +
> > > > +    asm volatile( "jr %0\n" : : "r"(ra) );
> > > > +}
> > > 
> > > ... none of this really requires C, I think we're at the point
> > > where
> > > (iirc) Andrew's and my suggestion wants following, moving this to
> > > assembly code (at which point it doesn't need to be a separate
> > > function). You can still build page tables in C, of course.
> > > (Likely
> > > you then also won't need a separate section; some minimal
> > > alignment
> > > guarantees ought to suffice to make sure the critical code is
> > > confined to a single page.)
> > 
> > Thanks. I'll move all of this to assembly code.
> > Regarding alignment it is needed alignment on start and end of
> > function:
> >     .balign PAGE_SIZE
> >     GLOBAL(turn_on_mmu)
> >         ...
> >     .balign PAGE_SIZE
> >     ENDPROC(turn_on_mmu)
> > 
> > Does the better way exist?
> 
> The function is only going to be a handful of instructions. Its
> alignment doesn't need to be larger than the next power of 2. I
> expect you'll be good with 64-byte alignment. (In no case do you
> need to align the end of the function: Putting other stuff there
> is not a problem at all.) What you want in any event is a build
> time check that the within-a-page constraint is met.
But shouldn't be an address be aligned to a boundary equal to page
size?

According to the RISC-V privileged spec:
Any level of PTE may be a leaf PTE, so in addition to 4 KiB pages, Sv39
supports 2 MiB megapages
and 1 GiB gigapages, each of which must be virtually and physically
aligned to a boundary equal
to its size. A page-fault exception is raised if the physical address
is insufficiently aligned.

~ Oleksii
Jan Beulich July 26, 2023, 3:59 p.m. UTC | #10
On 26.07.2023 17:54, Oleksii wrote:
> On Wed, 2023-07-26 at 17:00 +0200, Jan Beulich wrote:
>> On 26.07.2023 15:12, Oleksii wrote:
>>> On Wed, 2023-07-26 at 13:58 +0200, Jan Beulich wrote:
>>>> On 26.07.2023 13:23, Oleksii wrote:
>>>>> I would like to ask for advice on whether it would be easier,
>>>>> less
>>>>> bug-
>>>>> provoking ( during identity mapping to remove of whole Xen ) to
>>>>> have a
>>>>> separate identity section that won't be more than PAGE_SIZE.
>>>>
>>>> I'm afraid you can't safely do this in C, or at least not without
>>>> further checking on what the compiler actually did.
>>>>
>>>>> @@ -264,6 +268,19 @@ void __init enable_mmu(void)
>>>>>                RV_STAGE1_MODE << SATP_MODE_SHIFT);
>>>>>  }
>>>>>  
>>>>> +void __attribute__((naked)) __section(".ident")
>>>>> turn_on_mmu(unsigned
>>>>> long ra)
>>>>
>>>> Did you read what gcc doc says about "naked"? Extended asm()
>>>> isn't
>>>> supported there. Since ...
>>>>
>>>>> +{
>>>>> +    /* Ensure page table writes precede loading the SATP */
>>>>> +    sfence_vma();
>>>>> +
>>>>> +    /* Enable the MMU and load the new pagetable for Xen */
>>>>> +    csr_write(CSR_SATP,
>>>>> +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
>>>>> +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
>>>>> +
>>>>> +    asm volatile( "jr %0\n" : : "r"(ra) );
>>>>> +}
>>>>
>>>> ... none of this really requires C, I think we're at the point
>>>> where
>>>> (iirc) Andrew's and my suggestion wants following, moving this to
>>>> assembly code (at which point it doesn't need to be a separate
>>>> function). You can still build page tables in C, of course.
>>>> (Likely
>>>> you then also won't need a separate section; some minimal
>>>> alignment
>>>> guarantees ought to suffice to make sure the critical code is
>>>> confined to a single page.)
>>>
>>> Thanks. I'll move all of this to assembly code.
>>> Regarding alignment it is needed alignment on start and end of
>>> function:
>>>     .balign PAGE_SIZE
>>>     GLOBAL(turn_on_mmu)
>>>         ...
>>>     .balign PAGE_SIZE
>>>     ENDPROC(turn_on_mmu)
>>>
>>> Does the better way exist?
>>
>> The function is only going to be a handful of instructions. Its
>> alignment doesn't need to be larger than the next power of 2. I
>> expect you'll be good with 64-byte alignment. (In no case do you
>> need to align the end of the function: Putting other stuff there
>> is not a problem at all.) What you want in any event is a build
>> time check that the within-a-page constraint is met.
> But shouldn't be an address be aligned to a boundary equal to page
> size?
> 
> According to the RISC-V privileged spec:
> Any level of PTE may be a leaf PTE, so in addition to 4 KiB pages, Sv39
> supports 2 MiB megapages
> and 1 GiB gigapages, each of which must be virtually and physically
> aligned to a boundary equal
> to its size. A page-fault exception is raised if the physical address
> is insufficiently aligned.

You'd simply map the page containing the chunk, i.e. masking off the
low 12 bits. If far enough away from the Xen virtual range, you could
as well map a 2M page masking off the low 21 bits, or a 1G page with
the low 30 bits of the address cleared.

Jan
Oleksii Kurochko July 26, 2023, 6:39 p.m. UTC | #11
On Wed, 2023-07-26 at 17:59 +0200, Jan Beulich wrote:
> On 26.07.2023 17:54, Oleksii wrote:
> > On Wed, 2023-07-26 at 17:00 +0200, Jan Beulich wrote:
> > > On 26.07.2023 15:12, Oleksii wrote:
> > > > On Wed, 2023-07-26 at 13:58 +0200, Jan Beulich wrote:
> > > > > On 26.07.2023 13:23, Oleksii wrote:
> > > > > > I would like to ask for advice on whether it would be
> > > > > > easier,
> > > > > > less
> > > > > > bug-
> > > > > > provoking ( during identity mapping to remove of whole Xen
> > > > > > ) to
> > > > > > have a
> > > > > > separate identity section that won't be more than
> > > > > > PAGE_SIZE.
> > > > > 
> > > > > I'm afraid you can't safely do this in C, or at least not
> > > > > without
> > > > > further checking on what the compiler actually did.
> > > > > 
> > > > > > @@ -264,6 +268,19 @@ void __init enable_mmu(void)
> > > > > >                RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > > > > >  }
> > > > > >  
> > > > > > +void __attribute__((naked)) __section(".ident")
> > > > > > turn_on_mmu(unsigned
> > > > > > long ra)
> > > > > 
> > > > > Did you read what gcc doc says about "naked"? Extended asm()
> > > > > isn't
> > > > > supported there. Since ...
> > > > > 
> > > > > > +{
> > > > > > +    /* Ensure page table writes precede loading the SATP
> > > > > > */
> > > > > > +    sfence_vma();
> > > > > > +
> > > > > > +    /* Enable the MMU and load the new pagetable for Xen
> > > > > > */
> > > > > > +    csr_write(CSR_SATP,
> > > > > > +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
> > > > > > +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > > > > > +
> > > > > > +    asm volatile( "jr %0\n" : : "r"(ra) );
> > > > > > +}
> > > > > 
> > > > > ... none of this really requires C, I think we're at the
> > > > > point
> > > > > where
> > > > > (iirc) Andrew's and my suggestion wants following, moving
> > > > > this to
> > > > > assembly code (at which point it doesn't need to be a
> > > > > separate
> > > > > function). You can still build page tables in C, of course.
> > > > > (Likely
> > > > > you then also won't need a separate section; some minimal
> > > > > alignment
> > > > > guarantees ought to suffice to make sure the critical code is
> > > > > confined to a single page.)
> > > > 
> > > > Thanks. I'll move all of this to assembly code.
> > > > Regarding alignment it is needed alignment on start and end of
> > > > function:
> > > >     .balign PAGE_SIZE
> > > >     GLOBAL(turn_on_mmu)
> > > >         ...
> > > >     .balign PAGE_SIZE
> > > >     ENDPROC(turn_on_mmu)
> > > > 
> > > > Does the better way exist?
> > > 
> > > The function is only going to be a handful of instructions. Its
> > > alignment doesn't need to be larger than the next power of 2. I
> > > expect you'll be good with 64-byte alignment. (In no case do you
> > > need to align the end of the function: Putting other stuff there
> > > is not a problem at all.) What you want in any event is a build
> > > time check that the within-a-page constraint is met.
> > But shouldn't be an address be aligned to a boundary equal to page
> > size?
> > 
> > According to the RISC-V privileged spec:
> > Any level of PTE may be a leaf PTE, so in addition to 4 KiB pages,
> > Sv39
> > supports 2 MiB megapages
> > and 1 GiB gigapages, each of which must be virtually and physically
> > aligned to a boundary equal
> > to its size. A page-fault exception is raised if the physical
> > address
> > is insufficiently aligned.
> 
> You'd simply map the page containing the chunk, i.e. masking off the
> low 12 bits. If far enough away from the Xen virtual range, you could
> as well map a 2M page masking off the low 21 bits, or a 1G page with
> the low 30 bits of the address cleared.
Agree, then it will work.

But still it doesn't clear what to do if turn_on_mmu will be bigger
then 64 ( ASSERT( (turn_on_mmu_end - turn_on_mmu) <= 64 ) somewhere in
xen.lds.S ). Right now turn_on_mmu() function is 0x22 bytes and it is
enough ( we are sure that we don't cross 4k boundary ) to be 64-byte
aligned. But if the size will be more then 64 bytes then the alignment
need to be changed to 0x128.
Am i right?


~ Oleksii
Jan Beulich July 27, 2023, 7:25 a.m. UTC | #12
On 26.07.2023 20:39, Oleksii wrote:
> On Wed, 2023-07-26 at 17:59 +0200, Jan Beulich wrote:
>> On 26.07.2023 17:54, Oleksii wrote:
>>> On Wed, 2023-07-26 at 17:00 +0200, Jan Beulich wrote:
>>>> On 26.07.2023 15:12, Oleksii wrote:
>>>>> On Wed, 2023-07-26 at 13:58 +0200, Jan Beulich wrote:
>>>>>> On 26.07.2023 13:23, Oleksii wrote:
>>>>>>> I would like to ask for advice on whether it would be
>>>>>>> easier,
>>>>>>> less
>>>>>>> bug-
>>>>>>> provoking ( during identity mapping to remove of whole Xen
>>>>>>> ) to
>>>>>>> have a
>>>>>>> separate identity section that won't be more than
>>>>>>> PAGE_SIZE.
>>>>>>
>>>>>> I'm afraid you can't safely do this in C, or at least not
>>>>>> without
>>>>>> further checking on what the compiler actually did.
>>>>>>
>>>>>>> @@ -264,6 +268,19 @@ void __init enable_mmu(void)
>>>>>>>                RV_STAGE1_MODE << SATP_MODE_SHIFT);
>>>>>>>  }
>>>>>>>  
>>>>>>> +void __attribute__((naked)) __section(".ident")
>>>>>>> turn_on_mmu(unsigned
>>>>>>> long ra)
>>>>>>
>>>>>> Did you read what gcc doc says about "naked"? Extended asm()
>>>>>> isn't
>>>>>> supported there. Since ...
>>>>>>
>>>>>>> +{
>>>>>>> +    /* Ensure page table writes precede loading the SATP
>>>>>>> */
>>>>>>> +    sfence_vma();
>>>>>>> +
>>>>>>> +    /* Enable the MMU and load the new pagetable for Xen
>>>>>>> */
>>>>>>> +    csr_write(CSR_SATP,
>>>>>>> +              PFN_DOWN((unsigned long)stage1_pgtbl_root) |
>>>>>>> +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
>>>>>>> +
>>>>>>> +    asm volatile( "jr %0\n" : : "r"(ra) );
>>>>>>> +}
>>>>>>
>>>>>> ... none of this really requires C, I think we're at the
>>>>>> point
>>>>>> where
>>>>>> (iirc) Andrew's and my suggestion wants following, moving
>>>>>> this to
>>>>>> assembly code (at which point it doesn't need to be a
>>>>>> separate
>>>>>> function). You can still build page tables in C, of course.
>>>>>> (Likely
>>>>>> you then also won't need a separate section; some minimal
>>>>>> alignment
>>>>>> guarantees ought to suffice to make sure the critical code is
>>>>>> confined to a single page.)
>>>>>
>>>>> Thanks. I'll move all of this to assembly code.
>>>>> Regarding alignment it is needed alignment on start and end of
>>>>> function:
>>>>>     .balign PAGE_SIZE
>>>>>     GLOBAL(turn_on_mmu)
>>>>>         ...
>>>>>     .balign PAGE_SIZE
>>>>>     ENDPROC(turn_on_mmu)
>>>>>
>>>>> Does the better way exist?
>>>>
>>>> The function is only going to be a handful of instructions. Its
>>>> alignment doesn't need to be larger than the next power of 2. I
>>>> expect you'll be good with 64-byte alignment. (In no case do you
>>>> need to align the end of the function: Putting other stuff there
>>>> is not a problem at all.) What you want in any event is a build
>>>> time check that the within-a-page constraint is met.
>>> But shouldn't be an address be aligned to a boundary equal to page
>>> size?
>>>
>>> According to the RISC-V privileged spec:
>>> Any level of PTE may be a leaf PTE, so in addition to 4 KiB pages,
>>> Sv39
>>> supports 2 MiB megapages
>>> and 1 GiB gigapages, each of which must be virtually and physically
>>> aligned to a boundary equal
>>> to its size. A page-fault exception is raised if the physical
>>> address
>>> is insufficiently aligned.
>>
>> You'd simply map the page containing the chunk, i.e. masking off the
>> low 12 bits. If far enough away from the Xen virtual range, you could
>> as well map a 2M page masking off the low 21 bits, or a 1G page with
>> the low 30 bits of the address cleared.
> Agree, then it will work.
> 
> But still it doesn't clear what to do if turn_on_mmu will be bigger
> then 64 ( ASSERT( (turn_on_mmu_end - turn_on_mmu) <= 64 ) somewhere in
> xen.lds.S ). Right now turn_on_mmu() function is 0x22 bytes and it is
> enough ( we are sure that we don't cross 4k boundary ) to be 64-byte
> aligned. But if the size will be more then 64 bytes then the alignment
> need to be changed to 0x128.
> Am i right?

Well, to 128 (without 0x), but yes. That function isn't very likely to
change much, though.

Jan
Oleksii Kurochko July 27, 2023, 7:48 a.m. UTC | #13
On Thu, 2023-07-27 at 09:25 +0200, Jan Beulich wrote:
> On 26.07.2023 20:39, Oleksii wrote:
> > On Wed, 2023-07-26 at 17:59 +0200, Jan Beulich wrote:
> > > On 26.07.2023 17:54, Oleksii wrote:
> > > > On Wed, 2023-07-26 at 17:00 +0200, Jan Beulich wrote:
> > > > > On 26.07.2023 15:12, Oleksii wrote:
> > > > > > On Wed, 2023-07-26 at 13:58 +0200, Jan Beulich wrote:
> > > > > > > On 26.07.2023 13:23, Oleksii wrote:
> > > > > > > > I would like to ask for advice on whether it would be
> > > > > > > > easier,
> > > > > > > > less
> > > > > > > > bug-
> > > > > > > > provoking ( during identity mapping to remove of whole
> > > > > > > > Xen
> > > > > > > > ) to
> > > > > > > > have a
> > > > > > > > separate identity section that won't be more than
> > > > > > > > PAGE_SIZE.
> > > > > > > 
> > > > > > > I'm afraid you can't safely do this in C, or at least not
> > > > > > > without
> > > > > > > further checking on what the compiler actually did.
> > > > > > > 
> > > > > > > > @@ -264,6 +268,19 @@ void __init enable_mmu(void)
> > > > > > > >                RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +void __attribute__((naked)) __section(".ident")
> > > > > > > > turn_on_mmu(unsigned
> > > > > > > > long ra)
> > > > > > > 
> > > > > > > Did you read what gcc doc says about "naked"? Extended
> > > > > > > asm()
> > > > > > > isn't
> > > > > > > supported there. Since ...
> > > > > > > 
> > > > > > > > +{
> > > > > > > > +    /* Ensure page table writes precede loading the
> > > > > > > > SATP
> > > > > > > > */
> > > > > > > > +    sfence_vma();
> > > > > > > > +
> > > > > > > > +    /* Enable the MMU and load the new pagetable for
> > > > > > > > Xen
> > > > > > > > */
> > > > > > > > +    csr_write(CSR_SATP,
> > > > > > > > +              PFN_DOWN((unsigned
> > > > > > > > long)stage1_pgtbl_root) |
> > > > > > > > +              RV_STAGE1_MODE << SATP_MODE_SHIFT);
> > > > > > > > +
> > > > > > > > +    asm volatile( "jr %0\n" : : "r"(ra) );
> > > > > > > > +}
> > > > > > > 
> > > > > > > ... none of this really requires C, I think we're at the
> > > > > > > point
> > > > > > > where
> > > > > > > (iirc) Andrew's and my suggestion wants following, moving
> > > > > > > this to
> > > > > > > assembly code (at which point it doesn't need to be a
> > > > > > > separate
> > > > > > > function). You can still build page tables in C, of
> > > > > > > course.
> > > > > > > (Likely
> > > > > > > you then also won't need a separate section; some minimal
> > > > > > > alignment
> > > > > > > guarantees ought to suffice to make sure the critical
> > > > > > > code is
> > > > > > > confined to a single page.)
> > > > > > 
> > > > > > Thanks. I'll move all of this to assembly code.
> > > > > > Regarding alignment it is needed alignment on start and end
> > > > > > of
> > > > > > function:
> > > > > >     .balign PAGE_SIZE
> > > > > >     GLOBAL(turn_on_mmu)
> > > > > >         ...
> > > > > >     .balign PAGE_SIZE
> > > > > >     ENDPROC(turn_on_mmu)
> > > > > > 
> > > > > > Does the better way exist?
> > > > > 
> > > > > The function is only going to be a handful of instructions.
> > > > > Its
> > > > > alignment doesn't need to be larger than the next power of 2.
> > > > > I
> > > > > expect you'll be good with 64-byte alignment. (In no case do
> > > > > you
> > > > > need to align the end of the function: Putting other stuff
> > > > > there
> > > > > is not a problem at all.) What you want in any event is a
> > > > > build
> > > > > time check that the within-a-page constraint is met.
> > > > But shouldn't be an address be aligned to a boundary equal to
> > > > page
> > > > size?
> > > > 
> > > > According to the RISC-V privileged spec:
> > > > Any level of PTE may be a leaf PTE, so in addition to 4 KiB
> > > > pages,
> > > > Sv39
> > > > supports 2 MiB megapages
> > > > and 1 GiB gigapages, each of which must be virtually and
> > > > physically
> > > > aligned to a boundary equal
> > > > to its size. A page-fault exception is raised if the physical
> > > > address
> > > > is insufficiently aligned.
> > > 
> > > You'd simply map the page containing the chunk, i.e. masking off
> > > the
> > > low 12 bits. If far enough away from the Xen virtual range, you
> > > could
> > > as well map a 2M page masking off the low 21 bits, or a 1G page
> > > with
> > > the low 30 bits of the address cleared.
> > Agree, then it will work.
> > 
> > But still it doesn't clear what to do if turn_on_mmu will be bigger
> > then 64 ( ASSERT( (turn_on_mmu_end - turn_on_mmu) <= 64 ) somewhere
> > in
> > xen.lds.S ). Right now turn_on_mmu() function is 0x22 bytes and it
> > is
> > enough ( we are sure that we don't cross 4k boundary ) to be 64-
> > byte
> > aligned. But if the size will be more then 64 bytes then the
> > alignment
> > need to be changed to 0x128.
> > Am i right?
> 
> Well, to 128 (without 0x), but yes. That function isn't very likely
> to
> change much, though.
Thanks.

~ Oleksii
diff mbox series

Patch

diff --git a/xen/arch/riscv/include/asm/mm.h b/xen/arch/riscv/include/asm/mm.h
index d9c4205103..085eaab7fb 100644
--- a/xen/arch/riscv/include/asm/mm.h
+++ b/xen/arch/riscv/include/asm/mm.h
@@ -13,7 +13,8 @@  extern unsigned char cpu0_boot_stack[];
 void setup_initial_pagetables(void);
 
 void enable_mmu(void);
-void cont_after_mmu_is_enabled(void);
+
+void remove_identity_mapping(void);
 
 void calc_phys_offset(void);
 
diff --git a/xen/arch/riscv/mm.c b/xen/arch/riscv/mm.c
index c84a8a7c3c..aae24f3a54 100644
--- a/xen/arch/riscv/mm.c
+++ b/xen/arch/riscv/mm.c
@@ -4,6 +4,7 @@ 
 #include <xen/compiler.h>
 #include <xen/init.h>
 #include <xen/kernel.h>
+#include <xen/macros.h>
 #include <xen/pfn.h>
 
 #include <asm/early_printk.h>
@@ -35,8 +36,10 @@  unsigned long __ro_after_init phys_offset;
  *
  * It might be needed one more page table in case when Xen load address
  * isn't 2 MB aligned.
+ *
+ * CONFIG_PAGING_LEVELS page tables are needed for identity mapping.
  */
-#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) + 1)
+#define PGTBL_INITIAL_COUNT ((CONFIG_PAGING_LEVELS - 1) * 2) + 1
 
 pte_t __section(".bss.page_aligned") __aligned(PAGE_SIZE)
 stage1_pgtbl_root[PAGETABLE_ENTRIES];
@@ -75,10 +78,11 @@  static void __init setup_initial_mapping(struct mmu_desc *mmu_desc,
     unsigned int index;
     pte_t *pgtbl;
     unsigned long page_addr;
+    bool is_identity_mapping = map_start == pa_start;
 
-    if ( (unsigned long)_start % XEN_PT_LEVEL_SIZE(0) )
+    if ( !IS_ALIGNED((unsigned long)_start, KB(4)) )
     {
-        early_printk("(XEN) Xen should be loaded at 4k boundary\n");
+        early_printk("(XEN) Xen should be loaded at 4KB boundary\n");
         die();
     }
 
@@ -108,16 +112,18 @@  static void __init setup_initial_mapping(struct mmu_desc *mmu_desc,
             {
                 unsigned long paddr = (page_addr - map_start) + pa_start;
                 unsigned int permissions = PTE_LEAF_DEFAULT;
+                unsigned long addr = is_identity_mapping
+                                     ? page_addr : LINK_TO_LOAD(page_addr);
                 pte_t pte_to_be_written;
 
                 index = pt_index(0, page_addr);
 
-                if ( is_kernel_text(LINK_TO_LOAD(page_addr)) ||
-                     is_kernel_inittext(LINK_TO_LOAD(page_addr)) )
-                    permissions =
-                        PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
+                if ( is_kernel_text(addr) ||
+                     is_kernel_inittext(addr) )
+                        permissions =
+                            PTE_EXECUTABLE | PTE_READABLE | PTE_VALID;
 
-                if ( is_kernel_rodata(LINK_TO_LOAD(page_addr)) )
+                if ( is_kernel_rodata(addr) )
                     permissions = PTE_READABLE | PTE_VALID;
 
                 pte_to_be_written = paddr_to_pte(paddr, permissions);
@@ -211,6 +217,10 @@  void __init setup_initial_pagetables(void)
     unsigned long linker_start  = LOAD_TO_LINK(load_start);
     unsigned long linker_end    = LOAD_TO_LINK(load_end);
 
+    /*
+     * If the overlapping check will be removed then remove_identity_mapping()
+     * logic should be updated.
+     */
     if ( (linker_start != load_start) &&
          (linker_start <= load_end) && (load_start <= linker_end) )
     {
@@ -232,22 +242,18 @@  void __init setup_initial_pagetables(void)
                           linker_start,
                           linker_end,
                           load_start);
+
+    if ( linker_start == load_start )
+        return;
+
+    setup_initial_mapping(&mmu_desc,
+                          load_start,
+                          load_end,
+                          load_start);
 }
 
-void __init noreturn noinline enable_mmu()
+void __init enable_mmu(void)
 {
-    /*
-     * Calculate a linker time address of the mmu_is_enabled
-     * label and update CSR_STVEC with it.
-     * MMU is configured in a way where linker addresses are mapped
-     * on load addresses so in a case when linker addresses are not equal
-     * to load addresses, after MMU is enabled, it will cause
-     * an exception and jump to linker time addresses.
-     * Otherwise if load addresses are equal to linker addresses the code
-     * after mmu_is_enabled label will be executed without exception.
-     */
-    csr_write(CSR_STVEC, LOAD_TO_LINK((unsigned long)&&mmu_is_enabled));
-
     /* Ensure page table writes precede loading the SATP */
     sfence_vma();
 
@@ -255,25 +261,44 @@  void __init noreturn noinline enable_mmu()
     csr_write(CSR_SATP,
               PFN_DOWN((unsigned long)stage1_pgtbl_root) |
               RV_STAGE1_MODE << SATP_MODE_SHIFT);
+}
 
-    asm volatile ( ".p2align 2" );
- mmu_is_enabled:
-    /*
-     * Stack should be re-inited as:
-     * 1. Right now an address of the stack is relative to load time
-     *    addresses what will cause an issue in case of load start address
-     *    isn't equal to linker start address.
-     * 2. Addresses in stack are all load time relative which can be an
-     *    issue in case when load start address isn't equal to linker
-     *    start address.
-     *
-     * We can't return to the caller because the stack was reseted
-     * and it may have stash some variable on the stack.
-     * Jump to a brand new function as the stack was reseted
-     */
+void __init remove_identity_mapping(void)
+{
+    static pte_t *pgtbl = stage1_pgtbl_root;
+    static unsigned long load_start = XEN_VIRT_START;
+    static unsigned int pt_level = CONFIG_PAGING_LEVELS - 1;
+
+    unsigned long load_end = LINK_TO_LOAD(_end);
+    unsigned long xen_size;
+    unsigned long pt_level_size = XEN_PT_LEVEL_SIZE(pt_level);
+    unsigned long pte_nums;
+
+    unsigned long virt_indx = pt_index(pt_level, XEN_VIRT_START);
+    unsigned long indx;
+
+    if ( load_start == XEN_VIRT_START )
+        load_start = LINK_TO_LOAD(_start);
+
+    xen_size = load_end - load_start;
+    pte_nums = ROUNDUP(xen_size, pt_level_size) / pt_level_size;
+
+    while ( pte_nums-- )
+    {
+        indx = pt_index(pt_level, load_start);
 
-    switch_stack_and_jump((unsigned long)cpu0_boot_stack + STACK_SIZE,
-                          cont_after_mmu_is_enabled);
+        if ( virt_indx != indx )
+        {
+            pgtbl[indx].pte = 0;
+            load_start += XEN_PT_LEVEL_SIZE(pt_level);
+        }
+        else
+        {
+            pgtbl =  (pte_t *)LOAD_TO_LINK(pte_to_paddr(pgtbl[indx]));
+            pt_level--;
+            remove_identity_mapping();
+        }
+    }
 }
 
 /*
diff --git a/xen/arch/riscv/riscv64/head.S b/xen/arch/riscv/riscv64/head.S
index a28714e0ef..d74412351e 100644
--- a/xen/arch/riscv/riscv64/head.S
+++ b/xen/arch/riscv/riscv64/head.S
@@ -38,6 +38,28 @@  ENTRY(start)
 
         jal     calc_phys_offset
 
+        jal     setup_initial_pagetables
+
+        jal     enable_mmu
+
+        la      t1, phys_offset
+        REG_L   t1, (t1)
+
+        /* Calculate proper VA after jump from 1:1 mapping */
+        la      t0, .L_primary_switched
+        sub     t0, t0, t1
+
+        /* Jump from 1:1 mapping world */
+        jr      t0
+
+.L_primary_switched:
+        /*
+         * cpu0_boot_stack address is 1:1 mapping related so it should be
+         * recalculated after jump from 1:1 mapping world as 1:1 mapping
+         * will be removed soon in start_xen().
+         */
+        jal     reset_stack
+
         /* restore hart_id ( bootcpu_id ) and dtb address */
         mv      a0, s0
         mv      a1, s1
diff --git a/xen/arch/riscv/setup.c b/xen/arch/riscv/setup.c
index dde8fb898b..6593f601c1 100644
--- a/xen/arch/riscv/setup.c
+++ b/xen/arch/riscv/setup.c
@@ -13,20 +13,10 @@  unsigned char __initdata cpu0_boot_stack[STACK_SIZE]
 void __init noreturn start_xen(unsigned long bootcpu_id,
                                paddr_t dtb_addr)
 {
-    early_printk("Hello from C env\n");
-
-    setup_initial_pagetables();
-
-    enable_mmu();
-
-    for ( ;; )
-        asm volatile ("wfi");
+    remove_identity_mapping();
 
-    unreachable();
-}
+    early_printk("Hello from C env\n");
 
-void __init noreturn cont_after_mmu_is_enabled(void)
-{
     early_printk("All set up\n");
 
     for ( ;; )
diff --git a/xen/arch/riscv/xen.lds.S b/xen/arch/riscv/xen.lds.S
index 9064852173..31ccebadcb 100644
--- a/xen/arch/riscv/xen.lds.S
+++ b/xen/arch/riscv/xen.lds.S
@@ -173,4 +173,8 @@  ASSERT(IS_ALIGNED(__bss_end,        POINTER_ALIGN), "__bss_end is misaligned")
 ASSERT(!SIZEOF(.got),      ".got non-empty")
 ASSERT(!SIZEOF(.got.plt),  ".got.plt non-empty")
 
+/*
+ * Changing the size of Xen binary can require an update of
+ * PGTBL_INITIAL_COUNT.
+ */
 ASSERT(_end - _start <= MB(2), "Xen too large for early-boot assumptions")