[v4,6/7] xen/riscv: page table handling

Message ID	8362795280a48702bef6f01d41d148edcf299935.1723214540.git.oleksii.kurochko@gmail.com (mailing list archive)
State	Superseded
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> From: Oleksii Kurochko <oleksii.kurochko@gmail.com> To: xen-devel@lists.xenproject.org Cc: Oleksii Kurochko <oleksii.kurochko@gmail.com>, Alistair Francis <alistair.francis@wdc.com>, Bob Eshleman <bobbyeshleman@gmail.com>, Connor Davis <connojdavis@gmail.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Jan Beulich <jbeulich@suse.com>, Julien Grall <julien@xen.org>, Stefano Stabellini <sstabellini@kernel.org> Subject: [PATCH v4 6/7] xen/riscv: page table handling Date: Fri, 9 Aug 2024 18:19:44 +0200 Message-ID: <8362795280a48702bef6f01d41d148edcf299935.1723214540.git.oleksii.kurochko@gmail.com> In-Reply-To: <cover.1723214540.git.oleksii.kurochko@gmail.com> References: <cover.1723214540.git.oleksii.kurochko@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	RISCV device tree mapping \| expand [v4,0/7] RISCV device tree mapping [v4,1/7] xen/riscv: enable CONFIG_HAS_DEVICE_TREE [v4,2/7] xen/riscv: set up fixmap mappings [v4,3/7] xen/riscv: introduce asm/pmap.h header [v4,4/7] xen/riscv: introduce functionality to work with CPU info [v4,5/7] xen/riscv: introduce and initialize SBI RFENCE extension [v4,6/7] xen/riscv: page table handling [v4,7/7] xen/riscv: introduce early_fdt_map()

Oleksii Kurochko Aug. 9, 2024, 4:19 p.m. UTC

Implement map_pages_to_xen() which requires several
functions to manage page tables and entries:
- pt_update()
- pt_mapping_level()
- pt_update_entry()
- pt_next_level()
- pt_check_entry()

To support these operations, add functions for creating,
mapping, and unmapping Xen tables:
- create_xen_table()
- xen_map_table()
- xen_unmap_table()

Introduce internal macros starting with PTE_* for convenience.
These macros closely resemble PTE bits, with the exception of
PTE_BLOCK, which indicates that a page larger than 4KB is
needed.
RISC-V detects superpages using `pte.x` and `pte.r`, as there
is no specific bit in the PTE for this purpose. From the RISC-V spec:
```
  ...
  4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5.
     Otherwise, this PTE is a pointer to the next level of the page table.
     ... .
  5. A leaf PTE has been found.
     ...
  ...
```

The code doesn’t support super page shattering so 4KB pages are used as
default.
Additionaly as mentioed in RISC-V priviliged spec:
```
 After much deliberation, we have settled on a conventional page size of
 4 KiB for both RV32 and RV64. We expect this decision to ease the porting
 of low-level runtime software and device drivers.

 The TLB reach problem is ameliorated by transparent superpage support in
 modern operating systems [2]. Additionally, multi-level TLB hierarchies
 are quite inexpensive relative to the multi-level cache hierarchies whose
 address space they map.

 [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. Practical,
     transparent operating system support for superpages.
     SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
```

In addition introduce flush_tlb_range_va() for TLB flushing across
CPUs after updating the PTE for the requested mapping.

Signed-off-by: Oleksii Kurochko <oleksii.kurochko@gmail.com>
---
Changes in V4:
 - update the commit message.
 - drop xen_ prefix for functions: xen_pt_update(), xen_pt_mapping_level(),
   xen_pt_update_entry(), xen_pt_next_level(), xen_pt_check_entry().
 - drop 'select GENERIC_PT' for CONFIG_RISCV. There is no GENERIC_PT anymore.
 - update implementation of flush_xen_tlb_range_va and s/flush_xen_tlb_range_va/flush_tlb_range_va
 - s/pte_get_mfn/mfn_from_pte. Others similar definitions I decided not to touch as
   they were introduced before and this patter of naming such type of macros will be applied
   for newly introduced macros.
 - drop _PAGE_* definitions and use analogues of PTE_*.
 - introduce PTE_{W,X,R}_MASK and drop PAGE_{XN,W,X}_MASK. Also drop _PAGE_{*}_BIT
 - introduce PAGE_HYPERVISOR_RX.
 - drop unused now l3_table_offset.
 - drop struct pt_t as it was used only for one function. If it will be needed in the future
   pt_t will be re-introduced.
 - code styles fixes in pte_is_table(). drop level argument from t.
 - update implementation and prototype of pte_is_mapping().
 - drop level argument from pt_next_level().
 - introduce definition of SATP_PPN_MASK.
 - isolate PPN of CSR_SATP before shift by PAGE_SHIFT.
 - drop set_permission() functions as it is not used more then once.
 - update prototype of pt_check_entry(): drop level argument as it is not used.
 - pt_check_entry():
   - code style fixes
   - update the sanity check when modifying an entry
   - update the sanity check when when removing a mapping.
 - s/read_only/alloc_only.
 - code style fixes for pt_next_level().
 - pt_update_entry() changes:
   - drop arch_level variable inisde pt_update_entry()
   - drop convertion near virt to paddr_t in DECLARE_OFFSETS(offsets, virt);
   - pull out "goto out inside first 'for' cycle.
   - drop braces for 'if' cases which has only one line.
   - ident 'out' label with one blank.
   - update the comment above alloc_only and also definition to take into
     account  that if pte population was requested or not.
   - drop target variable and rename arch_target argument of the function to
     target.
 - pt_mapping_level() changes:
   - move the check if PTE_BLOCK should be mapped on the top of the function.
   - change int i to unsigned int and update 'for' cycle correspondingly.
 - update prototye of pt_update():
   - drop the comment  above nr_mfns and drop const to be consistent with other
     arguments.
   - always flush TLB at the end of the function as non-present entries can be put
     in the TLB.
   - add fence before TLB flush to ensure that PTEs are all updated before flushing.
 - s/XEN_TABLE_NORMAL_PAGE/XEN_TABLE_NORMAL
 - add a check in map_pages_to_xen() the mfn is not INVALID_MFN.
 - add the comment on top of pt_update() how mfn = INVALID_MFN is considered.
 - s/_PAGE_BLOCK/PTE_BLOCK.
 - add the comment with additional explanation for PTE_BLOCK.
 - drop defintion of FIRST_SIZE as it isn't used.
---
Changes in V3:
 - new patch. ( Technically it is reworked version of the generic approach
   which I tried to suggest in the previous version )
---

 xen/arch/riscv/Makefile                     |   1 +
 xen/arch/riscv/include/asm/flushtlb.h       |  13 +
 xen/arch/riscv/include/asm/mm.h             |   2 +
 xen/arch/riscv/include/asm/page.h           |  76 ++++
 xen/arch/riscv/include/asm/riscv_encoding.h |   1 +
 xen/arch/riscv/mm.c                         |   9 -
 xen/arch/riscv/pt.c                         | 408 ++++++++++++++++++++
 7 files changed, 501 insertions(+), 9 deletions(-)
 create mode 100644 xen/arch/riscv/pt.c

Jan Beulich Aug. 13, 2024, 10:31 a.m. UTC | #1

On 09.08.2024 18:19, Oleksii Kurochko wrote:
> Implement map_pages_to_xen() which requires several
> functions to manage page tables and entries:
> - pt_update()
> - pt_mapping_level()
> - pt_update_entry()
> - pt_next_level()
> - pt_check_entry()
> 
> To support these operations, add functions for creating,
> mapping, and unmapping Xen tables:
> - create_xen_table()
> - xen_map_table()
> - xen_unmap_table()

I think I commented on this before: Everything is "Xen" in hypervisor
code. What I think you mean is to map/unmap Xen's own page tables.
Naming-wise that would be {,un}map_xen_table(), though. Since they
are static, just {,un}map_table() ought to be unambiguous, too.

> Introduce internal macros starting with PTE_* for convenience.
> These macros closely resemble PTE bits, with the exception of
> PTE_BLOCK, which indicates that a page larger than 4KB is
> needed.

I did comment on this too, iirc: Is there going to be any case where
large pages are going to be "needed", i.e. not just preferred? If not,
giving the caller control over things the other way around (requesting
4k mappings are needed, as we have it in x86) may be preferable.

Hmm, but then ...

> RISC-V detects superpages using `pte.x` and `pte.r`, as there
> is no specific bit in the PTE for this purpose. From the RISC-V spec:
> ```
>   ...
>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5.
>      Otherwise, this PTE is a pointer to the next level of the page table.
>      ... .
>   5. A leaf PTE has been found.
>      ...
>   ...
> ```
> 
> The code doesn’t support super page shattering so 4KB pages are used as
> default.

... you have this. Yet still callers expecting re-mapping in the (large)
range they map can request small-page mappings right away.

> --- a/xen/arch/riscv/include/asm/flushtlb.h
> +++ b/xen/arch/riscv/include/asm/flushtlb.h
> @@ -5,12 +5,25 @@
>  #include <xen/bug.h>
>  #include <xen/cpumask.h>
>  
> +#include <asm/sbi.h>
> +
>  /* Flush TLB of local processor for address va. */
>  static inline void flush_tlb_one_local(vaddr_t va)
>  {
>      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
>  }
>  
> +/*
> + * Flush a range of VA's hypervisor mappings from the TLB of all
> + * processors in the inner-shareable domain.
> + */
> +static inline void flush_tlb_range_va(vaddr_t va,
> +                                      size_t size)

No need for line wrapping here?

> @@ -33,15 +38,72 @@
>  #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
>  #define PTE_TABLE                   (PTE_VALID)
>  
> +#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
>  #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE | PTE_WRITABLE)
> +#define PAGE_HYPERVISOR_RX          (PTE_VALID | PTE_READABLE | PTE_EXECUTABLE)
>  
>  #define PAGE_HYPERVISOR             PAGE_HYPERVISOR_RW
>  
> +
> +/*

Nit: As before, no double blank lines please.

> + * There are no such bits in PTE format for RISC-V.

This is an odd way to start a comment: There's nothing for "such" to refer
to.

> + * The code doesn’t support super page shattering so at the moment superpages
> + * can't be used as a default so PTE_BLOCK is introduced to have ability to
> + * tell that superpage should be allocated.
> + * Additionaly as mentioed in RISC-V priviliged spec:
> + * ```
> + *  After much deliberation, we have settled on a conventional page size of
> + *  4 KiB for both RV32 and RV64. We expect this decision to ease the porting
> + *  of low-level runtime software and device drivers.
> + *
> + *  The TLB reach problem is ameliorated by transparent superpage support in
> + *  modern operating systems [2]. Additionally, multi-level TLB hierarchies
> + *  are quite inexpensive relative to the multi-level cache hierarchies whose
> + *  address space they map.
> + *
> + *  [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox. Practical,
> + *      transparent operating system support for superpages.
> + *      SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
> + * ```
> + *
> + * PTE_POPULATE is introduced to have ability to tell that page tables
> + * shoud be populated.
> + */
> +#define PTE_BLOCK       BIT(10, UL)
> +#define PTE_POPULATE    BIT(11, UL)
> +
> +#define PTE_R_MASK(x)   ((x) & PTE_READABLE)
> +#define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
> +#define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
> +
> +#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE | PTE_EXECUTABLE))
> +
>  /* Calculate the offsets into the pagetables for a given VA */
>  #define pt_linear_offset(lvl, va)   ((va) >> XEN_PT_LEVEL_SHIFT(lvl))
>  
>  #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) & VPN_MASK)
>  
> +#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U) << PAGETABLE_ORDER) - 1))

Not: Too long line.

> +#if RV_STAGE1_MODE > SATP_MODE_SV48

SV48? Isn't ...

> +#error "need to to update DECLARE_OFFSETS macros"
> +#else
> +
> +#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
> +#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
> +#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
> +
> +/* Generate an array @var containing the offset for each level from @addr */
> +#define DECLARE_OFFSETS(var, addr)          \
> +    const unsigned int var[] = {            \
> +        l0_table_offset(addr),              \
> +        l1_table_offset(addr),              \
> +        l2_table_offset(addr),              \
> +    }

... this for SV39?

> @@ -67,6 +129,20 @@ static inline bool pte_is_valid(pte_t p)
>      return p.pte & PTE_VALID;
>  }
>  
> +inline bool pte_is_table(const pte_t p)
> +{
> +    return ((p.pte & (PTE_VALID |
> +                      PTE_READABLE |
> +                      PTE_WRITABLE |
> +                      PTE_EXECUTABLE)) == PTE_VALID);
> +}
> +
> +static inline bool pte_is_mapping(const pte_t p)
> +{
> +    return (p.pte & PTE_VALID) &&
> +           ((p.pte & PTE_WRITABLE) || (p.pte & PTE_EXECUTABLE));

Shorter as (p.pte & (PTE_WRITABLE | PTE_EXECUTABLE)) ?

> --- /dev/null
> +++ b/xen/arch/riscv/pt.c
> @@ -0,0 +1,408 @@
> +#include <xen/bug.h>
> +#include <xen/domain_page.h>
> +#include <xen/errno.h>
> +#include <xen/mm.h>
> +#include <xen/mm-frame.h>
> +#include <xen/pmap.h>
> +#include <xen/spinlock.h>
> +
> +#include <asm/flushtlb.h>
> +#include <asm/page.h>
> +
> +static inline const mfn_t get_root_page(void)
> +{
> +    unsigned long root_maddr =

maddr_t or paddr_t?

> +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> +
> +    return maddr_to_mfn(root_maddr);
> +}
> +
> +/*
> + * Sanity check of the entry
> + * mfn is not valid and we are not populating page table. This means

How does this fit with ...

> + * we either modify entry or remove an entry.
> + */
> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int flags)
> +{
> +    /* Sanity check when modifying an entry. */
> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )

... the MFN check here? And why is (valid,INVALID_MFN) an indication
of a modification? But then ...

> +    {
> +        /* We don't allow modifying an invalid entry. */
> +        if ( !pte_is_valid(entry) )
> +        {
> +            printk("Modifying invalid entry is not allowed.\n");
> +            return false;
> +        }

... I also don't understand what this is about. IOW I'm afraid I'm
still confused about the purpose of this function as well as the
transitions you want to permit / reject. I wonder whether the
flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.

> +/* Update an entry at the level @target. */
> +static int pt_update_entry(mfn_t root, unsigned long virt,
> +                           mfn_t mfn, unsigned int target,
> +                           unsigned int flags)
> +{
> +    int rc;
> +    unsigned int level = HYP_PT_ROOT_LEVEL;
> +    pte_t *table;
> +    /*
> +     * The intermediate page tables are read-only when the MFN is not valid
> +     * and we are not populating page table.

The way flags are handled in PTEs, how can page tables be read-only?

> +     * This means we either modify permissions or remove an entry.

From all I can determine we also get here when making brand new entries.
What am I overlooking?

> +     */
> +    bool alloc_only = mfn_eq(mfn, INVALID_MFN) && !(flags & PTE_POPULATE);
> +    pte_t pte, *entry;
> +
> +    /* convenience aliases */
> +    DECLARE_OFFSETS(offsets, virt);
> +
> +    table = xen_map_table(root);
> +    for ( ; level > target; level-- )
> +    {
> +        rc = pt_next_level(alloc_only, &table, offsets[level]);
> +        if ( rc == XEN_TABLE_MAP_FAILED )
> +        {
> +            rc = 0;
> +
> +            /*
> +             * We are here because pt_next_level has failed to map
> +             * the intermediate page table (e.g the table does not exist
> +             * and the pt is read-only). It is a valid case when
> +             * removing a mapping as it may not exist in the page table.
> +             * In this case, just ignore it.
> +             */
> +            if ( flags & PTE_VALID )
> +            {
> +                printk("%s: Unable to map level %u\n", __func__, level);
> +                rc = -ENOENT;
> +            }
> +
> +            goto out;
> +        }
> +        else if ( rc != XEN_TABLE_NORMAL )
> +            break;
> +    }
> +
> +    if ( level != target )
> +    {
> +        printk("%s: Shattering superpage is not supported\n", __func__);
> +        rc = -EOPNOTSUPP;
> +        goto out;
> +    }
> +
> +    entry = table + offsets[level];
> +
> +    rc = -EINVAL;
> +    if ( !pt_check_entry(*entry, mfn, flags) )
> +        goto out;
> +
> +    /* We are removing the page */
> +    if ( !(flags & PTE_VALID) )
> +        memset(&pte, 0x00, sizeof(pte));
> +    else
> +    {
> +        /* We are inserting a mapping => Create new pte. */
> +        if ( !mfn_eq(mfn, INVALID_MFN) )
> +            pte = pte_from_mfn(mfn, PTE_VALID);
> +        else /* We are updating the permission => Copy the current pte. */
> +            pte = *entry;
> +
> +        /* update permission according to the flags */
> +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
> +    }
> +
> +    write_pte(entry, pte);
> +
> +    rc = 0;
> +
> + out:
> +    xen_unmap_table(table);
> +
> +    return rc;
> +}
> +
> +static DEFINE_SPINLOCK(xen_pt_lock);

If you put this in the middle of the file (which is fine), I think it
wants putting immediately ahead of the (first) function using it, not
at some seemingly random place.

> +/*
> + * If `mfn` equals `INVALID_MFN`, it indicates that the following page table
> + * update operation might be related to either populating the table,
> + * destroying a mapping, or modifying an existing mapping.
> + */
> +static int pt_update(unsigned long virt,
> +                     mfn_t mfn,
> +                     unsigned long nr_mfns,
> +                     unsigned int flags)
> +{
> +    int rc = 0;
> +    unsigned long vfn = virt >> PAGE_SHIFT;
> +    unsigned long left = nr_mfns;
> +
> +    const mfn_t root = get_root_page();
> +
> +    /*
> +     * It is bad idea to have mapping both writeable and
> +     * executable.
> +     * When modifying/creating mapping (i.e PTE_VALID is set),
> +     * prevent any update if this happen.
> +     */
> +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) && PTE_X_MASK(flags) )
> +    {
> +        printk("Mappings should not be both Writeable and Executable.\n");
> +        return -EINVAL;
> +    }
> +
> +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> +    {
> +        printk("The virtual address is not aligned to the page-size.\n");
> +        return -EINVAL;
> +    }
> +
> +    spin_lock(&xen_pt_lock);
> +
> +    while ( left )
> +    {
> +        unsigned int order, level;
> +
> +        level = pt_mapping_level(vfn, mfn, left, flags);
> +        order = XEN_PT_LEVEL_ORDER(level);
> +
> +        ASSERT(left >= BIT(order, UL));
> +
> +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
> +                                    flags);

Indentation.

> +        if ( rc )
> +            break;
> +
> +        vfn += 1U << order;
> +        if ( !mfn_eq(mfn, INVALID_MFN) )
> +            mfn = mfn_add(mfn, 1U << order);
> +
> +        left -= (1U << order);

To be on thje safe side, 1UL everywhere?

> +        if ( rc )
> +            break;

There was such a check already a few lines up from here.

> +    }
> +
> +
> +    /* ensure that PTEs are all updated before flushing */

Again: No double blank lines please.

> +    RISCV_FENCE(rw, rw);
> +
> +    /*
> +     * always flush TLB at the end of the function as non-present entries
> +     * can be put in the TLB
> +     */
> +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);

Coming back to "negative" TLB entries: Assuming RISC-V, just like other
architectures, also permits intermediate page table entries to be cached,
the affected VA range may be much larger than the original request, if
intermediate page tables needed creating.

> +    spin_unlock(&xen_pt_lock);

Does this really need to come after fence and flush?

> +    return rc;
> +}
> +
> +int map_pages_to_xen(unsigned long virt,
> +                     mfn_t mfn,
> +                     unsigned long nr_mfns,
> +                     unsigned int flags)
> +{
> +    /*
> +     * Ensure that we have a valid MFN before proceeding.
> +     *
> +     * If the MFN is invalid, pt_update() might misinterpret the operation,
> +     * treating it as either a population, a mapping destruction,
> +     * or a mapping modification.
> +     */
> +    ASSERT(!mfn_eq(mfn, INVALID_MFN));

But flags with PTE_VALID not set are fine to come into here?

> +    return pt_update(virt, mfn, nr_mfns, flags);
> +}

Jan

Oleksii Kurochko Aug. 14, 2024, 4:50 p.m. UTC | #2

On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > Implement map_pages_to_xen() which requires several
> > functions to manage page tables and entries:
> > - pt_update()
> > - pt_mapping_level()
> > - pt_update_entry()
> > - pt_next_level()
> > - pt_check_entry()
> > 
> > To support these operations, add functions for creating,
> > mapping, and unmapping Xen tables:
> > - create_xen_table()
> > - xen_map_table()
> > - xen_unmap_table()
> 
> I think I commented on this before: Everything is "Xen" in hypervisor
> code. What I think you mean is to map/unmap Xen's own page tables.
> Naming-wise that would be {,un}map_xen_table(), though. Since they
> are static, just {,un}map_table() ought to be unambiguous, too.
I thought that your comment was about pt_*() functions but thanks for
explanation again.

> 
> > Introduce internal macros starting with PTE_* for convenience.
> > These macros closely resemble PTE bits, with the exception of
> > PTE_BLOCK, which indicates that a page larger than 4KB is
> > needed.
> 
> I did comment on this too, iirc: Is there going to be any case where
> large pages are going to be "needed", i.e. not just preferred? If
> not,
> giving the caller control over things the other way around
> (requesting
> 4k mappings are needed, as we have it in x86) may be preferable.
Yes, you did the comment but I thought that it will be enough to
mention that shattering isn't supported now and  also since
pt_update_entry()is used to unmap as well, there could be a need to
unmap e.g. 4K page from 2M block mapping what will a little bit harder
then just having 4k by default.

> 
> Hmm, but then ...
> 
> > RISC-V detects superpages using `pte.x` and `pte.r`, as there
> > is no specific bit in the PTE for this purpose. From the RISC-V
> > spec:
> > ```
> >   ...
> >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
> > step 5.
> >      Otherwise, this PTE is a pointer to the next level of the page
> > table.
> >      ... .
> >   5. A leaf PTE has been found.
> >      ...
> >   ...
> > ```
> > 
> > The code doesn’t support super page shattering so 4KB pages are
> > used as
> > default.
> 
> ... you have this. Yet still callers expecting re-mapping in the
> (large)
> range they map can request small-page mappings right away.
I am not sure that I fully understand what do you mean by "expcting re-
mapping".

> 
> > --- a/xen/arch/riscv/include/asm/flushtlb.h
> > +++ b/xen/arch/riscv/include/asm/flushtlb.h
> > @@ -5,12 +5,25 @@
> >  #include <xen/bug.h>
> >  #include <xen/cpumask.h>
> >  
> > +#include <asm/sbi.h>
> > +
> >  /* Flush TLB of local processor for address va. */
> >  static inline void flush_tlb_one_local(vaddr_t va)
> >  {
> >      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
> >  }
> >  
> > +/*
> > + * Flush a range of VA's hypervisor mappings from the TLB of all
> > + * processors in the inner-shareable domain.
> > + */
> > +static inline void flush_tlb_range_va(vaddr_t va,
> > +                                      size_t size)
> 
> No need for line wrapping here?
What is line wrapping here? Do you mean that size_t size should be on
the previous line?

> 
> > @@ -33,15 +38,72 @@
> >  #define PTE_LEAF_DEFAULT            (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> >  #define PTE_TABLE                   (PTE_VALID)
> >  
> > +#define PAGE_HYPERVISOR_RO          (PTE_VALID | PTE_READABLE)
> >  #define PAGE_HYPERVISOR_RW          (PTE_VALID | PTE_READABLE |
> > PTE_WRITABLE)
> > +#define PAGE_HYPERVISOR_RX          (PTE_VALID | PTE_READABLE |
> > PTE_EXECUTABLE)
> >  
> >  #define PAGE_HYPERVISOR             PAGE_HYPERVISOR_RW
> >  
> > +
> > +/*
> 
> Nit: As before, no double blank lines please.
> 
> > + * There are no such bits in PTE format for RISC-V.
> 
> This is an odd way to start a comment: There's nothing for "such" to
> refer
> to.
> 
> > + * The code doesn’t support super page shattering so at the moment
> > superpages
> > + * can't be used as a default so PTE_BLOCK is introduced to have
> > ability to
> > + * tell that superpage should be allocated.
> > + * Additionaly as mentioed in RISC-V priviliged spec:
> > + * ```
> > + *  After much deliberation, we have settled on a conventional
> > page size of
> > + *  4 KiB for both RV32 and RV64. We expect this decision to ease
> > the porting
> > + *  of low-level runtime software and device drivers.
> > + *
> > + *  The TLB reach problem is ameliorated by transparent superpage
> > support in
> > + *  modern operating systems [2]. Additionally, multi-level TLB
> > hierarchies
> > + *  are quite inexpensive relative to the multi-level cache
> > hierarchies whose
> > + *  address space they map.
> > + *
> > + *  [2] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan Cox.
> > Practical,
> > + *      transparent operating system support for superpages.
> > + *      SIGOPS Oper. Syst. Rev., 36(SI):89–104, December 2002.
> > + * ```
> > + *
> > + * PTE_POPULATE is introduced to have ability to tell that page
> > tables
> > + * shoud be populated.
> > + */
> > +#define PTE_BLOCK       BIT(10, UL)
> > +#define PTE_POPULATE    BIT(11, UL)
> > +
> > +#define PTE_R_MASK(x)   ((x) & PTE_READABLE)
> > +#define PTE_W_MASK(x)   ((x) & PTE_WRITABLE)
> > +#define PTE_X_MASK(x)   ((x) & PTE_EXECUTABLE)
> > +
> > +#define PTE_RWX_MASK(x) ((x) & (PTE_READABLE | PTE_WRITABLE |
> > PTE_EXECUTABLE))
> > +
> >  /* Calculate the offsets into the pagetables for a given VA */
> >  #define pt_linear_offset(lvl, va)   ((va) >>
> > XEN_PT_LEVEL_SHIFT(lvl))
> >  
> >  #define pt_index(lvl, va) (pt_linear_offset((lvl), (va)) &
> > VPN_MASK)
> >  
> > +#define TABLE_OFFSET(offs) (_AT(unsigned int, offs) & ((_AC(1, U)
> > << PAGETABLE_ORDER) - 1))
> 
> Not: Too long line.
> 
> > +#if RV_STAGE1_MODE > SATP_MODE_SV48
> 
> SV48? Isn't ...
> 
> > +#error "need to to update DECLARE_OFFSETS macros"
> > +#else
> > +
> > +#define l0_table_offset(va) TABLE_OFFSET(pt_linear_offset(0, va))
> > +#define l1_table_offset(va) TABLE_OFFSET(pt_linear_offset(1, va))
> > +#define l2_table_offset(va) TABLE_OFFSET(pt_linear_offset(2, va))
> > +
> > +/* Generate an array @var containing the offset for each level
> > from @addr */
> > +#define DECLARE_OFFSETS(var, addr)          \
> > +    const unsigned int var[] = {            \
> > +        l0_table_offset(addr),              \
> > +        l1_table_offset(addr),              \
> > +        l2_table_offset(addr),              \
> > +    }
> 
> ... this for SV39?
Agree, the check above isn't correct. It should be "RV_STAGE1_MODE >
SATP_MODE_SV39".


> 
> > --- /dev/null
> > +++ b/xen/arch/riscv/pt.c
> > @@ -0,0 +1,408 @@
> > +#include <xen/bug.h>
> > +#include <xen/domain_page.h>
> > +#include <xen/errno.h>
> > +#include <xen/mm.h>
> > +#include <xen/mm-frame.h>
> > +#include <xen/pmap.h>
> > +#include <xen/spinlock.h>
> > +
> > +#include <asm/flushtlb.h>
> > +#include <asm/page.h>
> > +
> > +static inline const mfn_t get_root_page(void)
> > +{
> > +    unsigned long root_maddr =
> 
> maddr_t or paddr_t?
> 
> > +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> > +
> > +    return maddr_to_mfn(root_maddr);
> > +}
> > +
> > +/*
> > + * Sanity check of the entry
> > + * mfn is not valid and we are not populating page table. This
> > means
> 
> How does this fit with ...
> 
> > + * we either modify entry or remove an entry.
> > + */
> > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
> > flags)
> > +{
> > +    /* Sanity check when modifying an entry. */
> > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> 
> ... the MFN check here?
The comment is incorrect and should be corrected:
" if mfn is valid or ... "

>  And why is (valid,INVALID_MFN) an indication
> of a modification? But then ...
the explanation is in the comment to pt_update():
   /*
    * If `mfn` equals `INVALID_MFN`, it indicates that the following page
   table
    * update operation might be related to either populating the table,
    * destroying a mapping, or modifying an existing mapping.
    */
   static int pt_update(unsigned long virt,

> 
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> > +            return false;
> > +        }
> 
> ... I also don't understand what this is about. IOW I'm afraid I'm
> still confused about the purpose of this function as well as the
> transitions you want to permit / reject. I wonder whether the
> flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.
> 
> > +/* Update an entry at the level @target. */
> > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > +                           mfn_t mfn, unsigned int target,
> > +                           unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page tables are read-only when the MFN is
> > not valid
> > +     * and we are not populating page table.
> 
> The way flags are handled in PTEs, how can page tables be read-only?
I started to be confused. Probably I have to re-write some code and
also drop almost the whole function xen_pt_check_entry().

> 
> > +     * This means we either modify permissions or remove an entry.
> 
> From all I can determine we also get here when making brand new
> entries.
> What am I overlooking?
Nothing. then it means intermidiate page table won't be read-only.

> 
> > +     */
> > +    bool alloc_only = mfn_eq(mfn, INVALID_MFN) && !(flags &
> > PTE_POPULATE);
> > +    pte_t pte, *entry;
> > +
> > +    /* convenience aliases */
> > +    DECLARE_OFFSETS(offsets, virt);
> > +
> > +    table = xen_map_table(root);
> > +    for ( ; level > target; level-- )
> > +    {
> > +        rc = pt_next_level(alloc_only, &table, offsets[level]);
> > +        if ( rc == XEN_TABLE_MAP_FAILED )
> > +        {
> > +            rc = 0;
> > +
> > +            /*
> > +             * We are here because pt_next_level has failed to map
> > +             * the intermediate page table (e.g the table does not
> > exist
> > +             * and the pt is read-only). It is a valid case when
> > +             * removing a mapping as it may not exist in the page
> > table.
> > +             * In this case, just ignore it.
> > +             */
> > +            if ( flags & PTE_VALID )
> > +            {
> > +                printk("%s: Unable to map level %u\n", __func__,
> > level);
> > +                rc = -ENOENT;
> > +            }
> > +
> > +            goto out;
> > +        }
> > +        else if ( rc != XEN_TABLE_NORMAL )
> > +            break;
> > +    }
> > +
> > +    if ( level != target )
> > +    {
> > +        printk("%s: Shattering superpage is not supported\n",
> > __func__);
> > +        rc = -EOPNOTSUPP;
> > +        goto out;
> > +    }
> > +
> > +    entry = table + offsets[level];
> > +
> > +    rc = -EINVAL;
> > +    if ( !pt_check_entry(*entry, mfn, flags) )
> > +        goto out;
> > +
> > +    /* We are removing the page */
> > +    if ( !(flags & PTE_VALID) )
> > +        memset(&pte, 0x00, sizeof(pte));
> > +    else
> > +    {
> > +        /* We are inserting a mapping => Create new pte. */
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            pte = pte_from_mfn(mfn, PTE_VALID);
> > +        else /* We are updating the permission => Copy the current
> > pte. */
> > +            pte = *entry;
> > +
> > +        /* update permission according to the flags */
> > +        pte.pte |= PTE_RWX_MASK(flags) | PTE_ACCESSED | PTE_DIRTY;
> > +    }
> > +
> > +    write_pte(entry, pte);
> > +
> > +    rc = 0;
> > +
> > + out:
> > +    xen_unmap_table(table);
> > +
> > +    return rc;
> > +}
> > +
> > +static DEFINE_SPINLOCK(xen_pt_lock);
> 
> If you put this in the middle of the file (which is fine), I think it
> wants putting immediately ahead of the (first) function using it, not
> at some seemingly random place.
> 
> > +/*
> > + * If `mfn` equals `INVALID_MFN`, it indicates that the following
> > page table
> > + * update operation might be related to either populating the
> > table,
> > + * destroying a mapping, or modifying an existing mapping.
> > + */
> > +static int pt_update(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    int rc = 0;
> > +    unsigned long vfn = virt >> PAGE_SHIFT;
> > +    unsigned long left = nr_mfns;
> > +
> > +    const mfn_t root = get_root_page();
> > +
> > +    /*
> > +     * It is bad idea to have mapping both writeable and
> > +     * executable.
> > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > +     * prevent any update if this happen.
> > +     */
> > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > PTE_X_MASK(flags) )
> > +    {
> > +        printk("Mappings should not be both Writeable and
> > Executable.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> > +    {
> > +        printk("The virtual address is not aligned to the page-
> > size.\n");
> > +        return -EINVAL;
> > +    }
> > +
> > +    spin_lock(&xen_pt_lock);
> > +
> > +    while ( left )
> > +    {
> > +        unsigned int order, level;
> > +
> > +        level = pt_mapping_level(vfn, mfn, left, flags);
> > +        order = XEN_PT_LEVEL_ORDER(level);
> > +
> > +        ASSERT(left >= BIT(order, UL));
> > +
> > +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
> > +                                    flags);
> 
> Indentation.
> 
> > +        if ( rc )
> > +            break;
> > +
> > +        vfn += 1U << order;
> > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > +            mfn = mfn_add(mfn, 1U << order);
> > +
> > +        left -= (1U << order);
> 
> To be on thje safe side, 1UL everywhere?
> 
> > +        if ( rc )
> > +            break;
> 
> There was such a check already a few lines up from here.
> 
> > +    }
> > +
> > +
> > +    /* ensure that PTEs are all updated before flushing */
> 
> Again: No double blank lines please.
> 
> > +    RISCV_FENCE(rw, rw);
> > +
> > +    /*
> > +     * always flush TLB at the end of the function as non-present
> > entries
> > +     * can be put in the TLB
> > +     */
> > +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
> 
> Coming back to "negative" TLB entries: Assuming RISC-V, just like
> other
> architectures, also permits intermediate page table entries to be
> cached,
> the affected VA range may be much larger than the original request,
> if
> intermediate page tables needed creating.
It could be an issue. Could we some how  to calculate the proper range
or the only option we have is to flush all.
   And for some reason it isn't an issue for Arm:
   
       /*
        * The TLBs flush can be safely skipped when a mapping is
   inserted
        * as we don't allow mapping replacement (see
   xen_pt_check_entry()).
        * Although we still need an ISB to ensure any DSB in
        * write_pte() will complete because the mapping may be used
   soon
        * after.
        *
        * For all the other cases, the TLBs will be flushed
   unconditionally
        * even if the mapping has failed. This is because we may have
        * partially modified the PT. This will prevent any unexpected
        * behavior afterwards.
        */
       if ( !((flags & _PAGE_PRESENT) && !mfn_eq(mfn, INVALID_MFN)) )
           flush_xen_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
       else
           isb();
   

> 
> > +    spin_unlock(&xen_pt_lock);
> 
> Does this really need to come after fence and flush?
I think yes, as page table should be updated only by 1 CPU at the same
time. And before give ability to other CPU to update page table we have
to finish a work on current CPU.

> 
> > +    return rc;
> > +}
> > +
> > +int map_pages_to_xen(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    /*
> > +     * Ensure that we have a valid MFN before proceeding.
> > +     *
> > +     * If the MFN is invalid, pt_update() might misinterpret the
> > operation,
> > +     * treating it as either a population, a mapping destruction,
> > +     * or a mapping modification.
> > +     */
> > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> 
> But flags with PTE_VALID not set are fine to come into here?
It is fine for pt_update() but I don't know if it is fine for
map_pages_to_xen(). I see that other architectures don't check that.

~ Oleksii

> 
> > +    return pt_update(virt, mfn, nr_mfns, flags);
> > +}
> 
> Jan

Jan Beulich Aug. 15, 2024, 8:09 a.m. UTC | #3

On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>> Introduce internal macros starting with PTE_* for convenience.
>>> These macros closely resemble PTE bits, with the exception of
>>> PTE_BLOCK, which indicates that a page larger than 4KB is
>>> needed.
>>
>> I did comment on this too, iirc: Is there going to be any case where
>> large pages are going to be "needed", i.e. not just preferred? If
>> not,
>> giving the caller control over things the other way around
>> (requesting
>> 4k mappings are needed, as we have it in x86) may be preferable.
> Yes, you did the comment but I thought that it will be enough to
> mention that shattering isn't supported now and  also since
> pt_update_entry()is used to unmap as well, there could be a need to
> unmap e.g. 4K page from 2M block mapping what will a little bit harder
> then just having 4k by default.

Shattering isn't supported now, but that's going to change at some point,
I suppose. Where possible the long-term behavior wants taking into account
right away, to avoid having to e.g. touch all callers again later on.

>> Hmm, but then ...
>>
>>> RISC-V detects superpages using `pte.x` and `pte.r`, as there
>>> is no specific bit in the PTE for this purpose. From the RISC-V
>>> spec:
>>> ```
>>>   ...
>>>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to
>>> step 5.
>>>      Otherwise, this PTE is a pointer to the next level of the page
>>> table.
>>>      ... .
>>>   5. A leaf PTE has been found.
>>>      ...
>>>   ...
>>> ```
>>>
>>> The code doesn’t support super page shattering so 4KB pages are
>>> used as
>>> default.
>>
>> ... you have this. Yet still callers expecting re-mapping in the
>> (large)
>> range they map can request small-page mappings right away.
> I am not sure that I fully understand what do you mean by "expcting re-
> mapping".

Right now you have callers pass PTE_BLOCK when they know that no small
page re-mappings are going to occur for an area. What I'm suggesting is
that you invert this logic: Have callers pass PTE_SMALL when there is
a possibility that re-mapping requests may be issued later. Then,
later, by simply grep-ing for PTE_SMALL you'll be able to easily find
all candidates that possibly can be relaxed when super-page shattering
starts being supported. That's going to be easier than finding all
instances where PTE_BLOCK is _not_used.

>>> --- a/xen/arch/riscv/include/asm/flushtlb.h
>>> +++ b/xen/arch/riscv/include/asm/flushtlb.h
>>> @@ -5,12 +5,25 @@
>>>  #include <xen/bug.h>
>>>  #include <xen/cpumask.h>
>>>  
>>> +#include <asm/sbi.h>
>>> +
>>>  /* Flush TLB of local processor for address va. */
>>>  static inline void flush_tlb_one_local(vaddr_t va)
>>>  {
>>>      asm volatile ( "sfence.vma %0" :: "r" (va) : "memory" );
>>>  }
>>>  
>>> +/*
>>> + * Flush a range of VA's hypervisor mappings from the TLB of all
>>> + * processors in the inner-shareable domain.
>>> + */
>>> +static inline void flush_tlb_range_va(vaddr_t va,
>>> +                                      size_t size)
>>
>> No need for line wrapping here?
> What is line wrapping here? Do you mean that size_t size should be on
> the previous line?

Yes. Everything will fit on one line quite nicely.

>>> --- /dev/null
>>> +++ b/xen/arch/riscv/pt.c
>>> @@ -0,0 +1,408 @@
>>> +#include <xen/bug.h>
>>> +#include <xen/domain_page.h>
>>> +#include <xen/errno.h>
>>> +#include <xen/mm.h>
>>> +#include <xen/mm-frame.h>
>>> +#include <xen/pmap.h>
>>> +#include <xen/spinlock.h>
>>> +
>>> +#include <asm/flushtlb.h>
>>> +#include <asm/page.h>
>>> +
>>> +static inline const mfn_t get_root_page(void)
>>> +{
>>> +    unsigned long root_maddr =
>>
>> maddr_t or paddr_t?
>>
>>> +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
>>> +
>>> +    return maddr_to_mfn(root_maddr);
>>> +}
>>> +
>>> +/*
>>> + * Sanity check of the entry
>>> + * mfn is not valid and we are not populating page table. This
>>> means
>>
>> How does this fit with ...
>>
>>> + * we either modify entry or remove an entry.
>>> + */
>>> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
>>> flags)
>>> +{
>>> +    /* Sanity check when modifying an entry. */
>>> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>>
>> ... the MFN check here?
> The comment is incorrect and should be corrected:
> " if mfn is valid or ... "
> 
>>  And why is (valid,INVALID_MFN) an indication
>> of a modification? But then ...
> the explanation is in the comment to pt_update():
>    /*
>     * If `mfn` equals `INVALID_MFN`, it indicates that the following page
>    table
>     * update operation might be related to either populating the table,
>     * destroying a mapping, or modifying an existing mapping.
>     */
>    static int pt_update(unsigned long virt,

And how do readers know that comments in pt_update() are crucial for
understanding what pt_check_entry() does? You certainly don't need to
have the same comment in two places, but you at least want to refer
to a relevant comment when that lives elsewhere.

>>> +static int pt_update(unsigned long virt,
>>> +                     mfn_t mfn,
>>> +                     unsigned long nr_mfns,
>>> +                     unsigned int flags)
>>> +{
>>> +    int rc = 0;
>>> +    unsigned long vfn = virt >> PAGE_SHIFT;
>>> +    unsigned long left = nr_mfns;
>>> +
>>> +    const mfn_t root = get_root_page();
>>> +
>>> +    /*
>>> +     * It is bad idea to have mapping both writeable and
>>> +     * executable.
>>> +     * When modifying/creating mapping (i.e PTE_VALID is set),
>>> +     * prevent any update if this happen.
>>> +     */
>>> +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
>>> PTE_X_MASK(flags) )
>>> +    {
>>> +        printk("Mappings should not be both Writeable and
>>> Executable.\n");
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
>>> +    {
>>> +        printk("The virtual address is not aligned to the page-
>>> size.\n");
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    spin_lock(&xen_pt_lock);
>>> +
>>> +    while ( left )
>>> +    {
>>> +        unsigned int order, level;
>>> +
>>> +        level = pt_mapping_level(vfn, mfn, left, flags);
>>> +        order = XEN_PT_LEVEL_ORDER(level);
>>> +
>>> +        ASSERT(left >= BIT(order, UL));
>>> +
>>> +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn, level,
>>> +                                    flags);
>>
>> Indentation.
>>
>>> +        if ( rc )
>>> +            break;
>>> +
>>> +        vfn += 1U << order;
>>> +        if ( !mfn_eq(mfn, INVALID_MFN) )
>>> +            mfn = mfn_add(mfn, 1U << order);
>>> +
>>> +        left -= (1U << order);
>>
>> To be on thje safe side, 1UL everywhere?
>>
>>> +        if ( rc )
>>> +            break;
>>
>> There was such a check already a few lines up from here.
>>
>>> +    }
>>> +
>>> +
>>> +    /* ensure that PTEs are all updated before flushing */
>>
>> Again: No double blank lines please.
>>
>>> +    RISCV_FENCE(rw, rw);
>>> +
>>> +    /*
>>> +     * always flush TLB at the end of the function as non-present
>>> entries
>>> +     * can be put in the TLB
>>> +     */
>>> +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
>>
>> Coming back to "negative" TLB entries: Assuming RISC-V, just like
>> other
>> architectures, also permits intermediate page table entries to be
>> cached,
>> the affected VA range may be much larger than the original request,
>> if
>> intermediate page tables needed creating.
> It could be an issue. Could we some how  to calculate the proper range
> or the only option we have is to flush all.

Right - either you maintain state to know the biggest possible range
that can be affected, or you flush all when a new intermediate page
table needed inserting.

>>> +    spin_unlock(&xen_pt_lock);
>>
>> Does this really need to come after fence and flush?
> I think yes, as page table should be updated only by 1 CPU at the same
> time. And before give ability to other CPU to update page table we have
> to finish a work on current CPU.

Can you then explain to me, perhaps by way of an example, what will go
wrong if the unlock is ahead of the flush? (I'm less certain about the
fence, and that's also less expensive.)

>>> +int map_pages_to_xen(unsigned long virt,
>>> +                     mfn_t mfn,
>>> +                     unsigned long nr_mfns,
>>> +                     unsigned int flags)
>>> +{
>>> +    /*
>>> +     * Ensure that we have a valid MFN before proceeding.
>>> +     *
>>> +     * If the MFN is invalid, pt_update() might misinterpret the
>>> operation,
>>> +     * treating it as either a population, a mapping destruction,
>>> +     * or a mapping modification.
>>> +     */
>>> +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
>>
>> But flags with PTE_VALID not set are fine to come into here?
> It is fine for pt_update() but I don't know if it is fine for
> map_pages_to_xen(). I see that other architectures don't check that.

That's not my point here. It's rather along the lines of an earlier
that I gave here: Since pte_update() is a pretty generic function, will
flags not having PTE_VALID set perhaps result in pte_update() doing
something that's not what the caller might expect?

And yes, there's a special case of map_pages_to_xen() being called with
_PAGE_NONE (if an arch defines such). That special case plays into here:
If an arch doesn't define it, unmap requests ought to exclusively come
through destroy_xen_mappings().

Jan

Oleksii Kurochko Aug. 15, 2024, 11:21 a.m. UTC | #4

On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
> On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > Introduce internal macros starting with PTE_* for convenience.
> > > > These macros closely resemble PTE bits, with the exception of
> > > > PTE_BLOCK, which indicates that a page larger than 4KB is
> > > > needed.
> > > 
> > > I did comment on this too, iirc: Is there going to be any case
> > > where
> > > large pages are going to be "needed", i.e. not just preferred? If
> > > not,
> > > giving the caller control over things the other way around
> > > (requesting
> > > 4k mappings are needed, as we have it in x86) may be preferable.
> > Yes, you did the comment but I thought that it will be enough to
> > mention that shattering isn't supported now and  also since
> > pt_update_entry()is used to unmap as well, there could be a need to
> > unmap e.g. 4K page from 2M block mapping what will a little bit
> > harder
> > then just having 4k by default.
> 
> Shattering isn't supported now, but that's going to change at some
> point,
> I suppose. Where possible the long-term behavior wants taking into
> account
> right away, to avoid having to e.g. touch all callers again later on.
Arm still leaves without shattering support for Xen pages:
https://gitlab.com/xen-project/xen/-/blob/staging/xen/arch/arm/mmu/pt.c?ref_type=heads#L454

So it can be pretty long-term behaviour.

> 
> > > Hmm, but then ...
> > > 
> > > > RISC-V detects superpages using `pte.x` and `pte.r`, as there
> > > > is no specific bit in the PTE for this purpose. From the RISC-V
> > > > spec:
> > > > ```
> > > >   ...
> > > >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go
> > > > to
> > > > step 5.
> > > >      Otherwise, this PTE is a pointer to the next level of the
> > > > page
> > > > table.
> > > >      ... .
> > > >   5. A leaf PTE has been found.
> > > >      ...
> > > >   ...
> > > > ```
> > > > 
> > > > The code doesn’t support super page shattering so 4KB pages are
> > > > used as
> > > > default.
> > > 
> > > ... you have this. Yet still callers expecting re-mapping in the
> > > (large)
> > > range they map can request small-page mappings right away.
> > I am not sure that I fully understand what do you mean by "expcting
> > re-
> > mapping".
> 
> Right now you have callers pass PTE_BLOCK when they know that no
> small
> page re-mappings are going to occur for an area. What I'm suggesting
> is
> that you invert this logic: Have callers pass PTE_SMALL when there is
> a possibility that re-mapping requests may be issued later. Then,
> later, by simply grep-ing for PTE_SMALL you'll be able to easily find
> all candidates that possibly can be relaxed when super-page
> shattering
> starts being supported. That's going to be easier than finding all
> instances where PTE_BLOCK is _not_used.
So if I understand correctly. Actually nothing will change in algorithm
of pt_update() and only PTE_SMALL should be introduced instead of
PTE_BLOCK. And if I will know that something will be better to map as
PTE_SMALL to not face shattering in case of unmap (for example) I just
can map this memory as PTE_SMALL and that is it?

> > > > --- /dev/null
> > > > +++ b/xen/arch/riscv/pt.c
> > > > @@ -0,0 +1,408 @@
> > > > +#include <xen/bug.h>
> > > > +#include <xen/domain_page.h>
> > > > +#include <xen/errno.h>
> > > > +#include <xen/mm.h>
> > > > +#include <xen/mm-frame.h>
> > > > +#include <xen/pmap.h>
> > > > +#include <xen/spinlock.h>
> > > > +
> > > > +#include <asm/flushtlb.h>
> > > > +#include <asm/page.h>
> > > > +
> > > > +static inline const mfn_t get_root_page(void)
> > > > +{
> > > > +    unsigned long root_maddr =
> > > 
> > > maddr_t or paddr_t?
> > > 
> > > > +        (csr_read(CSR_SATP) & SATP_PPN_MASK) << PAGE_SHIFT;
> > > > +
> > > > +    return maddr_to_mfn(root_maddr);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Sanity check of the entry
> > > > + * mfn is not valid and we are not populating page table. This
> > > > means
> > > 
> > > How does this fit with ...
> > > 
> > > > + * we either modify entry or remove an entry.
> > > > + */
> > > > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned
> > > > int
> > > > flags)
> > > > +{
> > > > +    /* Sanity check when modifying an entry. */
> > > > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> > > 
> > > ... the MFN check here?
> > The comment is incorrect and should be corrected:
> > " if mfn is valid or ... "
> > 
> > >  And why is (valid,INVALID_MFN) an indication
> > > of a modification? But then ...
> > the explanation is in the comment to pt_update():
> >    /*
> >     * If `mfn` equals `INVALID_MFN`, it indicates that the
> > following page
> >    table
> >     * update operation might be related to either populating the
> > table,
> >     * destroying a mapping, or modifying an existing mapping.
> >     */
> >    static int pt_update(unsigned long virt,
> 
> And how do readers know that comments in pt_update() are crucial for
> understanding what pt_check_entry() does? You certainly don't need to
> have the same comment in two places, but you at least want to refer
> to a relevant comment when that lives elsewhere.
Sure, I will update the comment in pt_check_entry() properly if this
function still makes any sense.

> 
> > > > +static int pt_update(unsigned long virt,
> > > > +                     mfn_t mfn,
> > > > +                     unsigned long nr_mfns,
> > > > +                     unsigned int flags)
> > > > +{
> > > > +    int rc = 0;
> > > > +    unsigned long vfn = virt >> PAGE_SHIFT;
> > > > +    unsigned long left = nr_mfns;
> > > > +
> > > > +    const mfn_t root = get_root_page();
> > > > +
> > > > +    /*
> > > > +     * It is bad idea to have mapping both writeable and
> > > > +     * executable.
> > > > +     * When modifying/creating mapping (i.e PTE_VALID is set),
> > > > +     * prevent any update if this happen.
> > > > +     */
> > > > +    if ( (flags & PTE_VALID) && PTE_W_MASK(flags) &&
> > > > PTE_X_MASK(flags) )
> > > > +    {
> > > > +        printk("Mappings should not be both Writeable and
> > > > Executable.\n");
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > > +    if ( !IS_ALIGNED(virt, PAGE_SIZE) )
> > > > +    {
> > > > +        printk("The virtual address is not aligned to the
> > > > page-
> > > > size.\n");
> > > > +        return -EINVAL;
> > > > +    }
> > > > +
> > > > +    spin_lock(&xen_pt_lock);
> > > > +
> > > > +    while ( left )
> > > > +    {
> > > > +        unsigned int order, level;
> > > > +
> > > > +        level = pt_mapping_level(vfn, mfn, left, flags);
> > > > +        order = XEN_PT_LEVEL_ORDER(level);
> > > > +
> > > > +        ASSERT(left >= BIT(order, UL));
> > > > +
> > > > +        rc = pt_update_entry(root, vfn << PAGE_SHIFT, mfn,
> > > > level,
> > > > +                                    flags);
> > > 
> > > Indentation.
> > > 
> > > > +        if ( rc )
> > > > +            break;
> > > > +
> > > > +        vfn += 1U << order;
> > > > +        if ( !mfn_eq(mfn, INVALID_MFN) )
> > > > +            mfn = mfn_add(mfn, 1U << order);
> > > > +
> > > > +        left -= (1U << order);
> > > 
> > > To be on thje safe side, 1UL everywhere?
> > > 
> > > > +        if ( rc )
> > > > +            break;
> > > 
> > > There was such a check already a few lines up from here.
> > > 
> > > > +    }
> > > > +
> > > > +
> > > > +    /* ensure that PTEs are all updated before flushing */
> > > 
> > > Again: No double blank lines please.
> > > 
> > > > +    RISCV_FENCE(rw, rw);
> > > > +
> > > > +    /*
> > > > +     * always flush TLB at the end of the function as non-
> > > > present
> > > > entries
> > > > +     * can be put in the TLB
> > > > +     */
> > > > +    flush_tlb_range_va(virt, PAGE_SIZE * nr_mfns);
> > > 
> > > Coming back to "negative" TLB entries: Assuming RISC-V, just like
> > > other
> > > architectures, also permits intermediate page table entries to be
> > > cached,
> > > the affected VA range may be much larger than the original
> > > request,
> > > if
> > > intermediate page tables needed creating.
> > It could be an issue. Could we some how  to calculate the proper
> > range
> > or the only option we have is to flush all.
> 
> Right - either you maintain state to know the biggest possible range
> that can be affected, or you flush all when a new intermediate page
> table needed inserting.
I think that the second one option will be easier to implement in the
current implementation. It is not issue for now as fixmap, fdt and xen
are in the same slot so no new intermediate page tables are needed.

> 
> > > > +    spin_unlock(&xen_pt_lock);
> > > 
> > > Does this really need to come after fence and flush?
> > I think yes, as page table should be updated only by 1 CPU at the
> > same
> > time. And before give ability to other CPU to update page table we
> > have
> > to finish a work on current CPU.
> 
> Can you then explain to me, perhaps by way of an example, what will
> go
> wrong if the unlock is ahead of the flush? (I'm less certain about
> the
> fence, and that's also less expensive.)
pt_update() will be called for interleaved region, for example, by
different CPUs:

                     pt_update():
CPU1:                                    CPU2:
 ...                                spin_lock(&xen_pt_lock);
RISCV_FENCE(rw, rw);                 ....

/* After this function will be
   executed the following thing
   can happen ------------------>  start to update page table
*/                                 entries which was partially      
spin_unlock(&xen_pt_lock);         created during CPU1 but CPU2       
....                               doesn't know about them yet        
....                               because flush_tlb() ( sfence.vma ) 
....                               wasn't done      
....                                                                  
flush_tlb_range_va();

And it can be an issue if I understand correctly.
> 
> > > > +int map_pages_to_xen(unsigned long virt,
> > > > +                     mfn_t mfn,
> > > > +                     unsigned long nr_mfns,
> > > > +                     unsigned int flags)
> > > > +{
> > > > +    /*
> > > > +     * Ensure that we have a valid MFN before proceeding.
> > > > +     *
> > > > +     * If the MFN is invalid, pt_update() might misinterpret
> > > > the
> > > > operation,
> > > > +     * treating it as either a population, a mapping
> > > > destruction,
> > > > +     * or a mapping modification.
> > > > +     */
> > > > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> > > 
> > > But flags with PTE_VALID not set are fine to come into here?
> > It is fine for pt_update() but I don't know if it is fine for
> > map_pages_to_xen(). I see that other architectures don't check
> > that.
> 
> That's not my point here. It's rather along the lines of an earlier
> that I gave here: Since pte_update() is a pretty generic function,
> will
> flags not having PTE_VALID set perhaps result in pte_update() doing
> something that's not what the caller might expect?
I think that everything will be okay, if PTE_VALID is set then it means
that pt_update() should update ( modify/remove/insert ) page table
entry and all the cases which isn't expected by the logic should be
covered by pt_check_entry().

and the case if when page table couldn't be mapped:
```
           rc = pt_next_level(alloc_only, &table, offsets[level]);
           if ( rc == XEN_TABLE_MAP_FAILED )
           {
               rc = 0;
   
               /*
                * We are here because pt_next_level has failed to map
                * the intermediate page table (e.g the table does not
   exist
                * and the pt is read-only). It is a valid case when
                * removing a mapping as it may not exist in the page
   table.
                * In this case, just ignore it.
                */
               if ( flags & PTE_VALID )
               {
                   printk("%s: Unable to map level %u\n", __func__,
   level);
                   rc = -ENOENT;
               }
```
> 
> And yes, there's a special case of map_pages_to_xen() being called
> with
> _PAGE_NONE (if an arch defines such). That special case plays into
> here:
> If an arch doesn't define it, unmap requests ought to exclusively
> come
> through destroy_xen_mappings().
I thought that it should always done through destroy_xen_mappings().

Arm doesn't introduce _PAGE_NONE and pt_update() is based on Arm's
version of xen_pt_update() so this special case should be covered
properly.

And it seems to me (if I am not confusing something ) that if it is
necessary to unmap pages mapped by map_pages_to_xen() they are using
destroy_xen_mappings() which is defined using xen_pt_update():
   int modify_xen_mappings(unsigned long s, unsigned long e, unsigned
   int nf)
   {
       ASSERT(IS_ALIGNED(s, PAGE_SIZE));
       ASSERT(IS_ALIGNED(e, PAGE_SIZE));
       ASSERT(s <= e);
       return xen_pt_update(s, INVALID_MFN, (e - s) >> PAGE_SHIFT, nf);
   }

~ Oleksii

Jan Beulich Aug. 15, 2024, 12:16 p.m. UTC | #5

On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
>> On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>>>> RISC-V detects superpages using `pte.x` and `pte.r`, as there
>>>>> is no specific bit in the PTE for this purpose. From the RISC-V
>>>>> spec:
>>>>> ```
>>>>>   ...
>>>>>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go
>>>>> to
>>>>> step 5.
>>>>>      Otherwise, this PTE is a pointer to the next level of the
>>>>> page
>>>>> table.
>>>>>      ... .
>>>>>   5. A leaf PTE has been found.
>>>>>      ...
>>>>>   ...
>>>>> ```
>>>>>
>>>>> The code doesn’t support super page shattering so 4KB pages are
>>>>> used as
>>>>> default.
>>>>
>>>> ... you have this. Yet still callers expecting re-mapping in the
>>>> (large)
>>>> range they map can request small-page mappings right away.
>>> I am not sure that I fully understand what do you mean by "expcting
>>> re-
>>> mapping".
>>
>> Right now you have callers pass PTE_BLOCK when they know that no
>> small
>> page re-mappings are going to occur for an area. What I'm suggesting
>> is
>> that you invert this logic: Have callers pass PTE_SMALL when there is
>> a possibility that re-mapping requests may be issued later. Then,
>> later, by simply grep-ing for PTE_SMALL you'll be able to easily find
>> all candidates that possibly can be relaxed when super-page
>> shattering
>> starts being supported. That's going to be easier than finding all
>> instances where PTE_BLOCK is _not_used.
> So if I understand correctly. Actually nothing will change in algorithm
> of pt_update() and only PTE_SMALL should be introduced instead of
> PTE_BLOCK. And if I will know that something will be better to map as
> PTE_SMALL to not face shattering in case of unmap (for example) I just
> can map this memory as PTE_SMALL and that is it?

That is it.

>>>>> +    spin_unlock(&xen_pt_lock);
>>>>
>>>> Does this really need to come after fence and flush?
>>> I think yes, as page table should be updated only by 1 CPU at the
>>> same
>>> time. And before give ability to other CPU to update page table we
>>> have
>>> to finish a work on current CPU.
>>
>> Can you then explain to me, perhaps by way of an example, what will
>> go
>> wrong if the unlock is ahead of the flush? (I'm less certain about
>> the
>> fence, and that's also less expensive.)
> pt_update() will be called for interleaved region, for example, by
> different CPUs:
> 
>                      pt_update():
> CPU1:                                    CPU2:
>  ...                                spin_lock(&xen_pt_lock);
> RISCV_FENCE(rw, rw);                 ....
> 
> /* After this function will be
>    executed the following thing
>    can happen ------------------>  start to update page table
> */                                 entries which was partially      
> spin_unlock(&xen_pt_lock);         created during CPU1 but CPU2       
> ....                               doesn't know about them yet        
> ....                               because flush_tlb() ( sfence.vma ) 
> ....                               wasn't done      
> ....                                                                  
> flush_tlb_range_va();

Not exactly: CPU2 knows about them as far as the memory used / modified
goes, and that's all that matters for further page table modifications.
CPU2 only doesn't know about the new page table entries yet when it comes
to using them for a translation (by the hardware page walker). Yet this
aspect is irrelevant here, if I'm not mistaken.

Jan

Oleksii Kurochko Aug. 15, 2024, 1:34 p.m. UTC | #6

On Thu, 2024-08-15 at 14:16 +0200, Jan Beulich wrote:
> On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
> > On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
> > > On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> > > > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > > > RISC-V detects superpages using `pte.x` and `pte.r`, as
> > > > > > there
> > > > > > is no specific bit in the PTE for this purpose. From the
> > > > > > RISC-V
> > > > > > spec:
> > > > > > ```
> > > > > >   ...
> > > > > >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x =
> > > > > > 1, go
> > > > > > to
> > > > > > step 5.
> > > > > >      Otherwise, this PTE is a pointer to the next level of
> > > > > > the
> > > > > > page
> > > > > > table.
> > > > > >      ... .
> > > > > >   5. A leaf PTE has been found.
> > > > > >      ...
> > > > > >   ...
> > > > > > ```
> > > > > > 
> > > > > > The code doesn’t support super page shattering so 4KB pages
> > > > > > are
> > > > > > used as
> > > > > > default.
> > > > > 
> > > > > ... you have this. Yet still callers expecting re-mapping in
> > > > > the
> > > > > (large)
> > > > > range they map can request small-page mappings right away.
> > > > I am not sure that I fully understand what do you mean by
> > > > "expcting
> > > > re-
> > > > mapping".
> > > 
> > > Right now you have callers pass PTE_BLOCK when they know that no
> > > small
> > > page re-mappings are going to occur for an area. What I'm
> > > suggesting
> > > is
> > > that you invert this logic: Have callers pass PTE_SMALL when
> > > there is
> > > a possibility that re-mapping requests may be issued later. Then,
> > > later, by simply grep-ing for PTE_SMALL you'll be able to easily
> > > find
> > > all candidates that possibly can be relaxed when super-page
> > > shattering
> > > starts being supported. That's going to be easier than finding
> > > all
> > > instances where PTE_BLOCK is _not_used.
> > So if I understand correctly. Actually nothing will change in
> > algorithm
> > of pt_update() and only PTE_SMALL should be introduced instead of
> > PTE_BLOCK. And if I will know that something will be better to map
> > as
> > PTE_SMALL to not face shattering in case of unmap (for example) I
> > just
> > can map this memory as PTE_SMALL and that is it?
> 
> That is it.
> 
> > > > > > +    spin_unlock(&xen_pt_lock);
> > > > > 
> > > > > Does this really need to come after fence and flush?
> > > > I think yes, as page table should be updated only by 1 CPU at
> > > > the
> > > > same
> > > > time. And before give ability to other CPU to update page table
> > > > we
> > > > have
> > > > to finish a work on current CPU.
> > > 
> > > Can you then explain to me, perhaps by way of an example, what
> > > will
> > > go
> > > wrong if the unlock is ahead of the flush? (I'm less certain
> > > about
> > > the
> > > fence, and that's also less expensive.)
> > pt_update() will be called for interleaved region, for example, by
> > different CPUs:
> > 
> >                      pt_update():
> > CPU1:                                    CPU2:
> >  ...                                spin_lock(&xen_pt_lock);
> > RISCV_FENCE(rw, rw);                 ....
> > 
> > /* After this function will be
> >    executed the following thing
> >    can happen ------------------>  start to update page table
> > */                                 entries which was partially     
> > spin_unlock(&xen_pt_lock);         created during CPU1 but
> > CPU2       
> > ....                               doesn't know about them
> > yet        
> > ....                               because flush_tlb() ( sfence.vma
> > ) 
> > ....                               wasn't done      
> > ....                                                               
> >    
> > flush_tlb_range_va();
> 
> Not exactly: CPU2 knows about them as far as the memory used /
> modified
> goes, and that's all that matters for further page table
> modifications.
> CPU2 only doesn't know about the new page table entries yet when it
> comes
> to using them for a translation (by the hardware page walker). Yet
> this
> aspect is irrelevant here, if I'm not mistaken.
And it isn't an issue that CPU2 will add these new page table entries
again during execution of CPU2's pt_update()?

~ Oleksii

Jan Beulich Aug. 15, 2024, 3:26 p.m. UTC | #7

On 15.08.2024 15:34, oleksii.kurochko@gmail.com wrote:
> On Thu, 2024-08-15 at 14:16 +0200, Jan Beulich wrote:
>> On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
>>> On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
>>>> On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
>>>>> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>>>>> On 09.08.2024 18:19, Oleksii Kurochko wrote:
>>>>>>> RISC-V detects superpages using `pte.x` and `pte.r`, as
>>>>>>> there
>>>>>>> is no specific bit in the PTE for this purpose. From the
>>>>>>> RISC-V
>>>>>>> spec:
>>>>>>> ```
>>>>>>>   ...
>>>>>>>   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x =
>>>>>>> 1, go
>>>>>>> to
>>>>>>> step 5.
>>>>>>>      Otherwise, this PTE is a pointer to the next level of
>>>>>>> the
>>>>>>> page
>>>>>>> table.
>>>>>>>      ... .
>>>>>>>   5. A leaf PTE has been found.
>>>>>>>      ...
>>>>>>>   ...
>>>>>>> ```
>>>>>>>
>>>>>>> The code doesn’t support super page shattering so 4KB pages
>>>>>>> are
>>>>>>> used as
>>>>>>> default.
>>>>>>
>>>>>> ... you have this. Yet still callers expecting re-mapping in
>>>>>> the
>>>>>> (large)
>>>>>> range they map can request small-page mappings right away.
>>>>> I am not sure that I fully understand what do you mean by
>>>>> "expcting
>>>>> re-
>>>>> mapping".
>>>>
>>>> Right now you have callers pass PTE_BLOCK when they know that no
>>>> small
>>>> page re-mappings are going to occur for an area. What I'm
>>>> suggesting
>>>> is
>>>> that you invert this logic: Have callers pass PTE_SMALL when
>>>> there is
>>>> a possibility that re-mapping requests may be issued later. Then,
>>>> later, by simply grep-ing for PTE_SMALL you'll be able to easily
>>>> find
>>>> all candidates that possibly can be relaxed when super-page
>>>> shattering
>>>> starts being supported. That's going to be easier than finding
>>>> all
>>>> instances where PTE_BLOCK is _not_used.
>>> So if I understand correctly. Actually nothing will change in
>>> algorithm
>>> of pt_update() and only PTE_SMALL should be introduced instead of
>>> PTE_BLOCK. And if I will know that something will be better to map
>>> as
>>> PTE_SMALL to not face shattering in case of unmap (for example) I
>>> just
>>> can map this memory as PTE_SMALL and that is it?
>>
>> That is it.
>>
>>>>>>> +    spin_unlock(&xen_pt_lock);
>>>>>>
>>>>>> Does this really need to come after fence and flush?
>>>>> I think yes, as page table should be updated only by 1 CPU at
>>>>> the
>>>>> same
>>>>> time. And before give ability to other CPU to update page table
>>>>> we
>>>>> have
>>>>> to finish a work on current CPU.
>>>>
>>>> Can you then explain to me, perhaps by way of an example, what
>>>> will
>>>> go
>>>> wrong if the unlock is ahead of the flush? (I'm less certain
>>>> about
>>>> the
>>>> fence, and that's also less expensive.)
>>> pt_update() will be called for interleaved region, for example, by
>>> different CPUs:
>>>
>>>                      pt_update():
>>> CPU1:                                    CPU2:
>>>  ...                                spin_lock(&xen_pt_lock);
>>> RISCV_FENCE(rw, rw);                 ....
>>>
>>> /* After this function will be
>>>    executed the following thing
>>>    can happen ------------------>  start to update page table
>>> */                                 entries which was partially     
>>> spin_unlock(&xen_pt_lock);         created during CPU1 but
>>> CPU2       
>>> ....                               doesn't know about them
>>> yet        
>>> ....                               because flush_tlb() ( sfence.vma
>>> ) 
>>> ....                               wasn't done      
>>> ....                                                               
>>>    
>>> flush_tlb_range_va();
>>
>> Not exactly: CPU2 knows about them as far as the memory used /
>> modified
>> goes, and that's all that matters for further page table
>> modifications.
>> CPU2 only doesn't know about the new page table entries yet when it
>> comes
>> to using them for a translation (by the hardware page walker). Yet
>> this
>> aspect is irrelevant here, if I'm not mistaken.
> And it isn't an issue that CPU2 will add these new page table entries
> again during execution of CPU2's pt_update()?

Add these page table entries again? That's only going to happen due to
another bug somewhere, I suppose. And it would be as much (or as little)
of an issue if that happened right after dropping the lock.

Jan

Oleksii Kurochko Aug. 16, 2024, 9:09 a.m. UTC | #8

On Thu, 2024-08-15 at 17:26 +0200, Jan Beulich wrote:
> On 15.08.2024 15:34, oleksii.kurochko@gmail.com wrote:
> > On Thu, 2024-08-15 at 14:16 +0200, Jan Beulich wrote:
> > > On 15.08.2024 13:21, oleksii.kurochko@gmail.com wrote:
> > > > On Thu, 2024-08-15 at 10:09 +0200, Jan Beulich wrote:
> > > > > On 14.08.2024 18:50, oleksii.kurochko@gmail.com wrote:
> > > > > > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > > > > > On 09.08.2024 18:19, Oleksii Kurochko wrote:
> > > > > > > > RISC-V detects superpages using `pte.x` and `pte.r`, as
> > > > > > > > there
> > > > > > > > is no specific bit in the PTE for this purpose. From
> > > > > > > > the
> > > > > > > > RISC-V
> > > > > > > > spec:
> > > > > > > > ```
> > > > > > > >   ...
> > > > > > > >   4. Otherwise, the PTE is valid. If pte.r = 1 or pte.x
> > > > > > > > =
> > > > > > > > 1, go
> > > > > > > > to
> > > > > > > > step 5.
> > > > > > > >      Otherwise, this PTE is a pointer to the next level
> > > > > > > > of
> > > > > > > > the
> > > > > > > > page
> > > > > > > > table.
> > > > > > > >      ... .
> > > > > > > >   5. A leaf PTE has been found.
> > > > > > > >      ...
> > > > > > > >   ...
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > The code doesn’t support super page shattering so 4KB
> > > > > > > > pages
> > > > > > > > are
> > > > > > > > used as
> > > > > > > > default.
> > > > > > > 
> > > > > > > ... you have this. Yet still callers expecting re-mapping
> > > > > > > in
> > > > > > > the
> > > > > > > (large)
> > > > > > > range they map can request small-page mappings right
> > > > > > > away.
> > > > > > I am not sure that I fully understand what do you mean by
> > > > > > "expcting
> > > > > > re-
> > > > > > mapping".
> > > > > 
> > > > > Right now you have callers pass PTE_BLOCK when they know that
> > > > > no
> > > > > small
> > > > > page re-mappings are going to occur for an area. What I'm
> > > > > suggesting
> > > > > is
> > > > > that you invert this logic: Have callers pass PTE_SMALL when
> > > > > there is
> > > > > a possibility that re-mapping requests may be issued later.
> > > > > Then,
> > > > > later, by simply grep-ing for PTE_SMALL you'll be able to
> > > > > easily
> > > > > find
> > > > > all candidates that possibly can be relaxed when super-page
> > > > > shattering
> > > > > starts being supported. That's going to be easier than
> > > > > finding
> > > > > all
> > > > > instances where PTE_BLOCK is _not_used.
> > > > So if I understand correctly. Actually nothing will change in
> > > > algorithm
> > > > of pt_update() and only PTE_SMALL should be introduced instead
> > > > of
> > > > PTE_BLOCK. And if I will know that something will be better to
> > > > map
> > > > as
> > > > PTE_SMALL to not face shattering in case of unmap (for example)
> > > > I
> > > > just
> > > > can map this memory as PTE_SMALL and that is it?
> > > 
> > > That is it.
> > > 
> > > > > > > > +    spin_unlock(&xen_pt_lock);
> > > > > > > 
> > > > > > > Does this really need to come after fence and flush?
> > > > > > I think yes, as page table should be updated only by 1 CPU
> > > > > > at
> > > > > > the
> > > > > > same
> > > > > > time. And before give ability to other CPU to update page
> > > > > > table
> > > > > > we
> > > > > > have
> > > > > > to finish a work on current CPU.
> > > > > 
> > > > > Can you then explain to me, perhaps by way of an example,
> > > > > what
> > > > > will
> > > > > go
> > > > > wrong if the unlock is ahead of the flush? (I'm less certain
> > > > > about
> > > > > the
> > > > > fence, and that's also less expensive.)
> > > > pt_update() will be called for interleaved region, for example,
> > > > by
> > > > different CPUs:
> > > > 
> > > >                      pt_update():
> > > > CPU1:                                    CPU2:
> > > >  ...                                spin_lock(&xen_pt_lock);
> > > > RISCV_FENCE(rw, rw);                 ....
> > > > 
> > > > /* After this function will be
> > > >    executed the following thing
> > > >    can happen ------------------>  start to update page table
> > > > */                                 entries which was
> > > > partially     
> > > > spin_unlock(&xen_pt_lock);         created during CPU1 but
> > > > CPU2       
> > > > ....                               doesn't know about them
> > > > yet        
> > > > ....                               because flush_tlb() (
> > > > sfence.vma
> > > > ) 
> > > > ....                               wasn't done      
> > > > ....                                                           
> > > >     
> > > >    
> > > > flush_tlb_range_va();
> > > 
> > > Not exactly: CPU2 knows about them as far as the memory used /
> > > modified
> > > goes, and that's all that matters for further page table
> > > modifications.
> > > CPU2 only doesn't know about the new page table entries yet when
> > > it
> > > comes
> > > to using them for a translation (by the hardware page walker).
> > > Yet
> > > this
> > > aspect is irrelevant here, if I'm not mistaken.
> > And it isn't an issue that CPU2 will add these new page table
> > entries
> > again during execution of CPU2's pt_update()?
> 
> Add these page table entries again? That's only going to happen due
> to
> another bug somewhere, I suppose. And it would be as much (or as
> little)
> of an issue if that happened right after dropping the lock.
Yes, agree, it sounds more like a bug. Thanks.

~ Oleksii

Oleksii Kurochko Aug. 20, 2024, 1:18 p.m. UTC | #9

On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > + * Sanity check of the entry
> > + * mfn is not valid and we are not populating page table. This
> > means
> 
> How does this fit with ...
> 
> > + * we either modify entry or remove an entry.
> > + */
> > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
> > flags)
> > +{
> > +    /* Sanity check when modifying an entry. */
> > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> 
> ... the MFN check here? And why is (valid,INVALID_MFN) an indication
> of a modification?
Because as mentioned here:
```
/*
 * If `mfn` equals `INVALID_MFN`, it indicates that the following page
table
 * update operation might be related to either populating the table,
 * destroying a mapping, or modifying an existing mapping.
 */
static int pt_update(unsigned long virt,
```
And so if requested flags are PTE_VALID ( present ) and mfn=INVALID it
will mean that we are going to modify an entry.


> But then ...
> 
> > +    {
> > +        /* We don't allow modifying an invalid entry. */
> > +        if ( !pte_is_valid(entry) )
> > +        {
> > +            printk("Modifying invalid entry is not allowed.\n");
> > +            return false;
> > +        }
> 
> ... I also don't understand what this is about. IOW I'm afraid I'm
> still confused about the purpose of this function as well as the
> transitions you want to permit / reject. 
In the case if the caller call modify_xen_mappings() on a region that
doesn't exist.

> I wonder whether the
> flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.
I am not sure that I understand what you mean.


> 
> > +/* Update an entry at the level @target. */
> > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > +                           mfn_t mfn, unsigned int target,
> > +                           unsigned int flags)
> > +{
> > +    int rc;
> > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > +    pte_t *table;
> > +    /*
> > +     * The intermediate page tables are read-only when the MFN is
> > not valid
> > +     * and we are not populating page table.
> 
> The way flags are handled in PTEs, how can page tables be read-only?
This is not needed for everyone case. In case of entry removing an
intermediate page table should be created in case when the user is
trying to remove a mapping that doesn't exist.


> 
> > +     * This means we either modify permissions or remove an entry.
> 
> From all I can determine we also get here when making brand new
> entries.
> What am I overlooking?
Yes, but in this case an intermediate page tables should be read_only,
so alloc_only will be true and it will be allowed to create new
intermediate page table.


> > +    return rc;
> > +}
> > +
> > +int map_pages_to_xen(unsigned long virt,
> > +                     mfn_t mfn,
> > +                     unsigned long nr_mfns,
> > +                     unsigned int flags)
> > +{
> > +    /*
> > +     * Ensure that we have a valid MFN before proceeding.
> > +     *
> > +     * If the MFN is invalid, pt_update() might misinterpret the
> > operation,
> > +     * treating it as either a population, a mapping destruction,
> > +     * or a mapping modification.
> > +     */
> > +    ASSERT(!mfn_eq(mfn, INVALID_MFN));
> 
> But flags with PTE_VALID not set are fine to come into here?
Probably not, I will double check again and if it is not okay, I will
update the ASSERT.

Jan Beulich Aug. 20, 2024, 1:47 p.m. UTC | #10

On 20.08.2024 15:18, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>> + * Sanity check of the entry
>>> + * mfn is not valid and we are not populating page table. This
>>> means
>>
>> How does this fit with ...
>>
>>> + * we either modify entry or remove an entry.
>>> + */
>>> +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned int
>>> flags)
>>> +{
>>> +    /* Sanity check when modifying an entry. */
>>> +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>>
>> ... the MFN check here? And why is (valid,INVALID_MFN) an indication
>> of a modification?
> Because as mentioned here:
> ```
> /*
>  * If `mfn` equals `INVALID_MFN`, it indicates that the following page
> table
>  * update operation might be related to either populating the table,
>  * destroying a mapping, or modifying an existing mapping.
>  */
> static int pt_update(unsigned long virt,
> ```

That's in the description of another function. How would one know that
the rules on (mfn,flags) tuples there would apply here as well, without
you saying so explicitly? It may not be necessary to repeat the other
comment, but at least you want to reference it.

> And so if requested flags are PTE_VALID ( present ) and mfn=INVALID it
> will mean that we are going to modify an entry.
> 
> 
>> But then ...
>>
>>> +    {
>>> +        /* We don't allow modifying an invalid entry. */
>>> +        if ( !pte_is_valid(entry) )
>>> +        {
>>> +            printk("Modifying invalid entry is not allowed.\n");
>>> +            return false;
>>> +        }
>>
>> ... I also don't understand what this is about. IOW I'm afraid I'm
>> still confused about the purpose of this function as well as the
>> transitions you want to permit / reject. 
> In the case if the caller call modify_xen_mappings() on a region that
> doesn't exist.

Perhaps. What I think is missing is a clear statement somewhere to describe
what the various combinations of (mfn,flags) mean, in terms of the operation
to be carried out. This may then also help with ...

>> I wonder whether the
>> flags & PTE_VALID and pte_is_valid(entry) aren't in need of swapping.
> I am not sure that I understand what you mean.

... this: It's hard to see what cannot be understood about my earlier
comment. In the code commented on you have a flags & PTE_VALID check and a
pte_is_valid(entry) one. I'm wondering whether the two simply are the wrong
way round.

>>> +/* Update an entry at the level @target. */
>>> +static int pt_update_entry(mfn_t root, unsigned long virt,
>>> +                           mfn_t mfn, unsigned int target,
>>> +                           unsigned int flags)
>>> +{
>>> +    int rc;
>>> +    unsigned int level = HYP_PT_ROOT_LEVEL;
>>> +    pte_t *table;
>>> +    /*
>>> +     * The intermediate page tables are read-only when the MFN is
>>> not valid
>>> +     * and we are not populating page table.
>>
>> The way flags are handled in PTEs, how can page tables be read-only?
> This is not needed for everyone case. In case of entry removing an
> intermediate page table should be created in case when the user is
> trying to remove a mapping that doesn't exist.

I don't follow: For one, how is this related to "read-only"-ness? And
then, why would any kind of removal, whether of a present or non-
present mapping, ever result in page tables being created?

>>> +     * This means we either modify permissions or remove an entry.
>>
>> From all I can determine we also get here when making brand new
>> entries.
>> What am I overlooking?
> Yes, but in this case an intermediate page tables should be read_only,
> so alloc_only will be true and it will be allowed to create new
> intermediate page table.

Hmm, so instead of "read-only" do you maybe mean page tables are not
supposed to be modified? There's a difference here: When they're
read-only, you can't write to them (or a fault will result). Whereas
when in principle they can be modified, there still may be a rule
saying "in this case they shouldn't be altered".

Jan

Oleksii Kurochko Aug. 20, 2024, 2:42 p.m. UTC | #11

On Tue, 2024-08-20 at 15:47 +0200, Jan Beulich wrote:
> On 20.08.2024 15:18, oleksii.kurochko@gmail.com wrote:
> > On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
> > > > + * Sanity check of the entry
> > > > + * mfn is not valid and we are not populating page table. This
> > > > means
> > > 
> > > How does this fit with ...
> > > 
> > > > + * we either modify entry or remove an entry.
> > > > + */
> > > > +static bool pt_check_entry(pte_t entry, mfn_t mfn, unsigned
> > > > int
> > > > flags)
> > > > +{
> > > > +    /* Sanity check when modifying an entry. */
> > > > +    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
> > > 
> > > ... the MFN check here? And why is (valid,INVALID_MFN) an
> > > indication
> > > of a modification?
> > Because as mentioned here:
> > ```
> > /*
> >  * If `mfn` equals `INVALID_MFN`, it indicates that the following
> > page
> > table
> >  * update operation might be related to either populating the
> > table,
> >  * destroying a mapping, or modifying an existing mapping.
> >  */
> > static int pt_update(unsigned long virt,
> > ```
> 
> That's in the description of another function. How would one know
> that
> the rules on (mfn,flags) tuples there would apply here as well,
> without
> you saying so explicitly? It may not be necessary to repeat the other
> comment, but at least you want to reference it.
> 
> > And so if requested flags are PTE_VALID ( present ) and mfn=INVALID
> > it
> > will mean that we are going to modify an entry.
> > 
> > 
> > > But then ...
> > > 
> > > > +    {
> > > > +        /* We don't allow modifying an invalid entry. */
> > > > +        if ( !pte_is_valid(entry) )
> > > > +        {
> > > > +            printk("Modifying invalid entry is not
> > > > allowed.\n");
> > > > +            return false;
> > > > +        }
> > > 
> > > ... I also don't understand what this is about. IOW I'm afraid
> > > I'm
> > > still confused about the purpose of this function as well as the
> > > transitions you want to permit / reject. 
> > In the case if the caller call modify_xen_mappings() on a region
> > that
> > doesn't exist.
> 
> Perhaps. What I think is missing is a clear statement somewhere to
> describe
> what the various combinations of (mfn,flags) mean, in terms of the
> operation
> to be carried out. This may then also help with ...
> 
> > > I wonder whether the
> > > flags & PTE_VALID and pte_is_valid(entry) aren't in need of
> > > swapping.
> > I am not sure that I understand what you mean.
> 
> ... this: It's hard to see what cannot be understood about my earlier
> comment. In the code commented on you have a flags & PTE_VALID check
> and a
> pte_is_valid(entry) one. I'm wondering whether the two simply are the
> wrong
> way round.
Sure. I'll add additional comments and reference in the next patch
version to clarify that moment.

> 
> > > > +/* Update an entry at the level @target. */
> > > > +static int pt_update_entry(mfn_t root, unsigned long virt,
> > > > +                           mfn_t mfn, unsigned int target,
> > > > +                           unsigned int flags)
> > > > +{
> > > > +    int rc;
> > > > +    unsigned int level = HYP_PT_ROOT_LEVEL;
> > > > +    pte_t *table;
> > > > +    /*
> > > > +     * The intermediate page tables are read-only when the MFN
> > > > is
> > > > not valid
> > > > +     * and we are not populating page table.
> > > 
> > > The way flags are handled in PTEs, how can page tables be read-
> > > only?
> > This is not needed for everyone case. In case of entry removing an
> > intermediate page table should be created in case when the user is
> > trying to remove a mapping that doesn't exist.
> 
> I don't follow: For one, how is this related to "read-only"-ness? And
> then, why would any kind of removal, whether of a present or non-
> present mapping, ever result in page tables being created?
If the mapping doesn't exist and it was requested ( accidentally by the
caller ) then then the logic of PT update will try to allocate the page
table what is actually a bogus behaviour... I have to double-check
that.

> 
> > > > +     * This means we either modify permissions or remove an
> > > > entry.
> > > 
> > > From all I can determine we also get here when making brand new
> > > entries.
> > > What am I overlooking?
> > Yes, but in this case an intermediate page tables should be
> > read_only,
> > so alloc_only will be true and it will be allowed to create new
> > intermediate page table.
> 
> Hmm, so instead of "read-only" do you maybe mean page tables are not
> supposed to be modified? There's a difference here: When they're
> read-only, you can't write to them (or a fault will result). Whereas
> when in principle they can be modified, there still may be a rule
> saying "in this case they shouldn't be altered".

There is such rule which checks that page tables aren't supposed to be
modified ( so that is why they are read-only ):
```
    /* Sanity check when modifying an entry. */
    if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
    {
	...

        /* We don't allow modifying a table entry */
        if ( pte_is_table(entry) )
        {
            printk("Modifying a table entry is not allowed.\n");
            return false;
        }
```

~ Oleksii

Jan Beulich Aug. 20, 2024, 4:30 p.m. UTC | #12

On 20.08.2024 16:42, oleksii.kurochko@gmail.com wrote:
> On Tue, 2024-08-20 at 15:47 +0200, Jan Beulich wrote:
>> On 20.08.2024 15:18, oleksii.kurochko@gmail.com wrote:
>>> On Tue, 2024-08-13 at 12:31 +0200, Jan Beulich wrote:
>>>> From all I can determine we also get here when making brand new
>>>> entries.
>>>> What am I overlooking?
>>> Yes, but in this case an intermediate page tables should be
>>> read_only,
>>> so alloc_only will be true and it will be allowed to create new
>>> intermediate page table.
>>
>> Hmm, so instead of "read-only" do you maybe mean page tables are not
>> supposed to be modified? There's a difference here: When they're
>> read-only, you can't write to them (or a fault will result). Whereas
>> when in principle they can be modified, there still may be a rule
>> saying "in this case they shouldn't be altered".
> 
> There is such rule which checks that page tables aren't supposed to be
> modified ( so that is why they are read-only ):

Hmm, you're saying "read-only" again in reply to me explaining that this
isn't the correct term here. I find this increasingly confusing.

Jan

> ```
>     /* Sanity check when modifying an entry. */
>     if ( (flags & PTE_VALID) && mfn_eq(mfn, INVALID_MFN) )
>     {
> 	...
> 
>         /* We don't allow modifying a table entry */
>         if ( pte_is_table(entry) )
>         {
>             printk("Modifying a table entry is not allowed.\n");
>             return false;
>         }
> ```
> 
> ~ Oleksii

[v4,6/7] xen/riscv: page table handling

Commit Message

Comments

Patch