diff mbox series

[V3,(resend),01/19] x86: Create per-domain mapping of guest_root_pt

Message ID 20240513134046.82605-2-eliasely@amazon.com (mailing list archive)
State New
Headers show
Series Remove the directmap | expand

Commit Message

Elias El Yandouzi May 13, 2024, 1:40 p.m. UTC
From: Hongyan Xia <hongyxia@amazon.com>

Create a per-domain mapping of PV guest_root_pt as direct map is being
removed.

Note that we do not map and unmap root_pgt for now since it is still a
xenheap page.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>
Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>

----
    Changes in V3:
        * Rename SHADOW_ROOT
        * Haven't addressed the potentially over-allocation issue as I don't get it

    Changes in V2:
        * Rework the shadow perdomain mapping solution in the follow-up patches

    Changes since Hongyan's version:
        * Remove the final dot in the commit title

Comments

Jan Beulich May 14, 2024, 2:51 p.m. UTC | #1
On 13.05.2024 15:40, Elias El Yandouzi wrote:
> From: Hongyan Xia <hongyxia@amazon.com>
> 
> Create a per-domain mapping of PV guest_root_pt as direct map is being
> removed.
> 
> Note that we do not map and unmap root_pgt for now since it is still a
> xenheap page.
> 
> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
> Signed-off-by: Julien Grall <jgrall@amazon.com>
> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
> 
> ----
>     Changes in V3:
>         * Rename SHADOW_ROOT
>         * Haven't addressed the potentially over-allocation issue as I don't get it

I thought I had explained in enough detail that the GDT/LDT area needs
quite a bit more space (2 times 64k per vCPU) than the root PT one (4k
per vCPU). Thus while d->arch.pv.gdt_ldt_l1tab really needs to point at
a full page (as long as not taking into account dynamic domain
properties), d->arch.pv.root_pt_l1tab doesn't need to (and hence might
better be allocated using xzalloc() / xzalloc_array(), even when also
not taking into account dynamic domain properties, i.e. vCPU count).

Jan
Elias El Yandouzi May 15, 2024, 6:25 p.m. UTC | #2
Hi Jan,

On 14/05/2024 15:51, Jan Beulich wrote:
> On 13.05.2024 15:40, Elias El Yandouzi wrote:
>> From: Hongyan Xia <hongyxia@amazon.com>
>>
>> Create a per-domain mapping of PV guest_root_pt as direct map is being
>> removed.
>>
>> Note that we do not map and unmap root_pgt for now since it is still a
>> xenheap page.
>>
>> Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
>> Signed-off-by: Julien Grall <jgrall@amazon.com>
>> Signed-off-by: Elias El Yandouzi <eliasely@amazon.com>
>>
>> ----
>>      Changes in V3:
>>          * Rename SHADOW_ROOT
>>          * Haven't addressed the potentially over-allocation issue as I don't get it
> 
> I thought I had explained in enough detail that the GDT/LDT area needs
> quite a bit more space (2 times 64k per vCPU) than the root PT one (4k
> per vCPU). Thus while d->arch.pv.gdt_ldt_l1tab really needs to point at
> a full page (as long as not taking into account dynamic domain
> properties), d->arch.pv.root_pt_l1tab doesn't need to (and hence might
> better be allocated using xzalloc() / xzalloc_array(), even when also
> not taking into account dynamic domain properties, i.e. vCPU count).

I just understood your point and yes you're correct I was 
over-allocating... Sorry, it took me so long to get it.

I'll go instead with:

@@ -371,6 +396,12 @@ int pv_domain_initialise(struct domain *d)
          goto fail;
      clear_page(d->arch.pv.gdt_ldt_l1tab);

+    d->arch.pv.root_pt_l1tab =
+        xzalloc_array(l1_pgentry_t *,
+                      DIV_ROUND_UP(d->max_vcpus, L1_PAGETABLE_ENTRIES));
+    if ( !d->arch.pv.root_pt_l1tab )
+        goto fail;
+
      if ( levelling_caps & ~LCAP_faulting &&
           (d->arch.pv.cpuidmasks = xmemdup(&cpuidmask_defaults)) == NULL )
          goto fail;

However, I noticed quite a weird bug while doing some testing. I may 
need your expertise to find the root cause.

In the case where I have more vCPUs than pCPUs (and let's consider we 
have one pCPU for two vCPUs), I noticed that I would always get a page 
fault in dom0 kernel (5.10.0-13-amd64) at the exact same location. I did 
a bit of investigation but I couldn't come to a clear conclusion. 
Looking at the stack trace [1], I have the feeling the crash occurs in a 
loop or a recursive call.

I tried to identify where the crash occurred using addr2line:

 > addr2line -e vmlinux-5.10.0-29-amd64 0xffffffff810218a0
debian/build/build_amd64_none_amd64/arch/x86/xen/mmu_pv.c:880

It turns out to point on the closing bracket of the function 
xen_mm_unpin_all()[2].

I thought the crash could happen while returning from the function in 
the assembly epilogue but the output of objdump doesn't even show the 
address.

The only theory I could think of was that because we only have one pCPU, 
we may never execute one of the two vCPUs, and never setup the mapping 
to the guest_root_pt in write_ptbase(), hence the page fault. This is 
just a random theory, I couldn't find any hint suggesting it would be 
the case though. Any idea how I could debug this?

[1] https://pastebin.com/UaGRaV6a
[2] https://github.com/torvalds/linux/blob/v5.10/arch/x86/xen/mmu_pv.c#L880

Elias
Jan Beulich May 16, 2024, 7:17 a.m. UTC | #3
On 15.05.2024 20:25, Elias El Yandouzi wrote:
> However, I noticed quite a weird bug while doing some testing. I may 
> need your expertise to find the root cause.

Looks like you've overflowed the dom0 kernel stack, most likely because
of recurring nested exceptions.

> In the case where I have more vCPUs than pCPUs (and let's consider we 
> have one pCPU for two vCPUs), I noticed that I would always get a page 
> fault in dom0 kernel (5.10.0-13-amd64) at the exact same location. I did 
> a bit of investigation but I couldn't come to a clear conclusion. 
> Looking at the stack trace [1], I have the feeling the crash occurs in a 
> loop or a recursive call.
> 
> I tried to identify where the crash occurred using addr2line:
> 
>  > addr2line -e vmlinux-5.10.0-29-amd64 0xffffffff810218a0
> debian/build/build_amd64_none_amd64/arch/x86/xen/mmu_pv.c:880
> 
> It turns out to point on the closing bracket of the function 
> xen_mm_unpin_all()[2].
> 
> I thought the crash could happen while returning from the function in 
> the assembly epilogue but the output of objdump doesn't even show the 
> address.
> 
> The only theory I could think of was that because we only have one pCPU, 
> we may never execute one of the two vCPUs, and never setup the mapping 
> to the guest_root_pt in write_ptbase(), hence the page fault. This is 
> just a random theory, I couldn't find any hint suggesting it would be 
> the case though. Any idea how I could debug this?

I guess you want to instrument Xen enough to catch the top level fault (or
the 2nd from top, depending on where the nesting actually starts) to see
why that happens. Quite likely some guest mapping isn't set up properly.

Jan

> [1] https://pastebin.com/UaGRaV6a
> [2] https://github.com/torvalds/linux/blob/v5.10/arch/x86/xen/mmu_pv.c#L880
> 
> Elias
>
diff mbox series

Patch

diff --git a/xen/arch/x86/include/asm/config.h b/xen/arch/x86/include/asm/config.h
index ab7288cb36..5d710384df 100644
--- a/xen/arch/x86/include/asm/config.h
+++ b/xen/arch/x86/include/asm/config.h
@@ -203,7 +203,7 @@  extern unsigned char boot_edid_info[128];
 /* Slot 260: per-domain mappings (including map cache). */
 #define PERDOMAIN_VIRT_START    (PML4_ADDR(260))
 #define PERDOMAIN_SLOT_MBYTES   (PML4_ENTRY_BYTES >> (20 + PAGETABLE_ORDER))
-#define PERDOMAIN_SLOTS         3
+#define PERDOMAIN_SLOTS         4
 #define PERDOMAIN_VIRT_SLOT(s)  (PERDOMAIN_VIRT_START + (s) * \
                                  (PERDOMAIN_SLOT_MBYTES << 20))
 /* Slot 4: mirror of per-domain mappings (for compat xlat area accesses). */
@@ -317,6 +317,14 @@  extern unsigned long xen_phys_start;
 #define ARG_XLAT_START(v)        \
     (ARG_XLAT_VIRT_START + ((v)->vcpu_id << ARG_XLAT_VA_SHIFT))
 
+/* pv_root_pt mapping area. The fourth per-domain-mapping sub-area */
+#define PV_ROOT_PT_MAPPING_VIRT_START   PERDOMAIN_VIRT_SLOT(3)
+#define PV_ROOT_PT_MAPPING_ENTRIES      MAX_VIRT_CPUS
+
+/* The address of a particular VCPU's PV_ROOT_PT */
+#define PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v) \
+    (PV_ROOT_PT_MAPPING_VIRT_START + ((v)->vcpu_id * PAGE_SIZE))
+
 #define ELFSIZE 64
 
 #define ARCH_CRASH_SAVE_VMCOREINFO
diff --git a/xen/arch/x86/include/asm/domain.h b/xen/arch/x86/include/asm/domain.h
index f5daeb182b..8a97530607 100644
--- a/xen/arch/x86/include/asm/domain.h
+++ b/xen/arch/x86/include/asm/domain.h
@@ -272,6 +272,7 @@  struct time_scale {
 struct pv_domain
 {
     l1_pgentry_t **gdt_ldt_l1tab;
+    l1_pgentry_t **root_pt_l1tab;
 
     atomic_t nr_l4_pages;
 
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d968bbbc73..efdf20f775 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,6 +505,13 @@  void share_xen_page_with_guest(struct page_info *page, struct domain *d,
     nrspin_unlock(&d->page_alloc_lock);
 }
 
+#define pv_root_pt_idx(v) \
+    ((v)->vcpu_id >> PAGETABLE_ORDER)
+
+#define pv_root_pt_pte(v) \
+    ((v)->domain->arch.pv.root_pt_l1tab[pv_root_pt_idx(v)] + \
+     ((v)->vcpu_id & (L1_PAGETABLE_ENTRIES - 1)))
+
 void make_cr3(struct vcpu *v, mfn_t mfn)
 {
     struct domain *d = v->domain;
@@ -524,6 +531,13 @@  void write_ptbase(struct vcpu *v)
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv.xpti )
     {
+        mfn_t guest_root_pt = _mfn(MASK_EXTR(v->arch.cr3, PAGE_MASK));
+        l1_pgentry_t *pte = pv_root_pt_pte(v);
+
+        ASSERT(v == current);
+
+        l1e_write(pte, l1e_from_mfn(guest_root_pt, __PAGE_HYPERVISOR_RO));
+
         cpu_info->root_pgt_changed = true;
         cpu_info->pv_cr3 = __pa(this_cpu(root_pgt));
         if ( new_cr4 & X86_CR4_PCIDE )
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 2a445bb17b..1b025986f7 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -288,6 +288,21 @@  static void pv_destroy_gdt_ldt_l1tab(struct vcpu *v)
                               1U << GDT_LDT_VCPU_SHIFT);
 }
 
+static int pv_create_root_pt_l1tab(struct vcpu *v)
+{
+    return create_perdomain_mapping(v->domain,
+                                    PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v),
+                                    1, v->domain->arch.pv.root_pt_l1tab,
+                                    NULL);
+}
+
+static void pv_destroy_root_pt_l1tab(struct vcpu *v)
+
+{
+    destroy_perdomain_mapping(v->domain,
+                              PV_ROOT_PT_MAPPING_VCPU_VIRT_START(v), 1);
+}
+
 void pv_vcpu_destroy(struct vcpu *v)
 {
     if ( is_pv_32bit_vcpu(v) )
@@ -297,6 +312,7 @@  void pv_vcpu_destroy(struct vcpu *v)
     }
 
     pv_destroy_gdt_ldt_l1tab(v);
+    pv_destroy_root_pt_l1tab(v);
     XFREE(v->arch.pv.trap_ctxt);
 }
 
@@ -311,6 +327,13 @@  int pv_vcpu_initialise(struct vcpu *v)
     if ( rc )
         return rc;
 
+    if ( v->domain->arch.pv.xpti )
+    {
+        rc = pv_create_root_pt_l1tab(v);
+        if ( rc )
+            goto done;
+    }
+
     BUILD_BUG_ON(X86_NR_VECTORS * sizeof(*v->arch.pv.trap_ctxt) >
                  PAGE_SIZE);
     v->arch.pv.trap_ctxt = xzalloc_array(struct trap_info, X86_NR_VECTORS);
@@ -346,10 +369,12 @@  void pv_domain_destroy(struct domain *d)
 
     destroy_perdomain_mapping(d, GDT_LDT_VIRT_START,
                               GDT_LDT_MBYTES << (20 - PAGE_SHIFT));
+    destroy_perdomain_mapping(d, PV_ROOT_PT_MAPPING_VIRT_START, PV_ROOT_PT_MAPPING_ENTRIES);
 
     XFREE(d->arch.pv.cpuidmasks);
 
     FREE_XENHEAP_PAGE(d->arch.pv.gdt_ldt_l1tab);
+    FREE_XENHEAP_PAGE(d->arch.pv.root_pt_l1tab);
 }
 
 void noreturn cf_check continue_pv_domain(void);
@@ -371,6 +396,12 @@  int pv_domain_initialise(struct domain *d)
         goto fail;
     clear_page(d->arch.pv.gdt_ldt_l1tab);
 
+    d->arch.pv.root_pt_l1tab =
+        alloc_xenheap_pages(0, MEMF_node(domain_to_node(d)));
+    if ( !d->arch.pv.root_pt_l1tab )
+        goto fail;
+    clear_page(d->arch.pv.root_pt_l1tab);
+
     if ( levelling_caps & ~LCAP_faulting &&
          (d->arch.pv.cpuidmasks = xmemdup(&cpuidmask_defaults)) == NULL )
         goto fail;
@@ -381,6 +412,11 @@  int pv_domain_initialise(struct domain *d)
     if ( rc )
         goto fail;
 
+    rc = create_perdomain_mapping(d, PV_ROOT_PT_MAPPING_VIRT_START,
+                                  PV_ROOT_PT_MAPPING_ENTRIES, NULL, NULL);
+    if ( rc )
+        goto fail;
+
     d->arch.ctxt_switch = &pv_csw;
 
     d->arch.pv.xpti = is_hardware_domain(d) ? opt_xpti_hwdom : opt_xpti_domu;
diff --git a/xen/arch/x86/x86_64/asm-offsets.c b/xen/arch/x86/x86_64/asm-offsets.c
index 630bdc3945..c1ae5013af 100644
--- a/xen/arch/x86/x86_64/asm-offsets.c
+++ b/xen/arch/x86/x86_64/asm-offsets.c
@@ -80,6 +80,7 @@  void __dummy__(void)
 
 #undef OFFSET_EF
 
+    OFFSET(VCPU_id, struct vcpu, vcpu_id);
     OFFSET(VCPU_processor, struct vcpu, processor);
     OFFSET(VCPU_domain, struct vcpu, domain);
     OFFSET(VCPU_vcpu_info, struct vcpu, vcpu_info_area.map);
diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S
index df015589ce..c1377da7a5 100644
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -162,7 +162,15 @@  FUNC_LOCAL(restore_all_guest)
         and   %rsi, %rdi
         and   %r9, %rsi
         add   %rcx, %rdi
+
+        /*
+         * The address in the vCPU cr3 is always mapped in the per-domain
+         * pv_root_pt virt area.
+         */
+        imul  $PAGE_SIZE, VCPU_id(%rbx), %esi
+        movabs $PV_ROOT_PT_MAPPING_VIRT_START, %rcx
         add   %rcx, %rsi
+
         mov   $ROOT_PAGETABLE_FIRST_XEN_SLOT, %ecx
         mov   root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rsi), %r8
         mov   %r8, root_table_offset(SH_LINEAR_PT_VIRT_START)*8(%rdi)