diff mbox series

[08/22] x86/pv: rewrite how building PV dom0 handles domheap mappings

Message ID 20221216114853.8227-9-julien@xen.org (mailing list archive)
State New, archived
Headers show
Series Remove the directmap | expand

Commit Message

Julien Grall Dec. 16, 2022, 11:48 a.m. UTC
From: Hongyan Xia <hongyxia@amazon.com>

Building a PV dom0 is allocating from the domheap but uses it like the
xenheap. This is clearly wrong. Fix.

Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Julien Grall <jgrall@amazon.com>

----

    Changes since Hongyan's version:
        * Rebase
        * Remove spurious newline
---
 xen/arch/x86/pv/dom0_build.c | 56 +++++++++++++++++++++++++++---------
 1 file changed, 42 insertions(+), 14 deletions(-)

Comments

Jan Beulich Dec. 22, 2022, 11:48 a.m. UTC | #1
On 16.12.2022 12:48, Julien Grall wrote:
> From: Hongyan Xia <hongyxia@amazon.com>
> 
> Building a PV dom0 is allocating from the domheap but uses it like the
> xenheap. This is clearly wrong. Fix.

"Clearly wrong" would mean there's a bug here, at lest under certain
conditions. But there isn't: Even on huge systems, due to running on
idle page tables, all memory is mapped at present.

> @@ -711,22 +715,32 @@ int __init dom0_construct_pv(struct domain *d,
>          v->arch.pv.event_callback_cs    = FLAT_COMPAT_KERNEL_CS;
>      }
>  
> +#define UNMAP_MAP_AND_ADVANCE(mfn_var, virt_var, maddr) \
> +do {                                                    \
> +    UNMAP_DOMAIN_PAGE(virt_var);                        \

Not much point using the macro when ...

> +    mfn_var = maddr_to_mfn(maddr);                      \
> +    maddr += PAGE_SIZE;                                 \
> +    virt_var = map_domain_page(mfn_var);                \

... the variable gets reset again to non-NULL unconditionally right
away.

> +} while ( false )

This being a local macro and all use sites passing mpt_alloc as the
last argument, I think that parameter wants dropping, which would
improve readability.

> @@ -792,9 +808,9 @@ int __init dom0_construct_pv(struct domain *d,
>              if ( !l3e_get_intpte(*l3tab) )
>              {
>                  maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
> -                l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
> -                clear_page(l2tab);
> -                *l3tab = l3e_from_paddr(__pa(l2tab), L3_PROT);
> +                UNMAP_MAP_AND_ADVANCE(l2start_mfn, l2start, mpt_alloc);
> +                clear_page(l2start);
> +                *l3tab = l3e_from_mfn(l2start_mfn, L3_PROT);

The l2start you map on the last iteration here can be re-used ...

> @@ -805,9 +821,17 @@ int __init dom0_construct_pv(struct domain *d,
>          unmap_domain_page(l2t);
>      }

... in the code the tail of which is visible here, eliminating a
redundant map/unmap pair.

> @@ -977,8 +1001,12 @@ int __init dom0_construct_pv(struct domain *d,
>       * !CONFIG_VIDEO case so the logic here can be simplified.
>       */
>      if ( pv_shim )
> +    {
> +        l4start = map_domain_page(l4start_mfn);
>          pv_shim_setup_dom(d, l4start, v_start, vxenstore_start, vconsole_start,
>                            vphysmap_start, si);
> +        UNMAP_DOMAIN_PAGE(l4start);
> +    }

The, at the first glance, redundant re-mapping of the L4 table here could
do with explaining in the description. However, I further wonder in how
far in shim mode eliminating the direct map is actually useful. Which is
to say that I question the need for this change in the first place. Or
wait - isn't this (unlike the rest of this patch) actually a bug fix? At
this point we're on the domain's page tables, which may not cover the
page the L4 is allocated at (if a truly huge shim was configured). So I
guess the change is needed but wants breaking out, allowing to at least
consider whether to backport it.

Jan
Elias El Yandouzi Jan. 10, 2024, 12:50 p.m. UTC | #2
Hi Jan,

I have been looking at this series recently and tried my best
to address your comments. I'll shortly to the other patches too.

On 22/12/2022 11:48, Jan Beulich wrote:
> On 16.12.2022 12:48, Julien Grall wrote:
>> From: Hongyan Xia <hongyxia@amazon.com>
>>
>> Building a PV dom0 is allocating from the domheap but uses it like the
>> xenheap. This is clearly wrong. Fix.
> 
> "Clearly wrong" would mean there's a bug here, at lest under certain
> conditions. But there isn't: Even on huge systems, due to running on
> idle page tables, all memory is mapped at present.

I agree with you, I'll rephrase the commit message.

> 
>> @@ -711,22 +715,32 @@ int __init dom0_construct_pv(struct domain *d,
>>           v->arch.pv.event_callback_cs    = FLAT_COMPAT_KERNEL_CS;
>>       }
>>   
>> +#define UNMAP_MAP_AND_ADVANCE(mfn_var, virt_var, maddr) \
>> +do {                                                    \
>> +    UNMAP_DOMAIN_PAGE(virt_var);                        \
> 
> Not much point using the macro when ...
> 
>> +    mfn_var = maddr_to_mfn(maddr);                      \
>> +    maddr += PAGE_SIZE;                                 \
>> +    virt_var = map_domain_page(mfn_var);                \
> 
> ... the variable gets reset again to non-NULL unconditionally right
> away.

Sure, I'll change that.

> 
>> +} while ( false )
> 
> This being a local macro and all use sites passing mpt_alloc as the
> last argument, I think that parameter wants dropping, which would
> improve readability.

I have to disagree. It wouldn't improve readability but make only make 
things more obscure. I'll keep the macro as is.

> 
>> @@ -792,9 +808,9 @@ int __init dom0_construct_pv(struct domain *d,
>>               if ( !l3e_get_intpte(*l3tab) )
>>               {
>>                   maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
>> -                l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
>> -                clear_page(l2tab);
>> -                *l3tab = l3e_from_paddr(__pa(l2tab), L3_PROT);
>> +                UNMAP_MAP_AND_ADVANCE(l2start_mfn, l2start, mpt_alloc);
>> +                clear_page(l2start);
>> +                *l3tab = l3e_from_mfn(l2start_mfn, L3_PROT);
> 
> The l2start you map on the last iteration here can be re-used ...
> 
>> @@ -805,9 +821,17 @@ int __init dom0_construct_pv(struct domain *d,
>>           unmap_domain_page(l2t);
>>       }
> 
> ... in the code the tail of which is visible here, eliminating a
> redundant map/unmap pair.

Good catch, I'll remove the redundant pair.

> 
>> @@ -977,8 +1001,12 @@ int __init dom0_construct_pv(struct domain *d,
>>        * !CONFIG_VIDEO case so the logic here can be simplified.
>>        */
>>       if ( pv_shim )
>> +    {
>> +        l4start = map_domain_page(l4start_mfn);
>>           pv_shim_setup_dom(d, l4start, v_start, vxenstore_start, vconsole_start,
>>                             vphysmap_start, si);
>> +        UNMAP_DOMAIN_PAGE(l4start);
>> +    }
> 
> The, at the first glance, redundant re-mapping of the L4 table here could
> do with explaining in the description. However, I further wonder in how
> far in shim mode eliminating the direct map is actually useful. Which is
> to say that I question the need for this change in the first place. Or
> wait - isn't this (unlike the rest of this patch) actually a bug fix? At
> this point we're on the domain's page tables, which may not cover the
> page the L4 is allocated at (if a truly huge shim was configured). So I
> guess the change is needed but wants breaking out, allowing to at least
> consider whether to backport it.
> 

I will create a separate patch for this change.

> Jan
>
diff mbox series

Patch

diff --git a/xen/arch/x86/pv/dom0_build.c b/xen/arch/x86/pv/dom0_build.c
index c837b2d96f89..cd60f259d1b7 100644
--- a/xen/arch/x86/pv/dom0_build.c
+++ b/xen/arch/x86/pv/dom0_build.c
@@ -383,6 +383,10 @@  int __init dom0_construct_pv(struct domain *d,
     l3_pgentry_t *l3tab = NULL, *l3start = NULL;
     l2_pgentry_t *l2tab = NULL, *l2start = NULL;
     l1_pgentry_t *l1tab = NULL, *l1start = NULL;
+    mfn_t l4start_mfn = INVALID_MFN;
+    mfn_t l3start_mfn = INVALID_MFN;
+    mfn_t l2start_mfn = INVALID_MFN;
+    mfn_t l1start_mfn = INVALID_MFN;
 
     /*
      * This fully describes the memory layout of the initial domain. All
@@ -711,22 +715,32 @@  int __init dom0_construct_pv(struct domain *d,
         v->arch.pv.event_callback_cs    = FLAT_COMPAT_KERNEL_CS;
     }
 
+#define UNMAP_MAP_AND_ADVANCE(mfn_var, virt_var, maddr) \
+do {                                                    \
+    UNMAP_DOMAIN_PAGE(virt_var);                        \
+    mfn_var = maddr_to_mfn(maddr);                      \
+    maddr += PAGE_SIZE;                                 \
+    virt_var = map_domain_page(mfn_var);                \
+} while ( false )
+
     if ( !compat )
     {
         maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l4_page_table;
-        l4start = l4tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+        UNMAP_MAP_AND_ADVANCE(l4start_mfn, l4start, mpt_alloc);
+        l4tab = l4start;
         clear_page(l4tab);
-        init_xen_l4_slots(l4tab, _mfn(virt_to_mfn(l4start)),
-                          d, INVALID_MFN, true);
-        v->arch.guest_table = pagetable_from_paddr(__pa(l4start));
+        init_xen_l4_slots(l4tab, l4start_mfn, d, INVALID_MFN, true);
+        v->arch.guest_table = pagetable_from_mfn(l4start_mfn);
     }
     else
     {
         /* Monitor table already created by switch_compat(). */
-        l4start = l4tab = __va(pagetable_get_paddr(v->arch.guest_table));
+        l4start_mfn = pagetable_get_mfn(v->arch.guest_table);
+        l4start = l4tab = map_domain_page(l4start_mfn);
         /* See public/xen.h on why the following is needed. */
         maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l3_page_table;
         l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+        UNMAP_MAP_AND_ADVANCE(l3start_mfn, l3start, mpt_alloc);
     }
 
     l4tab += l4_table_offset(v_start);
@@ -736,14 +750,16 @@  int __init dom0_construct_pv(struct domain *d,
         if ( !((unsigned long)l1tab & (PAGE_SIZE-1)) )
         {
             maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l1_page_table;
-            l1start = l1tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+            UNMAP_MAP_AND_ADVANCE(l1start_mfn, l1start, mpt_alloc);
+            l1tab = l1start;
             clear_page(l1tab);
             if ( count == 0 )
                 l1tab += l1_table_offset(v_start);
             if ( !((unsigned long)l2tab & (PAGE_SIZE-1)) )
             {
                 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
-                l2start = l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+                UNMAP_MAP_AND_ADVANCE(l2start_mfn, l2start, mpt_alloc);
+                l2tab = l2start;
                 clear_page(l2tab);
                 if ( count == 0 )
                     l2tab += l2_table_offset(v_start);
@@ -753,19 +769,19 @@  int __init dom0_construct_pv(struct domain *d,
                     {
                         maddr_to_page(mpt_alloc)->u.inuse.type_info =
                             PGT_l3_page_table;
-                        l3start = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
+                        UNMAP_MAP_AND_ADVANCE(l3start_mfn, l3start, mpt_alloc);
                     }
                     l3tab = l3start;
                     clear_page(l3tab);
                     if ( count == 0 )
                         l3tab += l3_table_offset(v_start);
-                    *l4tab = l4e_from_paddr(__pa(l3start), L4_PROT);
+                    *l4tab = l4e_from_mfn(l3start_mfn, L4_PROT);
                     l4tab++;
                 }
-                *l3tab = l3e_from_paddr(__pa(l2start), L3_PROT);
+                *l3tab = l3e_from_mfn(l2start_mfn, L3_PROT);
                 l3tab++;
             }
-            *l2tab = l2e_from_paddr(__pa(l1start), L2_PROT);
+            *l2tab = l2e_from_mfn(l1start_mfn, L2_PROT);
             l2tab++;
         }
         if ( count < initrd_pfn || count >= initrd_pfn + PFN_UP(initrd_len) )
@@ -792,9 +808,9 @@  int __init dom0_construct_pv(struct domain *d,
             if ( !l3e_get_intpte(*l3tab) )
             {
                 maddr_to_page(mpt_alloc)->u.inuse.type_info = PGT_l2_page_table;
-                l2tab = __va(mpt_alloc); mpt_alloc += PAGE_SIZE;
-                clear_page(l2tab);
-                *l3tab = l3e_from_paddr(__pa(l2tab), L3_PROT);
+                UNMAP_MAP_AND_ADVANCE(l2start_mfn, l2start, mpt_alloc);
+                clear_page(l2start);
+                *l3tab = l3e_from_mfn(l2start_mfn, L3_PROT);
             }
             if ( i == 3 )
                 l3e_get_page(*l3tab)->u.inuse.type_info |= PGT_pae_xen_l2;
@@ -805,9 +821,17 @@  int __init dom0_construct_pv(struct domain *d,
         unmap_domain_page(l2t);
     }
 
+#undef UNMAP_MAP_AND_ADVANCE
+
+    UNMAP_DOMAIN_PAGE(l1start);
+    UNMAP_DOMAIN_PAGE(l2start);
+    UNMAP_DOMAIN_PAGE(l3start);
+
     /* Pages that are part of page tables must be read only. */
     mark_pv_pt_pages_rdonly(d, l4start, vpt_start, nr_pt_pages, &flush_flags);
 
+    UNMAP_DOMAIN_PAGE(l4start);
+
     /* Mask all upcalls... */
     for ( i = 0; i < XEN_LEGACY_MAX_VCPUS; i++ )
         shared_info(d, vcpu_info[i].evtchn_upcall_mask) = 1;
@@ -977,8 +1001,12 @@  int __init dom0_construct_pv(struct domain *d,
      * !CONFIG_VIDEO case so the logic here can be simplified.
      */
     if ( pv_shim )
+    {
+        l4start = map_domain_page(l4start_mfn);
         pv_shim_setup_dom(d, l4start, v_start, vxenstore_start, vconsole_start,
                           vphysmap_start, si);
+        UNMAP_DOMAIN_PAGE(l4start);
+    }
 
 #ifdef CONFIG_COMPAT
     if ( compat )