diff mbox series

[v2] x86/dom0: improve PVH initrd and metadata placement

Message ID 20200303115253.47449-1-roger.pau@citrix.com (mailing list archive)
State Superseded
Headers show
Series [v2] x86/dom0: improve PVH initrd and metadata placement | expand

Commit Message

Roger Pau Monné March 3, 2020, 11:52 a.m. UTC
Don't assume there's going to be enough space at the tail of the
loaded kernel and instead try to find a suitable memory area where the
initrd and metadata can be loaded.

Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
---
Changes since v1:
 - Calculate end of e820 entry earlier.
 - Only check if the end of the region is < 1MB.
 - Check for range overlaps with the kernel region.
 - Check the region is of type RAM.
 - Fix off-by-one checks in range overlaps.
 - Add a comment about why initrd and metadata is placed together.
 - Add parentheses around size calculations.
---
 xen/arch/x86/hvm/dom0_build.c | 58 ++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

Comments

Jan Beulich March 3, 2020, 3:40 p.m. UTC | #1
On 03.03.2020 12:52, Roger Pau Monne wrote:
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
>  #undef MB1_PAGES
>  }
>  
> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> +                           size_t size)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> +    unsigned int i;
> +
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +
> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
> +
> +        if ( end <= kernel_start || start >= kernel_end )
> +            ; /* No overlap, nothing to do. */
> +        /* Deal with the kernel already being loaded in the region. */
> +        else if ( kernel_start <= start && kernel_end > start )

Since, according to your reply on v1, [kernel_start,kernel_end) is
a subset of [start,end), I understand that the <= could equally
well be == - do you agree? From this then ...

> +            /* Truncate the start of the region. */
> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> +        else if ( kernel_start <= end && kernel_end > end )

... it follows that you now have two off-by-1s here, as you changed
the right side of the && instead of the left one (the right side
could, as per above, use == again). Using == in both places would,
in lieu of a comment, imo make more visible to the reader that
there is this sub-range relationship between both ranges.

> +            /* Truncate the end of the region. */
> +            end = kernel_start;
> +        /* Pick the biggest of the split regions. */

Then again - wouldn't this part suffice? if start == kernel_start
or end == kernel_end, one side of the "split" region would simply
be empty.

> +        else if ( kernel_start - start > end - kernel_end )
> +            end = kernel_start;
> +        else
> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> +
> +        if ( end - start >= size )
> +            return start;
> +    }
> +
> +    return INVALID_PADDR;
> +}
> +
>  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>                                    unsigned long image_headroom,
>                                    module_t *initrd, void *image_base,
> @@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>          return rc;
>      }
>  
> -    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> +    /*
> +     * Find a RAM region big enough (and that doesn't overlap with the loaded
> +     * kernel) in order to load the initrd and the metadata. Note it could be
> +     * split into smaller allocations, done it as a single region in order to
> +     * simplify it.

I guess either "done" without "it" or "doing it"?

Jan
Andrew Cooper March 3, 2020, 6:47 p.m. UTC | #2
On 03/03/2020 11:52, Roger Pau Monne wrote:
> Don't assume there's going to be enough space at the tail of the
> loaded kernel and instead try to find a suitable memory area where the
> initrd and metadata can be loaded.
>
> Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>

I can confirm that this fixes the "failed to boot PVH" on my Rome system.

Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>

We've still got the excessive-time-to-construct issues to look at, but
this definitely brings things to a better position.

> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index eded87eaf5..33520ec1bc 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
>  #undef MB1_PAGES
>  }
>  
> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> +                           size_t size)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> +    unsigned int i;
> +
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +
> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */

The BDA is in mfn 0 so is special for other reasons*.  The EBDA and IBFT
are the problem datastructures.

~Andrew

[*] Thinking about it, how should a PVH hardware domain reconcile its
paravirtualised boot with finding itself on a BIOS or EFI system...
Roger Pau Monné March 4, 2020, 9:53 a.m. UTC | #3
On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote:
> On 03.03.2020 12:52, Roger Pau Monne wrote:
> > --- a/xen/arch/x86/hvm/dom0_build.c
> > +++ b/xen/arch/x86/hvm/dom0_build.c
> > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
> >  #undef MB1_PAGES
> >  }
> >  
> > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> > +                           size_t size)
> > +{
> > +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> > +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> > +    unsigned int i;
> > +
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> > +
> > +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> > +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
> > +            continue;
> > +
> > +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
> > +
> > +        if ( end <= kernel_start || start >= kernel_end )
> > +            ; /* No overlap, nothing to do. */
> > +        /* Deal with the kernel already being loaded in the region. */
> > +        else if ( kernel_start <= start && kernel_end > start )
> 
> Since, according to your reply on v1, [kernel_start,kernel_end) is
> a subset of [start,end), I understand that the <= could equally
> well be == - do you agree? From this then ...
> 
> > +            /* Truncate the start of the region. */
> > +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> > +        else if ( kernel_start <= end && kernel_end > end )
> 
> ... it follows that you now have two off-by-1s here, as you changed
> the right side of the && instead of the left one (the right side
> could, as per above, use == again). Using == in both places would,
> in lieu of a comment, imo make more visible to the reader that
> there is this sub-range relationship between both ranges.

Right, I agree to both the above and have adjusted the conditions.

> > +            /* Truncate the end of the region. */
> > +            end = kernel_start;
> > +        /* Pick the biggest of the split regions. */
> 
> Then again - wouldn't this part suffice? if start == kernel_start
> or end == kernel_end, one side of the "split" region would simply
> be empty.

That's why it's using an else if construct, so that we only get
here if the kernel is loaded in the middle of the region, and there
are two regions left as part of the split.

> 
> > +        else if ( kernel_start - start > end - kernel_end )
> > +            end = kernel_start;
> > +        else
> > +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> > +
> > +        if ( end - start >= size )
> > +            return start;
> > +    }
> > +
> > +    return INVALID_PADDR;
> > +}
> > +
> >  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
> >                                    unsigned long image_headroom,
> >                                    module_t *initrd, void *image_base,
> > @@ -546,7 +585,24 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
> >          return rc;
> >      }
> >  
> > -    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
> > +    /*
> > +     * Find a RAM region big enough (and that doesn't overlap with the loaded
> > +     * kernel) in order to load the initrd and the metadata. Note it could be
> > +     * split into smaller allocations, done it as a single region in order to
> > +     * simplify it.
> 
> I guess either "done" without "it" or "doing it"?

Fixed, thanks.

Roger.
Jan Beulich March 4, 2020, 10 a.m. UTC | #4
On 04.03.2020 10:53, Roger Pau Monné wrote:
> On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote:
>> On 03.03.2020 12:52, Roger Pau Monne wrote:
>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
>>>  #undef MB1_PAGES
>>>  }
>>>  
>>> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
>>> +                           size_t size)
>>> +{
>>> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
>>> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
>>> +    unsigned int i;
>>> +
>>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>>> +    {
>>> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
>>> +
>>> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
>>> +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
>>> +            continue;
>>> +
>>> +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
>>> +
>>> +        if ( end <= kernel_start || start >= kernel_end )
>>> +            ; /* No overlap, nothing to do. */
>>> +        /* Deal with the kernel already being loaded in the region. */
>>> +        else if ( kernel_start <= start && kernel_end > start )
>>
>> Since, according to your reply on v1, [kernel_start,kernel_end) is
>> a subset of [start,end), I understand that the <= could equally
>> well be == - do you agree? From this then ...
>>
>>> +            /* Truncate the start of the region. */
>>> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
>>> +        else if ( kernel_start <= end && kernel_end > end )
>>
>> ... it follows that you now have two off-by-1s here, as you changed
>> the right side of the && instead of the left one (the right side
>> could, as per above, use == again). Using == in both places would,
>> in lieu of a comment, imo make more visible to the reader that
>> there is this sub-range relationship between both ranges.
> 
> Right, I agree to both the above and have adjusted the conditions.
> 
>>> +            /* Truncate the end of the region. */
>>> +            end = kernel_start;
>>> +        /* Pick the biggest of the split regions. */
>>
>> Then again - wouldn't this part suffice? if start == kernel_start
>> or end == kernel_end, one side of the "split" region would simply
>> be empty.
> 
> That's why it's using an else if construct, so that we only get
> here if the kernel is loaded in the middle of the region, and there
> are two regions left as part of the split.

But my question is - do we really need the earlier parts of
this if/else-if chain? Won't this latter part deal find with
zero-sized ranges at head or tail of the region?

Jan
Roger Pau Monné March 4, 2020, 10:09 a.m. UTC | #5
On Wed, Mar 04, 2020 at 11:00:18AM +0100, Jan Beulich wrote:
> On 04.03.2020 10:53, Roger Pau Monné wrote:
> > On Tue, Mar 03, 2020 at 04:40:36PM +0100, Jan Beulich wrote:
> >> On 03.03.2020 12:52, Roger Pau Monne wrote:
> >>> --- a/xen/arch/x86/hvm/dom0_build.c
> >>> +++ b/xen/arch/x86/hvm/dom0_build.c
> >>> @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
> >>>  #undef MB1_PAGES
> >>>  }
> >>>  
> >>> +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> >>> +                           size_t size)
> >>> +{
> >>> +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> >>> +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> >>> +    unsigned int i;
> >>> +
> >>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> >>> +    {
> >>> +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> >>> +
> >>> +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> >>> +        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
> >>> +            continue;
> >>> +
> >>> +        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
> >>> +
> >>> +        if ( end <= kernel_start || start >= kernel_end )
> >>> +            ; /* No overlap, nothing to do. */
> >>> +        /* Deal with the kernel already being loaded in the region. */
> >>> +        else if ( kernel_start <= start && kernel_end > start )
> >>
> >> Since, according to your reply on v1, [kernel_start,kernel_end) is
> >> a subset of [start,end), I understand that the <= could equally
> >> well be == - do you agree? From this then ...
> >>
> >>> +            /* Truncate the start of the region. */
> >>> +            start = ROUNDUP(kernel_end, PAGE_SIZE);
> >>> +        else if ( kernel_start <= end && kernel_end > end )
> >>
> >> ... it follows that you now have two off-by-1s here, as you changed
> >> the right side of the && instead of the left one (the right side
> >> could, as per above, use == again). Using == in both places would,
> >> in lieu of a comment, imo make more visible to the reader that
> >> there is this sub-range relationship between both ranges.
> > 
> > Right, I agree to both the above and have adjusted the conditions.
> > 
> >>> +            /* Truncate the end of the region. */
> >>> +            end = kernel_start;
> >>> +        /* Pick the biggest of the split regions. */
> >>
> >> Then again - wouldn't this part suffice? if start == kernel_start
> >> or end == kernel_end, one side of the "split" region would simply
> >> be empty.
> > 
> > That's why it's using an else if construct, so that we only get
> > here if the kernel is loaded in the middle of the region, and there
> > are two regions left as part of the split.
> 
> But my question is - do we really need the earlier parts of
> this if/else-if chain? Won't this latter part deal find with
> zero-sized ranges at head or tail of the region?

Oh, I misread your reply sorry. Yes you are right, I can achieve the
same just with this last part.

Thanks, Roger.
Roger Pau Monné March 4, 2020, 10:49 a.m. UTC | #6
On Tue, Mar 03, 2020 at 06:47:35PM +0000, Andrew Cooper wrote:
> On 03/03/2020 11:52, Roger Pau Monne wrote:
> > Don't assume there's going to be enough space at the tail of the
> > loaded kernel and instead try to find a suitable memory area where the
> > initrd and metadata can be loaded.
> >
> > Reported-by: Andrew Cooper <andrew.cooper3@citrix.com>
> > Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
> 
> I can confirm that this fixes the "failed to boot PVH" on my Rome system.
> 
> Tested-by: Andrew Cooper <andrew.cooper3@citrix.com>

Thanks!

> We've still got the excessive-time-to-construct issues to look at, but
> this definitely brings things to a better position.

Well, PV is always going to be faster to construct, since you only
need to allocate memory and create the initial page tables that cover
the kernel, the metadata and optionally the initrd IIRC.

On PVH we need to populate the full p2m, so I think it's safe to say
that PVH build time is always going to be worse than PV. That doesn't
mean we can't make it faster.
I have to 
Since I also have an AMD box that I can play with, how much memory are
you assigning to dom0?

> > diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> > index eded87eaf5..33520ec1bc 100644
> > --- a/xen/arch/x86/hvm/dom0_build.c
> > +++ b/xen/arch/x86/hvm/dom0_build.c
> > @@ -490,6 +490,45 @@ static int __init pvh_populate_p2m(struct domain *d)
> >  #undef MB1_PAGES
> >  }
> >  
> > +static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
> > +                           size_t size)
> > +{
> > +    paddr_t kernel_start = (paddr_t)elf->dest_base;
> > +    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
> > +    unsigned int i;
> > +
> > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > +    {
> > +        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
> > +
> > +        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
> 
> The BDA is in mfn 0 so is special for other reasons*.  The EBDA and IBFT
> are the problem datastructures.

Sure. Sorry I haven't added it to the comment.

> ~Andrew
> 
> [*] Thinking about it, how should a PVH hardware domain reconcile its
> paravirtualised boot with finding itself on a BIOS or EFI system...

I guess the same applies to PV which also boots using a PV path but
has access to firmware.

I have to admit I never looked closely at how Linux does that, for
FreeBSD it's fairly easy because at least when booting from BIOS the
kernel won't try to make any calls into the BIOS, and instead expect
the data it requires to be provided by the loader as part of the
metadata blob, together with the modules &c.

Thanks, Roger.
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index eded87eaf5..33520ec1bc 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -490,6 +490,45 @@  static int __init pvh_populate_p2m(struct domain *d)
 #undef MB1_PAGES
 }
 
+static paddr_t find_memory(const struct domain *d, const struct elf_binary *elf,
+                           size_t size)
+{
+    paddr_t kernel_start = (paddr_t)elf->dest_base;
+    paddr_t kernel_end = (paddr_t)(elf->dest_base + elf->dest_size);
+    unsigned int i;
+
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        paddr_t start, end = d->arch.e820[i].addr + d->arch.e820[i].size;
+
+        /* Don't use memory below 1MB, as it could overwrite the BDA/EBDA. */
+        if ( end <= MB(1) || d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        start = MAX(ROUNDUP(d->arch.e820[i].addr, PAGE_SIZE), MB(1));
+
+        if ( end <= kernel_start || start >= kernel_end )
+            ; /* No overlap, nothing to do. */
+        /* Deal with the kernel already being loaded in the region. */
+        else if ( kernel_start <= start && kernel_end > start )
+            /* Truncate the start of the region. */
+            start = ROUNDUP(kernel_end, PAGE_SIZE);
+        else if ( kernel_start <= end && kernel_end > end )
+            /* Truncate the end of the region. */
+            end = kernel_start;
+        /* Pick the biggest of the split regions. */
+        else if ( kernel_start - start > end - kernel_end )
+            end = kernel_start;
+        else
+            start = ROUNDUP(kernel_end, PAGE_SIZE);
+
+        if ( end - start >= size )
+            return start;
+    }
+
+    return INVALID_PADDR;
+}
+
 static int __init pvh_load_kernel(struct domain *d, const module_t *image,
                                   unsigned long image_headroom,
                                   module_t *initrd, void *image_base,
@@ -546,7 +585,24 @@  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
         return rc;
     }
 
-    last_addr = ROUNDUP(parms.virt_kend - parms.virt_base, PAGE_SIZE);
+    /*
+     * Find a RAM region big enough (and that doesn't overlap with the loaded
+     * kernel) in order to load the initrd and the metadata. Note it could be
+     * split into smaller allocations, done it as a single region in order to
+     * simplify it.
+     */
+    last_addr = find_memory(d, &elf, sizeof(start_info) +
+                            (initrd ? ROUNDUP(initrd->mod_end, PAGE_SIZE) +
+                                      sizeof(mod)
+                                    : 0) +
+                            (cmdline ? ROUNDUP(strlen(cmdline) + 1,
+                                               elf_64bit(&elf) ? 8 : 4)
+                                     : 0));
+    if ( last_addr == INVALID_PADDR )
+    {
+        printk("Unable to find a memory region to load initrd and metadata\n");
+        return -ENOMEM;
+    }
 
     if ( initrd != NULL )
     {