diff mbox series

[v2,3/3] x86/PVH: Support relocatable dom0 kernels

Message ID 20240313193021.241764-4-jason.andryuk@amd.com (mailing list archive)
State Superseded
Headers show
Series x86/pvh: Support relocating dom0 kernel | expand

Commit Message

Jason Andryuk March 13, 2024, 7:30 p.m. UTC
Xen tries to load a PVH dom0 kernel at the fixed guest physical address
from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
it can be configured.

Unfortunately there exist firmwares that have reserved regions at this
address, so Xen fails to load the dom0 kernel since it's not RAM.

The PVH entry code is not relocatable - it loads from absolute
addresses, which fail when the kernel is loaded at a different address.
With a suitably modified kernel, a reloctable entry point is possible.

Add XEN_ELFNOTE_PVH_RELOCATION which specifies the minimum, maximum and
alignment needed for the kernel.  The presence of the NOTE indicates the
kernel supports a relocatable entry path.

Change the loading to check for an acceptable load address.  If the
kernel is relocatable, support finding an alternate load address.

Link: https://gitlab.com/xen-project/xen/-/issues/180
Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
---
ELF Note printing looks like:
(XEN) ELF: note: PVH_RELOCATION = min: 0x1000000 max: 0xffffffff align: 0x200000

v2:
Use elfnote for min, max & align - use 64bit values.
Print original and relocated memory addresses
Use check_and_adjust_load_address() name
Return relocated base instead of offset
Use PAGE_ALIGN
Don't load above max_phys (expected to be 4GB in kernel elf note)
Use single line comments
Exit check_load_address loop earlier
Add __init to find_kernel_memory()
---
 xen/arch/x86/hvm/dom0_build.c      | 108 +++++++++++++++++++++++++++++
 xen/common/libelf/libelf-dominfo.c |  13 ++++
 xen/include/public/elfnote.h       |  11 +++
 xen/include/xen/libelf.h           |   3 +
 4 files changed, 135 insertions(+)

Comments

Jason Andryuk March 13, 2024, 9:02 p.m. UTC | #1
On 2024-03-13 15:30, Jason Andryuk wrote:
> +/* Find an e820 RAM region that fits the kernel at a suitable alignment. */
> +static paddr_t __init find_kernel_memory(
> +    const struct domain *d, struct elf_binary *elf,
> +    const struct elf_dom_parms *parms)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
> +    paddr_t kernel_size = kernel_end - kernel_start;
> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.
> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +        paddr_t kstart, kend;
> +
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        if ( d->arch.e820[i].size < kernel_size )
> +            continue;
> +
> +        kstart = ROUNDUP(start, parms->phys_align);
> +        kstart = kstart < parms->phys_min ? parms->phys_min : kstart;

This should be
         kstart = MAX(kstart, parms->phys_min);

Regards,
Jason

> +        kend = kstart + kernel_size;
> +
> +        if ( kend > parms->phys_max )
> +            return 0;
> +
> +        if ( kend <= end )
> +            return kstart;
> +    }
> +
> +    return 0;
> +}
Jan Beulich March 14, 2024, 7:12 a.m. UTC | #2
On 13.03.2024 22:02, Jason Andryuk wrote:
> On 2024-03-13 15:30, Jason Andryuk wrote:
>> +/* Find an e820 RAM region that fits the kernel at a suitable alignment. */
>> +static paddr_t __init find_kernel_memory(
>> +    const struct domain *d, struct elf_binary *elf,
>> +    const struct elf_dom_parms *parms)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>> +    paddr_t kernel_size = kernel_end - kernel_start;
>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>> +        paddr_t kstart, kend;
>> +
>> +        if ( d->arch.e820[i].type != E820_RAM )
>> +            continue;
>> +
>> +        if ( d->arch.e820[i].size < kernel_size )
>> +            continue;
>> +
>> +        kstart = ROUNDUP(start, parms->phys_align);
>> +        kstart = kstart < parms->phys_min ? parms->phys_min : kstart;
> 
> This should be
>          kstart = MAX(kstart, parms->phys_min);

Except that MAX() really ought to be the last resort; max() and max_t()
want considering in preference.

Jan
Roger Pau Monne March 14, 2024, 9:48 a.m. UTC | #3
On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
> it can be configured.
> 
> Unfortunately there exist firmwares that have reserved regions at this
> address, so Xen fails to load the dom0 kernel since it's not RAM.
> 
> The PVH entry code is not relocatable - it loads from absolute
> addresses, which fail when the kernel is loaded at a different address.
> With a suitably modified kernel, a reloctable entry point is possible.
> 
> Add XEN_ELFNOTE_PVH_RELOCATION which specifies the minimum, maximum and
> alignment needed for the kernel.  The presence of the NOTE indicates the
> kernel supports a relocatable entry path.
> 
> Change the loading to check for an acceptable load address.  If the
> kernel is relocatable, support finding an alternate load address.
> 
> Link: https://gitlab.com/xen-project/xen/-/issues/180
> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> ---
> ELF Note printing looks like:
> (XEN) ELF: note: PVH_RELOCATION = min: 0x1000000 max: 0xffffffff align: 0x200000
> 
> v2:
> Use elfnote for min, max & align - use 64bit values.
> Print original and relocated memory addresses
> Use check_and_adjust_load_address() name
> Return relocated base instead of offset
> Use PAGE_ALIGN
> Don't load above max_phys (expected to be 4GB in kernel elf note)
> Use single line comments
> Exit check_load_address loop earlier
> Add __init to find_kernel_memory()
> ---
>  xen/arch/x86/hvm/dom0_build.c      | 108 +++++++++++++++++++++++++++++
>  xen/common/libelf/libelf-dominfo.c |  13 ++++
>  xen/include/public/elfnote.h       |  11 +++
>  xen/include/xen/libelf.h           |   3 +
>  4 files changed, 135 insertions(+)
> 
> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> index 0ceda4140b..5c6c0d2db3 100644
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>      return INVALID_PADDR;
>  }
>  
> +static bool __init check_load_address(
> +    const struct domain *d, const struct elf_binary *elf)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;

Are you sure this is correct?  If a program header specifies a non-4K
aligned load address we should still try to honor it.  I think this is
very unlikely, but still we shouldn't apply non-requested alignments
to addresses coming from the ELF headers.

> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.

Relying on sizes to be page aligned seems fragile: it might work now
because of the order in which pvh_setup_vmx_realmode_helpers() first
reserves memory for the TSS and afterwards for the identity page
tables, but it's not a property this code should assume.

> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +
> +        if ( start >= kernel_end )
> +            return false;
> +
> +        if ( start <= kernel_start &&
> +             end >= kernel_end &&
> +             d->arch.e820[i].type == E820_RAM )

Nit: I would maybe do the type check before the boundary ones, as it's
pointless to do boundary checking on a region of a non-RAM type.  But
I guess you could also see it the other way around.

> +            return true;
> +    }
> +
> +    return false;
> +}
> +
> +/* Find an e820 RAM region that fits the kernel at a suitable alignment. */
> +static paddr_t __init find_kernel_memory(
> +    const struct domain *d, struct elf_binary *elf,

const for elf also.

> +    const struct elf_dom_parms *parms)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
> +    paddr_t kernel_size = kernel_end - kernel_start;

Hm, I'm again unsure about the alignments applied here.

I think if anything we want to assert that dest_base is aligned to phys_align.

> +    unsigned int i;
> +
> +    /*
> +     * The memory map is sorted and all RAM regions starts and sizes are
> +     * aligned to page boundaries.
> +     */
> +    for ( i = 0; i < d->arch.nr_e820; i++ )
> +    {
> +        paddr_t start = d->arch.e820[i].addr;
> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> +        paddr_t kstart, kend;
> +
> +        if ( d->arch.e820[i].type != E820_RAM )
> +            continue;
> +
> +        if ( d->arch.e820[i].size < kernel_size )
> +            continue;

You can unify both checks in a single condition.

> +
> +        kstart = ROUNDUP(start, parms->phys_align);
> +        kstart = kstart < parms->phys_min ? parms->phys_min : kstart;
> +        kend = kstart + kernel_size;
> +
> +        if ( kend > parms->phys_max )
> +            return 0;
> +
> +        if ( kend <= end )
> +            return kstart;
> +    }
> +
> +    return 0;
> +}
> +
> +/* Check the kernel load address, and adjust if necessary and possible. */
> +static bool __init check_and_adjust_load_address(
> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms)
> +{
> +    paddr_t reloc_base;
> +
> +    if ( check_load_address(d, elf) )
> +        return true;
> +
> +    if ( parms->phys_align == UNSET_ADDR )
> +    {
> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
> +        return false;
> +    }
> +
> +    reloc_base = find_kernel_memory(d, elf, parms);
> +    if ( reloc_base == 0 )
> +    {
> +        printk("Failed find a load address for the kernel\n");

Since you print the domain in the error message prior to this one, I
would consider also printing it here (or not printing it in both
cases).

> +        return false;
> +    }
> +
> +    if ( opt_dom0_verbose )
> +        printk("Relocating kernel from [%lx, %lx] -> [%lx, %lx]\n",
> +               (paddr_t)elf->dest_base,
> +               (paddr_t)elf->dest_base + elf->dest_size,
> +               reloc_base, reloc_base + elf->dest_size);

I think you need `- 1` for the end calculation if you use inclusive
ranges [, ].  Otherwise [, ) should be used.

> +
> +    parms->phys_entry += (reloc_base - (paddr_t)elf->dest_base);

This seems to assume that the image is always relocated at a higher
address that the original one?

parms->phys_entry = reloc_base + (parms->phys_entry - elf->dest_base);

> +    elf->dest_base = (char *)reloc_base;
> +
> +    return true;
> +}
> +
>  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>                                    unsigned long image_headroom,
>                                    module_t *initrd, void *image_base,
> @@ -585,6 +687,12 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>      elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
>      elf.dest_size = parms.virt_kend - parms.virt_kstart;
>  
> +    if ( !check_and_adjust_load_address(d, &elf, &parms) )
> +    {
> +        printk("Unable to load kernel\n");

check_and_adjust_load_address() already prints an error message,
probably no need to print another message.

> +        return -ENOMEM;
> +    }
> +
>      elf_set_vcpu(&elf, v);
>      rc = elf_load_binary(&elf);
>      if ( rc < 0 )
> diff --git a/xen/common/libelf/libelf-dominfo.c b/xen/common/libelf/libelf-dominfo.c
> index 7cc7b18a51..837a1b0f21 100644
> --- a/xen/common/libelf/libelf-dominfo.c
> +++ b/xen/common/libelf/libelf-dominfo.c
> @@ -125,6 +125,7 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
>          [XEN_ELFNOTE_SUSPEND_CANCEL] = { "SUSPEND_CANCEL", ELFNOTE_INT },
>          [XEN_ELFNOTE_MOD_START_PFN] = { "MOD_START_PFN", ELFNOTE_INT },
>          [XEN_ELFNOTE_PHYS32_ENTRY] = { "PHYS32_ENTRY", ELFNOTE_INT },
> +        [XEN_ELFNOTE_PVH_RELOCATION] = { "PVH_RELOCATION", ELFNOTE_OTHER },
>      };
>  /* *INDENT-ON* */
>  
> @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
>                  elf_note_numeric_array(elf, note, 8, 0),
>                  elf_note_numeric_array(elf, note, 8, 1));
>          break;
> +
> +    case XEN_ELFNOTE_PVH_RELOCATION:
> +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
> +            return -1;
> +
> +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
> +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
> +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);

Size for those needs to be 4 (32bits) as the entry point is in 32bit
mode?  I don't see how we can start past the 4GB boundary.

> +        elf_msg(elf, "min: %#"PRIx64" max: %#"PRIx64" align: %#"PRIx64"\n",
> +                parms->phys_min, parms->phys_max, parms->phys_align);
> +        break;
>      }
>      return 0;
>  }
> @@ -545,6 +557,7 @@ elf_errorstatus elf_xen_parse(struct elf_binary *elf,
>      parms->p2m_base = UNSET_ADDR;
>      parms->elf_paddr_offset = UNSET_ADDR;
>      parms->phys_entry = UNSET_ADDR32;
> +    parms->phys_align = UNSET_ADDR;

For correctness I would also init phys_{min,max}.

Thanks, Roger.
Jason Andryuk March 14, 2024, 12:46 p.m. UTC | #4
On 2024-03-14 03:12, Jan Beulich wrote:
> On 13.03.2024 22:02, Jason Andryuk wrote:
>> On 2024-03-13 15:30, Jason Andryuk wrote:
>>> +/* Find an e820 RAM region that fits the kernel at a suitable alignment. */
>>> +static paddr_t __init find_kernel_memory(
>>> +    const struct domain *d, struct elf_binary *elf,
>>> +    const struct elf_dom_parms *parms)
>>> +{
>>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>>> +    paddr_t kernel_size = kernel_end - kernel_start;
>>> +    unsigned int i;
>>> +
>>> +    /*
>>> +     * The memory map is sorted and all RAM regions starts and sizes are
>>> +     * aligned to page boundaries.
>>> +     */
>>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>>> +    {
>>> +        paddr_t start = d->arch.e820[i].addr;
>>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>>> +        paddr_t kstart, kend;
>>> +
>>> +        if ( d->arch.e820[i].type != E820_RAM )
>>> +            continue;
>>> +
>>> +        if ( d->arch.e820[i].size < kernel_size )
>>> +            continue;
>>> +
>>> +        kstart = ROUNDUP(start, parms->phys_align);
>>> +        kstart = kstart < parms->phys_min ? parms->phys_min : kstart;
>>
>> This should be
>>           kstart = MAX(kstart, parms->phys_min);
> 
> Except that MAX() really ought to be the last resort; max() and max_t()
> want considering in preference.

Yes, thanks.  max() will do.

Regards,
Jason
Jan Beulich March 14, 2024, 1:21 p.m. UTC | #5
On 13.03.2024 20:30, Jason Andryuk wrote:
> --- a/xen/include/public/elfnote.h
> +++ b/xen/include/public/elfnote.h
> @@ -194,6 +194,17 @@
>   */
>  #define XEN_ELFNOTE_PHYS32_ENTRY 18
>  
> +/*
> + * Physical loading constraints for PVH kernels
> + *
> + * Used to place constraints on the guest physical loading addresses and
> + * alignment for a PVH kernel.  This note's value is 3 64bit values in
> + * the following order: minimum, maximum and alignment.

Along the lines of what I said on another sub-thread, I think at least
alignment wants to be optional here. Perhaps, with max going first, min
could also be optional.

As indicated in different context by Roger, the values being uniformly
64-bit ones also is questionable.

> + * The presence of this note indicates the kernel is relocatable.

I think it wants making explicit here that the act of relocating is still
left to the kernel.

Jan
Jan Beulich March 14, 2024, 1:31 p.m. UTC | #6
On 13.03.2024 20:30, Jason Andryuk wrote:
> --- a/xen/arch/x86/hvm/dom0_build.c
> +++ b/xen/arch/x86/hvm/dom0_build.c
> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>      return INVALID_PADDR;
>  }
>  
> +static bool __init check_load_address(
> +    const struct domain *d, const struct elf_binary *elf)
> +{
> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);

Both casts act on a pointer value. Such cannot legitimately be converted
to paddr_t; it only so happens that paddr_t is effectively the same as
uintptr_t right now. (Same issue again further down.) That said, I notice
we have pre-existing examples of this ...

> +/* Check the kernel load address, and adjust if necessary and possible. */
> +static bool __init check_and_adjust_load_address(
> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms)
> +{
> +    paddr_t reloc_base;
> +
> +    if ( check_load_address(d, elf) )
> +        return true;
> +
> +    if ( parms->phys_align == UNSET_ADDR )
> +    {
> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
> +        return false;
> +    }
> +
> +    reloc_base = find_kernel_memory(d, elf, parms);
> +    if ( reloc_base == 0 )
> +    {
> +        printk("Failed find a load address for the kernel\n");
> +        return false;
> +    }
> +
> +    if ( opt_dom0_verbose )
> +        printk("Relocating kernel from [%lx, %lx] -> [%lx, %lx]\n",
> +               (paddr_t)elf->dest_base,
> +               (paddr_t)elf->dest_base + elf->dest_size,

By using %p you clearly can avoid the casts here.

> +               reloc_base, reloc_base + elf->dest_size);

I'm not convinced %lx is really appropriate for paddr_t.

Jan
Jason Andryuk March 14, 2024, 1:51 p.m. UTC | #7
On 2024-03-14 05:48, Roger Pau Monné wrote:
> On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
>> Xen tries to load a PVH dom0 kernel at the fixed guest physical address
>> from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
>> it can be configured.
>>
>> Unfortunately there exist firmwares that have reserved regions at this
>> address, so Xen fails to load the dom0 kernel since it's not RAM.
>>
>> The PVH entry code is not relocatable - it loads from absolute
>> addresses, which fail when the kernel is loaded at a different address.
>> With a suitably modified kernel, a reloctable entry point is possible.
>>
>> Add XEN_ELFNOTE_PVH_RELOCATION which specifies the minimum, maximum and
>> alignment needed for the kernel.  The presence of the NOTE indicates the
>> kernel supports a relocatable entry path.
>>
>> Change the loading to check for an acceptable load address.  If the
>> kernel is relocatable, support finding an alternate load address.
>>
>> Link: https://gitlab.com/xen-project/xen/-/issues/180
>> Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
>> ---
>> ELF Note printing looks like:
>> (XEN) ELF: note: PVH_RELOCATION = min: 0x1000000 max: 0xffffffff align: 0x200000
>>
>> v2:
>> Use elfnote for min, max & align - use 64bit values.
>> Print original and relocated memory addresses
>> Use check_and_adjust_load_address() name
>> Return relocated base instead of offset
>> Use PAGE_ALIGN
>> Don't load above max_phys (expected to be 4GB in kernel elf note)
>> Use single line comments
>> Exit check_load_address loop earlier
>> Add __init to find_kernel_memory()
>> ---
>>   xen/arch/x86/hvm/dom0_build.c      | 108 +++++++++++++++++++++++++++++
>>   xen/common/libelf/libelf-dominfo.c |  13 ++++
>>   xen/include/public/elfnote.h       |  11 +++
>>   xen/include/xen/libelf.h           |   3 +
>>   4 files changed, 135 insertions(+)
>>
>> diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
>> index 0ceda4140b..5c6c0d2db3 100644
>> --- a/xen/arch/x86/hvm/dom0_build.c
>> +++ b/xen/arch/x86/hvm/dom0_build.c
>> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>>       return INVALID_PADDR;
>>   }
>>   
>> +static bool __init check_load_address(
>> +    const struct domain *d, const struct elf_binary *elf)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> 
> Are you sure this is correct?  If a program header specifies a non-4K
> aligned load address we should still try to honor it.  I think this is
> very unlikely, but still we shouldn't apply non-requested alignments
> to addresses coming from the ELF headers.

I think it's correct in terms of checking the e820 table.  Since the 
memory map is limited to 4k granularity, the bounds need to be rounded 
accordingly.

>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
> 
> Relying on sizes to be page aligned seems fragile: it might work now
> because of the order in which pvh_setup_vmx_realmode_helpers() first
> reserves memory for the TSS and afterwards for the identity page
> tables, but it's not a property this code should assume.

That can be removed.  It would just eliminate the early exit...

>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>> +
>> +        if ( start >= kernel_end )
>> +            return false;

... here.

>> +        if ( start <= kernel_start &&
>> +             end >= kernel_end &&
>> +             d->arch.e820[i].type == E820_RAM )
> 
> Nit: I would maybe do the type check before the boundary ones, as it's
> pointless to do boundary checking on a region of a non-RAM type.  But
> I guess you could also see it the other way around.
> 
>> +            return true;
>> +    }
>> +
>> +    return false;
>> +}
>> +
>> +/* Find an e820 RAM region that fits the kernel at a suitable alignment. */
>> +static paddr_t __init find_kernel_memory(
>> +    const struct domain *d, struct elf_binary *elf,
> 
> const for elf also.

Yes, thanks.

>> +    const struct elf_dom_parms *parms)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>> +    paddr_t kernel_size = kernel_end - kernel_start;
> 
> Hm, I'm again unsure about the alignments applied here.

Same as above regarding 4k granularity.

> I think if anything we want to assert that dest_base is aligned to phys_align.

That would indicate the kernel image is inconsistent.

>> +    unsigned int i;
>> +
>> +    /*
>> +     * The memory map is sorted and all RAM regions starts and sizes are
>> +     * aligned to page boundaries.
>> +     */
>> +    for ( i = 0; i < d->arch.nr_e820; i++ )
>> +    {
>> +        paddr_t start = d->arch.e820[i].addr;
>> +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
>> +        paddr_t kstart, kend;
>> +
>> +        if ( d->arch.e820[i].type != E820_RAM )
>> +            continue;
>> +
>> +        if ( d->arch.e820[i].size < kernel_size )
>> +            continue;
> 
> You can unify both checks in a single condition.

Ok.

>> +
>> +        kstart = ROUNDUP(start, parms->phys_align);
>> +        kstart = kstart < parms->phys_min ? parms->phys_min : kstart;
>> +        kend = kstart + kernel_size;
>> +
>> +        if ( kend > parms->phys_max )
>> +            return 0;
>> +
>> +        if ( kend <= end )
>> +            return kstart;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* Check the kernel load address, and adjust if necessary and possible. */
>> +static bool __init check_and_adjust_load_address(
>> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms)
>> +{
>> +    paddr_t reloc_base;
>> +
>> +    if ( check_load_address(d, elf) )
>> +        return true;
>> +
>> +    if ( parms->phys_align == UNSET_ADDR )
>> +    {
>> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
>> +        return false;
>> +    }
>> +
>> +    reloc_base = find_kernel_memory(d, elf, parms);
>> +    if ( reloc_base == 0 )
>> +    {
>> +        printk("Failed find a load address for the kernel\n");
> 
> Since you print the domain in the error message prior to this one, I
> would consider also printing it here (or not printing it in both
> cases).

I'll add the %pd.

>> +        return false;
>> +    }
>> +
>> +    if ( opt_dom0_verbose )
>> +        printk("Relocating kernel from [%lx, %lx] -> [%lx, %lx]\n",
>> +               (paddr_t)elf->dest_base,
>> +               (paddr_t)elf->dest_base + elf->dest_size,
>> +               reloc_base, reloc_base + elf->dest_size);
> 
> I think you need `- 1` for the end calculation if you use inclusive
> ranges [, ].  Otherwise [, ) should be used.

Ok, I'll do [, ] with the -1.

>> +
>> +    parms->phys_entry += (reloc_base - (paddr_t)elf->dest_base);
> 
> This seems to assume that the image is always relocated at a higher
> address that the original one?
> 
> parms->phys_entry = reloc_base + (parms->phys_entry - elf->dest_base);

Ok

>> +    elf->dest_base = (char *)reloc_base;
>> +
>> +    return true;
>> +}
>> +
>>   static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>>                                     unsigned long image_headroom,
>>                                     module_t *initrd, void *image_base,
>> @@ -585,6 +687,12 @@ static int __init pvh_load_kernel(struct domain *d, const module_t *image,
>>       elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
>>       elf.dest_size = parms.virt_kend - parms.virt_kstart;
>>   
>> +    if ( !check_and_adjust_load_address(d, &elf, &parms) )
>> +    {
>> +        printk("Unable to load kernel\n");
> 
> check_and_adjust_load_address() already prints an error message,
> probably no need to print another message.

Ok

>> +        return -ENOMEM;
>> +    }
>> +
>>       elf_set_vcpu(&elf, v);
>>       rc = elf_load_binary(&elf);
>>       if ( rc < 0 )
>> diff --git a/xen/common/libelf/libelf-dominfo.c b/xen/common/libelf/libelf-dominfo.c
>> index 7cc7b18a51..837a1b0f21 100644
>> --- a/xen/common/libelf/libelf-dominfo.c
>> +++ b/xen/common/libelf/libelf-dominfo.c
>> @@ -125,6 +125,7 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
>>           [XEN_ELFNOTE_SUSPEND_CANCEL] = { "SUSPEND_CANCEL", ELFNOTE_INT },
>>           [XEN_ELFNOTE_MOD_START_PFN] = { "MOD_START_PFN", ELFNOTE_INT },
>>           [XEN_ELFNOTE_PHYS32_ENTRY] = { "PHYS32_ENTRY", ELFNOTE_INT },
>> +        [XEN_ELFNOTE_PVH_RELOCATION] = { "PVH_RELOCATION", ELFNOTE_OTHER },
>>       };
>>   /* *INDENT-ON* */
>>   
>> @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
>>                   elf_note_numeric_array(elf, note, 8, 0),
>>                   elf_note_numeric_array(elf, note, 8, 1));
>>           break;
>> +
>> +    case XEN_ELFNOTE_PVH_RELOCATION:
>> +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
>> +            return -1;
>> +
>> +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
>> +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
>> +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
> 
> Size for those needs to be 4 (32bits) as the entry point is in 32bit
> mode?  I don't see how we can start past the 4GB boundary.

I specified the note as 3x 64bit values.  It seemed simpler than trying 
to support both 32bit and 64bit depending on the kernel arch.  Also, 
just using 64bit provides room in case it is needed in the future.

Do you want the note to be changed to 3x 32bit values?

>> +        elf_msg(elf, "min: %#"PRIx64" max: %#"PRIx64" align: %#"PRIx64"\n",
>> +                parms->phys_min, parms->phys_max, parms->phys_align);
>> +        break;
>>       }
>>       return 0;
>>   }
>> @@ -545,6 +557,7 @@ elf_errorstatus elf_xen_parse(struct elf_binary *elf,
>>       parms->p2m_base = UNSET_ADDR;
>>       parms->elf_paddr_offset = UNSET_ADDR;
>>       parms->phys_entry = UNSET_ADDR32;
>> +    parms->phys_align = UNSET_ADDR;
> 
> For correctness I would also init phys_{min,max}.

There is a memset() out of context above to zero the structure.  I 
thought leaving them both 0 would be fine.

I chose to initialize phys_align as the 64bit UNSET_ADDR since that is 
clearly invalid...  Though we don't have any checking that phys_align is 
a power of 2.

Regards,
Jason
Jason Andryuk March 14, 2024, 2:13 p.m. UTC | #8
On 2024-03-14 09:21, Jan Beulich wrote:
> On 13.03.2024 20:30, Jason Andryuk wrote:
>> --- a/xen/include/public/elfnote.h
>> +++ b/xen/include/public/elfnote.h
>> @@ -194,6 +194,17 @@
>>    */
>>   #define XEN_ELFNOTE_PHYS32_ENTRY 18
>>   
>> +/*
>> + * Physical loading constraints for PVH kernels
>> + *
>> + * Used to place constraints on the guest physical loading addresses and
>> + * alignment for a PVH kernel.  This note's value is 3 64bit values in
>> + * the following order: minimum, maximum and alignment.
> 
> Along the lines of what I said on another sub-thread, I think at least
> alignment wants to be optional here. Perhaps, with max going first, min
> could also be optional.

Interesting idea.

> As indicated in different context by Roger, the values being uniformly
> 64-bit ones also is questionable.
> 
>> + * The presence of this note indicates the kernel is relocatable.
> 
> I think it wants making explicit here that the act of relocating is still
> left to the kernel.

Ok.

How is this for a new description?

"""
Physical loading constraints for PVH kernels

Used to place constraints on the guest physical loading addresses and 
alignment for a PVH kernel.

The presence of this note indicates the kernel supports relocating itself.

The note may include up to three 32bit values.
  - a maximum address for the entire image to be loaded below (default 
0xfffffff)
  - a minimum address for the start of the image (default 0)
  - a required start alignment (default 1)
"""

I think if we can agree on the ELF note, the rest will fall into place.

Thanks,
Jason
Jan Beulich March 14, 2024, 2:19 p.m. UTC | #9
On 14.03.2024 15:13, Jason Andryuk wrote:
> On 2024-03-14 09:21, Jan Beulich wrote:
>> On 13.03.2024 20:30, Jason Andryuk wrote:
>>> --- a/xen/include/public/elfnote.h
>>> +++ b/xen/include/public/elfnote.h
>>> @@ -194,6 +194,17 @@
>>>    */
>>>   #define XEN_ELFNOTE_PHYS32_ENTRY 18
>>>   
>>> +/*
>>> + * Physical loading constraints for PVH kernels
>>> + *
>>> + * Used to place constraints on the guest physical loading addresses and
>>> + * alignment for a PVH kernel.  This note's value is 3 64bit values in
>>> + * the following order: minimum, maximum and alignment.
>>
>> Along the lines of what I said on another sub-thread, I think at least
>> alignment wants to be optional here. Perhaps, with max going first, min
>> could also be optional.
> 
> Interesting idea.
> 
>> As indicated in different context by Roger, the values being uniformly
>> 64-bit ones also is questionable.
>>
>>> + * The presence of this note indicates the kernel is relocatable.
>>
>> I think it wants making explicit here that the act of relocating is still
>> left to the kernel.
> 
> Ok.
> 
> How is this for a new description?
> 
> """
> Physical loading constraints for PVH kernels
> 
> Used to place constraints on the guest physical loading addresses and 
> alignment for a PVH kernel.
> 
> The presence of this note indicates the kernel supports relocating itself.
> 
> The note may include up to three 32bit values.

I'm as unsure about always 32-bit as I am on it being uniformly 64-bit.
One question here is whether this note is intended to be x86-specific.

>   - a maximum address for the entire image to be loaded below (default 
> 0xfffffff)

One f too few?

>   - a minimum address for the start of the image (default 0)
>   - a required start alignment (default 1)
> """
> 
> I think if we can agree on the ELF note, the rest will fall into place.

Presumably, yes.

Jan
Roger Pau Monne March 14, 2024, 2:33 p.m. UTC | #10
On Thu, Mar 14, 2024 at 09:51:22AM -0400, Jason Andryuk wrote:
> On 2024-03-14 05:48, Roger Pau Monné wrote:
> > On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
> > > Xen tries to load a PVH dom0 kernel at the fixed guest physical address
> > > from the elf headers.  For Linux, this defaults to 0x1000000 (16MB), but
> > > it can be configured.
> > > 
> > > Unfortunately there exist firmwares that have reserved regions at this
> > > address, so Xen fails to load the dom0 kernel since it's not RAM.
> > > 
> > > The PVH entry code is not relocatable - it loads from absolute
> > > addresses, which fail when the kernel is loaded at a different address.
> > > With a suitably modified kernel, a reloctable entry point is possible.
> > > 
> > > Add XEN_ELFNOTE_PVH_RELOCATION which specifies the minimum, maximum and
> > > alignment needed for the kernel.  The presence of the NOTE indicates the
> > > kernel supports a relocatable entry path.
> > > 
> > > Change the loading to check for an acceptable load address.  If the
> > > kernel is relocatable, support finding an alternate load address.
> > > 
> > > Link: https://gitlab.com/xen-project/xen/-/issues/180
> > > Signed-off-by: Jason Andryuk <jason.andryuk@amd.com>
> > > ---
> > > ELF Note printing looks like:
> > > (XEN) ELF: note: PVH_RELOCATION = min: 0x1000000 max: 0xffffffff align: 0x200000
> > > 
> > > v2:
> > > Use elfnote for min, max & align - use 64bit values.
> > > Print original and relocated memory addresses
> > > Use check_and_adjust_load_address() name
> > > Return relocated base instead of offset
> > > Use PAGE_ALIGN
> > > Don't load above max_phys (expected to be 4GB in kernel elf note)
> > > Use single line comments
> > > Exit check_load_address loop earlier
> > > Add __init to find_kernel_memory()
> > > ---
> > >   xen/arch/x86/hvm/dom0_build.c      | 108 +++++++++++++++++++++++++++++
> > >   xen/common/libelf/libelf-dominfo.c |  13 ++++
> > >   xen/include/public/elfnote.h       |  11 +++
> > >   xen/include/xen/libelf.h           |   3 +
> > >   4 files changed, 135 insertions(+)
> > > 
> > > diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
> > > index 0ceda4140b..5c6c0d2db3 100644
> > > --- a/xen/arch/x86/hvm/dom0_build.c
> > > +++ b/xen/arch/x86/hvm/dom0_build.c
> > > @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
> > >       return INVALID_PADDR;
> > >   }
> > > +static bool __init check_load_address(
> > > +    const struct domain *d, const struct elf_binary *elf)
> > > +{
> > > +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> > 
> > Are you sure this is correct?  If a program header specifies a non-4K
> > aligned load address we should still try to honor it.  I think this is
> > very unlikely, but still we shouldn't apply non-requested alignments
> > to addresses coming from the ELF headers.
> 
> I think it's correct in terms of checking the e820 table.  Since the memory
> map is limited to 4k granularity, the bounds need to be rounded accordingly.

That's for populating the p2m, but I don't see why the kernel load
area should be limited by this?

There's AFAICt no issue from a kernel requesting that it's start load
address is not page aligned (granted that's very unlikely), but I
don't see why we would impose an unneeded restriction here.

The kernel load area doesn't affect how the p2m is populated, that's
mandated by the e820.

> > > +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
> > > +    unsigned int i;
> > > +
> > > +    /*
> > > +     * The memory map is sorted and all RAM regions starts and sizes are
> > > +     * aligned to page boundaries.
> > 
> > Relying on sizes to be page aligned seems fragile: it might work now
> > because of the order in which pvh_setup_vmx_realmode_helpers() first
> > reserves memory for the TSS and afterwards for the identity page
> > tables, but it's not a property this code should assume.
> 
> That can be removed.  It would just eliminate the early exit...
> 
> > > +     */
> > > +    for ( i = 0; i < d->arch.nr_e820; i++ )
> > > +    {
> > > +        paddr_t start = d->arch.e820[i].addr;
> > > +        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
> > > +
> > > +        if ( start >= kernel_end )
> > > +            return false;
> 
> ... here.

I think the sorted aspect is fine, the aligned part is the one I'm
complaining about, so the check above can stay.

> > > +    const struct elf_dom_parms *parms)
> > > +{
> > > +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
> > > +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
> > > +    paddr_t kernel_size = kernel_end - kernel_start;
> > 
> > Hm, I'm again unsure about the alignments applied here.
> 
> Same as above regarding 4k granularity.
> 
> > I think if anything we want to assert that dest_base is aligned to phys_align.
> 
> That would indicate the kernel image is inconsistent.

Indeed.  I think doing that sanity check would be worth.

> > > diff --git a/xen/common/libelf/libelf-dominfo.c b/xen/common/libelf/libelf-dominfo.c
> > > index 7cc7b18a51..837a1b0f21 100644
> > > --- a/xen/common/libelf/libelf-dominfo.c
> > > +++ b/xen/common/libelf/libelf-dominfo.c
> > > @@ -125,6 +125,7 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
> > >           [XEN_ELFNOTE_SUSPEND_CANCEL] = { "SUSPEND_CANCEL", ELFNOTE_INT },
> > >           [XEN_ELFNOTE_MOD_START_PFN] = { "MOD_START_PFN", ELFNOTE_INT },
> > >           [XEN_ELFNOTE_PHYS32_ENTRY] = { "PHYS32_ENTRY", ELFNOTE_INT },
> > > +        [XEN_ELFNOTE_PVH_RELOCATION] = { "PVH_RELOCATION", ELFNOTE_OTHER },
> > >       };
> > >   /* *INDENT-ON* */
> > > @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
> > >                   elf_note_numeric_array(elf, note, 8, 0),
> > >                   elf_note_numeric_array(elf, note, 8, 1));
> > >           break;
> > > +
> > > +    case XEN_ELFNOTE_PVH_RELOCATION:
> > > +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
> > > +            return -1;
> > > +
> > > +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
> > > +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
> > > +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
> > 
> > Size for those needs to be 4 (32bits) as the entry point is in 32bit
> > mode?  I don't see how we can start past the 4GB boundary.
> 
> I specified the note as 3x 64bit values.  It seemed simpler than trying to
> support both 32bit and 64bit depending on the kernel arch.  Also, just using
> 64bit provides room in case it is needed in the future.

Why do you say depending on the kernel arch?

PVH doesn't know the bitness of the kernel, as the kernel entry point
is always started in protected 32bit mode.  We should just support
32bit values, regardless of the kernel bitness, because that's the
only range that's suitable in order to jump into the entry point.

Note how XEN_ELFNOTE_PHYS32_ENTRY is also unconditionally a 32bit
integer.

> Do you want the note to be changed to 3x 32bit values?

Unless anyone objects, yes, that's would be my preference.

> > > +        elf_msg(elf, "min: %#"PRIx64" max: %#"PRIx64" align: %#"PRIx64"\n",
> > > +                parms->phys_min, parms->phys_max, parms->phys_align);
> > > +        break;
> > >       }
> > >       return 0;
> > >   }
> > > @@ -545,6 +557,7 @@ elf_errorstatus elf_xen_parse(struct elf_binary *elf,
> > >       parms->p2m_base = UNSET_ADDR;
> > >       parms->elf_paddr_offset = UNSET_ADDR;
> > >       parms->phys_entry = UNSET_ADDR32;
> > > +    parms->phys_align = UNSET_ADDR;
> > 
> > For correctness I would also init phys_{min,max}.
> 
> There is a memset() out of context above to zero the structure.  I thought
> leaving them both 0 would be fine.

0 would be a valid value, hence it's best to use UNSET_ADDR to clearly
notice when a value has been provided by the parsed binary or not.

Thanks, Roger.
Jan Beulich March 14, 2024, 3:30 p.m. UTC | #11
On 14.03.2024 15:33, Roger Pau Monné wrote:
> On Thu, Mar 14, 2024 at 09:51:22AM -0400, Jason Andryuk wrote:
>> On 2024-03-14 05:48, Roger Pau Monné wrote:
>>> On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
>>>> @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
>>>>                   elf_note_numeric_array(elf, note, 8, 0),
>>>>                   elf_note_numeric_array(elf, note, 8, 1));
>>>>           break;
>>>> +
>>>> +    case XEN_ELFNOTE_PVH_RELOCATION:
>>>> +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
>>>> +            return -1;
>>>> +
>>>> +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
>>>> +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
>>>> +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
>>>
>>> Size for those needs to be 4 (32bits) as the entry point is in 32bit
>>> mode?  I don't see how we can start past the 4GB boundary.
>>
>> I specified the note as 3x 64bit values.  It seemed simpler than trying to
>> support both 32bit and 64bit depending on the kernel arch.  Also, just using
>> 64bit provides room in case it is needed in the future.
> 
> Why do you say depending on the kernel arch?
> 
> PVH doesn't know the bitness of the kernel, as the kernel entry point
> is always started in protected 32bit mode.  We should just support
> 32bit values, regardless of the kernel bitness, because that's the
> only range that's suitable in order to jump into the entry point.
> 
> Note how XEN_ELFNOTE_PHYS32_ENTRY is also unconditionally a 32bit
> integer.
> 
>> Do you want the note to be changed to 3x 32bit values?
> 
> Unless anyone objects, yes, that's would be my preference.

As mentioned elsewhere, unless the entire note is meant to be x86-specific,
this fixed-32-bit property then would want limiting to x86.

Jan
Roger Pau Monne March 14, 2024, 4:48 p.m. UTC | #12
On Thu, Mar 14, 2024 at 04:30:05PM +0100, Jan Beulich wrote:
> On 14.03.2024 15:33, Roger Pau Monné wrote:
> > On Thu, Mar 14, 2024 at 09:51:22AM -0400, Jason Andryuk wrote:
> >> On 2024-03-14 05:48, Roger Pau Monné wrote:
> >>> On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
> >>>> @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
> >>>>                   elf_note_numeric_array(elf, note, 8, 0),
> >>>>                   elf_note_numeric_array(elf, note, 8, 1));
> >>>>           break;
> >>>> +
> >>>> +    case XEN_ELFNOTE_PVH_RELOCATION:
> >>>> +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
> >>>> +            return -1;
> >>>> +
> >>>> +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
> >>>> +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
> >>>> +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
> >>>
> >>> Size for those needs to be 4 (32bits) as the entry point is in 32bit
> >>> mode?  I don't see how we can start past the 4GB boundary.
> >>
> >> I specified the note as 3x 64bit values.  It seemed simpler than trying to
> >> support both 32bit and 64bit depending on the kernel arch.  Also, just using
> >> 64bit provides room in case it is needed in the future.
> > 
> > Why do you say depending on the kernel arch?
> > 
> > PVH doesn't know the bitness of the kernel, as the kernel entry point
> > is always started in protected 32bit mode.  We should just support
> > 32bit values, regardless of the kernel bitness, because that's the
> > only range that's suitable in order to jump into the entry point.
> > 
> > Note how XEN_ELFNOTE_PHYS32_ENTRY is also unconditionally a 32bit
> > integer.
> > 
> >> Do you want the note to be changed to 3x 32bit values?
> > 
> > Unless anyone objects, yes, that's would be my preference.
> 
> As mentioned elsewhere, unless the entire note is meant to be x86-specific,
> this fixed-32-bit property then would want limiting to x86.

Elfnotes are used only on x86 so far.  I don't see why if/when another
architecture wants to use the same elfnotes names with different field
sizes that would be an issue.  When such a need arises we could
clarify that 32-bit size is only for x86 and also specify the size for
the other architecture.

Thanks, Roger.
Jason Andryuk March 14, 2024, 4:59 p.m. UTC | #13
On 2024-03-14 11:30, Jan Beulich wrote:
> On 14.03.2024 15:33, Roger Pau Monné wrote:
>> On Thu, Mar 14, 2024 at 09:51:22AM -0400, Jason Andryuk wrote:
>>> On 2024-03-14 05:48, Roger Pau Monné wrote:
>>>> On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
>>>>> @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
>>>>>                    elf_note_numeric_array(elf, note, 8, 0),
>>>>>                    elf_note_numeric_array(elf, note, 8, 1));
>>>>>            break;
>>>>> +
>>>>> +    case XEN_ELFNOTE_PVH_RELOCATION:
>>>>> +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
>>>>> +            return -1;
>>>>> +
>>>>> +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
>>>>> +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
>>>>> +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
>>>>
>>>> Size for those needs to be 4 (32bits) as the entry point is in 32bit
>>>> mode?  I don't see how we can start past the 4GB boundary.
>>>
>>> I specified the note as 3x 64bit values.  It seemed simpler than trying to
>>> support both 32bit and 64bit depending on the kernel arch.  Also, just using
>>> 64bit provides room in case it is needed in the future.
>>
>> Why do you say depending on the kernel arch?
>>
>> PVH doesn't know the bitness of the kernel, as the kernel entry point
>> is always started in protected 32bit mode.  We should just support
>> 32bit values, regardless of the kernel bitness, because that's the
>> only range that's suitable in order to jump into the entry point.
>>
>> Note how XEN_ELFNOTE_PHYS32_ENTRY is also unconditionally a 32bit
>> integer.

Linux defines PHYS32_ENTRY with _ASM_PTR, so it's 32 or 64 bit to match 
how the kernel is compiled.  The Xen code parses the integer according 
to the size of the note.

>>> Do you want the note to be changed to 3x 32bit values?
>>
>> Unless anyone objects, yes, that's would be my preference.
> 
> As mentioned elsewhere, unless the entire note is meant to be x86-specific,
> this fixed-32-bit property then would want limiting to x86.

PVH is only implemented for x86 today.  Are you saying that the comment 
should just specify the values are 32bit for x86?  If the note is reused 
for other arches, then they can specify their usage?

If this note is to be a variably sized array of values, then the 
elements should be of fixed size.  Otherwise parsing is ambiguous 
without, say, another field specifying element size.

Maybe XEN_ELFNOTE_PHYS32_RELOC would be a better name to complement the 
PHYS32_ENTRY?

Regards,
Jason
Jan Beulich March 14, 2024, 5:02 p.m. UTC | #14
On 14.03.2024 17:59, Jason Andryuk wrote:
> On 2024-03-14 11:30, Jan Beulich wrote:
>> On 14.03.2024 15:33, Roger Pau Monné wrote:
>>> On Thu, Mar 14, 2024 at 09:51:22AM -0400, Jason Andryuk wrote:
>>>> On 2024-03-14 05:48, Roger Pau Monné wrote:
>>>>> On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
>>>>>> @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
>>>>>>                    elf_note_numeric_array(elf, note, 8, 0),
>>>>>>                    elf_note_numeric_array(elf, note, 8, 1));
>>>>>>            break;
>>>>>> +
>>>>>> +    case XEN_ELFNOTE_PVH_RELOCATION:
>>>>>> +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
>>>>>> +            return -1;
>>>>>> +
>>>>>> +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
>>>>>> +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
>>>>>> +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
>>>>>
>>>>> Size for those needs to be 4 (32bits) as the entry point is in 32bit
>>>>> mode?  I don't see how we can start past the 4GB boundary.
>>>>
>>>> I specified the note as 3x 64bit values.  It seemed simpler than trying to
>>>> support both 32bit and 64bit depending on the kernel arch.  Also, just using
>>>> 64bit provides room in case it is needed in the future.
>>>
>>> Why do you say depending on the kernel arch?
>>>
>>> PVH doesn't know the bitness of the kernel, as the kernel entry point
>>> is always started in protected 32bit mode.  We should just support
>>> 32bit values, regardless of the kernel bitness, because that's the
>>> only range that's suitable in order to jump into the entry point.
>>>
>>> Note how XEN_ELFNOTE_PHYS32_ENTRY is also unconditionally a 32bit
>>> integer.
> 
> Linux defines PHYS32_ENTRY with _ASM_PTR, so it's 32 or 64 bit to match 
> how the kernel is compiled.  The Xen code parses the integer according 
> to the size of the note.
> 
>>>> Do you want the note to be changed to 3x 32bit values?
>>>
>>> Unless anyone objects, yes, that's would be my preference.
>>
>> As mentioned elsewhere, unless the entire note is meant to be x86-specific,
>> this fixed-32-bit property then would want limiting to x86.
> 
> PVH is only implemented for x86 today.  Are you saying that the comment 
> should just specify the values are 32bit for x86?  If the note is reused 
> for other arches, then they can specify their usage?

Along these lines. But looks like Roger isn't concerned and would be
happy to leave that to the future.

> If this note is to be a variably sized array of values, then the 
> elements should be of fixed size.  Otherwise parsing is ambiguous 
> without, say, another field specifying element size.
> 
> Maybe XEN_ELFNOTE_PHYS32_RELOC would be a better name to complement the 
> PHYS32_ENTRY?

Perhaps, yes.

Jan
Jason Andryuk March 14, 2024, 7:19 p.m. UTC | #15
On 2024-03-14 09:31, Jan Beulich wrote:
> On 13.03.2024 20:30, Jason Andryuk wrote:
>> --- a/xen/arch/x86/hvm/dom0_build.c
>> +++ b/xen/arch/x86/hvm/dom0_build.c
>> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>>       return INVALID_PADDR;
>>   }
>>   
>> +static bool __init check_load_address(
>> +    const struct domain *d, const struct elf_binary *elf)
>> +{
>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
> 
> Both casts act on a pointer value. Such cannot legitimately be converted
> to paddr_t; it only so happens that paddr_t is effectively the same as
> uintptr_t right now. (Same issue again further down.) That said, I notice
> we have pre-existing examples of this ...

Yes, I followed existing code.  Do you want dest_base to be switched to 
a uintptr_t?

>> +/* Check the kernel load address, and adjust if necessary and possible. */
>> +static bool __init check_and_adjust_load_address(
>> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms)
>> +{
>> +    paddr_t reloc_base;
>> +
>> +    if ( check_load_address(d, elf) )
>> +        return true;
>> +
>> +    if ( parms->phys_align == UNSET_ADDR )
>> +    {
>> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
>> +        return false;
>> +    }
>> +
>> +    reloc_base = find_kernel_memory(d, elf, parms);
>> +    if ( reloc_base == 0 )
>> +    {
>> +        printk("Failed find a load address for the kernel\n");
>> +        return false;
>> +    }
>> +
>> +    if ( opt_dom0_verbose )
>> +        printk("Relocating kernel from [%lx, %lx] -> [%lx, %lx]\n",
>> +               (paddr_t)elf->dest_base,
>> +               (paddr_t)elf->dest_base + elf->dest_size,
> 
> By using %p you clearly can avoid the casts here.

Ok.

>> +               reloc_base, reloc_base + elf->dest_size);
> 
> I'm not convinced %lx is really appropriate for paddr_t.

PRIpaddr exists.  It's "016lx" for x86.  Using that and %p add lots of 0s:
(XEN) Relocating kernel from [0000000001000000, 000000000202ffff] -> 
[0000000002200000, 000000000322ffff]

Regards,
Jason
Roger Pau Monne March 15, 2024, 8:45 a.m. UTC | #16
On Thu, Mar 14, 2024 at 12:59:19PM -0400, Jason Andryuk wrote:
> On 2024-03-14 11:30, Jan Beulich wrote:
> > On 14.03.2024 15:33, Roger Pau Monné wrote:
> > > On Thu, Mar 14, 2024 at 09:51:22AM -0400, Jason Andryuk wrote:
> > > > On 2024-03-14 05:48, Roger Pau Monné wrote:
> > > > > On Wed, Mar 13, 2024 at 03:30:21PM -0400, Jason Andryuk wrote:
> > > > > > @@ -234,6 +235,17 @@ elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
> > > > > >                    elf_note_numeric_array(elf, note, 8, 0),
> > > > > >                    elf_note_numeric_array(elf, note, 8, 1));
> > > > > >            break;
> > > > > > +
> > > > > > +    case XEN_ELFNOTE_PVH_RELOCATION:
> > > > > > +        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
> > > > > > +            return -1;
> > > > > > +
> > > > > > +        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
> > > > > > +        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
> > > > > > +        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
> > > > > 
> > > > > Size for those needs to be 4 (32bits) as the entry point is in 32bit
> > > > > mode?  I don't see how we can start past the 4GB boundary.
> > > > 
> > > > I specified the note as 3x 64bit values.  It seemed simpler than trying to
> > > > support both 32bit and 64bit depending on the kernel arch.  Also, just using
> > > > 64bit provides room in case it is needed in the future.
> > > 
> > > Why do you say depending on the kernel arch?
> > > 
> > > PVH doesn't know the bitness of the kernel, as the kernel entry point
> > > is always started in protected 32bit mode.  We should just support
> > > 32bit values, regardless of the kernel bitness, because that's the
> > > only range that's suitable in order to jump into the entry point.
> > > 
> > > Note how XEN_ELFNOTE_PHYS32_ENTRY is also unconditionally a 32bit
> > > integer.
> 
> Linux defines PHYS32_ENTRY with _ASM_PTR, so it's 32 or 64 bit to match how
> the kernel is compiled.  The Xen code parses the integer according to the
> size of the note.

I think that's wrong, PHYS32_ENTRY should strictly be a 32bit integer,
in fact the field in struct elf_dom_parms is an uint32_t, so Linux
using _ASM_PTR seems bogus, it should unconditionally use .long
regardless of the kernel bitness.

> > > > Do you want the note to be changed to 3x 32bit values?
> > > 
> > > Unless anyone objects, yes, that's would be my preference.
> > 
> > As mentioned elsewhere, unless the entire note is meant to be x86-specific,
> > this fixed-32-bit property then would want limiting to x86.
> 
> PVH is only implemented for x86 today.  Are you saying that the comment
> should just specify the values are 32bit for x86?  If the note is reused for
> other arches, then they can specify their usage?
> 
> If this note is to be a variably sized array of values, then the elements
> should be of fixed size.  Otherwise parsing is ambiguous without, say,
> another field specifying element size.
> 
> Maybe XEN_ELFNOTE_PHYS32_RELOC would be a better name to complement the
> PHYS32_ENTRY?

IMO the '32' part of PHYS32_ENTRY is kind of redundant, given the CPU
state when using such entry point it's impossible to use 64bit
addresses.  I would be fine with using XEN_ELFNOTE_PHYS_RELOC or some
such.  Anyway, this is just a name so I'm not going to opposed if Jan
and yourself prefer to keep using PHYS32.

Thanks, Roger.
Jan Beulich March 15, 2024, 9:48 a.m. UTC | #17
On 14.03.2024 20:19, Jason Andryuk wrote:
> On 2024-03-14 09:31, Jan Beulich wrote:
>> On 13.03.2024 20:30, Jason Andryuk wrote:
>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>>>       return INVALID_PADDR;
>>>   }
>>>   
>>> +static bool __init check_load_address(
>>> +    const struct domain *d, const struct elf_binary *elf)
>>> +{
>>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>>
>> Both casts act on a pointer value. Such cannot legitimately be converted
>> to paddr_t; it only so happens that paddr_t is effectively the same as
>> uintptr_t right now. (Same issue again further down.) That said, I notice
>> we have pre-existing examples of this ...
> 
> Yes, I followed existing code.  Do you want dest_base to be switched to 
> a uintptr_t?

I think it was deliberately switched to being a pointer at some point,
maybe even in a security fix.

>>> +/* Check the kernel load address, and adjust if necessary and possible. */
>>> +static bool __init check_and_adjust_load_address(
>>> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms)
>>> +{
>>> +    paddr_t reloc_base;
>>> +
>>> +    if ( check_load_address(d, elf) )
>>> +        return true;
>>> +
>>> +    if ( parms->phys_align == UNSET_ADDR )
>>> +    {
>>> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
>>> +        return false;
>>> +    }
>>> +
>>> +    reloc_base = find_kernel_memory(d, elf, parms);
>>> +    if ( reloc_base == 0 )
>>> +    {
>>> +        printk("Failed find a load address for the kernel\n");
>>> +        return false;
>>> +    }
>>> +
>>> +    if ( opt_dom0_verbose )
>>> +        printk("Relocating kernel from [%lx, %lx] -> [%lx, %lx]\n",
>>> +               (paddr_t)elf->dest_base,
>>> +               (paddr_t)elf->dest_base + elf->dest_size,
>>
>> By using %p you clearly can avoid the casts here.
> 
> Ok.
> 
>>> +               reloc_base, reloc_base + elf->dest_size);
>>
>> I'm not convinced %lx is really appropriate for paddr_t.
> 
> PRIpaddr exists.  It's "016lx" for x86.  Using that and %p add lots of 0s:
> (XEN) Relocating kernel from [0000000001000000, 000000000202ffff] -> 
> [0000000002200000, 000000000322ffff]

That's to be accepted, I guess.

Looking at it again, is "Relocating" in the log message perhaps misleading?
We don't relocate it, that's something the kernel itself has to do. We only
put it at other than the default position. Maybe "Moving" instead?

Jan
Jason Andryuk March 18, 2024, 9:19 p.m. UTC | #18
On 2024-03-14 10:19, Jan Beulich wrote:
> On 14.03.2024 15:13, Jason Andryuk wrote:
>> On 2024-03-14 09:21, Jan Beulich wrote:
>>> On 13.03.2024 20:30, Jason Andryuk wrote:
>>>> --- a/xen/include/public/elfnote.h
>>>> +++ b/xen/include/public/elfnote.h
>>>> @@ -194,6 +194,17 @@
>>>>     */
>>>>    #define XEN_ELFNOTE_PHYS32_ENTRY 18
>>>>    
>>>> +/*
>>>> + * Physical loading constraints for PVH kernels
>>>> + *
>>>> + * Used to place constraints on the guest physical loading addresses and
>>>> + * alignment for a PVH kernel.  This note's value is 3 64bit values in
>>>> + * the following order: minimum, maximum and alignment.
>>>
>>> Along the lines of what I said on another sub-thread, I think at least
>>> alignment wants to be optional here. Perhaps, with max going first, min
>>> could also be optional.
>>
>> Interesting idea.
>>
>>> As indicated in different context by Roger, the values being uniformly
>>> 64-bit ones also is questionable.
>>>
>>>> + * The presence of this note indicates the kernel is relocatable.
>>>
>>> I think it wants making explicit here that the act of relocating is still
>>> left to the kernel.
>>
>> Ok.
>>
>> How is this for a new description?
>>
>> """
>> Physical loading constraints for PVH kernels
>>
>> Used to place constraints on the guest physical loading addresses and
>> alignment for a PVH kernel.
>>
>> The presence of this note indicates the kernel supports relocating itself.
>>
>> The note may include up to three 32bit values.
> 
> I'm as unsure about always 32-bit as I am on it being uniformly 64-bit.
> One question here is whether this note is intended to be x86-specific.
> 
>>    - a maximum address for the entire image to be loaded below (default
>> 0xfffffff)
> 
> One f too few?

Whoops - yes.

>>    - a minimum address for the start of the image (default 0)
>>    - a required start alignment (default 1)

Jan, in the discussion of patch 1, you wrote "Hmm, shouldn't the order 
of attempts to figure the alignment be ELF note, ELF header, and then 
2Mb?"  My latest revision initializes phys_alignment to 1 and updates 
that if PHYS32_RELOC specifies an alignment.  Do you still want these 
other locations checked for alignment values?

Regards,
Jason
Jason Andryuk March 18, 2024, 9:21 p.m. UTC | #19
On 2024-03-15 05:48, Jan Beulich wrote:
> On 14.03.2024 20:19, Jason Andryuk wrote:
>> On 2024-03-14 09:31, Jan Beulich wrote:
>>> On 13.03.2024 20:30, Jason Andryuk wrote:
>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>>>>        return INVALID_PADDR;
>>>>    }
>>>>    
>>>> +static bool __init check_load_address(
>>>> +    const struct domain *d, const struct elf_binary *elf)
>>>> +{
>>>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>>>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>>>
>>> Both casts act on a pointer value. Such cannot legitimately be converted
>>> to paddr_t; it only so happens that paddr_t is effectively the same as
>>> uintptr_t right now. (Same issue again further down.) That said, I notice
>>> we have pre-existing examples of this ...
>>
>> Yes, I followed existing code.  Do you want dest_base to be switched to
>> a uintptr_t?
> 
> I think it was deliberately switched to being a pointer at some point,
> maybe even in a security fix.

commit 65808a8ed41cc7c044f588bd6cab5af0fdc0e029 "libelf: check all 
pointer accesses", part of XSA-55, switched from a single dest pointer 
to dest_base & dest_size.

For libxenguest, it's a pointer to guest-mapped memory to copy in the 
kernel.  For PV dom0 creation, it's a pointer - Xen switches to the dom0 
page tables and performs the copy with it as-is.  For PVH dom0, it's a 
guest physical address.

Are you looking for this to be addressed in this series?

>>>> +/* Check the kernel load address, and adjust if necessary and possible. */
>>>> +static bool __init check_and_adjust_load_address(
>>>> +    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms)
>>>> +{
>>>> +    paddr_t reloc_base;
>>>> +
>>>> +    if ( check_load_address(d, elf) )
>>>> +        return true;
>>>> +
>>>> +    if ( parms->phys_align == UNSET_ADDR )
>>>> +    {
>>>> +        printk("Address conflict and %pd kernel is not relocatable\n", d);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    reloc_base = find_kernel_memory(d, elf, parms);
>>>> +    if ( reloc_base == 0 )
>>>> +    {
>>>> +        printk("Failed find a load address for the kernel\n");
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    if ( opt_dom0_verbose )
>>>> +        printk("Relocating kernel from [%lx, %lx] -> [%lx, %lx]\n",
>>>> +               (paddr_t)elf->dest_base,
>>>> +               (paddr_t)elf->dest_base + elf->dest_size,
>>>
>>> By using %p you clearly can avoid the casts here.
>>
>> Ok.
>>
>>>> +               reloc_base, reloc_base + elf->dest_size);
>>>
>>> I'm not convinced %lx is really appropriate for paddr_t.
>>
>> PRIpaddr exists.  It's "016lx" for x86.  Using that and %p add lots of 0s:
>> (XEN) Relocating kernel from [0000000001000000, 000000000202ffff] ->
>> [0000000002200000, 000000000322ffff]
> 
> That's to be accepted, I guess.
> 
> Looking at it again, is "Relocating" in the log message perhaps misleading?
> We don't relocate it, that's something the kernel itself has to do. We only
> put it at other than the default position. Maybe "Moving" instead?

Yes, "Moving" sounds better.  I guess I'll drop the "from" and insert a 
%pd for:

(XEN) Moving d0 kernel [0000000001000000, 000000000202ffff] -> 
[0000000002200000, 000000000322ffff]

Regards,
Jason
Jan Beulich March 19, 2024, 8:11 a.m. UTC | #20
On 18.03.2024 22:19, Jason Andryuk wrote:
> On 2024-03-14 10:19, Jan Beulich wrote:
>> On 14.03.2024 15:13, Jason Andryuk wrote:
>>> On 2024-03-14 09:21, Jan Beulich wrote:
>>>> On 13.03.2024 20:30, Jason Andryuk wrote:
>>>>> --- a/xen/include/public/elfnote.h
>>>>> +++ b/xen/include/public/elfnote.h
>>>>> @@ -194,6 +194,17 @@
>>>>>     */
>>>>>    #define XEN_ELFNOTE_PHYS32_ENTRY 18
>>>>>    
>>>>> +/*
>>>>> + * Physical loading constraints for PVH kernels
>>>>> + *
>>>>> + * Used to place constraints on the guest physical loading addresses and
>>>>> + * alignment for a PVH kernel.  This note's value is 3 64bit values in
>>>>> + * the following order: minimum, maximum and alignment.
>>>>
>>>> Along the lines of what I said on another sub-thread, I think at least
>>>> alignment wants to be optional here. Perhaps, with max going first, min
>>>> could also be optional.
>>>
>>> Interesting idea.
>>>
>>>> As indicated in different context by Roger, the values being uniformly
>>>> 64-bit ones also is questionable.
>>>>
>>>>> + * The presence of this note indicates the kernel is relocatable.
>>>>
>>>> I think it wants making explicit here that the act of relocating is still
>>>> left to the kernel.
>>>
>>> Ok.
>>>
>>> How is this for a new description?
>>>
>>> """
>>> Physical loading constraints for PVH kernels
>>>
>>> Used to place constraints on the guest physical loading addresses and
>>> alignment for a PVH kernel.
>>>
>>> The presence of this note indicates the kernel supports relocating itself.
>>>
>>> The note may include up to three 32bit values.
>>
>> I'm as unsure about always 32-bit as I am on it being uniformly 64-bit.
>> One question here is whether this note is intended to be x86-specific.
>>
>>>    - a maximum address for the entire image to be loaded below (default
>>> 0xfffffff)
>>
>> One f too few?
> 
> Whoops - yes.
> 
>>>    - a minimum address for the start of the image (default 0)
>>>    - a required start alignment (default 1)
> 
> Jan, in the discussion of patch 1, you wrote "Hmm, shouldn't the order 
> of attempts to figure the alignment be ELF note, ELF header, and then 
> 2Mb?"  My latest revision initializes phys_alignment to 1 and updates 
> that if PHYS32_RELOC specifies an alignment.  Do you still want these 
> other locations checked for alignment values?

I think it would be prudent to do so, yet at the same time I guess I won't
insist. Defaulting to 1, though, looks overly lax. In order for the
alignment value to be sensible to omit, the default needs to be sensible
(no lower than 4k, and quite likely better 2M).

Jan
Jan Beulich March 19, 2024, 8:15 a.m. UTC | #21
On 18.03.2024 22:21, Jason Andryuk wrote:
> On 2024-03-15 05:48, Jan Beulich wrote:
>> On 14.03.2024 20:19, Jason Andryuk wrote:
>>> On 2024-03-14 09:31, Jan Beulich wrote:
>>>> On 13.03.2024 20:30, Jason Andryuk wrote:
>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>>>>>        return INVALID_PADDR;
>>>>>    }
>>>>>    
>>>>> +static bool __init check_load_address(
>>>>> +    const struct domain *d, const struct elf_binary *elf)
>>>>> +{
>>>>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>>>>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>>>>
>>>> Both casts act on a pointer value. Such cannot legitimately be converted
>>>> to paddr_t; it only so happens that paddr_t is effectively the same as
>>>> uintptr_t right now. (Same issue again further down.) That said, I notice
>>>> we have pre-existing examples of this ...
>>>
>>> Yes, I followed existing code.  Do you want dest_base to be switched to
>>> a uintptr_t?
>>
>> I think it was deliberately switched to being a pointer at some point,
>> maybe even in a security fix.
> 
> commit 65808a8ed41cc7c044f588bd6cab5af0fdc0e029 "libelf: check all 
> pointer accesses", part of XSA-55, switched from a single dest pointer 
> to dest_base & dest_size.
> 
> For libxenguest, it's a pointer to guest-mapped memory to copy in the 
> kernel.  For PV dom0 creation, it's a pointer - Xen switches to the dom0 
> page tables and performs the copy with it as-is.  For PVH dom0, it's a 
> guest physical address.
> 
> Are you looking for this to be addressed in this series?

I'm sorry to ask, but what is "this" here? I'd like your change to leave
all types alone. I'd further like your change to preferably avoid cloning
questionable code (i.e. casts using the wrong type in particular). Plus,
where technically possible, adhere to the general principle of avoiding
casts (for typically being at least somewhat risky towards hiding
potential bugs).

Jan
Jason Andryuk March 19, 2024, 1:50 p.m. UTC | #22
On 2024-03-19 04:15, Jan Beulich wrote:
> On 18.03.2024 22:21, Jason Andryuk wrote:
>> On 2024-03-15 05:48, Jan Beulich wrote:
>>> On 14.03.2024 20:19, Jason Andryuk wrote:
>>>> On 2024-03-14 09:31, Jan Beulich wrote:
>>>>> On 13.03.2024 20:30, Jason Andryuk wrote:
>>>>>> --- a/xen/arch/x86/hvm/dom0_build.c
>>>>>> +++ b/xen/arch/x86/hvm/dom0_build.c
>>>>>> @@ -537,6 +537,108 @@ static paddr_t __init find_memory(
>>>>>>         return INVALID_PADDR;
>>>>>>     }
>>>>>>     
>>>>>> +static bool __init check_load_address(
>>>>>> +    const struct domain *d, const struct elf_binary *elf)
>>>>>> +{
>>>>>> +    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
>>>>>> +    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
>>>>>
>>>>> Both casts act on a pointer value. Such cannot legitimately be converted
>>>>> to paddr_t; it only so happens that paddr_t is effectively the same as
>>>>> uintptr_t right now. (Same issue again further down.) That said, I notice
>>>>> we have pre-existing examples of this ...
>>>>
>>>> Yes, I followed existing code.  Do you want dest_base to be switched to
>>>> a uintptr_t?
>>>
>>> I think it was deliberately switched to being a pointer at some point,
>>> maybe even in a security fix.
>>
>> commit 65808a8ed41cc7c044f588bd6cab5af0fdc0e029 "libelf: check all
>> pointer accesses", part of XSA-55, switched from a single dest pointer
>> to dest_base & dest_size.
>>
>> For libxenguest, it's a pointer to guest-mapped memory to copy in the
>> kernel.  For PV dom0 creation, it's a pointer - Xen switches to the dom0
>> page tables and performs the copy with it as-is.  For PVH dom0, it's a
>> guest physical address.
>>
>> Are you looking for this to be addressed in this series?
> 
> I'm sorry to ask, but what is "this" here? I'd like your change to leave
> all types alone. I'd further like your change to preferably avoid cloning
> questionable code (i.e. casts using the wrong type in particular). Plus,
> where technically possible, adhere to the general principle of avoiding
> casts (for typically being at least somewhat risky towards hiding
> potential bugs).

I intended "this" to refer to changing dest_base's type.  I won't do 
that.  With some of your other suggestions, most castings are gone.

Thanks,
Jason
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/dom0_build.c b/xen/arch/x86/hvm/dom0_build.c
index 0ceda4140b..5c6c0d2db3 100644
--- a/xen/arch/x86/hvm/dom0_build.c
+++ b/xen/arch/x86/hvm/dom0_build.c
@@ -537,6 +537,108 @@  static paddr_t __init find_memory(
     return INVALID_PADDR;
 }
 
+static bool __init check_load_address(
+    const struct domain *d, const struct elf_binary *elf)
+{
+    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
+    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
+    unsigned int i;
+
+    /*
+     * The memory map is sorted and all RAM regions starts and sizes are
+     * aligned to page boundaries.
+     */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        paddr_t start = d->arch.e820[i].addr;
+        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
+
+        if ( start >= kernel_end )
+            return false;
+
+        if ( start <= kernel_start &&
+             end >= kernel_end &&
+             d->arch.e820[i].type == E820_RAM )
+            return true;
+    }
+
+    return false;
+}
+
+/* Find an e820 RAM region that fits the kernel at a suitable alignment. */
+static paddr_t __init find_kernel_memory(
+    const struct domain *d, struct elf_binary *elf,
+    const struct elf_dom_parms *parms)
+{
+    paddr_t kernel_start = (paddr_t)elf->dest_base & PAGE_MASK;
+    paddr_t kernel_end = PAGE_ALIGN((paddr_t)elf->dest_base + elf->dest_size);
+    paddr_t kernel_size = kernel_end - kernel_start;
+    unsigned int i;
+
+    /*
+     * The memory map is sorted and all RAM regions starts and sizes are
+     * aligned to page boundaries.
+     */
+    for ( i = 0; i < d->arch.nr_e820; i++ )
+    {
+        paddr_t start = d->arch.e820[i].addr;
+        paddr_t end = d->arch.e820[i].addr + d->arch.e820[i].size;
+        paddr_t kstart, kend;
+
+        if ( d->arch.e820[i].type != E820_RAM )
+            continue;
+
+        if ( d->arch.e820[i].size < kernel_size )
+            continue;
+
+        kstart = ROUNDUP(start, parms->phys_align);
+        kstart = kstart < parms->phys_min ? parms->phys_min : kstart;
+        kend = kstart + kernel_size;
+
+        if ( kend > parms->phys_max )
+            return 0;
+
+        if ( kend <= end )
+            return kstart;
+    }
+
+    return 0;
+}
+
+/* Check the kernel load address, and adjust if necessary and possible. */
+static bool __init check_and_adjust_load_address(
+    const struct domain *d, struct elf_binary *elf, struct elf_dom_parms *parms)
+{
+    paddr_t reloc_base;
+
+    if ( check_load_address(d, elf) )
+        return true;
+
+    if ( parms->phys_align == UNSET_ADDR )
+    {
+        printk("Address conflict and %pd kernel is not relocatable\n", d);
+        return false;
+    }
+
+    reloc_base = find_kernel_memory(d, elf, parms);
+    if ( reloc_base == 0 )
+    {
+        printk("Failed find a load address for the kernel\n");
+        return false;
+    }
+
+    if ( opt_dom0_verbose )
+        printk("Relocating kernel from [%lx, %lx] -> [%lx, %lx]\n",
+               (paddr_t)elf->dest_base,
+               (paddr_t)elf->dest_base + elf->dest_size,
+               reloc_base, reloc_base + elf->dest_size);
+
+    parms->phys_entry += (reloc_base - (paddr_t)elf->dest_base);
+    elf->dest_base = (char *)reloc_base;
+
+    return true;
+}
+
 static int __init pvh_load_kernel(struct domain *d, const module_t *image,
                                   unsigned long image_headroom,
                                   module_t *initrd, void *image_base,
@@ -585,6 +687,12 @@  static int __init pvh_load_kernel(struct domain *d, const module_t *image,
     elf.dest_base = (void *)(parms.virt_kstart - parms.virt_base);
     elf.dest_size = parms.virt_kend - parms.virt_kstart;
 
+    if ( !check_and_adjust_load_address(d, &elf, &parms) )
+    {
+        printk("Unable to load kernel\n");
+        return -ENOMEM;
+    }
+
     elf_set_vcpu(&elf, v);
     rc = elf_load_binary(&elf);
     if ( rc < 0 )
diff --git a/xen/common/libelf/libelf-dominfo.c b/xen/common/libelf/libelf-dominfo.c
index 7cc7b18a51..837a1b0f21 100644
--- a/xen/common/libelf/libelf-dominfo.c
+++ b/xen/common/libelf/libelf-dominfo.c
@@ -125,6 +125,7 @@  elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
         [XEN_ELFNOTE_SUSPEND_CANCEL] = { "SUSPEND_CANCEL", ELFNOTE_INT },
         [XEN_ELFNOTE_MOD_START_PFN] = { "MOD_START_PFN", ELFNOTE_INT },
         [XEN_ELFNOTE_PHYS32_ENTRY] = { "PHYS32_ENTRY", ELFNOTE_INT },
+        [XEN_ELFNOTE_PVH_RELOCATION] = { "PVH_RELOCATION", ELFNOTE_OTHER },
     };
 /* *INDENT-ON* */
 
@@ -234,6 +235,17 @@  elf_errorstatus elf_xen_parse_note(struct elf_binary *elf,
                 elf_note_numeric_array(elf, note, 8, 0),
                 elf_note_numeric_array(elf, note, 8, 1));
         break;
+
+    case XEN_ELFNOTE_PVH_RELOCATION:
+        if ( elf_uval(elf, note, descsz) != 3 * sizeof(uint64_t) )
+            return -1;
+
+        parms->phys_min = elf_note_numeric_array(elf, note, 8, 0);
+        parms->phys_max = elf_note_numeric_array(elf, note, 8, 1);
+        parms->phys_align = elf_note_numeric_array(elf, note, 8, 2);
+        elf_msg(elf, "min: %#"PRIx64" max: %#"PRIx64" align: %#"PRIx64"\n",
+                parms->phys_min, parms->phys_max, parms->phys_align);
+        break;
     }
     return 0;
 }
@@ -545,6 +557,7 @@  elf_errorstatus elf_xen_parse(struct elf_binary *elf,
     parms->p2m_base = UNSET_ADDR;
     parms->elf_paddr_offset = UNSET_ADDR;
     parms->phys_entry = UNSET_ADDR32;
+    parms->phys_align = UNSET_ADDR;
 
     /* Find and parse elf notes. */
     count = elf_phdr_count(elf);
diff --git a/xen/include/public/elfnote.h b/xen/include/public/elfnote.h
index 8bf54d035b..70bb93499c 100644
--- a/xen/include/public/elfnote.h
+++ b/xen/include/public/elfnote.h
@@ -194,6 +194,17 @@ 
  */
 #define XEN_ELFNOTE_PHYS32_ENTRY 18
 
+/*
+ * Physical loading constraints for PVH kernels
+ *
+ * Used to place constraints on the guest physical loading addresses and
+ * alignment for a PVH kernel.  This note's value is 3 64bit values in
+ * the following order: minimum, maximum and alignment.
+ *
+ * The presence of this note indicates the kernel is relocatable.
+ */
+#define XEN_ELFNOTE_PVH_RELOCATION 19
+
 /*
  * The number of the highest elfnote defined.
  */
diff --git a/xen/include/xen/libelf.h b/xen/include/xen/libelf.h
index 1c77e3df31..d55a839411 100644
--- a/xen/include/xen/libelf.h
+++ b/xen/include/xen/libelf.h
@@ -431,6 +431,9 @@  struct elf_dom_parms {
     uint64_t virt_hv_start_low;
     uint64_t p2m_base;
     uint64_t elf_paddr_offset;
+    uint64_t phys_min;
+    uint64_t phys_max;
+    uint64_t phys_align;
     uint32_t f_supported[XENFEAT_NR_SUBMAPS];
     uint32_t f_required[XENFEAT_NR_SUBMAPS];
     uint32_t phys_entry;