diff mbox series

xen/arm: skip first page when RAM starts at 0x0

Message ID 1556214715-22030-1-git-send-email-sstabellini@kernel.org (mailing list archive)
State New, archived
Headers show
Series xen/arm: skip first page when RAM starts at 0x0 | expand

Commit Message

Stefano Stabellini April 25, 2019, 5:51 p.m. UTC
Xen assumes that RAM starts at addresses greater than 0x0 in a few
places. The PDX code is one of them but there are more.

For now, skip the first page in memory when the start addess is 0x0.

Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
---
 xen/arch/arm/bootfdt.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Julien Grall April 25, 2019, 9:27 p.m. UTC | #1
Hi Stefano,

On 25/04/2019 18:51, Stefano Stabellini wrote:
> Xen assumes that RAM starts at addresses greater than 0x0 in a few
> places. The PDX code is one of them but there are more.

A bit more context in the commit message would have been useful. Imagine 
if we have to look at the commit message it in 2 years time.

> 
> For now, skip the first page in memory when the start addess is 0x0.
> 
> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
> ---
>   xen/arch/arm/bootfdt.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
> index 891b4b6..2ae4238 100644
> --- a/xen/arch/arm/bootfdt.c
> +++ b/xen/arch/arm/bootfdt.c
> @@ -157,6 +157,13 @@ static void __init process_memory_node(const void *fdt, int node,
>           device_tree_get_reg(&cell, address_cells, size_cells, &start, &size);
>           if ( !size )
>               continue;
> +
> +        /*
> +         * Xen cannot deal with memory starting at 0x0. Burn the first
> +         * page.
> +         */
> +        if ( start == 0 )
> +            start += PAGE_SIZE;

I am afraid this is not enough. This does not cover the case where Xen 
is booting using UEFI (either DT or ACPI). The banks will be created 
from the UEFI memory map (see efi_process_memory_map_bootinfo()).

I also want to avoid mixing populating bootinfo.mem.bank and trying to 
workaround Xen allocator/PDX.

For a first we need to gather feedback from contributors that know a bit 
more of the code that may be affected. AFAICT, there are only two pieces 
where we hand over memory to common code:
     - PDX: The problem is passing 0 to pdx_init_mask() will result to a 
~0 mask defeating the compression later on.
     - Buddy allocator: Jan has been suggesting to stick a check in 
init_xenheap_pages(). This would go the other ugliness existing to deal 
with the buddy allocator.

If we fix in those two places, then I think we cover all the issues in 
common code and make easier to port Xen to a new arch.

Cheers,
Jan Beulich April 26, 2019, 9:12 a.m. UTC | #2
>>> On 25.04.19 at 23:27, <Julien.Grall@arm.com> wrote:
> On 25/04/2019 18:51, Stefano Stabellini wrote:
>> Xen assumes that RAM starts at addresses greater than 0x0 in a few
>> places. The PDX code is one of them but there are more.
> 
> A bit more context in the commit message would have been useful. Imagine 
> if we have to look at the commit message it in 2 years time.
> 
>> 
>> For now, skip the first page in memory when the start addess is 0x0.
>> 
>> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>

Isn't this in response to a bug report / completely different
code change from someone at DornerWorks? In which case
shouldn't there be a Reported-by?

> For a first we need to gather feedback from contributors that know a bit 
> more of the code that may be affected. AFAICT, there are only two pieces 
> where we hand over memory to common code:
>      - PDX: The problem is passing 0 to pdx_init_mask() will result to a 
> ~0 mask defeating the compression later on.

On x86 the function gets called only for memory blocks above 4Gb.
Question is whether on Arm you also have some ad hoc boundary
below which there's no point to look for bits to compact. If not I
wonder why you call the function at all; at the very least (as you
seem to imply) it shouldn't be called when
bootinfo.mem.bank[0].start is zero.

>      - Buddy allocator: Jan has been suggesting to stick a check in 
> init_xenheap_pages(). This would go the other ugliness existing to deal 
> with the buddy allocator.

And this would then also take care of future architectures Xen may
get ported to. (No idea what other ugliness you refer to.)

Jan
Julien Grall April 26, 2019, 9:19 a.m. UTC | #3
Hi,

On 26/04/2019 10:12, Jan Beulich wrote:
>>>> On 25.04.19 at 23:27, <Julien.Grall@arm.com> wrote:
>> On 25/04/2019 18:51, Stefano Stabellini wrote:
>>> Xen assumes that RAM starts at addresses greater than 0x0 in a few
>>> places. The PDX code is one of them but there are more.
>>
>> A bit more context in the commit message would have been useful. Imagine
>> if we have to look at the commit message it in 2 years time.
>>
>>>
>>> For now, skip the first page in memory when the start addess is 0x0.
>>>
>>> Signed-off-by: Stefano Stabellini <stefanos@xilinx.com>
> 
> Isn't this in response to a bug report / completely different
> code change from someone at DornerWorks? In which case
> shouldn't there be a Reported-by?

There are two bugs. One in the buddy allocator (reported by dornerworks) 
and one in the PDX.

> 
>> For a first we need to gather feedback from contributors that know a bit
>> more of the code that may be affected. AFAICT, there are only two pieces
>> where we hand over memory to common code:
>>       - PDX: The problem is passing 0 to pdx_init_mask() will result to a
>> ~0 mask defeating the compression later on.
> 
> On x86 the function gets called only for memory blocks above 4Gb.
> Question is whether on Arm you also have some ad hoc boundary
> below which there's no point to look for bits to compact. If not I
> wonder why you call the function at all; at the very least (as you
> seem to imply) it shouldn't be called when
> bootinfo.mem.bank[0].start is zero.

So how does the PDX work for memory below 4GB? Do you still call 
pfn_to_pdx or those pages?

> 
>>       - Buddy allocator: Jan has been suggesting to stick a check in
>> init_xenheap_pages(). This would go the other ugliness existing to deal
>> with the buddy allocator.
> 
> And this would then also take care of future architectures Xen may
> get ported to. (No idea what other ugliness you refer to.)

See:

/*
  * Yuk! Ensure there is a one-page buffer between Xen and Dom zones, to
  * prevent merging of power-of-two blocks across the zone boundary.
  */

And the 0 is yet another hack for the buddy allocator.

Cheers,
Jan Beulich April 26, 2019, 9:42 a.m. UTC | #4
>>> On 26.04.19 at 11:19, <Julien.Grall@arm.com> wrote:
> On 26/04/2019 10:12, Jan Beulich wrote:
>>>>> On 25.04.19 at 23:27, <Julien.Grall@arm.com> wrote:
>>> On 25/04/2019 18:51, Stefano Stabellini wrote:
>>> For a first we need to gather feedback from contributors that know a bit
>>> more of the code that may be affected. AFAICT, there are only two pieces
>>> where we hand over memory to common code:
>>>       - PDX: The problem is passing 0 to pdx_init_mask() will result to a
>>> ~0 mask defeating the compression later on.
>> 
>> On x86 the function gets called only for memory blocks above 4Gb.
>> Question is whether on Arm you also have some ad hoc boundary
>> below which there's no point to look for bits to compact. If not I
>> wonder why you call the function at all; at the very least (as you
>> seem to imply) it shouldn't be called when
>> bootinfo.mem.bank[0].start is zero.
> 
> So how does the PDX work for memory below 4GB? Do you still call 
> pfn_to_pdx or those pages?

Of course. We merely never compress any of the low 32 address
bits, i.e. our choice is limited to address bits 32 ... 51.

>>>       - Buddy allocator: Jan has been suggesting to stick a check in
>>> init_xenheap_pages(). This would go the other ugliness existing to deal
>>> with the buddy allocator.
>> 
>> And this would then also take care of future architectures Xen may
>> get ported to. (No idea what other ugliness you refer to.)
> 
> See:
> 
> /*
>   * Yuk! Ensure there is a one-page buffer between Xen and Dom zones, to
>   * prevent merging of power-of-two blocks across the zone boundary.
>   */
> 
> And the 0 is yet another hack for the buddy allocator.

Ah, this one. Is this actually an issue on Arm32? ix86 needed
to deal with the situation because its direct map range was
12Mb in size, i.e. not a power of two. If it's not an issue (I
can't really figure it out considering there are no DIRECTMAP_*
constants for Arm32 at all, only XENHEAP_* ones), I'd suggest
putting this ugliness in an #ifdef using a CONFIG_* option not
selected by any presently supported arch.

Jan
Julien Grall April 26, 2019, 3:38 p.m. UTC | #5
Hi Jan,

On 26/04/2019 10:42, Jan Beulich wrote:
>>>> On 26.04.19 at 11:19, <Julien.Grall@arm.com> wrote:
>> On 26/04/2019 10:12, Jan Beulich wrote:
>>>>>> On 25.04.19 at 23:27, <Julien.Grall@arm.com> wrote:
>>>> On 25/04/2019 18:51, Stefano Stabellini wrote:
>>>> For a first we need to gather feedback from contributors that know a bit
>>>> more of the code that may be affected. AFAICT, there are only two pieces
>>>> where we hand over memory to common code:
>>>>        - PDX: The problem is passing 0 to pdx_init_mask() will result to a
>>>> ~0 mask defeating the compression later on.
>>>
>>> On x86 the function gets called only for memory blocks above 4Gb.
>>> Question is whether on Arm you also have some ad hoc boundary
>>> below which there's no point to look for bits to compact. If not I
>>> wonder why you call the function at all; at the very least (as you
>>> seem to imply) it shouldn't be called when
>>> bootinfo.mem.bank[0].start is zero.
>>
>> So how does the PDX work for memory below 4GB? Do you still call
>> pfn_to_pdx or those pages?
> 
> Of course. We merely never compress any of the low 32 address
> bits, i.e. our choice is limited to address bits 32 ... 51.

Ah, the code makes a bit more sense on the x86 side. Is there any reason to 
limit to address bit 32 .. 51?

For Arm, we can't compress bits 0 ... 30 due to the buddy allocator (see 
pfn_pdx_hole_setup). So we could ignore the first 1GB of RAM.

> 
>>>>        - Buddy allocator: Jan has been suggesting to stick a check in
>>>> init_xenheap_pages(). This would go the other ugliness existing to deal
>>>> with the buddy allocator.
>>>
>>> And this would then also take care of future architectures Xen may
>>> get ported to. (No idea what other ugliness you refer to.)
>>
>> See:
>>
>> /*
>>    * Yuk! Ensure there is a one-page buffer between Xen and Dom zones, to
>>    * prevent merging of power-of-two blocks across the zone boundary.
>>    */
>>
>> And the 0 is yet another hack for the buddy allocator.
> 
> Ah, this one. Is this actually an issue on Arm32? ix86 needed
> to deal with the situation because its direct map range was
> 12Mb in size, i.e. not a power of two. If it's not an issue (I
> can't really figure it out considering there are no DIRECTMAP_*
> constants for Arm32 at all, only XENHEAP_* ones), I'd suggest
> putting this ugliness in an #ifdef using a CONFIG_* option not
> selected by any presently supported arch.

On Arm32, the size of the xenheap depends on the platform. It is calculated by 
setup_mm in arch/arm/setup.c. From my understanding it may not be a power of 2, 
so we would be affected by the problem you mention here.

May I ask why it is currently not be an issue on x86? Do you always pass a power 
of 2 to init_xenheap_pages?

Cheers,
Jan Beulich April 26, 2019, 3:48 p.m. UTC | #6
>>> On 26.04.19 at 17:38, <julien.grall@arm.com> wrote:
> On 26/04/2019 10:42, Jan Beulich wrote:
>>>>> On 26.04.19 at 11:19, <Julien.Grall@arm.com> wrote:
>>> So how does the PDX work for memory below 4GB? Do you still call
>>> pfn_to_pdx or those pages?
>> 
>> Of course. We merely never compress any of the low 32 address
>> bits, i.e. our choice is limited to address bits 32 ... 51.
> 
> Ah, the code makes a bit more sense on the x86 side. Is there any reason to 
> limit to address bit 32 .. 51?

Well, there being various special things immediately below 4Gb
there's simply little point in trying to squash any of the lower bits.

> For Arm, we can't compress bits 0 ... 30 due to the buddy allocator (see 
> pfn_pdx_hole_setup). So we could ignore the first 1GB of RAM.

Bits 0 ... 30 would be the first 2Gb afaict.

>>> See:
>>>
>>> /*
>>>    * Yuk! Ensure there is a one-page buffer between Xen and Dom zones, to
>>>    * prevent merging of power-of-two blocks across the zone boundary.
>>>    */
>>>
>>> And the 0 is yet another hack for the buddy allocator.
>> 
>> Ah, this one. Is this actually an issue on Arm32? ix86 needed
>> to deal with the situation because its direct map range was
>> 12Mb in size, i.e. not a power of two. If it's not an issue (I
>> can't really figure it out considering there are no DIRECTMAP_*
>> constants for Arm32 at all, only XENHEAP_* ones), I'd suggest
>> putting this ugliness in an #ifdef using a CONFIG_* option not
>> selected by any presently supported arch.
> 
> On Arm32, the size of the xenheap depends on the platform. It is calculated by 
> setup_mm in arch/arm/setup.c. From my understanding it may not be a power of 2, 
> so we would be affected by the problem you mention here.
> 
> May I ask why it is currently not be an issue on x86? Do you always pass a power 
> of 2 to init_xenheap_pages?

No, it's because CONFIG_SEPARATE_XENHEAP is undefined on
(64-bit) x86, i.e. there simply is no boundary between both heaps.

Jan
Julien Grall April 26, 2019, 9:31 p.m. UTC | #7
Hi,

On 4/26/19 4:48 PM, Jan Beulich wrote:
>>>> On 26.04.19 at 17:38, <julien.grall@arm.com> wrote:
>> On 26/04/2019 10:42, Jan Beulich wrote:
>>>>>> On 26.04.19 at 11:19, <Julien.Grall@arm.com> wrote:
>>>> So how does the PDX work for memory below 4GB? Do you still call
>>>> pfn_to_pdx or those pages?
>>>
>>> Of course. We merely never compress any of the low 32 address
>>> bits, i.e. our choice is limited to address bits 32 ... 51.
>>
>> Ah, the code makes a bit more sense on the x86 side. Is there any reason to
>> limit to address bit 32 .. 51?
> 
> Well, there being various special things immediately below 4Gb
> there's simply little point in trying to squash any of the lower bits.
> 
>> For Arm, we can't compress bits 0 ... 30 due to the buddy allocator (see
>> pfn_pdx_hole_setup). So we could ignore the first 1GB of RAM.
> 
> Bits 0 ... 30 would be the first 2Gb afaict.

Sorry I messed us my maths. I meant 0 ... 29 (MAX_ORDER = 18).

@Stefano, do you think you can have a look at it?

> 
>>>> See:
>>>>
>>>> /*
>>>>     * Yuk! Ensure there is a one-page buffer between Xen and Dom zones, to
>>>>     * prevent merging of power-of-two blocks across the zone boundary.
>>>>     */
>>>>
>>>> And the 0 is yet another hack for the buddy allocator.
>>>
>>> Ah, this one. Is this actually an issue on Arm32? ix86 needed
>>> to deal with the situation because its direct map range was
>>> 12Mb in size, i.e. not a power of two. If it's not an issue (I
>>> can't really figure it out considering there are no DIRECTMAP_*
>>> constants for Arm32 at all, only XENHEAP_* ones), I'd suggest
>>> putting this ugliness in an #ifdef using a CONFIG_* option not
>>> selected by any presently supported arch.
>>
>> On Arm32, the size of the xenheap depends on the platform. It is calculated by
>> setup_mm in arch/arm/setup.c. From my understanding it may not be a power of 2,
>> so we would be affected by the problem you mention here.
>>
>> May I ask why it is currently not be an issue on x86? Do you always pass a power
>> of 2 to init_xenheap_pages?
> 
> No, it's because CONFIG_SEPARATE_XENHEAP is undefined on
> (64-bit) x86, i.e. there simply is no boundary between both heaps.

Oh, somehow I thought it was separated. Ok, so I would keep the code as 
is in init_xenheap_pages then.

But it occured to me that we may want to put the check in 
init_heap_pages() rather than init_xenheap_pages() as we also want to 
cover domheap.

Cheers,
Stefano Stabellini April 26, 2019, 11:47 p.m. UTC | #8
On Fri, 26 Apr 2019, Julien Grall wrote:
> On 4/26/19 4:48 PM, Jan Beulich wrote:
> > > > > On 26.04.19 at 17:38, <julien.grall@arm.com> wrote:
> > > On 26/04/2019 10:42, Jan Beulich wrote:
> > > > > > > On 26.04.19 at 11:19, <Julien.Grall@arm.com> wrote:
> > > > > So how does the PDX work for memory below 4GB? Do you still call
> > > > > pfn_to_pdx or those pages?
> > > > 
> > > > Of course. We merely never compress any of the low 32 address
> > > > bits, i.e. our choice is limited to address bits 32 ... 51.
> > > 
> > > Ah, the code makes a bit more sense on the x86 side. Is there any reason
> > > to
> > > limit to address bit 32 .. 51?
> > 
> > Well, there being various special things immediately below 4Gb
> > there's simply little point in trying to squash any of the lower bits.
> > 
> > > For Arm, we can't compress bits 0 ... 30 due to the buddy allocator (see
> > > pfn_pdx_hole_setup). So we could ignore the first 1GB of RAM.
> > 
> > Bits 0 ... 30 would be the first 2Gb afaict.
> 
> Sorry I messed us my maths. I meant 0 ... 29 (MAX_ORDER = 18).
> 
> @Stefano, do you think you can have a look at it?

After an extensive online debugging session with Julien, we found a fix
to the PDX code that makes the masks and nr_pdxs calculation correct.
See the appended patch below.

Mask is initialized to 0 to the let the algorithm compute the right mask
based on the RAM banks. pdx_init_mask doesn't cope with a start address
of 0x0 because it does `base_addr - 1' which causes a wrap around when
base_addr is 0x0. That works OK as far as I can tell and the resulting
values of `pfn_pdx_bottom_mask', `pfn_top_mask', and
`pfn_pdx_hole_shift' seems correct (they match the case where I force
ram_start = PAGE_SIZE).

The other change to nr_pdxs is less obvious. It is clear that nr_pdxs is
calculated wrongly in this case (memory 0-0x80000000,
0x800000000-0x880000000, ps=0, pe=0x880000000): nr_pdxs=524288 which is
half the number we need (the correct number is 1048575).

Taking a line from the x86 code xen/arch/x86/setup.c:setup_max_pdx
(Julien's suggestion):

  max_pdx = pfn_to_pdx(top_page - 1) + 1;

I changed nr_pdxs to
  
  nr_pdxs = pfn_to_pdx((pe >> PAGE_SHIFT) - 1) + 1;

and it works. I think the change is correct because looking at the
implementation of pfn_to_pdx it is certainly meant to operate on a pfn
masking bits on it to compensate for the holes. It is not meant to work
on a size.

Jan, does it look correct to you too?


diff --git a/xen/arch/arm/mm.c b/xen/arch/arm/mm.c
index 01ae2cc..f711eef 100644
--- a/xen/arch/arm/mm.c
+++ b/xen/arch/arm/mm.c
@@ -874,8 +874,7 @@ void __init setup_xenheap_mappings(unsigned long base_mfn,
 /* Map a frame table to cover physical addresses ps through pe */
 void __init setup_frametable_mappings(paddr_t ps, paddr_t pe)
 {
-    unsigned long nr_pages = (pe - ps) >> PAGE_SHIFT;
-    unsigned long nr_pdxs = pfn_to_pdx(nr_pages);
+    unsigned long nr_pdxs = pfn_to_pdx((pe >> PAGE_SHIFT) - 1) + 1;
     unsigned long frametable_size = nr_pdxs * sizeof(struct page_info);
     mfn_t base_mfn;
     const unsigned long mapping_size = frametable_size < MB(32) ? MB(2) : MB(32);
diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c
index ccb0f18..9e7da41 100644
--- a/xen/arch/arm/setup.c
+++ b/xen/arch/arm/setup.c
@@ -482,7 +482,7 @@ static void __init init_pdx(void)
 {
     paddr_t bank_start, bank_size, bank_end;
 
-    u64 mask = pdx_init_mask(bootinfo.mem.bank[0].start);
+    u64 mask = 0;
     int bank;
 
     for ( bank = 0 ; bank < bootinfo.mem.nr_banks; bank++ )
Julien Grall April 27, 2019, 7:43 p.m. UTC | #9
Hi,

On 27/04/2019 00:47, Stefano Stabellini wrote:
> On Fri, 26 Apr 2019, Julien Grall wrote:
>> On 4/26/19 4:48 PM, Jan Beulich wrote:
>>>>>> On 26.04.19 at 17:38, <julien.grall@arm.com> wrote:
>>>> On 26/04/2019 10:42, Jan Beulich wrote:
>>>>>>>> On 26.04.19 at 11:19, <Julien.Grall@arm.com> wrote:
>>>>>> So how does the PDX work for memory below 4GB? Do you still call
>>>>>> pfn_to_pdx or those pages?
>>>>>
>>>>> Of course. We merely never compress any of the low 32 address
>>>>> bits, i.e. our choice is limited to address bits 32 ... 51.
>>>>
>>>> Ah, the code makes a bit more sense on the x86 side. Is there any reason
>>>> to
>>>> limit to address bit 32 .. 51?
>>>
>>> Well, there being various special things immediately below 4Gb
>>> there's simply little point in trying to squash any of the lower bits.
>>>
>>>> For Arm, we can't compress bits 0 ... 30 due to the buddy allocator (see
>>>> pfn_pdx_hole_setup). So we could ignore the first 1GB of RAM.
>>>
>>> Bits 0 ... 30 would be the first 2Gb afaict.
>>
>> Sorry I messed us my maths. I meant 0 ... 29 (MAX_ORDER = 18).
>>
>> @Stefano, do you think you can have a look at it?
> 
> After an extensive online debugging session with Julien, we found a fix
> to the PDX code that makes the masks and nr_pdxs calculation correct.
> See the appended patch below.
> 
> Mask is initialized to 0 to the let the algorithm compute the right mask
> based on the RAM banks. pdx_init_mask doesn't cope with a start address
> of 0x0 because it does `base_addr - 1' which causes a wrap around when
> base_addr is 0x0. That works OK as far as I can tell and the resulting
> values of `pfn_pdx_bottom_mask', `pfn_top_mask', and
> `pfn_pdx_hole_shift' seems correct (they match the case where I force
> ram_start = PAGE_SIZE).

Well, I asked to use 0 as a mask to understand a bit more the issue in 
the algorithm. Now, we need to understand whether this is the right 
thing to do everytime.

In this case, Jan has pointed out that on x86 they don't compute the PDX 
compression using any RAM below 4GB.

For Arm, there are limited reason to use any RAM below 1GB to work out 
the PDX compression because they will not be ignored by the PDX allocator.

So maybe a more sensible choice (and potentially requiring less brain 
work :)) is to set a boundary below which there are no point to use for 
bits to compact.

My suggestion would be to follow x86 and use 4GB.

> The other change to nr_pdxs is less obvious. It is clear that nr_pdxs is
> calculated wrongly in this case (memory 0-0x80000000,
> 0x800000000-0x880000000, ps=0, pe=0x880000000): nr_pdxs=524288 which is
> half the number we need (the correct number is 1048575).
> 
> Taking a line from the x86 code xen/arch/x86/setup.c:setup_max_pdx
> (Julien's suggestion):
> 
>    max_pdx = pfn_to_pdx(top_page - 1) + 1;
> 
> I changed nr_pdxs to
>    
>    nr_pdxs = pfn_to_pdx((pe >> PAGE_SHIFT) - 1) + 1;
> 
> and it works. I think the change is correct because looking at the
> implementation of pfn_to_pdx it is certainly meant to operate on a pfn
> masking bits on it to compensate for the holes. It is not meant to work
> on a size.

While the change makes perfect sense to me,  I still don't understand 
how this has ever worked on Arm. Surely if we passed a size rather than 
a PFN then the computation will be wrong. How do we get away today? Is 
it because the frametable will always be a multiple of 2MB (or 32MB if 
bigger than 32MB)? But then it would mean that some page_info may not be 
initialized correctly.

I have seen report regarding memory issue on Xen recently [1] (data 
abort on page_info). This seems to happen because we stop relocating Xen 
towards the end of the memory. I am wondering whether the issue is 
because of the way we setup the frametable as well.

[1] 
https://lists.xenproject.org/archives/html/xen-devel/2019-03/msg01483.html

Cheers,
Jan Beulich April 29, 2019, 7:07 a.m. UTC | #10
>>> On 26.04.19 at 23:31, <julien.grall@arm.com> wrote:
> On 4/26/19 4:48 PM, Jan Beulich wrote:
>>>>> On 26.04.19 at 17:38, <julien.grall@arm.com> wrote:
>>> May I ask why it is currently not be an issue on x86? Do you always pass a power
>>> of 2 to init_xenheap_pages?
>> 
>> No, it's because CONFIG_SEPARATE_XENHEAP is undefined on
>> (64-bit) x86, i.e. there simply is no boundary between both heaps.
> 
> Oh, somehow I thought it was separated. Ok, so I would keep the code as 
> is in init_xenheap_pages then.
> 
> But it occured to me that we may want to put the check in 
> init_heap_pages() rather than init_xenheap_pages() as we also want to 
> cover domheap.

Indeed.

Jan
Jan Beulich April 29, 2019, 7:15 a.m. UTC | #11
>>> On 27.04.19 at 01:47, <sstabellini@kernel.org> wrote:
> The other change to nr_pdxs is less obvious. It is clear that nr_pdxs is
> calculated wrongly in this case (memory 0-0x80000000,
> 0x800000000-0x880000000, ps=0, pe=0x880000000): nr_pdxs=524288 which is
> half the number we need (the correct number is 1048575).
> 
> Taking a line from the x86 code xen/arch/x86/setup.c:setup_max_pdx
> (Julien's suggestion):
> 
>   max_pdx = pfn_to_pdx(top_page - 1) + 1;
> 
> I changed nr_pdxs to
>   
>   nr_pdxs = pfn_to_pdx((pe >> PAGE_SHIFT) - 1) + 1;
> 
> and it works. I think the change is correct because looking at the
> implementation of pfn_to_pdx it is certainly meant to operate on a pfn
> masking bits on it to compensate for the holes. It is not meant to work
> on a size.
> 
> Jan, does it look correct to you too?

Yes. pfn_to_pdx() and friends may only ever be passed actual
PFNs / PDXes, not something that's one above or one below a
valid range. I share Julien's question though: Was it really mere
luck that things have been working to date?

Jan
Julien Grall April 29, 2019, 3:54 p.m. UTC | #12
Hi,

On 29/04/2019 08:15, Jan Beulich wrote:
>>>> On 27.04.19 at 01:47, <sstabellini@kernel.org> wrote:
>> The other change to nr_pdxs is less obvious. It is clear that nr_pdxs is
>> calculated wrongly in this case (memory 0-0x80000000,
>> 0x800000000-0x880000000, ps=0, pe=0x880000000): nr_pdxs=524288 which is
>> half the number we need (the correct number is 1048575).
>>
>> Taking a line from the x86 code xen/arch/x86/setup.c:setup_max_pdx
>> (Julien's suggestion):
>>
>>    max_pdx = pfn_to_pdx(top_page - 1) + 1;
>>
>> I changed nr_pdxs to
>>    
>>    nr_pdxs = pfn_to_pdx((pe >> PAGE_SHIFT) - 1) + 1;
>>
>> and it works. I think the change is correct because looking at the
>> implementation of pfn_to_pdx it is certainly meant to operate on a pfn
>> masking bits on it to compensate for the holes. It is not meant to work
>> on a size.
>>
>> Jan, does it look correct to you too?
> 
> Yes. pfn_to_pdx() and friends may only ever be passed actual
> PFNs / PDXes, not something that's one above or one below a
> valid range. I share Julien's question though: Was it really mere
> luck that things have been working to date?

TLDR; unexplained to not say mere luck

We define the size of the frametable using:

nr_pdxs = pfn_to_pdx(pe - ps);
frametable_size = nr_pdxs * sizeof (struct page_info);

mfn_to_page is implemented the following way:

frame_table + (mfn_to_pdx(mfn) - frametable_base_pdx)

Where frametable_base_pdx = pfn_to_pdx(ps >> PAGE_SHIFT)

On the two platforms (Juno-r2 + Foundation Model) I tested today, somehow

pfn_to_pdx(pe - ps) == (pfn_to_pdx((pe >> PAGE_SHIFT) - 1) + 1) - 
frametable_base_pdx)

So the frametable is correctly sized but I honestly I have no idea why it works.

Anyway, I also tested the change suggested by Stefano. This will substantially 
increase the size of the frametable on platform where the RAM does not start at 0.

For instance, on Foundation Model the RAM starts at 2GB. As we don't compress 
any of the first 31 bits, the frametable will now be 28MB bigger than we 
currently have (112MB up from 84MB).

So I think what we want is:

nr_pdxs = pfn_to_pdx(end - 1) - pfn_to_pdx(start) + 1;
frame_table_base_pdx = pfn_to_pdx(start);

On a side note, I noticed that the table is still much bigger than it should be. 
4GB of RAM should only require a frametable of 56MB, but it is 84MB. Anyway, 
that's probably a separate discussion.

Cheers,
Jan Beulich April 29, 2019, 4:07 p.m. UTC | #13
>>> On 29.04.19 at 17:54, <julien.grall@arm.com> wrote:
> Anyway, I also tested the change suggested by Stefano. This will substantially 
> increase the size of the frametable on platform where the RAM does not start 
> at 0.
> 
> For instance, on Foundation Model the RAM starts at 2GB. As we don't compress 
> any of the first 31 bits, the frametable will now be 28MB bigger than we 
> currently have (112MB up from 84MB).
> 
> So I think what we want is:
> 
> nr_pdxs = pfn_to_pdx(end - 1) - pfn_to_pdx(start) + 1;
> frame_table_base_pdx = pfn_to_pdx(start);

Yes, if there's only a single memory range at 2Gb, then subtracting
the base address will of course yield better results. But if there are
multiple regions, things really depend on the placement of all of
them.

Jan
Stefano Stabellini April 29, 2019, 5:51 p.m. UTC | #14
On Mon, 29 Apr 2019, Jan Beulich wrote:
> >>> On 29.04.19 at 17:54, <julien.grall@arm.com> wrote:
> > Anyway, I also tested the change suggested by Stefano. This will substantially 
> > increase the size of the frametable on platform where the RAM does not start 
> > at 0.
> > 
> > For instance, on Foundation Model the RAM starts at 2GB. As we don't compress 
> > any of the first 31 bits, the frametable will now be 28MB bigger than we 
> > currently have (112MB up from 84MB).
> > 
> > So I think what we want is:
> > 
> > nr_pdxs = pfn_to_pdx(end - 1) - pfn_to_pdx(start) + 1;
> > frame_table_base_pdx = pfn_to_pdx(start);
> 
> Yes, if there's only a single memory range at 2Gb, then subtracting
> the base address will of course yield better results. But if there are
> multiple regions, things really depend on the placement of all of
> them.

If we do not compress any RAM addresses below 4G, then we cannot
compensate for any holes in that range. However, at least we can safely
skip the first [0-start] like Julien suggested in the calculation of
nr_pdxs.
Stefano Stabellini May 1, 2019, 10:44 p.m. UTC | #15
On Mon, 29 Apr 2019, Stefano Stabellini wrote:
> On Mon, 29 Apr 2019, Jan Beulich wrote:
> > >>> On 29.04.19 at 17:54, <julien.grall@arm.com> wrote:
> > > Anyway, I also tested the change suggested by Stefano. This will substantially 
> > > increase the size of the frametable on platform where the RAM does not start 
> > > at 0.
> > > 
> > > For instance, on Foundation Model the RAM starts at 2GB. As we don't compress 
> > > any of the first 31 bits, the frametable will now be 28MB bigger than we 
> > > currently have (112MB up from 84MB).
> > > 
> > > So I think what we want is:
> > > 
> > > nr_pdxs = pfn_to_pdx(end - 1) - pfn_to_pdx(start) + 1;
> > > frame_table_base_pdx = pfn_to_pdx(start);
> > 
> > Yes, if there's only a single memory range at 2Gb, then subtracting
> > the base address will of course yield better results. But if there are
> > multiple regions, things really depend on the placement of all of
> > them.
> 
> If we do not compress any RAM addresses below 4G, then we cannot
> compensate for any holes in that range. However, at least we can safely
> skip the first [0-start] like Julien suggested in the calculation of
> nr_pdxs.


Hi Jan, I have a question on the PDX code.

The PDX initialization is a bit different between x86 and ARM, but it
follows roughly the same pattern, look at
xen/arch/x86/srat.c:srat_parse_regions (I take that is where things
happen on x86) and xen/arch/arm/setup.c:init_pdx.

Mask is initialized calling pdx_init_mask on a start address, then a
loop fills in the rest of the mask calling pdx_region_mask, based on the
memory regions present.

I wrote a small unit test of the ARM PDX initialization and while I was
playing with addresses and values I noticed that actually if I simply
skip pdx_init_mask and just initialize the mask to 0 (mask = 0) in
init_pdx, the rest of the function always calculates the right mask.

In fact, there are cases where initializing the mask to a value causes
the rest of the code to miss some potential compressions. While
initializing the mask to 0 leads to more optimizations. I can provide
specific examples if you are curious.

Before I make any changes to that code, I would like to understand a bit
better why things are done that way today. Do you know why the mask is
initialized to pdx_init_mask(start-of-ram)?
Jan Beulich May 2, 2019, 7:30 a.m. UTC | #16
>>> On 02.05.19 at 00:44, <sstabellini@kernel.org> wrote:
> Hi Jan, I have a question on the PDX code.
> 
> The PDX initialization is a bit different between x86 and ARM, but it
> follows roughly the same pattern, look at
> xen/arch/x86/srat.c:srat_parse_regions (I take that is where things
> happen on x86) and xen/arch/arm/setup.c:init_pdx.
> 
> Mask is initialized calling pdx_init_mask on a start address, then a
> loop fills in the rest of the mask calling pdx_region_mask, based on the
> memory regions present.
> 
> I wrote a small unit test of the ARM PDX initialization and while I was
> playing with addresses and values I noticed that actually if I simply
> skip pdx_init_mask and just initialize the mask to 0 (mask = 0) in
> init_pdx, the rest of the function always calculates the right mask.
> 
> In fact, there are cases where initializing the mask to a value causes
> the rest of the code to miss some potential compressions. While
> initializing the mask to 0 leads to more optimizations. I can provide
> specific examples if you are curious.
> 
> Before I make any changes to that code, I would like to understand a bit
> better why things are done that way today. Do you know why the mask is
> initialized to pdx_init_mask(start-of-ram)?

I'm confused, and hence I'm perhaps misunderstanding your
question. To me it looks like you're re-asking a question already
answered. On x86 we don't want to squash out any of the low
32 bits, because of the special address ranges that live below
4Gb. Hence we invoke pdx_init_mask(first-block-at-or-above-4Gb).
Note it's not start-of-ram, as you've said.

Jan
Julien Grall May 2, 2019, 9:02 a.m. UTC | #17
Hi Jan,

On 5/2/19 8:30 AM, Jan Beulich wrote:
>>>> On 02.05.19 at 00:44, <sstabellini@kernel.org> wrote:
>> Hi Jan, I have a question on the PDX code.
>>
>> The PDX initialization is a bit different between x86 and ARM, but it
>> follows roughly the same pattern, look at
>> xen/arch/x86/srat.c:srat_parse_regions (I take that is where things
>> happen on x86) and xen/arch/arm/setup.c:init_pdx.
>>
>> Mask is initialized calling pdx_init_mask on a start address, then a
>> loop fills in the rest of the mask calling pdx_region_mask, based on the
>> memory regions present.
>>
>> I wrote a small unit test of the ARM PDX initialization and while I was
>> playing with addresses and values I noticed that actually if I simply
>> skip pdx_init_mask and just initialize the mask to 0 (mask = 0) in
>> init_pdx, the rest of the function always calculates the right mask.
>>
>> In fact, there are cases where initializing the mask to a value causes
>> the rest of the code to miss some potential compressions. While
>> initializing the mask to 0 leads to more optimizations. I can provide
>> specific examples if you are curious.
>>
>> Before I make any changes to that code, I would like to understand a bit
>> better why things are done that way today. Do you know why the mask is
>> initialized to pdx_init_mask(start-of-ram)?

Well, it is not the start-of-ram on Arm. It is whatever is the start of 
bank 0. This is because the firmware table (such as DT) may not require 
ordering and we don't order banks in Xen.

So it may be possible the PDX will not compress if the banks are not 
ordered in the firmware tables.

> 
> I'm confused, and hence I'm perhaps misunderstanding your
> question. To me it looks like you're re-asking a question already
> answered. On x86 we don't want to squash out any of the low
> 32 bits, because of the special address ranges that live below
> 4Gb. Hence we invoke pdx_init_mask(first-block-at-or-above-4Gb).
> Note it's not start-of-ram, as you've said.

I think what Stefano is asking is why pdx_init_mask(...) is invoked with 
the first block address rather than 4GB (or even 0 thought I don't think 
this is right).

By using the first block address, the PDX will not be able to compress 
any bits between 0 and the MSB 1' in the first block address. In other 
word, for a base address 0x200000000 (8GB), the initial mask will be 
0x1ffffffff.

Stefano and I were wondering whether it would instead be possible to 
create the initial mask with pdx_init_mask(4GB) or pdx_init_mask(1GB) 
(I.e the maximum contiguous range the buddy allocator can allocate).

Cheers,
Jan Beulich May 2, 2019, 9:20 a.m. UTC | #18
>>> On 02.05.19 at 11:02, <julien.grall@arm.com> wrote:
> On 5/2/19 8:30 AM, Jan Beulich wrote:
>>>>> On 02.05.19 at 00:44, <sstabellini@kernel.org> wrote:
>>> Hi Jan, I have a question on the PDX code.
>>>
>>> The PDX initialization is a bit different between x86 and ARM, but it
>>> follows roughly the same pattern, look at
>>> xen/arch/x86/srat.c:srat_parse_regions (I take that is where things
>>> happen on x86) and xen/arch/arm/setup.c:init_pdx.
>>>
>>> Mask is initialized calling pdx_init_mask on a start address, then a
>>> loop fills in the rest of the mask calling pdx_region_mask, based on the
>>> memory regions present.
>>>
>>> I wrote a small unit test of the ARM PDX initialization and while I was
>>> playing with addresses and values I noticed that actually if I simply
>>> skip pdx_init_mask and just initialize the mask to 0 (mask = 0) in
>>> init_pdx, the rest of the function always calculates the right mask.
>>>
>>> In fact, there are cases where initializing the mask to a value causes
>>> the rest of the code to miss some potential compressions. While
>>> initializing the mask to 0 leads to more optimizations. I can provide
>>> specific examples if you are curious.
>>>
>>> Before I make any changes to that code, I would like to understand a bit
>>> better why things are done that way today. Do you know why the mask is
>>> initialized to pdx_init_mask(start-of-ram)?
> 
> Well, it is not the start-of-ram on Arm. It is whatever is the start of 
> bank 0. This is because the firmware table (such as DT) may not require 
> ordering and we don't order banks in Xen.
> 
> So it may be possible the PDX will not compress if the banks are not 
> ordered in the firmware tables.

Even more so a reason not to use bank 0's start address.

>> I'm confused, and hence I'm perhaps misunderstanding your
>> question. To me it looks like you're re-asking a question already
>> answered. On x86 we don't want to squash out any of the low
>> 32 bits, because of the special address ranges that live below
>> 4Gb. Hence we invoke pdx_init_mask(first-block-at-or-above-4Gb).
>> Note it's not start-of-ram, as you've said.
> 
> I think what Stefano is asking is why pdx_init_mask(...) is invoked with 
> the first block address rather than 4GB (or even 0 thought I don't think 
> this is right).
> 
> By using the first block address, the PDX will not be able to compress 
> any bits between 0 and the MSB 1' in the first block address. In other 
> word, for a base address 0x200000000 (8GB), the initial mask will be 
> 0x1ffffffff.
> 
> Stefano and I were wondering whether it would instead be possible to 
> create the initial mask with pdx_init_mask(4GB) or pdx_init_mask(1GB) 
> (I.e the maximum contiguous range the buddy allocator can allocate).

That's indeed an option - it's just that I've yet to see an x86
system where there's a hole starting at 4Gb. What's better in that
case can probably be judged only once run into such a case.

Jan
Stefano Stabellini May 2, 2019, 10:25 p.m. UTC | #19
On Thu, 2 May 2019, Jan Beulich wrote:
> >>> On 02.05.19 at 11:02, <julien.grall@arm.com> wrote:
> > On 5/2/19 8:30 AM, Jan Beulich wrote:
> >>>>> On 02.05.19 at 00:44, <sstabellini@kernel.org> wrote:
> >>> Hi Jan, I have a question on the PDX code.
> >>>
> >>> The PDX initialization is a bit different between x86 and ARM, but it
> >>> follows roughly the same pattern, look at
> >>> xen/arch/x86/srat.c:srat_parse_regions (I take that is where things
> >>> happen on x86) and xen/arch/arm/setup.c:init_pdx.
> >>>
> >>> Mask is initialized calling pdx_init_mask on a start address, then a
> >>> loop fills in the rest of the mask calling pdx_region_mask, based on the
> >>> memory regions present.
> >>>
> >>> I wrote a small unit test of the ARM PDX initialization and while I was
> >>> playing with addresses and values I noticed that actually if I simply
> >>> skip pdx_init_mask and just initialize the mask to 0 (mask = 0) in
> >>> init_pdx, the rest of the function always calculates the right mask.
> >>>
> >>> In fact, there are cases where initializing the mask to a value causes
> >>> the rest of the code to miss some potential compressions. While
> >>> initializing the mask to 0 leads to more optimizations. I can provide
> >>> specific examples if you are curious.
> >>>
> >>> Before I make any changes to that code, I would like to understand a bit
> >>> better why things are done that way today. Do you know why the mask is
> >>> initialized to pdx_init_mask(start-of-ram)?
> > 
> > Well, it is not the start-of-ram on Arm. It is whatever is the start of 
> > bank 0. This is because the firmware table (such as DT) may not require 
> > ordering and we don't order banks in Xen.
> > 
> > So it may be possible the PDX will not compress if the banks are not 
> > ordered in the firmware tables.
> 
> Even more so a reason not to use bank 0's start address.
> 
> >> I'm confused, and hence I'm perhaps misunderstanding your
> >> question. To me it looks like you're re-asking a question already
> >> answered. On x86 we don't want to squash out any of the low
> >> 32 bits, because of the special address ranges that live below
> >> 4Gb. Hence we invoke pdx_init_mask(first-block-at-or-above-4Gb).
> >> Note it's not start-of-ram, as you've said.
> > 
> > I think what Stefano is asking is why pdx_init_mask(...) is invoked with 
> > the first block address rather than 4GB (or even 0 thought I don't think 
> > this is right).
> > 
> > By using the first block address, the PDX will not be able to compress 
> > any bits between 0 and the MSB 1' in the first block address. In other 
> > word, for a base address 0x200000000 (8GB), the initial mask will be 
> > 0x1ffffffff.
> > 
> > Stefano and I were wondering whether it would instead be possible to 
> > create the initial mask with pdx_init_mask(4GB) or pdx_init_mask(1GB) 
> > (I.e the maximum contiguous range the buddy allocator can allocate).
> 
> That's indeed an option - it's just that I've yet to see an x86
> system where there's a hole starting at 4Gb. What's better in that
> case can probably be judged only once run into such a case.

All right. Looking at the comment in pfn_pdx_hole_setup, it seems that
it is intending to skip the first MAX_ORDER bits, but actually it is
skipping the first MAX_ORDER-1 bits, if my calculations are correct.

MAX_ORDER is 18 on ARM which correspond to 1GB. With the current
implementation of pfn_pdx_hole_setup, if I pass a mask corresponding to
512MB, I can see "PFN compression on bits 17...19". So the range
512MB-1GB gets compressed.

Shouldn't it be:

diff --git a/xen/common/pdx.c b/xen/common/pdx.c
index 50c21b6..b334eb9 100644
--- a/xen/common/pdx.c
+++ b/xen/common/pdx.c
@@ -81,7 +81,7 @@ void __init pfn_pdx_hole_setup(unsigned long mask)
      * contiguous aligned ranges of 2^MAX_ORDER pages. Among others, our
      * buddy allocator relies on this assumption.
      */
-    for ( j = MAX_ORDER-1; ; )
+    for ( j = MAX_ORDER; ; )
     {
         i = find_next_zero_bit(&mask, BITS_PER_LONG, j);
         j = find_next_bit(&mask, BITS_PER_LONG, i); 


With this change, I don't see pfn_pdx_hole_setup trying to compress bit
17 anymore.
Jan Beulich May 3, 2019, 7:26 a.m. UTC | #20
>>> On 03.05.19 at 00:25, <sstabellini@kernel.org> wrote:
> All right. Looking at the comment in pfn_pdx_hole_setup, it seems that
> it is intending to skip the first MAX_ORDER bits, but actually it is
> skipping the first MAX_ORDER-1 bits, if my calculations are correct.
> 
> MAX_ORDER is 18 on ARM which correspond to 1GB. With the current
> implementation of pfn_pdx_hole_setup, if I pass a mask corresponding to
> 512MB, I can see "PFN compression on bits 17...19". So the range
> 512MB-1GB gets compressed.
> 
> Shouldn't it be:
> 
> diff --git a/xen/common/pdx.c b/xen/common/pdx.c
> index 50c21b6..b334eb9 100644
> --- a/xen/common/pdx.c
> +++ b/xen/common/pdx.c
> @@ -81,7 +81,7 @@ void __init pfn_pdx_hole_setup(unsigned long mask)
>       * contiguous aligned ranges of 2^MAX_ORDER pages. Among others, our
>       * buddy allocator relies on this assumption.
>       */
> -    for ( j = MAX_ORDER-1; ; )
> +    for ( j = MAX_ORDER; ; )
>      {
>          i = find_next_zero_bit(&mask, BITS_PER_LONG, j);
>          j = find_next_bit(&mask, BITS_PER_LONG, i); 

Yes, but. Originally we started from zero here. As a wild guess,
I think Keir may have thought the cpumask_next() way when
putting together bdb5439c3f, where an adjustment by 1 is
needed in the call to find_next_bit(). Hence it probably was
intuitive for him to have the index start at one less. I do think,
however, that with the switch away from zero, things would
better have become

    for ( j = MAX_ORDER - 1; ; )
    {
        i = find_next_zero_bit(&mask, BITS_PER_LONG, j + 1);
        j = find_next_bit(&mask, BITS_PER_LONG, i + 1);

As you can see, using j + 1 when starting from zero wouldn't
really have been correct (albeit we surely didn't expect to
compress on bit zero, so this is merely a moot consideration).

Now there's a possible caveat here: While for symmetry also
using i + 1 in the second call would seem desirable, I'm afraid
it can't be used directly that way, as find_{,next_}zero_bit(),
on x86 at least, assume their last argument to be less than
their middle one. This, in turn, may already be violated in
the general case (now that the function lives in common code):
An architecture with all BITS_PER_LONG+PAGE_SIZE bits usable
as physical address (x86-64 can use only up to 52, but x86-32
can in principle - possibly with some extra conditions like running
on top of a 64-bit hypervisor - use all 44 bits) the first call may
already return BITS_PER_LONG, and hence the second call
might already produce UB. As a result, to fix this other (latent
only afaict) issue at the same time, the code imo ought to
become

    for ( j = MAX_ORDER - 1; ; )
    {
        i = find_next_zero_bit(&mask, BITS_PER_LONG, j + 1);
        if ( i >= BITS_PER_LONG )
            break;
        j = find_next_bit(&mask, BITS_PER_LONG, i + 1);
        if ( j >= BITS_PER_LONG )
            break;

Jan
Stefano Stabellini May 3, 2019, 8:16 p.m. UTC | #21
On Fri, 3 May 2019, Jan Beulich wrote:
> >>> On 03.05.19 at 00:25, <sstabellini@kernel.org> wrote:
> > All right. Looking at the comment in pfn_pdx_hole_setup, it seems that
> > it is intending to skip the first MAX_ORDER bits, but actually it is
> > skipping the first MAX_ORDER-1 bits, if my calculations are correct.
> > 
> > MAX_ORDER is 18 on ARM which correspond to 1GB. With the current
> > implementation of pfn_pdx_hole_setup, if I pass a mask corresponding to
> > 512MB, I can see "PFN compression on bits 17...19". So the range
> > 512MB-1GB gets compressed.
> > 
> > Shouldn't it be:
> > 
> > diff --git a/xen/common/pdx.c b/xen/common/pdx.c
> > index 50c21b6..b334eb9 100644
> > --- a/xen/common/pdx.c
> > +++ b/xen/common/pdx.c
> > @@ -81,7 +81,7 @@ void __init pfn_pdx_hole_setup(unsigned long mask)
> >       * contiguous aligned ranges of 2^MAX_ORDER pages. Among others, our
> >       * buddy allocator relies on this assumption.
> >       */
> > -    for ( j = MAX_ORDER-1; ; )
> > +    for ( j = MAX_ORDER; ; )
> >      {
> >          i = find_next_zero_bit(&mask, BITS_PER_LONG, j);
> >          j = find_next_bit(&mask, BITS_PER_LONG, i); 
> 
> Yes, but. Originally we started from zero here. As a wild guess,
> I think Keir may have thought the cpumask_next() way when
> putting together bdb5439c3f, where an adjustment by 1 is
> needed in the call to find_next_bit(). Hence it probably was
> intuitive for him to have the index start at one less. I do think,
> however, that with the switch away from zero, things would
> better have become
> 
>     for ( j = MAX_ORDER - 1; ; )
>     {
>         i = find_next_zero_bit(&mask, BITS_PER_LONG, j + 1);
>         j = find_next_bit(&mask, BITS_PER_LONG, i + 1);
> 
> As you can see, using j + 1 when starting from zero wouldn't
> really have been correct (albeit we surely didn't expect to
> compress on bit zero, so this is merely a moot consideration).
> 
> Now there's a possible caveat here: While for symmetry also
> using i + 1 in the second call would seem desirable, I'm afraid
> it can't be used directly that way, as find_{,next_}zero_bit(),
> on x86 at least, assume their last argument to be less than
> their middle one. This, in turn, may already be violated in
> the general case (now that the function lives in common code):
> An architecture with all BITS_PER_LONG+PAGE_SIZE bits usable
> as physical address (x86-64 can use only up to 52, but x86-32
> can in principle - possibly with some extra conditions like running
> on top of a 64-bit hypervisor - use all 44 bits) the first call may
> already return BITS_PER_LONG, and hence the second call
> might already produce UB. As a result, to fix this other (latent
> only afaict) issue at the same time, the code imo ought to
> become
> 
>     for ( j = MAX_ORDER - 1; ; )
>     {
>         i = find_next_zero_bit(&mask, BITS_PER_LONG, j + 1);
>         if ( i >= BITS_PER_LONG )
>             break;
>         j = find_next_bit(&mask, BITS_PER_LONG, i + 1);
>         if ( j >= BITS_PER_LONG )
>             break;
> 

Makes sense. I tried it in my setup and it fixes the misbehavior I was
seeing. I am adding this patch to my short series of PDX fixes. I am
adding your signed-off-by to the patch, let me know if it is a problem.
diff mbox series

Patch

diff --git a/xen/arch/arm/bootfdt.c b/xen/arch/arm/bootfdt.c
index 891b4b6..2ae4238 100644
--- a/xen/arch/arm/bootfdt.c
+++ b/xen/arch/arm/bootfdt.c
@@ -157,6 +157,13 @@  static void __init process_memory_node(const void *fdt, int node,
         device_tree_get_reg(&cell, address_cells, size_cells, &start, &size);
         if ( !size )
             continue;
+
+        /*
+         * Xen cannot deal with memory starting at 0x0. Burn the first
+         * page.
+         */
+        if ( start == 0 )
+            start += PAGE_SIZE;
         bootinfo.mem.bank[bootinfo.mem.nr_banks].start = start;
         bootinfo.mem.bank[bootinfo.mem.nr_banks].size = size;
         bootinfo.mem.nr_banks++;