diff mbox series

[05/12] xen: introduce reserve_heap_pages

Message ID 20200415010255.10081-5-sstabellini@kernel.org (mailing list archive)
State New, archived
Headers show
Series [01/12] xen: introduce xen_dom_flags | expand

Commit Message

Stefano Stabellini April 15, 2020, 1:02 a.m. UTC
Introduce a function named reserve_heap_pages (similar to
alloc_heap_pages) that allocates a requested memory range. Call
__alloc_heap_pages for the implementation.

Change __alloc_heap_pages so that the original page doesn't get
modified, giving back unneeded memory top to bottom rather than bottom
to top.

Also introduce a function named reserve_domheap_pages, similar to
alloc_domheap_pages, that checks memflags before calling
reserve_heap_pages. It also assign_pages to the domain on success.

Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
CC: andrew.cooper3@citrix.com
CC: jbeulich@suse.com
CC: George Dunlap <george.dunlap@citrix.com>
CC: Ian Jackson <ian.jackson@eu.citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/common/page_alloc.c | 72 ++++++++++++++++++++++++++++++++++++++---
 xen/include/xen/mm.h    |  2 ++
 2 files changed, 69 insertions(+), 5 deletions(-)

Comments

Julien Grall April 15, 2020, 1:24 p.m. UTC | #1
On 15/04/2020 02:02, Stefano Stabellini wrote:
> Introduce a function named reserve_heap_pages (similar to
> alloc_heap_pages) that allocates a requested memory range. Call
> __alloc_heap_pages for the implementation.
> 
> Change __alloc_heap_pages so that the original page doesn't get
> modified, giving back unneeded memory top to bottom rather than bottom
> to top.
> 
> Also introduce a function named reserve_domheap_pages, similar to
> alloc_domheap_pages, that checks memflags before calling
> reserve_heap_pages. It also assign_pages to the domain on success.

Xen may have already allocated the part of region for its own purpose or 
for another domain. So this will not work reliably.

We have the same issues with LiveUpdate as memory have to be preserved. 
We need to mark the page reserved before any allocation (including early 
boot allocation) so nobody can use them. See [1].

Cheers,

[1]  Live update: boot memory management, data stream handling, record 
format <a92287c03fed310e08ba40063e370038569b94ac.camel@infradead.org>
Jan Beulich April 17, 2020, 10:11 a.m. UTC | #2
On 15.04.2020 03:02, Stefano Stabellini wrote:
> Introduce a function named reserve_heap_pages (similar to
> alloc_heap_pages) that allocates a requested memory range. Call
> __alloc_heap_pages for the implementation.
> 
> Change __alloc_heap_pages so that the original page doesn't get
> modified, giving back unneeded memory top to bottom rather than bottom
> to top.

While it may be less of a problem within a zone, doing so is
against our general "return high pages first" policy.

> @@ -1073,7 +1073,42 @@ static struct page_info *alloc_heap_pages(
>          return NULL;
>      }
>  
> -    __alloc_heap_pages(&pg, order, memflags, d);
> +    __alloc_heap_pages(pg, order, memflags, d);
> +    return pg;
> +}
> +
> +static struct page_info *reserve_heap_pages(struct domain *d,
> +                                            paddr_t start,
> +                                            unsigned int order,
> +                                            unsigned int memflags)
> +{
> +    nodeid_t node;
> +    unsigned int zone;
> +    struct page_info *pg;
> +
> +    if ( unlikely(order > MAX_ORDER) )
> +        return NULL;
> +
> +    spin_lock(&heap_lock);
> +
> +    /*
> +     * Claimed memory is considered unavailable unless the request
> +     * is made by a domain with sufficient unclaimed pages.
> +     */
> +    if ( (outstanding_claims + (1UL << order) > total_avail_pages) &&
> +          ((memflags & MEMF_no_refcount) ||
> +           !d || d->outstanding_pages < (1UL << order)) )
> +    {
> +        spin_unlock(&heap_lock);
> +        return NULL;
> +    }

Where would such a claim come from? Given the purpose I'd assume
the function (as well as reserve_domheap_pages()) can actually be
__init. With that I'd then also be okay with them getting built
unconditionally, i.e. even on x86.

> +    pg = maddr_to_page(start);
> +    node = phys_to_nid(start);
> +    zone = page_to_zone(pg);
> +    page_list_del(pg, &heap(node, zone, order));
> +
> +    __alloc_heap_pages(pg, order, memflags, d);

I agree with Julien in not seeing how this can be safe / correct.

Jan
Stefano Stabellini April 29, 2020, 10:46 p.m. UTC | #3
On Fri, 17 Apr 2020, Jan Beulich wrote:
> On 15.04.2020 03:02, Stefano Stabellini wrote:
> > Introduce a function named reserve_heap_pages (similar to
> > alloc_heap_pages) that allocates a requested memory range. Call
> > __alloc_heap_pages for the implementation.
> > 
> > Change __alloc_heap_pages so that the original page doesn't get
> > modified, giving back unneeded memory top to bottom rather than bottom
> > to top.
> 
> While it may be less of a problem within a zone, doing so is
> against our general "return high pages first" policy.

Is this something you'd be OK with anyway?

If not, do you have a suggestion on how to do it better? I couldn't find
a nice way to do it without code duplication, or a big nasty 'if' in the
middle of the function.


> > @@ -1073,7 +1073,42 @@ static struct page_info *alloc_heap_pages(
> >          return NULL;
> >      }
> >  
> > -    __alloc_heap_pages(&pg, order, memflags, d);
> > +    __alloc_heap_pages(pg, order, memflags, d);
> > +    return pg;
> > +}
> > +
> > +static struct page_info *reserve_heap_pages(struct domain *d,
> > +                                            paddr_t start,
> > +                                            unsigned int order,
> > +                                            unsigned int memflags)
> > +{
> > +    nodeid_t node;
> > +    unsigned int zone;
> > +    struct page_info *pg;
> > +
> > +    if ( unlikely(order > MAX_ORDER) )
> > +        return NULL;
> > +
> > +    spin_lock(&heap_lock);
> > +
> > +    /*
> > +     * Claimed memory is considered unavailable unless the request
> > +     * is made by a domain with sufficient unclaimed pages.
> > +     */
> > +    if ( (outstanding_claims + (1UL << order) > total_avail_pages) &&
> > +          ((memflags & MEMF_no_refcount) ||
> > +           !d || d->outstanding_pages < (1UL << order)) )
> > +    {
> > +        spin_unlock(&heap_lock);
> > +        return NULL;
> > +    }
> 
> Where would such a claim come from? Given the purpose I'd assume
> the function (as well as reserve_domheap_pages()) can actually be
> __init. With that I'd then also be okay with them getting built
> unconditionally, i.e. even on x86.

Yes, you are right, I'll make the function __init and also remove this
check on claimed memory.


> > +    pg = maddr_to_page(start);
> > +    node = phys_to_nid(start);
> > +    zone = page_to_zone(pg);
> > +    page_list_del(pg, &heap(node, zone, order));
> > +
> > +    __alloc_heap_pages(pg, order, memflags, d);
> 
> I agree with Julien in not seeing how this can be safe / correct.

I haven't seen any issues so far in my testing -- I imagine it is
because there aren't many memory allocations after setup_mm() and before
create_domUs()  (which on ARM is called just before
domain_unpause_by_systemcontroller at the end of start_xen.)


I gave a quick look at David's series. Is the idea that I should add a
patch to do the following:

- avoiding adding these ranges to xenheap in setup_mm, wait for later
  (a bit like reserved_mem regions)

- in construct_domU, add the range to xenheap and reserve it with reserve_heap_pages

Is that right?
Jan Beulich April 30, 2020, 6:29 a.m. UTC | #4
On 30.04.2020 00:46, Stefano Stabellini wrote:
> On Fri, 17 Apr 2020, Jan Beulich wrote:
>> On 15.04.2020 03:02, Stefano Stabellini wrote:
>>> Introduce a function named reserve_heap_pages (similar to
>>> alloc_heap_pages) that allocates a requested memory range. Call
>>> __alloc_heap_pages for the implementation.
>>>
>>> Change __alloc_heap_pages so that the original page doesn't get
>>> modified, giving back unneeded memory top to bottom rather than bottom
>>> to top.
>>
>> While it may be less of a problem within a zone, doing so is
>> against our general "return high pages first" policy.
> 
> Is this something you'd be OK with anyway?

As a last resort, maybe. But it really depends on why it needs to be
this way.

> If not, do you have a suggestion on how to do it better? I couldn't find
> a nice way to do it without code duplication, or a big nasty 'if' in the
> middle of the function.

I'd first need to understand the problem to solve.

>>> +    pg = maddr_to_page(start);
>>> +    node = phys_to_nid(start);
>>> +    zone = page_to_zone(pg);
>>> +    page_list_del(pg, &heap(node, zone, order));
>>> +
>>> +    __alloc_heap_pages(pg, order, memflags, d);
>>
>> I agree with Julien in not seeing how this can be safe / correct.
> 
> I haven't seen any issues so far in my testing -- I imagine it is
> because there aren't many memory allocations after setup_mm() and before
> create_domUs()  (which on ARM is called just before
> domain_unpause_by_systemcontroller at the end of start_xen.)
> 
> 
> I gave a quick look at David's series. Is the idea that I should add a
> patch to do the following:
> 
> - avoiding adding these ranges to xenheap in setup_mm, wait for later
>   (a bit like reserved_mem regions)
> 
> - in construct_domU, add the range to xenheap and reserve it with reserve_heap_pages
> 
> Is that right?

This may be one way, but it may also be not the only possible one.
The main thing to arrange for is that there is either a guarantee
for these ranges to be free (which I think you want to check in
any event, rather than risking to give out something that's already
in use elsewhere), or that you skip ranges which are already in use
(potentially altering [decreasing?] what the specific domain gets
allocated).

Jan
Julien Grall April 30, 2020, 2:51 p.m. UTC | #5
Hi,

On 29/04/2020 23:46, Stefano Stabellini wrote:
> On Fri, 17 Apr 2020, Jan Beulich wrote:
>> On 15.04.2020 03:02, Stefano Stabellini wrote:
>>> Introduce a function named reserve_heap_pages (similar to
>>> alloc_heap_pages) that allocates a requested memory range. Call
>>> __alloc_heap_pages for the implementation.
>>>
>>> Change __alloc_heap_pages so that the original page doesn't get
>>> modified, giving back unneeded memory top to bottom rather than bottom
>>> to top.
>>
>> While it may be less of a problem within a zone, doing so is
>> against our general "return high pages first" policy.
> 
> Is this something you'd be OK with anyway?
> 
> If not, do you have a suggestion on how to do it better? I couldn't find
> a nice way to do it without code duplication, or a big nasty 'if' in the
> middle of the function.
> 
> 
>>> @@ -1073,7 +1073,42 @@ static struct page_info *alloc_heap_pages(
>>>           return NULL;
>>>       }
>>>   
>>> -    __alloc_heap_pages(&pg, order, memflags, d);
>>> +    __alloc_heap_pages(pg, order, memflags, d);
>>> +    return pg;
>>> +}
>>> +
>>> +static struct page_info *reserve_heap_pages(struct domain *d,
>>> +                                            paddr_t start,
>>> +                                            unsigned int order,
>>> +                                            unsigned int memflags)
>>> +{
>>> +    nodeid_t node;
>>> +    unsigned int zone;
>>> +    struct page_info *pg;
>>> +
>>> +    if ( unlikely(order > MAX_ORDER) )
>>> +        return NULL;
>>> +
>>> +    spin_lock(&heap_lock);
>>> +
>>> +    /*
>>> +     * Claimed memory is considered unavailable unless the request
>>> +     * is made by a domain with sufficient unclaimed pages.
>>> +     */
>>> +    if ( (outstanding_claims + (1UL << order) > total_avail_pages) &&
>>> +          ((memflags & MEMF_no_refcount) ||
>>> +           !d || d->outstanding_pages < (1UL << order)) )
>>> +    {
>>> +        spin_unlock(&heap_lock);
>>> +        return NULL;
>>> +    }
>>
>> Where would such a claim come from? Given the purpose I'd assume
>> the function (as well as reserve_domheap_pages()) can actually be
>> __init. With that I'd then also be okay with them getting built
>> unconditionally, i.e. even on x86.
> 
> Yes, you are right, I'll make the function __init and also remove this
> check on claimed memory.
> 
> 
>>> +    pg = maddr_to_page(start);
>>> +    node = phys_to_nid(start);
>>> +    zone = page_to_zone(pg);
>>> +    page_list_del(pg, &heap(node, zone, order));
>>> +
>>> +    __alloc_heap_pages(pg, order, memflags, d);
>>
>> I agree with Julien in not seeing how this can be safe / correct.
> 
> I haven't seen any issues so far in my testing -- I imagine it is
> because there aren't many memory allocations after setup_mm() and before
> create_domUs()  (which on ARM is called just before
> domain_unpause_by_systemcontroller at the end of start_xen.)

I am not sure why you exclude setup_mm(). Any memory allocated (boot 
allocator, xenheap) can clash with your regions. The main memory 
allocations are for the frametable and dom0. I would say you were lucky 
to not hit them.

> 
> 
> I gave a quick look at David's series. Is the idea that I should add a
> patch to do the following:
> 
> - avoiding adding these ranges to xenheap in setup_mm, wait for later
>    (a bit like reserved_mem regions)

I guess by xenheap, you mean domheap? But the problem is not only for 
domheap, it is also for any memory allocated via the boot allocator. So 
you need to exclude those regions from any possible allocations.

> 
> - in construct_domU, add the range to xenheap and reserve it with reserve_heap_pages

I am afraid you can't give the regions to the allocator and then 
allocate them. The allocator is free to use any page for its own purpose 
or exclude them.

AFAICT, the allocator doesn't have a list of page in use. It only keeps 
track of free pages. So we can make the content of struct page_info to 
look like it was allocated by the allocator.

We would need to be careful when giving a page back to allocator as the 
page would need to be initialized (see [1]). This may not be a concern 
for Dom0less as the domain may never be destroyed but will be for 
correctness PoV.

For LiveUpdate, the original Xen will carve out space to use by the boot 
allocator in the new Xen. But I think this is not necessary in your context.

It should be sufficient to exclude the page from the boot allocators (as 
we do for other modules).

One potential issue that can arise is there is no easy way today to 
differentiate between pages allocated and pages not yet initialized. To 
make the code robust, we need to prevent a page to be used in two 
places. So for LiveUpdate we are marking them with a special value, this 
is used afterwards to check we are effictively using a reserved page.

I hope this helps.

Cheers,

[1] <20200319212150.2651419-2-dwmw2@infradead.org>
Stefano Stabellini April 30, 2020, 4:21 p.m. UTC | #6
On Thu, 30 Apr 2020, Jan Beulich wrote:
> On 30.04.2020 00:46, Stefano Stabellini wrote:
> > On Fri, 17 Apr 2020, Jan Beulich wrote:
> >> On 15.04.2020 03:02, Stefano Stabellini wrote:
> >>> Introduce a function named reserve_heap_pages (similar to
> >>> alloc_heap_pages) that allocates a requested memory range. Call
> >>> __alloc_heap_pages for the implementation.
> >>>
> >>> Change __alloc_heap_pages so that the original page doesn't get
> >>> modified, giving back unneeded memory top to bottom rather than bottom
> >>> to top.
> >>
> >> While it may be less of a problem within a zone, doing so is
> >> against our general "return high pages first" policy.
> > 
> > Is this something you'd be OK with anyway?
> 
> As a last resort, maybe. But it really depends on why it needs to be
> this way.
> 
> > If not, do you have a suggestion on how to do it better? I couldn't find
> > a nice way to do it without code duplication, or a big nasty 'if' in the
> > middle of the function.
> 
> I'd first need to understand the problem to solve.

OK, I'll make an example.

reserve_heap_pages wants to reserve the range 0x10000000 - 0x20000000.

reserve_heap_pages get the struct page_info for 0x10000000 and calls
alloc_pages_from_buddy to allocate an order of 28.

alloc_pages_from_buddy realizes that the free memory area starting from
0x10000000 is actually of order 30, even larger than the requested order
of 28. The free area is 0x10000000 - 0x50000000.

alloc_pages_from_buddy instead of just allocating an order of 28
starting from 0x10000000, it would allocate the "top" order of 28 in the
free area. So it would allocate: 0x40000000 - 0x50000000, returning
0x40000000.

Of course, this doesn't work for reserve_heap_pages.


This patch changes the behavior of alloc_pages_from_buddy so that in a
situation like the above, it would return 0x10000000 - 0x20000000
(leaving the rest of the free area unallocated.)
Stefano Stabellini April 30, 2020, 5 p.m. UTC | #7
On Thu, 30 Apr 2020, Julien Grall wrote:
> > > > +    pg = maddr_to_page(start);
> > > > +    node = phys_to_nid(start);
> > > > +    zone = page_to_zone(pg);
> > > > +    page_list_del(pg, &heap(node, zone, order));
> > > > +
> > > > +    __alloc_heap_pages(pg, order, memflags, d);
> > > 
> > > I agree with Julien in not seeing how this can be safe / correct.
> > 
> > I haven't seen any issues so far in my testing -- I imagine it is
> > because there aren't many memory allocations after setup_mm() and before
> > create_domUs()  (which on ARM is called just before
> > domain_unpause_by_systemcontroller at the end of start_xen.)
> 
> I am not sure why you exclude setup_mm(). Any memory allocated (boot
> allocator, xenheap) can clash with your regions. The main memory allocations
> are for the frametable and dom0. I would say you were lucky to not hit them.

Maybe it is because Xen typically allocates memory top-down? So if I
chose a high range then I would see a failure? But I have been mostly
testing with ranges close to the begin of RAM (as opposed to
ranges close to the end of RAM.)

 
> > I gave a quick look at David's series. Is the idea that I should add a
> > patch to do the following:
> > 
> > - avoiding adding these ranges to xenheap in setup_mm, wait for later
> >    (a bit like reserved_mem regions)
> 
> I guess by xenheap, you mean domheap? But the problem is not only for domheap,
> it is also for any memory allocated via the boot allocator. So you need to
> exclude those regions from any possible allocations.

OK, I think we are saying the same thing but let me check.

By boot allocator you mean alloc_boot_pages, right? That boot allocator
operates on ranges given to it by init_boot_pages calls.
init_boot_pages is called from setup_mm. I didn't write it clearly but
I also meant not calling init_boot_pages on them from setup_mm.

Are we saying the same thing?



> > - in construct_domU, add the range to xenheap and reserve it with
> > reserve_heap_pages
> 
> I am afraid you can't give the regions to the allocator and then allocate
> them. The allocator is free to use any page for its own purpose or exclude
> them.
>
> AFAICT, the allocator doesn't have a list of page in use. It only keeps track
> of free pages. So we can make the content of struct page_info to look like it
> was allocated by the allocator.
> 
> We would need to be careful when giving a page back to allocator as the page
> would need to be initialized (see [1]). This may not be a concern for Dom0less
> as the domain may never be destroyed but will be for correctness PoV.
> 
> For LiveUpdate, the original Xen will carve out space to use by the boot
> allocator in the new Xen. But I think this is not necessary in your context.
> 
> It should be sufficient to exclude the page from the boot allocators (as we do
> for other modules).
> 
> One potential issue that can arise is there is no easy way today to
> differentiate between pages allocated and pages not yet initialized. To make
> the code robust, we need to prevent a page to be used in two places. So for
> LiveUpdate we are marking them with a special value, this is used afterwards
> to check we are effictively using a reserved page.
> 
> I hope this helps.

Thanks for writing all of this down but I haven't understood some of it.

For the sake of this discussion let's say that we managed to "reserve"
the range early enough like we do for other modules, as you wrote.

At the point where we want to call reserve_heap_pages() we would call
init_heap_pages() just before it. We are still relatively early at boot
so there aren't any concurrent memory operations. Why this doesn't work?

If it doesn't work, I am not following what is your alternative
suggestion about making "the content of struct page_info to look like it
was allocated by the allocator."
Julien Grall April 30, 2020, 6:27 p.m. UTC | #8
Hi,

On 30/04/2020 18:00, Stefano Stabellini wrote:
> On Thu, 30 Apr 2020, Julien Grall wrote:
>>>>> +    pg = maddr_to_page(start);
>>>>> +    node = phys_to_nid(start);
>>>>> +    zone = page_to_zone(pg);
>>>>> +    page_list_del(pg, &heap(node, zone, order));
>>>>> +
>>>>> +    __alloc_heap_pages(pg, order, memflags, d);
>>>>
>>>> I agree with Julien in not seeing how this can be safe / correct.
>>>
>>> I haven't seen any issues so far in my testing -- I imagine it is
>>> because there aren't many memory allocations after setup_mm() and before
>>> create_domUs()  (which on ARM is called just before
>>> domain_unpause_by_systemcontroller at the end of start_xen.)
>>
>> I am not sure why you exclude setup_mm(). Any memory allocated (boot
>> allocator, xenheap) can clash with your regions. The main memory allocations
>> are for the frametable and dom0. I would say you were lucky to not hit them.
> 
> Maybe it is because Xen typically allocates memory top-down? So if I
> chose a high range then I would see a failure? But I have been mostly
> testing with ranges close to the begin of RAM (as opposed to
> ranges close to the end of RAM.)

I haven't looked at the details of the implementation, but you can try 
to specify dom0 addresses for your domU. You should see a failure.

> 
>   
>>> I gave a quick look at David's series. Is the idea that I should add a
>>> patch to do the following:
>>>
>>> - avoiding adding these ranges to xenheap in setup_mm, wait for later
>>>     (a bit like reserved_mem regions)
>>
>> I guess by xenheap, you mean domheap? But the problem is not only for domheap,
>> it is also for any memory allocated via the boot allocator. So you need to
>> exclude those regions from any possible allocations.
> 
> OK, I think we are saying the same thing but let me check.
> 
> By boot allocator you mean alloc_boot_pages, right? That boot allocator
> operates on ranges given to it by init_boot_pages calls.

That's correct.

> init_boot_pages is called from setup_mm. I didn't write it clearly but
> I also meant not calling init_boot_pages on them from setup_mm.
> 
> Are we saying the same thing?

Yes.

> 
> 
>>> - in construct_domU, add the range to xenheap and reserve it with
>>> reserve_heap_pages
>>
>> I am afraid you can't give the regions to the allocator and then allocate
>> them. The allocator is free to use any page for its own purpose or exclude
>> them.
>>
>> AFAICT, the allocator doesn't have a list of page in use. It only keeps track
>> of free pages. So we can make the content of struct page_info to look like it
>> was allocated by the allocator.
>>
>> We would need to be careful when giving a page back to allocator as the page
>> would need to be initialized (see [1]). This may not be a concern for Dom0less
>> as the domain may never be destroyed but will be for correctness PoV.
>>
>> For LiveUpdate, the original Xen will carve out space to use by the boot
>> allocator in the new Xen. But I think this is not necessary in your context.
>>
>> It should be sufficient to exclude the page from the boot allocators (as we do
>> for other modules).
>>
>> One potential issue that can arise is there is no easy way today to
>> differentiate between pages allocated and pages not yet initialized. To make
>> the code robust, we need to prevent a page to be used in two places. So for
>> LiveUpdate we are marking them with a special value, this is used afterwards
>> to check we are effictively using a reserved page.
>>
>> I hope this helps.
> 
> Thanks for writing all of this down but I haven't understood some of it.
> 
> For the sake of this discussion let's say that we managed to "reserve"
> the range early enough like we do for other modules, as you wrote.
> 
> At the point where we want to call reserve_heap_pages() we would call
> init_heap_pages() just before it. We are still relatively early at boot
> so there aren't any concurrent memory operations. Why this doesn't work?

Because init_heap_pages() may exclude some pages (for instance MFN 0 is 
carved out) or use pages for its internal structure (see 
init_node_heap()). So you can't expect to be able to allocate the exact 
same region by reserve_heap_pages().

> 
> If it doesn't work, I am not following what is your alternative
> suggestion about making "the content of struct page_info to look like it
> was allocated by the allocator."

If you look at alloc_heap_pages(), it will allocate pages, the allocator 
will initialize some fields in struct page_info before returning the 
page. We basically need to do the same thing, so the struct page_info 
looks exactly the same whether we call alloc_heap_pages() or use memory 
that was carved out from the allocator.

David has spent more time than me on this problem, so I may be missing 
some bits. Based on what we did in the LU PoC, my suggestion would be to:
    1) Carve out the memory from any allocator (and before any memory is 
allocated).
    2) Make sure a struct page_info is allocated for those regions in 
the boot allocator
    3) Mark the regions as reserved in the frametable so we can 
differentiate from the others pages.
    4) Allocate the region when necessary

When it is necessary to allocate the region. For each page:
    1) Check if it is a valid page
    2) Check if the page is reserved
    3) Do the necessary preparation on struct page_info

At the moment, in the LU PoC, we are using count_info = PGC_allocated to 
mark the reserved page. I don't particularly like it and not sure of the 
consequence. So I am open to a different way to mark them.

The last part we need to take care is how to hand over the pages to the 
allocator. This may happen if your domain die or ballooning (although 
not in the direct map case). Even without this series, this is actually 
already a problem today because boot allocator pages may be freed 
afterwards (I think this can only happen on x86 so far). But we are 
getting away because in most of the cases you never carve out a full 
NUMA node. This is where David's patch should help.

Cheers,
Jan Beulich May 4, 2020, 9:16 a.m. UTC | #9
On 30.04.2020 18:21, Stefano Stabellini wrote:
> On Thu, 30 Apr 2020, Jan Beulich wrote:
>> On 30.04.2020 00:46, Stefano Stabellini wrote:
>>> On Fri, 17 Apr 2020, Jan Beulich wrote:
>>>> On 15.04.2020 03:02, Stefano Stabellini wrote:
>>>>> Introduce a function named reserve_heap_pages (similar to
>>>>> alloc_heap_pages) that allocates a requested memory range. Call
>>>>> __alloc_heap_pages for the implementation.
>>>>>
>>>>> Change __alloc_heap_pages so that the original page doesn't get
>>>>> modified, giving back unneeded memory top to bottom rather than bottom
>>>>> to top.
>>>>
>>>> While it may be less of a problem within a zone, doing so is
>>>> against our general "return high pages first" policy.
>>>
>>> Is this something you'd be OK with anyway?
>>
>> As a last resort, maybe. But it really depends on why it needs to be
>> this way.
>>
>>> If not, do you have a suggestion on how to do it better? I couldn't find
>>> a nice way to do it without code duplication, or a big nasty 'if' in the
>>> middle of the function.
>>
>> I'd first need to understand the problem to solve.
> 
> OK, I'll make an example.
> 
> reserve_heap_pages wants to reserve the range 0x10000000 - 0x20000000.
> 
> reserve_heap_pages get the struct page_info for 0x10000000 and calls
> alloc_pages_from_buddy to allocate an order of 28.
> 
> alloc_pages_from_buddy realizes that the free memory area starting from
> 0x10000000 is actually of order 30, even larger than the requested order
> of 28. The free area is 0x10000000 - 0x50000000.
> 
> alloc_pages_from_buddy instead of just allocating an order of 28
> starting from 0x10000000, it would allocate the "top" order of 28 in the
> free area. So it would allocate: 0x40000000 - 0x50000000, returning
> 0x40000000.
> 
> Of course, this doesn't work for reserve_heap_pages.
> 
> 
> This patch changes the behavior of alloc_pages_from_buddy so that in a
> situation like the above, it would return 0x10000000 - 0x20000000
> (leaving the rest of the free area unallocated.)

So what if then, for the same order-30 (really order-18 if I assume
you name addresses, not frame numbers), a reservation request came
in for the highest order-28 sub-region? You'd again be screwed if
you relied on which part of a larger buddy gets returned by the
lower level function you call. I can't help thinking that basing
reservation on allocation functions can't really be made work for
all possible cases. Instead reservation requests need to check that
the requested range is free _and_ split the potentially larger
range according to the request.

Jan
Stefano Stabellini May 12, 2020, 1:10 a.m. UTC | #10
On Thu, 30 Apr 2020, Julien Grall wrote:
> On 30/04/2020 18:00, Stefano Stabellini wrote:
> > On Thu, 30 Apr 2020, Julien Grall wrote:
> > > > > > +    pg = maddr_to_page(start);
> > > > > > +    node = phys_to_nid(start);
> > > > > > +    zone = page_to_zone(pg);
> > > > > > +    page_list_del(pg, &heap(node, zone, order));
> > > > > > +
> > > > > > +    __alloc_heap_pages(pg, order, memflags, d);
> > > > > 
> > > > > I agree with Julien in not seeing how this can be safe / correct.
> > > > 
> > > > I haven't seen any issues so far in my testing -- I imagine it is
> > > > because there aren't many memory allocations after setup_mm() and before
> > > > create_domUs()  (which on ARM is called just before
> > > > domain_unpause_by_systemcontroller at the end of start_xen.)
> > > 
> > > I am not sure why you exclude setup_mm(). Any memory allocated (boot
> > > allocator, xenheap) can clash with your regions. The main memory
> > > allocations
> > > are for the frametable and dom0. I would say you were lucky to not hit
> > > them.
> > 
> > Maybe it is because Xen typically allocates memory top-down? So if I
> > chose a high range then I would see a failure? But I have been mostly
> > testing with ranges close to the begin of RAM (as opposed to
> > ranges close to the end of RAM.)
> 
> I haven't looked at the details of the implementation, but you can try to
> specify dom0 addresses for your domU. You should see a failure.

I managed to reproduce a failure by choosing the top address range. On
Xilinx ZynqMP the memory is:

  reg = <0x0 0x0 0x0 0x7ff00000 0x8 0x0 0x0 0x80000000>;

And I chose:

  fdt set /chosen/domU0 direct-map <0x0 0x10000000 0x10000000 0x8 0x70000000 0x10000000>

Resulting in:

(XEN) *** LOADING DOMU cpus=1 memory=80000KB ***
(XEN) Loading d1 kernel from boot module @ 0000000007200000
(XEN) Loading ramdisk from boot module @ 0000000008200000
(XEN) direct_map start=0x00000010000000 size=0x00000010000000
(XEN) direct_map start=0x00000870000000 size=0x00000010000000
(XEN) Data Abort Trap. Syndrome=0x5
(XEN) Walking Hypervisor VA 0x2403480018 on CPU0 via TTBR 0x0000000000f05000
(XEN) 0TH[0x0] = 0x0000000000f08f7f
(XEN) 1ST[0x90] = 0x0000000000000000
(XEN) CPU0: Unexpected Trap: Data Abort

[...]

(XEN) Xen call trace:
(XEN)    [<000000000021a65c>] page_alloc.c#alloc_pages_from_buddy+0x15c/0x5d0 (PC)
(XEN)    [<000000000021b43c>] reserve_domheap_pages+0xc4/0x148 (LR)

Anything other than the very top of memory works.


> > > > - in construct_domU, add the range to xenheap and reserve it with
> > > > reserve_heap_pages
> > > 
> > > I am afraid you can't give the regions to the allocator and then allocate
> > > them. The allocator is free to use any page for its own purpose or exclude
> > > them.
> > > 
> > > AFAICT, the allocator doesn't have a list of page in use. It only keeps
> > > track
> > > of free pages. So we can make the content of struct page_info to look like
> > > it
> > > was allocated by the allocator.
> > > 
> > > We would need to be careful when giving a page back to allocator as the
> > > page
> > > would need to be initialized (see [1]). This may not be a concern for
> > > Dom0less
> > > as the domain may never be destroyed but will be for correctness PoV.
> > > 
> > > For LiveUpdate, the original Xen will carve out space to use by the boot
> > > allocator in the new Xen. But I think this is not necessary in your
> > > context.
> > > 
> > > It should be sufficient to exclude the page from the boot allocators (as
> > > we do
> > > for other modules).
> > > 
> > > One potential issue that can arise is there is no easy way today to
> > > differentiate between pages allocated and pages not yet initialized. To
> > > make
> > > the code robust, we need to prevent a page to be used in two places. So
> > > for
> > > LiveUpdate we are marking them with a special value, this is used
> > > afterwards
> > > to check we are effictively using a reserved page.
> > > 
> > > I hope this helps.
> > 
> > Thanks for writing all of this down but I haven't understood some of it.
> > 
> > For the sake of this discussion let's say that we managed to "reserve"
> > the range early enough like we do for other modules, as you wrote.
> > 
> > At the point where we want to call reserve_heap_pages() we would call
> > init_heap_pages() just before it. We are still relatively early at boot
> > so there aren't any concurrent memory operations. Why this doesn't work?
> 
> Because init_heap_pages() may exclude some pages (for instance MFN 0 is carved
> out) or use pages for its internal structure (see init_node_heap()). So you
> can't expect to be able to allocate the exact same region by
> reserve_heap_pages().

But it can't possibly use of any of pages it is trying to add to the
heap, right?

We have reserved a certain range, we tell init_heap_pages to add the
range to the heap, init_node_heap gets called and it ends up calling
xmalloc. There is no way xmalloc can use any memory from that
particular range because it is not in the heap yet. That should be safe.

The init_node_heap code is a bit hard to follow but I went through it
and couldn't spot anything that could cause any issues (MFN 0 aside
which is a bit special). Am I missing something?
Julien Grall May 12, 2020, 8:57 a.m. UTC | #11
Hi,

On 12/05/2020 02:10, Stefano Stabellini wrote:
> On Thu, 30 Apr 2020, Julien Grall wrote:
>> On 30/04/2020 18:00, Stefano Stabellini wrote:
>>> On Thu, 30 Apr 2020, Julien Grall wrote:
>>>>>>> +    pg = maddr_to_page(start);
>>>>>>> +    node = phys_to_nid(start);
>>>>>>> +    zone = page_to_zone(pg);
>>>>>>> +    page_list_del(pg, &heap(node, zone, order));
>>>>>>> +
>>>>>>> +    __alloc_heap_pages(pg, order, memflags, d);
>>>>>>
>>>>>> I agree with Julien in not seeing how this can be safe / correct.
>>>>>
>>>>> I haven't seen any issues so far in my testing -- I imagine it is
>>>>> because there aren't many memory allocations after setup_mm() and before
>>>>> create_domUs()  (which on ARM is called just before
>>>>> domain_unpause_by_systemcontroller at the end of start_xen.)
>>>>
>>>> I am not sure why you exclude setup_mm(). Any memory allocated (boot
>>>> allocator, xenheap) can clash with your regions. The main memory
>>>> allocations
>>>> are for the frametable and dom0. I would say you were lucky to not hit
>>>> them.
>>>
>>> Maybe it is because Xen typically allocates memory top-down? So if I
>>> chose a high range then I would see a failure? But I have been mostly
>>> testing with ranges close to the begin of RAM (as opposed to
>>> ranges close to the end of RAM.)
>>
>> I haven't looked at the details of the implementation, but you can try to
>> specify dom0 addresses for your domU. You should see a failure.
> 
> I managed to reproduce a failure by choosing the top address range. On
> Xilinx ZynqMP the memory is:
> 
>    reg = <0x0 0x0 0x0 0x7ff00000 0x8 0x0 0x0 0x80000000>;
> 
> And I chose:
> 
>    fdt set /chosen/domU0 direct-map <0x0 0x10000000 0x10000000 0x8 0x70000000 0x10000000>
> 
> Resulting in:
> 
> (XEN) *** LOADING DOMU cpus=1 memory=80000KB ***
> (XEN) Loading d1 kernel from boot module @ 0000000007200000
> (XEN) Loading ramdisk from boot module @ 0000000008200000
> (XEN) direct_map start=0x00000010000000 size=0x00000010000000
> (XEN) direct_map start=0x00000870000000 size=0x00000010000000
> (XEN) Data Abort Trap. Syndrome=0x5
> (XEN) Walking Hypervisor VA 0x2403480018 on CPU0 via TTBR 0x0000000000f05000
> (XEN) 0TH[0x0] = 0x0000000000f08f7f
> (XEN) 1ST[0x90] = 0x0000000000000000
> (XEN) CPU0: Unexpected Trap: Data Abort
> 
> [...]
> 
> (XEN) Xen call trace:
> (XEN)    [<000000000021a65c>] page_alloc.c#alloc_pages_from_buddy+0x15c/0x5d0 (PC)
> (XEN)    [<000000000021b43c>] reserve_domheap_pages+0xc4/0x148 (LR)

This isn't what I was expecting. If there is any failure, I would expect 
an error message not a data abort. However...

> 
> Anything other than the very top of memory works.

... I am very confused by this. Are you suggesting that with your series 
you can allocate the same range for Dom0 and a DomU without any trouble?

> 
>>>>> - in construct_domU, add the range to xenheap and reserve it with
>>>>> reserve_heap_pages
>>>>
>>>> I am afraid you can't give the regions to the allocator and then allocate
>>>> them. The allocator is free to use any page for its own purpose or exclude
>>>> them.
>>>>
>>>> AFAICT, the allocator doesn't have a list of page in use. It only keeps
>>>> track
>>>> of free pages. So we can make the content of struct page_info to look like
>>>> it
>>>> was allocated by the allocator.
>>>>
>>>> We would need to be careful when giving a page back to allocator as the
>>>> page
>>>> would need to be initialized (see [1]). This may not be a concern for
>>>> Dom0less
>>>> as the domain may never be destroyed but will be for correctness PoV.
>>>>
>>>> For LiveUpdate, the original Xen will carve out space to use by the boot
>>>> allocator in the new Xen. But I think this is not necessary in your
>>>> context.
>>>>
>>>> It should be sufficient to exclude the page from the boot allocators (as
>>>> we do
>>>> for other modules).
>>>>
>>>> One potential issue that can arise is there is no easy way today to
>>>> differentiate between pages allocated and pages not yet initialized. To
>>>> make
>>>> the code robust, we need to prevent a page to be used in two places. So
>>>> for
>>>> LiveUpdate we are marking them with a special value, this is used
>>>> afterwards
>>>> to check we are effictively using a reserved page.
>>>>
>>>> I hope this helps.
>>>
>>> Thanks for writing all of this down but I haven't understood some of it.
>>>
>>> For the sake of this discussion let's say that we managed to "reserve"
>>> the range early enough like we do for other modules, as you wrote.
>>>
>>> At the point where we want to call reserve_heap_pages() we would call
>>> init_heap_pages() just before it. We are still relatively early at boot
>>> so there aren't any concurrent memory operations. Why this doesn't work?
>>
>> Because init_heap_pages() may exclude some pages (for instance MFN 0 is carved
>> out) or use pages for its internal structure (see init_node_heap()). So you
>> can't expect to be able to allocate the exact same region by
>> reserve_heap_pages().
> 
> But it can't possibly use of any of pages it is trying to add to the
> heap, right?
Yes it can, there are already multiple examples in the buddy allocator.

> 
> We have reserved a certain range, we tell init_heap_pages to add the
> range to the heap, init_node_heap gets called and it ends up calling
> xmalloc. There is no way xmalloc can use any memory from that
> particular range because it is not in the heap yet. That should be safe.

If you look carefully at the code, you will notice:

     else if ( *use_tail && nr >= needed &&
               arch_mfn_in_directmap(mfn + nr) &&
               (!xenheap_bits ||
                !((mfn + nr - 1) >> (xenheap_bits - PAGE_SHIFT))) )
     {
         _heap[node] = mfn_to_virt(mfn + nr - needed);
         avail[node] = mfn_to_virt(mfn + nr - 1) +
                       PAGE_SIZE - sizeof(**avail) * NR_ZONES;
     }

This is one of the condition where the allocator will use a few pages 
from the region for itself.

> The init_node_heap code is a bit hard to follow but I went through it
> and couldn't spot anything that could cause any issues (MFN 0 aside
> which is a bit special). Am I missing something?
Aside what I wrote above, as soon as you give a page to an allocator, 
you waive a right to decide what the page is used for. The allocator is 
free to use the page for bookeeping or even carve out the page because 
it can't deal with it.

So I don't really see how giving a region to the allocator and then 
expecting the same region a call after is ever going to be safe.

Cheers,
diff mbox series

Patch

diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c
index 79ae64d4b8..3a9c1a291b 100644
--- a/xen/common/page_alloc.c
+++ b/xen/common/page_alloc.c
@@ -911,7 +911,7 @@  static struct page_info *get_free_buddy(unsigned int zone_lo,
     }
 }
 
-static void __alloc_heap_pages(struct page_info **pgo,
+static void __alloc_heap_pages(struct page_info *pg,
                                unsigned int order,
                                unsigned int memflags,
                                struct domain *d)
@@ -922,7 +922,7 @@  static void __alloc_heap_pages(struct page_info **pgo,
     bool need_tlbflush = false;
     uint32_t tlbflush_timestamp = 0;
     unsigned int dirty_cnt = 0;
-    struct page_info *pg = *pgo;
+    struct page_info *pg_start = pg;
 
     node = phys_to_nid(page_to_maddr(pg));
     zone = page_to_zone(pg);
@@ -934,10 +934,10 @@  static void __alloc_heap_pages(struct page_info **pgo,
     while ( buddy_order != order )
     {
         buddy_order--;
+        pg = pg_start + (1U << buddy_order);
         page_list_add_scrub(pg, node, zone, buddy_order,
                             (1U << buddy_order) > first_dirty ?
                             first_dirty : INVALID_DIRTY_IDX);
-        pg += 1U << buddy_order;
 
         if ( first_dirty != INVALID_DIRTY_IDX )
         {
@@ -948,7 +948,7 @@  static void __alloc_heap_pages(struct page_info **pgo,
                 first_dirty = 0; /* We've moved past original first_dirty */
         }
     }
-    *pgo = pg;
+    pg = pg_start;
 
     ASSERT(avail[node][zone] >= request);
     avail[node][zone] -= request;
@@ -1073,7 +1073,42 @@  static struct page_info *alloc_heap_pages(
         return NULL;
     }
 
-    __alloc_heap_pages(&pg, order, memflags, d);
+    __alloc_heap_pages(pg, order, memflags, d);
+    return pg;
+}
+
+static struct page_info *reserve_heap_pages(struct domain *d,
+                                            paddr_t start,
+                                            unsigned int order,
+                                            unsigned int memflags)
+{
+    nodeid_t node;
+    unsigned int zone;
+    struct page_info *pg;
+
+    if ( unlikely(order > MAX_ORDER) )
+        return NULL;
+
+    spin_lock(&heap_lock);
+
+    /*
+     * Claimed memory is considered unavailable unless the request
+     * is made by a domain with sufficient unclaimed pages.
+     */
+    if ( (outstanding_claims + (1UL << order) > total_avail_pages) &&
+          ((memflags & MEMF_no_refcount) ||
+           !d || d->outstanding_pages < (1UL << order)) )
+    {
+        spin_unlock(&heap_lock);
+        return NULL;
+    }
+
+    pg = maddr_to_page(start);
+    node = phys_to_nid(start);
+    zone = page_to_zone(pg);
+    page_list_del(pg, &heap(node, zone, order));
+
+    __alloc_heap_pages(pg, order, memflags, d);
     return pg;
 }
 
@@ -2385,6 +2420,33 @@  struct page_info *alloc_domheap_pages(
     return pg;
 }
 
+struct page_info *reserve_domheap_pages(
+    struct domain *d, paddr_t start, unsigned int order, unsigned int memflags)
+{
+    struct page_info *pg = NULL;
+
+    ASSERT(!in_irq());
+
+    if ( memflags & MEMF_no_owner )
+        memflags |= MEMF_no_refcount;
+    else if ( (memflags & MEMF_no_refcount) && d )
+    {
+        ASSERT(!(memflags & MEMF_no_refcount));
+        return NULL;
+    }
+
+    pg = reserve_heap_pages(d, start, order, memflags);
+
+    if ( d && !(memflags & MEMF_no_owner) &&
+         assign_pages(d, pg, order, memflags) )
+    {
+        free_heap_pages(pg, order, memflags & MEMF_no_scrub);
+        return NULL;
+    }
+
+    return pg;
+}
+
 void free_domheap_pages(struct page_info *pg, unsigned int order)
 {
     struct domain *d = page_get_owner(pg);
diff --git a/xen/include/xen/mm.h b/xen/include/xen/mm.h
index 9b62087be1..35407e1b68 100644
--- a/xen/include/xen/mm.h
+++ b/xen/include/xen/mm.h
@@ -199,6 +199,8 @@  void get_outstanding_claims(uint64_t *free_pages, uint64_t *outstanding_pages);
 void init_domheap_pages(paddr_t ps, paddr_t pe);
 struct page_info *alloc_domheap_pages(
     struct domain *d, unsigned int order, unsigned int memflags);
+struct page_info *reserve_domheap_pages(
+    struct domain *d, paddr_t start, unsigned int order, unsigned int memflags);
 void free_domheap_pages(struct page_info *pg, unsigned int order);
 unsigned long avail_domheap_pages_region(
     unsigned int node, unsigned int min_width, unsigned int max_width);