diff mbox series

[1/5] mm: Add support for unaccepted memory

Message ID 20210810062626.1012-2-kirill.shutemov@linux.intel.com (mailing list archive)
State New
Headers show
Series x86: Impplement support for unaccepted memory | expand

Commit Message

Kirill A . Shutemov Aug. 10, 2021, 6:26 a.m. UTC
UEFI Specification version 2.9 introduces concept of memory acceptance:
Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
requiring memory to be accepted before it can be used by the guest.
Accepting happens via a protocol specific for the Virtrual Machine
platform.

Accepting memory is costly and it makes VMM allocate memory for the
accepted guest physical address range. It's better to postpone memory
acceptation until memory is needed. It lowers boot time and reduces
memory overhead.

Support of such memory requires few changes in core-mm code:

  - memblock has to accept memory on allocation;

  - page allocator has to accept memory on the first allocation of the
    page;

Memblock change is trivial.

Page allocator is modified to accept pages on the first allocation.
PageOffline() is used to indicate that the page requires acceptance.
The flag currently used by hotplug and balloon. Such pages are not
available to page allocator.

An architecture has to provide three helpers if it wants to support
unaccepted memory:

 - accept_memory() makes a range of physical addresses accepted.

 - maybe_set_page_offline() marks a page PageOffline() if it requires
   acceptance. Used during boot to put pages on free lists.

 - clear_page_offline() clears makes a page accepted and clears
   PageOffline().

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/internal.h   | 14 ++++++++++++++
 mm/memblock.c   |  1 +
 mm/page_alloc.c | 13 ++++++++++++-
 3 files changed, 27 insertions(+), 1 deletion(-)

Comments

David Hildenbrand Aug. 10, 2021, 7:48 a.m. UTC | #1
On 10.08.21 08:26, Kirill A. Shutemov wrote:
> UEFI Specification version 2.9 introduces concept of memory acceptance:
> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> requiring memory to be accepted before it can be used by the guest.
> Accepting happens via a protocol specific for the Virtrual Machine
> platform.
> 
> Accepting memory is costly and it makes VMM allocate memory for the
> accepted guest physical address range. It's better to postpone memory
> acceptation until memory is needed. It lowers boot time and reduces
> memory overhead.
> 
> Support of such memory requires few changes in core-mm code:
> 
>    - memblock has to accept memory on allocation;
> 
>    - page allocator has to accept memory on the first allocation of the
>      page;
> 
> Memblock change is trivial.
> 
> Page allocator is modified to accept pages on the first allocation.
> PageOffline() is used to indicate that the page requires acceptance.
> The flag currently used by hotplug and balloon. Such pages are not
> available to page allocator.
> 
> An architecture has to provide three helpers if it wants to support
> unaccepted memory:
> 
>   - accept_memory() makes a range of physical addresses accepted.
> 
>   - maybe_set_page_offline() marks a page PageOffline() if it requires
>     acceptance. Used during boot to put pages on free lists.
> 
>   - clear_page_offline() clears makes a page accepted and clears
>     PageOffline().
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   mm/internal.h   | 14 ++++++++++++++
>   mm/memblock.c   |  1 +
>   mm/page_alloc.c | 13 ++++++++++++-
>   3 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 31ff935b2547..d2fc8a17fbe0 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>   int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>   		      unsigned long addr, int page_nid, int *flags);
>   
> +#ifndef CONFIG_UNACCEPTED_MEMORY
> +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
> +{
> +}
> +
> +static inline void clear_page_offline(struct page *page, unsigned int order)
> +{
> +}
> +
> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> +{
> +}

Can we find better fitting names for the first two? The function names 
are way too generic. For example:

accept_or_set_page_offline()

accept_and_clear_page_offline()

I thought for a second if
	PAGE_TYPE_OPS(Unaccepted, offline)
makes sense as well, not sure.


Also, please update the description of PageOffline in page-flags.h to 
include the additional usage with PageBuddy set at the same time.


I assume you don't have to worry about page_offline_freeze/thaw ... as 
we only set PageOffline initially, but not later at runtime when other 
subsystems (/proc/kcore) might stumble over it.
kirill.shutemov@linux.intel.com Aug. 10, 2021, 3:02 p.m. UTC | #2
On Tue, Aug 10, 2021 at 09:48:04AM +0200, David Hildenbrand wrote:
> On 10.08.21 08:26, Kirill A. Shutemov wrote:
> > UEFI Specification version 2.9 introduces concept of memory acceptance:
> > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > requiring memory to be accepted before it can be used by the guest.
> > Accepting happens via a protocol specific for the Virtrual Machine
> > platform.
> > 
> > Accepting memory is costly and it makes VMM allocate memory for the
> > accepted guest physical address range. It's better to postpone memory
> > acceptation until memory is needed. It lowers boot time and reduces
> > memory overhead.
> > 
> > Support of such memory requires few changes in core-mm code:
> > 
> >    - memblock has to accept memory on allocation;
> > 
> >    - page allocator has to accept memory on the first allocation of the
> >      page;
> > 
> > Memblock change is trivial.
> > 
> > Page allocator is modified to accept pages on the first allocation.
> > PageOffline() is used to indicate that the page requires acceptance.
> > The flag currently used by hotplug and balloon. Such pages are not
> > available to page allocator.
> > 
> > An architecture has to provide three helpers if it wants to support
> > unaccepted memory:
> > 
> >   - accept_memory() makes a range of physical addresses accepted.
> > 
> >   - maybe_set_page_offline() marks a page PageOffline() if it requires
> >     acceptance. Used during boot to put pages on free lists.
> > 
> >   - clear_page_offline() clears makes a page accepted and clears
> >     PageOffline().
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >   mm/internal.h   | 14 ++++++++++++++
> >   mm/memblock.c   |  1 +
> >   mm/page_alloc.c | 13 ++++++++++++-
> >   3 files changed, 27 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 31ff935b2547..d2fc8a17fbe0 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
> >   int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> >   		      unsigned long addr, int page_nid, int *flags);
> > +#ifndef CONFIG_UNACCEPTED_MEMORY
> > +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
> > +{
> > +}
> > +
> > +static inline void clear_page_offline(struct page *page, unsigned int order)
> > +{
> > +}
> > +
> > +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> > +{
> > +}
> 
> Can we find better fitting names for the first two? The function names are
> way too generic. For example:
> 
> accept_or_set_page_offline()
> 
> accept_and_clear_page_offline()

Sounds good.

> I thought for a second if
> 	PAGE_TYPE_OPS(Unaccepted, offline)
> makes sense as well, not sure.

I find Offline fitting the situation. Don't see a reason to add more
terminology here.

> Also, please update the description of PageOffline in page-flags.h to
> include the additional usage with PageBuddy set at the same time.

Okay.

> I assume you don't have to worry about page_offline_freeze/thaw ... as we
> only set PageOffline initially, but not later at runtime when other
> subsystems (/proc/kcore) might stumble over it.

I think so, but I would need to look at this code once again.
David Hildenbrand Aug. 10, 2021, 3:21 p.m. UTC | #3
On 10.08.21 17:02, Kirill A. Shutemov wrote:
> On Tue, Aug 10, 2021 at 09:48:04AM +0200, David Hildenbrand wrote:
>> On 10.08.21 08:26, Kirill A. Shutemov wrote:
>>> UEFI Specification version 2.9 introduces concept of memory acceptance:
>>> Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
>>> requiring memory to be accepted before it can be used by the guest.
>>> Accepting happens via a protocol specific for the Virtrual Machine
>>> platform.
>>>
>>> Accepting memory is costly and it makes VMM allocate memory for the
>>> accepted guest physical address range. It's better to postpone memory
>>> acceptation until memory is needed. It lowers boot time and reduces
>>> memory overhead.
>>>
>>> Support of such memory requires few changes in core-mm code:
>>>
>>>     - memblock has to accept memory on allocation;
>>>
>>>     - page allocator has to accept memory on the first allocation of the
>>>       page;
>>>
>>> Memblock change is trivial.
>>>
>>> Page allocator is modified to accept pages on the first allocation.
>>> PageOffline() is used to indicate that the page requires acceptance.
>>> The flag currently used by hotplug and balloon. Such pages are not
>>> available to page allocator.
>>>
>>> An architecture has to provide three helpers if it wants to support
>>> unaccepted memory:
>>>
>>>    - accept_memory() makes a range of physical addresses accepted.
>>>
>>>    - maybe_set_page_offline() marks a page PageOffline() if it requires
>>>      acceptance. Used during boot to put pages on free lists.
>>>
>>>    - clear_page_offline() clears makes a page accepted and clears
>>>      PageOffline().
>>>
>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> ---
>>>    mm/internal.h   | 14 ++++++++++++++
>>>    mm/memblock.c   |  1 +
>>>    mm/page_alloc.c | 13 ++++++++++++-
>>>    3 files changed, 27 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index 31ff935b2547..d2fc8a17fbe0 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
>>>    int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
>>>    		      unsigned long addr, int page_nid, int *flags);
>>> +#ifndef CONFIG_UNACCEPTED_MEMORY
>>> +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
>>> +{
>>> +}
>>> +
>>> +static inline void clear_page_offline(struct page *page, unsigned int order)
>>> +{
>>> +}
>>> +
>>> +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
>>> +{
>>> +}
>>
>> Can we find better fitting names for the first two? The function names are
>> way too generic. For example:
>>
>> accept_or_set_page_offline()
>>
>> accept_and_clear_page_offline()
> 
> Sounds good.
> 
>> I thought for a second if
>> 	PAGE_TYPE_OPS(Unaccepted, offline)
>> makes sense as well, not sure.
> 
> I find Offline fitting the situation. Don't see a reason to add more
> terminology here.
> 
>> Also, please update the description of PageOffline in page-flags.h to
>> include the additional usage with PageBuddy set at the same time.
> 
> Okay.
> 
>> I assume you don't have to worry about page_offline_freeze/thaw ... as we
>> only set PageOffline initially, but not later at runtime when other
>> subsystems (/proc/kcore) might stumble over it.
> 
> I think so, but I would need to look at this code once again.
> 

Another thing to look into would be teaching makedumpfile via vmcoreinfo 
about these special buddy pages:

makedumpfile will naturally skip all PageOffline pages and skip 
PageBuddy pages if requested to skip free pages. It detects these pages 
via the mapcount value. You will want makedumpfile to treat them like 
PageOffline pages: kernel/crash_core.c

#define PAGE_BUDDY_MAPCOUNT_VALUE	(~PG_buddy)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);

#define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);

We could export PAGE_BUDDY_OFFLINE_MAPCOUNT_VALUE or just compute it 
inside makedumpfile from the other two values.
Dave Hansen Aug. 10, 2021, 6:13 p.m. UTC | #4
> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>  	if (page_reported(page))
>  		__ClearPageReported(page);
>  
> +	if (PageOffline(page))
> +		clear_page_offline(page, order);
> +
>  	list_del(&page->lru);
>  	__ClearPageBuddy(page);
>  	set_page_private(page, 0);

So, this is right in the fast path of the page allocator.  It's a
one-time thing per 2M page, so it's not permanent.

*But* there's both a global spinlock and a firmware call hidden in
clear_page_offline().  That's *GOT* to hurt if you were, for instance,
running a benchmark while this code path is being tickled.  Not just to

That could be just downright catastrophic for scalability, albeit
temporarily.
Andi Kleen Aug. 10, 2021, 6:30 p.m. UTC | #5
> So, this is right in the fast path of the page allocator.  It's a
> one-time thing per 2M page, so it's not permanent.
>
> *But* there's both a global spinlock and a firmware call hidden in
> clear_page_offline().  That's *GOT* to hurt if you were, for instance,
> running a benchmark while this code path is being tickled.  Not just to
>
> That could be just downright catastrophic for scalability, albeit
> temporarily

This would be only a short blib at initialization until the system 
reaches steady state. So yes it would be temporary, but very short at that.

-Andi
Dave Hansen Aug. 10, 2021, 6:56 p.m. UTC | #6
On 8/10/21 11:30 AM, Andi Kleen wrote:
>> So, this is right in the fast path of the page allocator.  It's a
>> one-time thing per 2M page, so it's not permanent.
>>
>> *But* there's both a global spinlock and a firmware call hidden in
>> clear_page_offline().  That's *GOT* to hurt if you were, for instance,
>> running a benchmark while this code path is being tickled.  Not just to
>>
>> That could be just downright catastrophic for scalability, albeit
>> temporarily
> 
> This would be only a short blib at initialization until the system
> reaches steady state. So yes it would be temporary, but very short at that.

But it can't be *that* short or we wouldn't be going to all this trouble
in the first place.  This can't simultaneously be both bad enough that
this series exists, but minor enough that nobody will notice or care at
runtime.

In general, I'd rather have a system which is running userspace, slowly,
than one where I'm waiting for the kernel.  The trade-off being made is
a *good* trade-off for me.  But, not everyone is going to agree with me.

This also begs the question of how folks know when this "blip" is over.
 Do we have a counter for offline pages?  Is there any way to force page
acceptance?  Or, are we just stuck allocating a bunch of memory to warm
up the system?

How do folks who care about these new blips avoid them?

Again, I don't particularly care about how this affects the
benchmarkers.  But, I do care that they're going to hound us when these
blips start impacting their 99th percentile tail latencies.
Andi Kleen Aug. 10, 2021, 7:23 p.m. UTC | #7
> But, not everyone is going to agree with me.

Both the Intel TDX and the AMD SEV side independently came to opposite 
conclusions. In general people care a lot about boot time of VM guests.


>
> This also begs the question of how folks know when this "blip" is over.
>   Do we have a counter for offline pages?  Is there any way to force page
> acceptance?  Or, are we just stuck allocating a bunch of memory to warm
> up the system?
>
> How do folks who care about these new blips avoid them?

It's not different than any other warmup. At warmup time you always have 
lots of blips until the working set stabilizes. For example in 
virtualization first touch of a new page is usually an EPT violation 
handled to the host. Or in the native case you may need to do IO or free 
memory. Everybody who based their critical latency percentiles around a 
warming up process would be foolish, the picture would be completely 
distorted.

So the basic operation is adding some overhead, but I don't think 
anything is that unusual compared to the state of the art.

Now perhaps the locking might be a problem if the other operations all 
run in parallel, causing unnecessary serialization If that's really a 
problem I guess we can optimize later. I don't think there's anything 
fundamental about the current locking.


-Andi
Dave Hansen Aug. 10, 2021, 7:46 p.m. UTC | #8
On 8/10/21 12:23 PM, Andi Kleen wrote:
>> But, not everyone is going to agree with me.
> 
> Both the Intel TDX and the AMD SEV side independently came to opposite
> conclusions. In general people care a lot about boot time of VM guests.

I was also saying that getting to userspace fast is important to me.
Almost everyone agrees there.

>> This also begs the question of how folks know when this "blip" is over.
>>   Do we have a counter for offline pages?  Is there any way to force page
>> acceptance?  Or, are we just stuck allocating a bunch of memory to warm
>> up the system?
>>
>> How do folks who care about these new blips avoid them?
> 
> It's not different than any other warmup. At warmup time you always have
> lots of blips until the working set stabilizes. For example in
> virtualization first touch of a new page is usually an EPT violation
> handled to the host. Or in the native case you may need to do IO or free
> memory. Everybody who based their critical latency percentiles around a
> warming up process would be foolish, the picture would be completely
> distorted.
> 
> So the basic operation is adding some overhead, but I don't think
> anything is that unusual compared to the state of the art.

Except that today, you can totally avoid the allocation latency (not
sure about the EPT violation/fill latency) from things like QEMU's
-mem-prealloc.

> Now perhaps the locking might be a problem if the other operations all
> run in parallel, causing unnecessary serialization If that's really a
> problem I guess we can optimize later. I don't think there's anything
> fundamental about the current locking.

These boot blips are not the biggest issue in the world.  But, it is
fully under the guest's control and I think the guest has some
responsibility to provide *some* mitigation for it.

1. Do background acceptance, as opposed to relying 100% on demand-driven
   acceptance.  Guarantees a limited window in which blips can occur.
2. Do acceptance based on user input, like from sysfs.
3. Add a command-line argument to accept everything up front, or at
   least before userspace runs.
4. Add some statistic for how much unaccepted memory remains.

I can think of at least four ways we could mitigate it.  A sysfs
statistic file would probably take ~30 lines of code to loop over the
bitmap.  A command-line option would probably be <10 lines of code to
just short-circuit the bitmap and accept everything up front.  A file to
force acceptance would probably be pretty quick too.

Nothing there seem too onerous.
Dave Hansen Aug. 10, 2021, 8:50 p.m. UTC | #9
On 8/10/21 11:13 AM, Dave Hansen wrote:
>> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
>>  	if (page_reported(page))
>>  		__ClearPageReported(page);
>>  
>> +	if (PageOffline(page))
>> +		clear_page_offline(page, order);
>> +
>>  	list_del(&page->lru);
>>  	__ClearPageBuddy(page);
>>  	set_page_private(page, 0);
> So, this is right in the fast path of the page allocator.  It's a
> one-time thing per 2M page, so it's not permanent.
> 
> *But* there's both a global spinlock and a firmware call hidden in
> clear_page_offline().  That's *GOT* to hurt if you were, for instance,
> running a benchmark while this code path is being tickled.  Not just to
> 
> That could be just downright catastrophic for scalability, albeit
> temporarily.

One more thing...

How long are these calls?  You have to make at least 512 calls into the
SEAM module.  Assuming they're syscall-ish, so ~1,000 cycles each,
that's ~500,000 cycles, even if we ignore the actual time it takes to
zero that 2MB worth of memory and all other overhead within the SEAM module.

So, we're sitting on one CPU with interrupts off, blocking all the other
CPUs from doing page allocation in this zone.  Then, we're holding a
global lock which prevents any other NUMA nodes from accepting pages.
If the other node happens to *try* to do an accept, it will sit with its
zone lock held waiting for this one.

Maybe nobody will ever notice.  But, it seems like an awfully big risk
to me.  I'd at least *try* do these calls outside of the zone lock.
Then the collateral damage will at least be limited to things doing
accepts rather than all zone->lock users.

Couldn't we delay the acceptance to, say the place where we've dropped
the zone->lock and do the __GFP_ZERO memset() like at prep_new_page()?
Or is there some concern that the page has been split at that point?

I guess that makes it more complicated because you might have a 4k page
but you need to go accept a 2M page.  You might end up having to check
the bitmap 511 more times because you might see 511 more PageOffline()
pages come through.

You shouldn't even need the bitmap lock to read since it's a one-way
trip from unaccepted->accepted.
Andi Kleen Aug. 10, 2021, 9:20 p.m. UTC | #10
> These boot blips are not the biggest issue in the world.  But, it is
> fully under the guest's control and I think the guest has some
> responsibility to provide *some* mitigation for it.

It sounds more like an exercise in preliminary optimization at this 
point. If it's a real problem we can still worry about it later.


> 1. Do background acceptance, as opposed to relying 100% on demand-driven
>     acceptance.  Guarantees a limited window in which blips can occur.

Like Kirill wrote this was abandoned because it always allocates all 
memory on the host even if the guest doesn't need it.


> 2. Do acceptance based on user input, like from sysfs.

You can easily do that by running "memhog" at boot. No need for anything 
in the kernel.

BTW I believe this is also configurable at the guest BIOS level.

> 3. Add a command-line argument to accept everything up front, or at
>     least before userspace runs.

Same.


> 4. Add some statistic for how much unaccepted memory remains.

Yes that makes sense. We should have statistic counters for all of this.

Also I agree with your suggestion that we should get the slow path out 
of the zone locks/interrupt disable region. That should be easy enough 
and is an obvious improvement.

-Andi
Joerg Roedel Aug. 12, 2021, 8:19 a.m. UTC | #11
On Tue, Aug 10, 2021 at 02:20:08PM -0700, Andi Kleen wrote:
> Also I agree with your suggestion that we should get the slow path out of
> the zone locks/interrupt disable region. That should be easy enough and is
> an obvious improvement.

I also agree that the slow-path needs to be outside of the memory
allocator locks. But I think this conflicts with the concept of
accepting memory in 2MB chunks even if allocation size is smaller.

Given some kernel code allocated 2 pages and the allocator path starts
to validate the whole 2MB page the memory is on, then there are
potential races to take into account.

Either some other code path allocates memory from that page and returns
it before validation is finished or we end up with double validation.
Returning unvalidated memory is a guest-problem and double validation
will cause security issues for SNP guests.

Regards,

	Joerg
Dave Hansen Aug. 12, 2021, 2:14 p.m. UTC | #12
On 8/12/21 1:19 AM, Joerg Roedel wrote:
> On Tue, Aug 10, 2021 at 02:20:08PM -0700, Andi Kleen wrote:
>> Also I agree with your suggestion that we should get the slow path out of
>> the zone locks/interrupt disable region. That should be easy enough and is
>> an obvious improvement.
> 
> I also agree that the slow-path needs to be outside of the memory
> allocator locks. But I think this conflicts with the concept of
> accepting memory in 2MB chunks even if allocation size is smaller.
> 
> Given some kernel code allocated 2 pages and the allocator path starts
> to validate the whole 2MB page the memory is on, then there are
> potential races to take into account.

Yeah, the PageOffline()+PageBuddy() trick breaks down as soon as
PageBuddy() gets cleared.

I'm not 100% sure we need a page flag, though.  Imagine if we just did a
static key check in prep_new_page():

	if (static_key_whatever(tdx_accept_ongoing))
		maybe_accept_page(page, order);

maybe_accept_page() could just check the acceptance bitmap and see if
the 2MB page has been accepted.  If so, just return.  If not, take the
bitmap lock, accept the 2MB page, then mark the bitmap.

maybe_accept_page()
{
	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;

	/* Test the bit before taking any locks: */
	if (test_bit(huge_pfn, &accepted_bitmap))
		return;

	spin_lock_irq();
	/* Retest inside the lock: */
	if (test_bit(huge_pfn, &accepted_bitmap))
		return;
	tdx_accept_page(page, PMD_SIZE);
	set_bit(huge_pfn, &accepted_bitmap));
	spin_unlock_irq();
}

That's still not great.  It's still a global lock and the lock is still
held for quite a while because that accept isn't going to be lightning
fast.  But, at least it's not holding any *other* locks.  It also
doesn't take any locks in the fast path where the 2MB page was already
accepted.

The locking could be more fine-grained, for sure.  The bitmap could, for
instance, have a lock bit too.  Or we could just have an array of locks
and hash the huge_pfn to find a lock given a huge_pfn.  But, for now, I
think it's fine to just keep the global lock.

> Either some other code path allocates memory from that page and returns
> it before validation is finished or we end up with double validation.
> Returning unvalidated memory is a guest-problem and double validation
> will cause security issues for SNP guests.

Yeah, I think the *canonical* source of information for accepts is the
bitmap.  The page flags and any static keys or whatever are
less-canonical sources that tell you when you _might_ need to consult
the bitmap.
Kirill A . Shutemov Aug. 12, 2021, 8:34 p.m. UTC | #13
On Tue, Aug 10, 2021 at 05:21:48PM +0200, David Hildenbrand wrote:
> On 10.08.21 17:02, Kirill A. Shutemov wrote:
> > On Tue, Aug 10, 2021 at 09:48:04AM +0200, David Hildenbrand wrote:
> > > On 10.08.21 08:26, Kirill A. Shutemov wrote:
> > > > UEFI Specification version 2.9 introduces concept of memory acceptance:
> > > > Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP,
> > > > requiring memory to be accepted before it can be used by the guest.
> > > > Accepting happens via a protocol specific for the Virtrual Machine
> > > > platform.
> > > > 
> > > > Accepting memory is costly and it makes VMM allocate memory for the
> > > > accepted guest physical address range. It's better to postpone memory
> > > > acceptation until memory is needed. It lowers boot time and reduces
> > > > memory overhead.
> > > > 
> > > > Support of such memory requires few changes in core-mm code:
> > > > 
> > > >     - memblock has to accept memory on allocation;
> > > > 
> > > >     - page allocator has to accept memory on the first allocation of the
> > > >       page;
> > > > 
> > > > Memblock change is trivial.
> > > > 
> > > > Page allocator is modified to accept pages on the first allocation.
> > > > PageOffline() is used to indicate that the page requires acceptance.
> > > > The flag currently used by hotplug and balloon. Such pages are not
> > > > available to page allocator.
> > > > 
> > > > An architecture has to provide three helpers if it wants to support
> > > > unaccepted memory:
> > > > 
> > > >    - accept_memory() makes a range of physical addresses accepted.
> > > > 
> > > >    - maybe_set_page_offline() marks a page PageOffline() if it requires
> > > >      acceptance. Used during boot to put pages on free lists.
> > > > 
> > > >    - clear_page_offline() clears makes a page accepted and clears
> > > >      PageOffline().
> > > > 
> > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > ---
> > > >    mm/internal.h   | 14 ++++++++++++++
> > > >    mm/memblock.c   |  1 +
> > > >    mm/page_alloc.c | 13 ++++++++++++-
> > > >    3 files changed, 27 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 31ff935b2547..d2fc8a17fbe0 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -662,4 +662,18 @@ void vunmap_range_noflush(unsigned long start, unsigned long end);
> > > >    int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
> > > >    		      unsigned long addr, int page_nid, int *flags);
> > > > +#ifndef CONFIG_UNACCEPTED_MEMORY
> > > > +static inline void maybe_set_page_offline(struct page *page, unsigned int order)
> > > > +{
> > > > +}
> > > > +
> > > > +static inline void clear_page_offline(struct page *page, unsigned int order)
> > > > +{
> > > > +}
> > > > +
> > > > +static inline void accept_memory(phys_addr_t start, phys_addr_t end)
> > > > +{
> > > > +}
> > > 
> > > Can we find better fitting names for the first two? The function names are
> > > way too generic. For example:
> > > 
> > > accept_or_set_page_offline()
> > > 
> > > accept_and_clear_page_offline()
> > 
> > Sounds good.
> > 
> > > I thought for a second if
> > > 	PAGE_TYPE_OPS(Unaccepted, offline)
> > > makes sense as well, not sure.
> > 
> > I find Offline fitting the situation. Don't see a reason to add more
> > terminology here.
> > 
> > > Also, please update the description of PageOffline in page-flags.h to
> > > include the additional usage with PageBuddy set at the same time.
> > 
> > Okay.
> > 
> > > I assume you don't have to worry about page_offline_freeze/thaw ... as we
> > > only set PageOffline initially, but not later at runtime when other
> > > subsystems (/proc/kcore) might stumble over it.
> > 
> > I think so, but I would need to look at this code once again.
> > 
> 
> Another thing to look into would be teaching makedumpfile via vmcoreinfo
> about these special buddy pages:
> 
> makedumpfile will naturally skip all PageOffline pages and skip PageBuddy
> pages if requested to skip free pages. It detects these pages via the
> mapcount value. You will want makedumpfile to treat them like PageOffline
> pages: kernel/crash_core.c
> 
> #define PAGE_BUDDY_MAPCOUNT_VALUE	(~PG_buddy)
> VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
> 
> #define PAGE_OFFLINE_MAPCOUNT_VALUE	(~PG_offline)
> VMCOREINFO_NUMBER(PAGE_OFFLINE_MAPCOUNT_VALUE);
> 
> We could export PAGE_BUDDY_OFFLINE_MAPCOUNT_VALUE or just compute it inside
> makedumpfile from the other two values.

Thanks, for digging it up. I'll look into makedumpfile, but it's not on
top of my todo list, so may take a while.
Kirill A . Shutemov Aug. 12, 2021, 8:49 p.m. UTC | #14
On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
> On 8/12/21 1:19 AM, Joerg Roedel wrote:
> > On Tue, Aug 10, 2021 at 02:20:08PM -0700, Andi Kleen wrote:
> >> Also I agree with your suggestion that we should get the slow path out of
> >> the zone locks/interrupt disable region. That should be easy enough and is
> >> an obvious improvement.
> > 
> > I also agree that the slow-path needs to be outside of the memory
> > allocator locks. But I think this conflicts with the concept of
> > accepting memory in 2MB chunks even if allocation size is smaller.
> > 
> > Given some kernel code allocated 2 pages and the allocator path starts
> > to validate the whole 2MB page the memory is on, then there are
> > potential races to take into account.
> 
> Yeah, the PageOffline()+PageBuddy() trick breaks down as soon as
> PageBuddy() gets cleared.
> 
> I'm not 100% sure we need a page flag, though.  Imagine if we just did a
> static key check in prep_new_page():
> 
> 	if (static_key_whatever(tdx_accept_ongoing))
> 		maybe_accept_page(page, order);
> 
> maybe_accept_page() could just check the acceptance bitmap and see if
> the 2MB page has been accepted.  If so, just return.  If not, take the
> bitmap lock, accept the 2MB page, then mark the bitmap.
> 
> maybe_accept_page()
> {
> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
> 
> 	/* Test the bit before taking any locks: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 
> 	spin_lock_irq();
> 	/* Retest inside the lock: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 	tdx_accept_page(page, PMD_SIZE);
> 	set_bit(huge_pfn, &accepted_bitmap));
> 	spin_unlock_irq();
> }
> 
> That's still not great.  It's still a global lock and the lock is still
> held for quite a while because that accept isn't going to be lightning
> fast.  But, at least it's not holding any *other* locks.  It also
> doesn't take any locks in the fast path where the 2MB page was already
> accepted.

I expect a cache line with bitmap to bounce around during warm up. One
cache line covers a gig of RAM.

It's also not clear at all at what point the static key has to be
switched. We don't have any obvious point where we can say we are done
with accepting (unless you advocate for proactive accepting).

I like PageOffline() because we only need to consult the cache page
allocator already have in hands before looking into bitmap.

> The locking could be more fine-grained, for sure.  The bitmap could, for
> instance, have a lock bit too.  Or we could just have an array of locks
> and hash the huge_pfn to find a lock given a huge_pfn.  But, for now, I
> think it's fine to just keep the global lock.
> 
> > Either some other code path allocates memory from that page and returns
> > it before validation is finished or we end up with double validation.
> > Returning unvalidated memory is a guest-problem and double validation
> > will cause security issues for SNP guests.
> 
> Yeah, I think the *canonical* source of information for accepts is the
> bitmap.  The page flags and any static keys or whatever are
> less-canonical sources that tell you when you _might_ need to consult
> the bitmap.

Right.
Dave Hansen Aug. 12, 2021, 8:59 p.m. UTC | #15
On 8/12/21 1:49 PM, Kirill A. Shutemov wrote:
> On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
>> On 8/12/21 1:19 AM, Joerg Roedel wrote:
>> maybe_accept_page()
>> {
>> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
>>
>> 	/* Test the bit before taking any locks: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>>
>> 	spin_lock_irq();
>> 	/* Retest inside the lock: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>> 	tdx_accept_page(page, PMD_SIZE);
>> 	set_bit(huge_pfn, &accepted_bitmap));
>> 	spin_unlock_irq();
>> }
>>
>> That's still not great.  It's still a global lock and the lock is still
>> held for quite a while because that accept isn't going to be lightning
>> fast.  But, at least it's not holding any *other* locks.  It also
>> doesn't take any locks in the fast path where the 2MB page was already
>> accepted.
> 
> I expect a cache line with bitmap to bounce around during warm up. One
> cache line covers a gig of RAM.

The bitmap bouncing around isn't going to really matter when you have a
global lock protecting it from writes.

> It's also not clear at all at what point the static key has to be
> switched. We don't have any obvious point where we can say we are done
> with accepting (unless you advocate for proactive accepting).

Two easy options:
1. Run over the bitmap and counts the bits left.  That can be done
   outside the lock even.
2. Keep a counter of the number of bits set in the bitmap.

> I like PageOffline() because we only need to consult the cache page
> allocator already have in hands before looking into bitmap.

I like it too.  But, it's really nasty if the value is only valid deep
in the allocator.

We could keep the PageOffline() thing and then move it to some other
field in 'struct page' that's only valid between ClearPageOffline() and
prep_new_page().   Some magic value that says: "This page has not yet
been accepted, you better check the bitmap."  Something like:

	if (TestClearPageOffline(page))
		page->private = 0Xdeadbeef;

and then check page->private in prep_new_page(). There should be plenty
of 'struct page' space to hijack in that small window.

BTW, I was going to actually try and hack something up, but I got
annoyed that your patches don't apply upstream and gave up.  A git tree
with all of the dependencies would be nice. <hint, hint>
Kirill A . Shutemov Aug. 12, 2021, 9:08 p.m. UTC | #16
On Tue, Aug 10, 2021 at 01:50:57PM -0700, Dave Hansen wrote:
> On 8/10/21 11:13 AM, Dave Hansen wrote:
> >> @@ -1001,6 +1004,9 @@ static inline void del_page_from_free_list(struct page *page, struct zone *zone,
> >>  	if (page_reported(page))
> >>  		__ClearPageReported(page);
> >>  
> >> +	if (PageOffline(page))
> >> +		clear_page_offline(page, order);
> >> +
> >>  	list_del(&page->lru);
> >>  	__ClearPageBuddy(page);
> >>  	set_page_private(page, 0);
> > So, this is right in the fast path of the page allocator.  It's a
> > one-time thing per 2M page, so it's not permanent.
> > 
> > *But* there's both a global spinlock and a firmware call hidden in
> > clear_page_offline().  That's *GOT* to hurt if you were, for instance,
> > running a benchmark while this code path is being tickled.  Not just to
> > 
> > That could be just downright catastrophic for scalability, albeit
> > temporarily.
> 
> One more thing...
> 
> How long are these calls?  You have to make at least 512 calls into the
> SEAM module.  Assuming they're syscall-ish, so ~1,000 cycles each,
> that's ~500,000 cycles, even if we ignore the actual time it takes to
> zero that 2MB worth of memory and all other overhead within the SEAM module.

I hope to get away with 2 calls per 2M: one MapGPA and one TDACCEPTPAGE
(or 3 for MAXORDER -- 4M -- pages). I don't have any numbers yet.

> So, we're sitting on one CPU with interrupts off, blocking all the other
> CPUs from doing page allocation in this zone. 

I agree that's not good. Let's see if it's going to be okay with accepting
in 2M chunks.

> Then, we're holding a global lock which prevents any other NUMA nodes
> from accepting pages.

Looking at this again, the global lock is aviodable: the caller owns the
pfn range so nobody can touch these bits in the bitmap. We can replace
bitmap_clear() with atomic clear_bit() loop and drop the lock completely.

> If the other node happens to *try* to do an
> accept, it will sit with its zone lock held waiting for this one.

> Maybe nobody will ever notice.  But, it seems like an awfully big risk
> to me.  I'd at least *try* do these calls outside of the zone lock.
> Then the collateral damage will at least be limited to things doing
> accepts rather than all zone->lock users.
> 
> Couldn't we delay the acceptance to, say the place where we've dropped
> the zone->lock and do the __GFP_ZERO memset() like at prep_new_page()?
> Or is there some concern that the page has been split at that point?

It *will* be split by the point. Like if you ask for order-0 page and you
don't any left page allocator will try higher orders until finds anything.
On order-9 it would hit unaccepted. At that point the page going to split
and put on the free lists accordingly. That's all happens under zone lock.

  __rmqueue_smallest ->
    del_page_from_free_list()
    expand()

> I guess that makes it more complicated because you might have a 4k page
> but you need to go accept a 2M page.  You might end up having to check
> the bitmap 511 more times because you might see 511 more PageOffline()
> pages come through.
> 
> You shouldn't even need the bitmap lock to read since it's a one-way
> trip from unaccepted->accepted.

Yeah. Unless we don't want to flip it back on making the range share.
I think we do. Otherwise it will cause problems for kexec.
Kirill A . Shutemov Aug. 12, 2021, 9:23 p.m. UTC | #17
On Thu, Aug 12, 2021 at 01:59:01PM -0700, Dave Hansen wrote:
> On 8/12/21 1:49 PM, Kirill A. Shutemov wrote:
> > On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
> >> On 8/12/21 1:19 AM, Joerg Roedel wrote:
> >> maybe_accept_page()
> >> {
> >> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
> >>
> >> 	/* Test the bit before taking any locks: */
> >> 	if (test_bit(huge_pfn, &accepted_bitmap))
> >> 		return;
> >>
> >> 	spin_lock_irq();
> >> 	/* Retest inside the lock: */
> >> 	if (test_bit(huge_pfn, &accepted_bitmap))
> >> 		return;
> >> 	tdx_accept_page(page, PMD_SIZE);
> >> 	set_bit(huge_pfn, &accepted_bitmap));
> >> 	spin_unlock_irq();
> >> }
> >>
> >> That's still not great.  It's still a global lock and the lock is still
> >> held for quite a while because that accept isn't going to be lightning
> >> fast.  But, at least it's not holding any *other* locks.  It also
> >> doesn't take any locks in the fast path where the 2MB page was already
> >> accepted.
> > 
> > I expect a cache line with bitmap to bounce around during warm up. One
> > cache line covers a gig of RAM.
> 
> The bitmap bouncing around isn't going to really matter when you have a
> global lock protecting it from writes.

The idea with static key would not work if we mark shared memory as
unaccepted there.

> > It's also not clear at all at what point the static key has to be
> > switched. We don't have any obvious point where we can say we are done
> > with accepting (unless you advocate for proactive accepting).
> 
> Two easy options:
> 1. Run over the bitmap and counts the bits left.  That can be done
>    outside the lock even.
> 2. Keep a counter of the number of bits set in the bitmap.
> 
> > I like PageOffline() because we only need to consult the cache page
> > allocator already have in hands before looking into bitmap.
> 
> I like it too.  But, it's really nasty if the value is only valid deep
> in the allocator.
> 
> We could keep the PageOffline() thing and then move it to some other
> field in 'struct page' that's only valid between ClearPageOffline() and
> prep_new_page().   Some magic value that says: "This page has not yet
> been accepted, you better check the bitmap."  Something like:
> 
> 	if (TestClearPageOffline(page))
> 		page->private = 0Xdeadbeef;
> 
> and then check page->private in prep_new_page(). There should be plenty
> of 'struct page' space to hijack in that small window.

PageOffline() encoded in mapcount and check_new_page_bad() would complain
is mapcount is not -1.

> BTW, I was going to actually try and hack something up, but I got
> annoyed that your patches don't apply upstream and gave up.  A git tree
> with all of the dependencies would be nice. <hint, hint>

Okay.
Joerg Roedel Aug. 13, 2021, 2:49 p.m. UTC | #18
Hi Dave,

On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
> maybe_accept_page()
> {
> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
> 
> 	/* Test the bit before taking any locks: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 
> 	spin_lock_irq();
> 	/* Retest inside the lock: */
> 	if (test_bit(huge_pfn, &accepted_bitmap))
> 		return;
> 	tdx_accept_page(page, PMD_SIZE);
> 	set_bit(huge_pfn, &accepted_bitmap));
> 	spin_unlock_irq();
> }

Yeah, this could work, but the global lock is likely the show-stopper
here. For SNP we also not allowed to double-validate, so we need
something that basically indicates 'validation-is-ongoing' on a per 2MB
basis.

I am not an mm expert, but a page flag probably doesn't work. The flag
would be on the head of the 2MB range and when that page is already used
somewhere else there is no guarantee that the flag will survive. But
correct me if I am wrong here :)

The other options I can come up with are not great either:

	1) using an AVL bit in the direct-mapping PMD of that page. The
	   page-table would only be walked if the bit in the
	   accept_bitmap is clear. But I am not sure that all memory
	   which needs to be validated is in the direct-map.

	2) Use another page-sized bitmap. If the machine has more than
	   64GB of memory the bit index is wrapped around. This
	   shouldn't be a performance problem at runtime, if this page
	   is only consulted when the valid bit is clear in the
	   accept_bitmap.

MM experts could certainly come up with better ideas :)

> Yeah, I think the *canonical* source of information for accepts is the
> bitmap.  The page flags and any static keys or whatever are
> less-canonical sources that tell you when you _might_ need to consult
> the bitmap.

Right, it also helps the kexec case. The only problem left is how to
track 4kb shared pages for things like the GHCB.

Regards,

	Joerg
David Hildenbrand Aug. 17, 2021, 3 p.m. UTC | #19
On 13.08.21 16:49, Joerg Roedel wrote:
> Hi Dave,
> 
> On Thu, Aug 12, 2021 at 07:14:20AM -0700, Dave Hansen wrote:
>> maybe_accept_page()
>> {
>> 	unsigned long huge_pfn = page_to_phys(page) / PMD_SIZE;
>>
>> 	/* Test the bit before taking any locks: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>>
>> 	spin_lock_irq();
>> 	/* Retest inside the lock: */
>> 	if (test_bit(huge_pfn, &accepted_bitmap))
>> 		return;
>> 	tdx_accept_page(page, PMD_SIZE);
>> 	set_bit(huge_pfn, &accepted_bitmap));
>> 	spin_unlock_irq();
>> }
> 
> Yeah, this could work, but the global lock is likely the show-stopper
> here. For SNP we also not allowed to double-validate, so we need
> something that basically indicates 'validation-is-ongoing' on a per 2MB
> basis.
> 
> I am not an mm expert, but a page flag probably doesn't work. The flag
> would be on the head of the 2MB range and when that page is already used
> somewhere else there is no guarantee that the flag will survive. But
> correct me if I am wrong here :)
> 
> The other options I can come up with are not great either:
> 
> 	1) using an AVL bit in the direct-mapping PMD of that page. The
> 	   page-table would only be walked if the bit in the
> 	   accept_bitmap is clear. But I am not sure that all memory
> 	   which needs to be validated is in the direct-map.
> 
> 	2) Use another page-sized bitmap. If the machine has more than
> 	   64GB of memory the bit index is wrapped around. This
> 	   shouldn't be a performance problem at runtime, if this page
> 	   is only consulted when the valid bit is clear in the
> 	   accept_bitmap.
> 
> MM experts could certainly come up with better ideas :)


Not sure if already discussed, but what about making sure that free 
pages are not a mixture (partially unaccepted, partially accepted).

You'd have to expose the pages in that granularity to the buddy 
(__free_pages_core), indicating the state. You'd have to reject merging 
pages of differing acceptance state.

Accepting a page would then be handled outside of the zone lock, 
completely controlled by the state.

So a page in the buddy would either be completely accepted or completely 
unaccepted, signaled e.g., by PageOffline().

Consequently, when allocating a 4KiB page, you'd split an unaccepted 
2MiB page into separate unaccepted pages. You'd grab one of the 
unaccepted 4KiB pages and accept it before initializing it and handing 
it out.
Joerg Roedel Aug. 19, 2021, 9:55 a.m. UTC | #20
Hi David,

On Tue, Aug 17, 2021 at 05:00:55PM +0200, David Hildenbrand wrote:
> Not sure if already discussed, but what about making sure that free pages
> are not a mixture (partially unaccepted, partially accepted).
> 
> You'd have to expose the pages in that granularity to the buddy
> (__free_pages_core), indicating the state. You'd have to reject merging
> pages of differing acceptance state.
> 
> Accepting a page would then be handled outside of the zone lock, completely
> controlled by the state.
> 
> So a page in the buddy would either be completely accepted or completely
> unaccepted, signaled e.g., by PageOffline().
> 
> Consequently, when allocating a 4KiB page, you'd split an unaccepted 2MiB
> page into separate unaccepted pages. You'd grab one of the unaccepted 4KiB
> pages and accept it before initializing it and handing it out.

Yes, that is the alternative to over-accepting memory on allocation. But
the problem here is that accepting/validating memory is an expensive
operation which also requires a hypercall. The hypercalls on SNP and TDX
can accept/validate multiple pages in one call. So the recommendation is
to accept memory in bigger chunks, like the 2MB that have been proposed.

Only accepting memory in allocation size might be too slow, as there is
a lot of code doing order-0 allocations. I think this approach will also
be more intrusive to the page alloctor, as it needs more changes on the
free path to check for acceptance states before pages can be merged.

Regards,

	Joerg
David Hildenbrand Aug. 19, 2021, 10:06 a.m. UTC | #21
On 19.08.21 11:55, Joerg Roedel wrote:
> Hi David,
> 
> On Tue, Aug 17, 2021 at 05:00:55PM +0200, David Hildenbrand wrote:
>> Not sure if already discussed, but what about making sure that free pages
>> are not a mixture (partially unaccepted, partially accepted).
>>
>> You'd have to expose the pages in that granularity to the buddy
>> (__free_pages_core), indicating the state. You'd have to reject merging
>> pages of differing acceptance state.
>>
>> Accepting a page would then be handled outside of the zone lock, completely
>> controlled by the state.
>>
>> So a page in the buddy would either be completely accepted or completely
>> unaccepted, signaled e.g., by PageOffline().
>>
>> Consequently, when allocating a 4KiB page, you'd split an unaccepted 2MiB
>> page into separate unaccepted pages. You'd grab one of the unaccepted 4KiB
>> pages and accept it before initializing it and handing it out.
> 
> Yes, that is the alternative to over-accepting memory on allocation. But
> the problem here is that accepting/validating memory is an expensive
> operation which also requires a hypercall. The hypercalls on SNP and TDX
> can accept/validate multiple pages in one call. So the recommendation is
> to accept memory in bigger chunks, like the 2MB that have been proposed.
> 

The general idea would be to have one thread scanning the free page list 
and accepting pages in the background. Take a page out, accept it, 
release it back to the buddy. If we place accepted pages to the front of 
the free list, allocations would be quite fast.

Sure, you'd have some slow early allocations, but I'd guess it really 
wouldn't make a big impact overall. We'd have to try.

It's quite similar to the free page reporting infrastructure, except 
that with free page reporting we want to place pages to the tail, not to 
the front. That would be easy to add.


> Only accepting memory in allocation size might be too slow, as there is
> a lot of code doing order-0 allocations. I think this approach will also
> be more intrusive to the page alloctor, as it needs more changes on the
> free path to check for acceptance states before pages can be merged.

That's already mostly done in this patch IIRC.
diff mbox series

Patch

diff --git a/mm/internal.h b/mm/internal.h
index 31ff935b2547..d2fc8a17fbe0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -662,4 +662,18 @@  void vunmap_range_noflush(unsigned long start, unsigned long end);
 int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
 		      unsigned long addr, int page_nid, int *flags);
 
+#ifndef CONFIG_UNACCEPTED_MEMORY
+static inline void maybe_set_page_offline(struct page *page, unsigned int order)
+{
+}
+
+static inline void clear_page_offline(struct page *page, unsigned int order)
+{
+}
+
+static inline void accept_memory(phys_addr_t start, phys_addr_t end)
+{
+}
+#endif
+
 #endif	/* __MM_INTERNAL_H */
diff --git a/mm/memblock.c b/mm/memblock.c
index 28a813d9e955..8c1bf08f2b0b 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1370,6 +1370,7 @@  phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size,
 		 */
 		kmemleak_alloc_phys(found, size, 0, 0);
 
+	accept_memory(found, found + size);
 	return found;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 856b175c15a4..892347d9a507 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -899,6 +899,9 @@  static inline bool page_is_buddy(struct page *page, struct page *buddy,
 	if (buddy_order(buddy) != order)
 		return false;
 
+	if (PageOffline(buddy) || PageOffline(page))
+		return false;
+
 	/*
 	 * zone check is done late to avoid uselessly calculating
 	 * zone/node ids for pages that could never merge.
@@ -1001,6 +1004,9 @@  static inline void del_page_from_free_list(struct page *page, struct zone *zone,
 	if (page_reported(page))
 		__ClearPageReported(page);
 
+	if (PageOffline(page))
+		clear_page_offline(page, order);
+
 	list_del(&page->lru);
 	__ClearPageBuddy(page);
 	set_page_private(page, 0);
@@ -1165,7 +1171,8 @@  static inline void __free_one_page(struct page *page,
 static inline bool page_expected_state(struct page *page,
 					unsigned long check_flags)
 {
-	if (unlikely(atomic_read(&page->_mapcount) != -1))
+	if (unlikely(atomic_read(&page->_mapcount) != -1) &&
+	    !PageOffline(page))
 		return false;
 
 	if (unlikely((unsigned long)page->mapping |
@@ -1748,6 +1755,8 @@  void __init memblock_free_pages(struct page *page, unsigned long pfn,
 {
 	if (early_page_uninitialised(pfn))
 		return;
+
+	maybe_set_page_offline(page, order);
 	__free_pages_core(page, order);
 }
 
@@ -1839,10 +1848,12 @@  static void __init deferred_free_range(unsigned long pfn,
 	if (nr_pages == pageblock_nr_pages &&
 	    (pfn & (pageblock_nr_pages - 1)) == 0) {
 		set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+		maybe_set_page_offline(page, pageblock_order);
 		__free_pages_core(page, pageblock_order);
 		return;
 	}
 
+	accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT);
 	for (i = 0; i < nr_pages; i++, page++, pfn++) {
 		if ((pfn & (pageblock_nr_pages - 1)) == 0)
 			set_pageblock_migratetype(page, MIGRATE_MOVABLE);