[v2,4/5] mm,memory_hotplug: allocate memmap from the added memory range for sparse-vmemmap
diff mbox series

Message ID 20190625075227.15193-5-osalvador@suse.de
State New
Headers show
Series
  • Allocate memmap from hotadded memory
Related show

Commit Message

Oscar Salvador June 25, 2019, 7:52 a.m. UTC
Physical memory hotadd has to allocate a memmap (struct page array) for
the newly added memory section. Currently, alloc_pages_node() is used
for those allocations.

This has some disadvantages:
 a) an existing memory is consumed for that purpose
    (~2MB per 128MB memory section on x86_64)
 b) if the whole node is movable then we have off-node struct pages
    which has performance drawbacks.

a) has turned out to be a problem for memory hotplug based ballooning
   because the userspace might not react in time to online memory while
   the memory consumed during physical hotadd consumes enough memory to
   push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
   policy for the newly added memory") has been added to workaround that
   problem.

I have also seen hot-add operations failing on powerpc due to the fact
that we try to use order-8 pages. If the base page size is 64KB, this
gives us 16MB, and if we run out of those, we simply fail.
One could arge that we can fall back to basepages as we do in x86_64, but
we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.

Vmemap page tables can map arbitrary memory.
That means that we can simply use the beginning of each memory section and
map struct pages there.
struct pages which back the allocated space then just need to be treated
carefully.

Implementation wise we reuse vmem_altmap infrastructure to override
the default allocator used by __vmemap_populate. Once the memmap is
allocated we need a way to mark altmap pfns used for the allocation.
If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
altmap structure at the beginning of __add_pages(), and then we call
mark_vmemmap_pages().

Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
mark_vmemmap_pages() gets called at a different stage.
With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
sections have been populated.

mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:

The current layout of the Vmemmap pages are:

	[Head->refcount] : Nr sections used by this altmap
	[Head->private]  : Nr of vmemmap pages
	[Tail->freelist] : Pointer to the head page

This is done to easy the computation we need in some places.
E.g:

Example 1)
We hot-add 1GB on x86_64 (memory block 128MB) using
MHP_MEMMAP_DEVICE:

head->_refcount = 8 sections
head->private = 4096 vmemmap pages
tail's->freelist = head

Example 2)
We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:

[at the beginning of each memblock]
head->_refcount = 1 section
head->private = 512 vmemmap pages
tail's->freelist = head

We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
how much do we have to defer the call to vmemmap_free().
The thing is that the first pages of the hot-added range are used to create
the memmap mapping, so we cannot remove those first, otherwise we would blow up
when accessing the other pages.

What we do is that since when we hot-remove a memory-range, sections are being
removed sequentially, we wait until we hit the last section, and then we free
the hole range to vmemmap_free backwards.
We know that it is the last section because in every pass we
decrease head->_refcount, and when it reaches 0, we got our last section.

We also have to be careful about those pages during online and offline
operations. They are simply skipped, so online will keep them
reserved and so unusable for any other purpose and offline ignores them
so they do not block the offline operation.

In offline operation we only have to check for one particularity.
Depending on how large was the hot-added range, and using MHP_MEMMAP_DEVICE,
can be that one or more than one memory block is filled with only vmemmap pages.
We just need to check for this case and skip 1) isolating 2) migrating,
because those pages do not need to be migrated anywhere, they are self-hosted.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/arm64/mm/mmu.c            |   5 +-
 arch/powerpc/mm/init_64.c      |   7 +++
 arch/s390/mm/init.c            |   6 ++
 arch/x86/mm/init_64.c          |  10 +++
 drivers/acpi/acpi_memhotplug.c |   2 +-
 drivers/base/memory.c          |   2 +-
 include/linux/memory_hotplug.h |   6 ++
 include/linux/memremap.h       |   2 +-
 mm/compaction.c                |   7 +++
 mm/memory_hotplug.c            | 138 +++++++++++++++++++++++++++++++++++------
 mm/page_alloc.c                |  22 ++++++-
 mm/page_isolation.c            |  14 ++++-
 mm/sparse.c                    |  93 +++++++++++++++++++++++++++
 13 files changed, 289 insertions(+), 25 deletions(-)

Comments

David Hildenbrand June 25, 2019, 8:49 a.m. UTC | #1
On 25.06.19 09:52, Oscar Salvador wrote:
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
> 
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
>     (~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
>     which has performance drawbacks.
> 
> a) has turned out to be a problem for memory hotplug based ballooning
>    because the userspace might not react in time to online memory while
>    the memory consumed during physical hotadd consumes enough memory to
>    push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory") has been added to workaround that
>    problem.
> 
> I have also seen hot-add operations failing on powerpc due to the fact
> that we try to use order-8 pages. If the base page size is 64KB, this
> gives us 16MB, and if we run out of those, we simply fail.
> One could arge that we can fall back to basepages as we do in x86_64, but
> we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> 
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
> 
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by __vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
> 
> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> mark_vmemmap_pages() gets called at a different stage.
> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> sections have been populated.

So, only MHP_MEMMAP_DEVICE will be used. Would it make sense to only
implement one for now (after we decide which one to use), to make things
simpler?

Or do you have a real user in mind for the other?

> 
> mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
> 
> The current layout of the Vmemmap pages are:
> 
> 	[Head->refcount] : Nr sections used by this altmap
> 	[Head->private]  : Nr of vmemmap pages
> 	[Tail->freelist] : Pointer to the head page
> 
> This is done to easy the computation we need in some places.
> E.g:
> 
> Example 1)
> We hot-add 1GB on x86_64 (memory block 128MB) using
> MHP_MEMMAP_DEVICE:
> 
> head->_refcount = 8 sections
> head->private = 4096 vmemmap pages
> tail's->freelist = head
> 
> Example 2)
> We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
> 
> [at the beginning of each memblock]
> head->_refcount = 1 section
> head->private = 512 vmemmap pages
> tail's->freelist = head
> 
> We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
> how much do we have to defer the call to vmemmap_free().
> The thing is that the first pages of the hot-added range are used to create
> the memmap mapping, so we cannot remove those first, otherwise we would blow up
> when accessing the other pages.
> 
> What we do is that since when we hot-remove a memory-range, sections are being
> removed sequentially, we wait until we hit the last section, and then we free
> the hole range to vmemmap_free backwards.
> We know that it is the last section because in every pass we
> decrease head->_refcount, and when it reaches 0, we got our last section.
> 
> We also have to be careful about those pages during online and offline
> operations. They are simply skipped, so online will keep them
> reserved and so unusable for any other purpose and offline ignores them
> so they do not block the offline operation.
> 
> In offline operation we only have to check for one particularity.
> Depending on how large was the hot-added range, and using MHP_MEMMAP_DEVICE,
> can be that one or more than one memory block is filled with only vmemmap pages.
> We just need to check for this case and skip 1) isolating 2) migrating,
> because those pages do not need to be migrated anywhere, they are self-hosted.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  arch/arm64/mm/mmu.c            |   5 +-
>  arch/powerpc/mm/init_64.c      |   7 +++
>  arch/s390/mm/init.c            |   6 ++
>  arch/x86/mm/init_64.c          |  10 +++
>  drivers/acpi/acpi_memhotplug.c |   2 +-
>  drivers/base/memory.c          |   2 +-
>  include/linux/memory_hotplug.h |   6 ++
>  include/linux/memremap.h       |   2 +-
>  mm/compaction.c                |   7 +++
>  mm/memory_hotplug.c            | 138 +++++++++++++++++++++++++++++++++++------
>  mm/page_alloc.c                |  22 ++++++-
>  mm/page_isolation.c            |  14 ++++-
>  mm/sparse.c                    |  93 +++++++++++++++++++++++++++
>  13 files changed, 289 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df4df79..d4b5661fa6b6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  		if (pmd_none(READ_ONCE(*pmdp))) {
>  			void *p = NULL;
>  
> -			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> +			if (altmap)
> +				p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> +			else
> +				p = vmemmap_alloc_block_buf(PMD_SIZE, node);
>  			if (!p)
>  				return -ENOMEM;
>  
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a4e17a979e45..ff9d2c245321 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -289,6 +289,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
>  
>  		if (base_pfn >= alt_start && base_pfn < alt_end) {
>  			vmem_altmap_free(altmap, nr_pages);
> +		} else if (PageVmemmap(page)) {
> +			/*
> +			 * runtime vmemmap pages are residing inside the memory
> +			 * section so they do not have to be freed anywhere.
> +			 */
> +			while (PageVmemmap(page))
> +				__ClearPageVmemmap(page++);
>  		} else if (PageReserved(page)) {
>  			/* allocated from bootmem */
>  			if (page_size < PAGE_SIZE) {
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ffb81fe95c77..c045411552a3 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -226,6 +226,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>  	unsigned long size_pages = PFN_DOWN(size);
>  	int rc;
>  
> +	/*
> +	 * Physical memory is added only later during the memory online so we
> +	 * cannot use the added range at this stage unfortunately.
> +	 */
> +	restrictions->flags &= ~restrictions->flags;
> +
>  	if (WARN_ON_ONCE(restrictions->altmap))
>  		return -EINVAL;
>  
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 688fb0687e55..00d17b666337 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -874,6 +874,16 @@ static void __meminit free_pagetable(struct page *page, int order)
>  	unsigned long magic;
>  	unsigned int nr_pages = 1 << order;
>  
> +	/*
> +	 * Runtime vmemmap pages are residing inside the memory section so
> +	 * they do not have to be freed anywhere.
> +	 */
> +	if (PageVmemmap(page)) {
> +		while (nr_pages--)
> +			__ClearPageVmemmap(page++);
> +		return;
> +	}
> +
>  	/* bootmem page has reserved flag */
>  	if (PageReserved(page)) {
>  		__ClearPageReserved(page);
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 860f84e82dd0..3257edb98d90 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  		if (node < 0)
>  			node = memory_add_physaddr_to_nid(info->start_addr);
>  
> -		result = __add_memory(node, info->start_addr, info->length, 0);
> +		result = __add_memory(node, info->start_addr, info->length, MHP_MEMMAP_DEVICE);
>  
>  		/*
>  		 * If the memory block has been used by the kernel, add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index ad9834b8b7f7..e0ac9a3b66f8 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -32,7 +32,7 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
>  
>  #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>  
> -static int sections_per_block;
> +int sections_per_block;
>  
>  static inline int base_memory_block_id(int section_nr)
>  {
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 6fdbce9d04f9..e28e226c9a20 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -375,4 +375,10 @@ extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
>  		int online_type);
>  extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
>  		unsigned long nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +extern void mark_vmemmap_pages(struct vmem_altmap *self);
> +#else
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self) {}
> +#endif
>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1732dea030b2..6de37e168f57 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -16,7 +16,7 @@ struct device;
>   * @alloc: track pages consumed, private to vmemmap_populate()
>   */
>  struct vmem_altmap {
> -	const unsigned long base_pfn;
> +	unsigned long base_pfn;
>  	const unsigned long reserve;
>  	unsigned long free;
>  	unsigned long align;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9e1b9acb116b..40697f74b8b4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -855,6 +855,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>  		nr_scanned++;
>  
>  		page = pfn_to_page(low_pfn);
> +		/*
> +		 * Vmemmap pages do not need to be isolated.
> +		 */
> +		if (PageVmemmap(page)) {
> +			low_pfn += get_nr_vmemmap_pages(page) - 1;
> +			continue;
> +		}
>  
>  		/*
>  		 * Check if the pageblock has already been marked skipped.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e4e3baa6eaa7..b5106cb75795 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
>  #include "internal.h"
>  #include "shuffle.h"
>  
> +extern int sections_per_block;
> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -279,6 +281,24 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
>  	return 0;
>  }
>  
> +static void mhp_reset_altmap(unsigned long next_pfn,
> +			     struct vmem_altmap *altmap)
> +{
> +	altmap->base_pfn = next_pfn;
> +	altmap->alloc = 0;
> +}
> +
> +static void mhp_init_altmap(unsigned long pfn, unsigned long nr_pages,
> +			    unsigned long mhp_flags,
> +			    struct vmem_altmap *altmap)
> +{
> +	if (mhp_flags & MHP_MEMMAP_DEVICE)
> +		altmap->free = nr_pages;
> +	else
> +		altmap->free = PAGES_PER_SECTION * sections_per_block;
> +	altmap->base_pfn = pfn;
> +}
> +
>  /*
>   * Reasonably generic function for adding memory.  It is
>   * expected that archs that support memory hotplug will
> @@ -290,8 +310,17 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>  {
>  	unsigned long i;
>  	int start_sec, end_sec, err;
> -	struct vmem_altmap *altmap = restrictions->altmap;
> +	struct vmem_altmap *altmap;
> +	struct vmem_altmap __memblk_altmap = {};
> +	unsigned long mhp_flags = restrictions->flags;
> +	unsigned long sections_added;
> +
> +	if (mhp_flags & MHP_VMEMMAP_FLAGS) {
> +		mhp_init_altmap(pfn, nr_pages, mhp_flags, &__memblk_altmap);
> +		restrictions->altmap = &__memblk_altmap;
> +	}
>  
> +	altmap = restrictions->altmap;
>  	if (altmap) {
>  		/*
>  		 * Validate altmap is within bounds of the total request
> @@ -308,9 +337,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>  	if (err)
>  		return err;
>  
> +	sections_added = 1;
>  	start_sec = pfn_to_section_nr(pfn);
>  	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> -	for (i = start_sec; i <= end_sec; i++) {
> +	for (i = start_sec; i <= end_sec; i++, sections_added++) {
>  		unsigned long pfns;
>  
>  		pfns = min(nr_pages, PAGES_PER_SECTION
> @@ -320,9 +350,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>  			break;
>  		pfn += pfns;
>  		nr_pages -= pfns;
> +
> +		if (mhp_flags & MHP_MEMMAP_MEMBLOCK &&
> +		    !(sections_added % sections_per_block)) {
> +			mark_vmemmap_pages(altmap);
> +			mhp_reset_altmap(pfn, altmap);
> +		}
>  		cond_resched();
>  	}
>  	vmemmap_populate_print_last();
> +
> +	if (mhp_flags & MHP_MEMMAP_DEVICE)
> +		mark_vmemmap_pages(altmap);
> +
>  	return err;
>  }
>  
> @@ -642,6 +682,14 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
>  	while (start < end) {
>  		order = min(MAX_ORDER - 1,
>  			get_order(PFN_PHYS(end) - PFN_PHYS(start)));
> +		/*
> +		 * Check if the pfn is aligned to its order.
> +		 * If not, we decrement the order until it is,
> +		 * otherwise __free_one_page will bug us.
> +		 */
> +		while (start & ((1 << order) - 1))
> +			order--;
> +
>  		(*online_page_callback)(pfn_to_page(start), order);
>  
>  		onlined_pages += (1UL << order);
> @@ -654,13 +702,30 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
>  			void *arg)
>  {
>  	unsigned long onlined_pages = *(unsigned long *)arg;
> +	unsigned long pfn = start_pfn;
> +	unsigned long nr_vmemmap_pages = 0;
>  
> -	if (PageReserved(pfn_to_page(start_pfn)))
> -		onlined_pages += online_pages_blocks(start_pfn, nr_pages);
> +	if (PageVmemmap(pfn_to_page(pfn))) {
> +		/*
> +		 * Do not send vmemmap pages to the page allocator.
> +		 */
> +		nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> +		nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> +		pfn += nr_vmemmap_pages;
> +		if (nr_vmemmap_pages == nr_pages)
> +			/*
> +			 * If the entire range contains only vmemmap pages,
> +			 * there are no pages left for the page allocator.
> +			 */
> +			goto skip_online;
> +	}
>  
> +	if (PageReserved(pfn_to_page(pfn)))
> +		onlined_pages += online_pages_blocks(pfn, nr_pages - nr_vmemmap_pages);
> +skip_online:
>  	online_mem_sections(start_pfn, start_pfn + nr_pages);
>  
> -	*(unsigned long *)arg = onlined_pages;
> +	*(unsigned long *)arg = onlined_pages + nr_vmemmap_pages;
>  	return 0;
>  }
>  
> @@ -1051,6 +1116,23 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>  	return device_online(&mem->dev);
>  }
>  
> +static bool mhp_check_correct_flags(unsigned long flags)
> +{
> +	if (flags & MHP_VMEMMAP_FLAGS) {
> +		if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) {
> +			WARN(1, "Vmemmap capability can only be used on"
> +				"CONFIG_SPARSEMEM_VMEMMAP. Ignoring flags.\n");
> +			return false;
> +		}
> +		if ((flags & MHP_VMEMMAP_FLAGS) == MHP_VMEMMAP_FLAGS) {
> +			WARN(1, "Both MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK"
> +				"were passed. Ignoring flags.\n");
> +			return false;
> +		}
> +	}
> +	return true;
> +}
> +
>  /*
>   * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
>   * and online/offline operations (triggered e.g. by sysfs).
> @@ -1086,6 +1168,9 @@ int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags
>  		goto error;
>  	new_node = ret;
>  
> +	if (mhp_check_correct_flags(flags))
> +		restrictions.flags = flags;
> +
>  	/* call arch's memory hotadd */
>  	ret = arch_add_memory(nid, start, size, &restrictions);
>  	if (ret < 0)
> @@ -1518,12 +1603,14 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  {
>  	unsigned long pfn, nr_pages;
>  	unsigned long offlined_pages = 0;
> +	unsigned long nr_vmemmap_pages = 0;
>  	int ret, node, nr_isolate_pageblock;
>  	unsigned long flags;
>  	unsigned long valid_start, valid_end;
>  	struct zone *zone;
>  	struct memory_notify arg;
>  	char *reason;
> +	bool skip = false;
>  
>  	mem_hotplug_begin();
>  
> @@ -1540,15 +1627,24 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	node = zone_to_nid(zone);
>  	nr_pages = end_pfn - start_pfn;
>  
> -	/* set above range as isolated */
> -	ret = start_isolate_page_range(start_pfn, end_pfn,
> -				       MIGRATE_MOVABLE,
> -				       SKIP_HWPOISON | REPORT_FAILURE);
> -	if (ret < 0) {
> -		reason = "failure to isolate range";
> -		goto failed_removal;
> +	if (PageVmemmap(pfn_to_page(start_pfn))) {
> +		nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> +		nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> +		if (nr_vmemmap_pages == nr_pages)
> +			skip = true;
> +	}
> +
> +	if (!skip) {
> +		/* set above range as isolated */
> +		ret = start_isolate_page_range(start_pfn, end_pfn,
> +					       MIGRATE_MOVABLE,
> +					       SKIP_HWPOISON | REPORT_FAILURE);
> +		if (ret < 0) {
> +			reason = "failure to isolate range";
> +			goto failed_removal;
> +		}
> +		nr_isolate_pageblock = ret;
>  	}
> -	nr_isolate_pageblock = ret;
>  
>  	arg.start_pfn = start_pfn;
>  	arg.nr_pages = nr_pages;
> @@ -1561,6 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  		goto failed_removal_isolated;
>  	}
>  
> +	if (skip)
> +		goto skip_migration;
> +
>  	do {
>  		for (pfn = start_pfn; pfn;) {
>  			if (signal_pending(current)) {
> @@ -1601,7 +1700,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	   We cannot do rollback at this point. */
>  	walk_system_ram_range(start_pfn, end_pfn - start_pfn,
>  			      &offlined_pages, offline_isolated_pages_cb);
> -	pr_info("Offlined Pages %ld\n", offlined_pages);
> +
> +skip_migration:
> +	pr_info("Offlined Pages %ld\n", offlined_pages + nr_vmemmap_pages);
>  	/*
>  	 * Onlining will reset pagetype flags and makes migrate type
>  	 * MOVABLE, so just need to decrease the number of isolated
> @@ -1612,11 +1713,12 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  
>  	/* removal success */
> -	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> -	zone->present_pages -= offlined_pages;
> +	if (offlined_pages)
> +		adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> +	zone->present_pages -= offlined_pages + nr_vmemmap_pages;
>  
>  	pgdat_resize_lock(zone->zone_pgdat, &flags);
> -	zone->zone_pgdat->node_present_pages -= offlined_pages;
> +	zone->zone_pgdat->node_present_pages -= offlined_pages + nr_vmemmap_pages;
>  	pgdat_resize_unlock(zone->zone_pgdat, &flags);
>  
>  	init_per_zone_wmark_min();
> @@ -1645,7 +1747,7 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	memory_notify(MEM_CANCEL_OFFLINE, &arg);
>  failed_removal:
>  	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
> -		 (unsigned long long) start_pfn << PAGE_SHIFT,
> +		 (unsigned long long) (start_pfn - nr_vmemmap_pages) << PAGE_SHIFT,
>  		 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
>  		 reason);
>  	/* pushback to free area */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b3266d63521..7a73a06c5730 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1282,9 +1282,14 @@ static void free_one_page(struct zone *zone,
>  static void __meminit __init_single_page(struct page *page, unsigned long pfn,
>  				unsigned long zone, int nid)
>  {
> -	mm_zero_struct_page(page);
> +	if (!__PageVmemmap(page)) {
> +		/*
> +		 * Vmemmap pages need to preserve their state.
> +		 */
> +		mm_zero_struct_page(page);
> +		init_page_count(page);
> +	}
>  	set_page_links(page, zone, nid, pfn);
> -	init_page_count(page);
>  	page_mapcount_reset(page);
>  	page_cpupid_reset_last(page);
>  	page_kasan_tag_reset(page);
> @@ -8143,6 +8148,14 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
>  
>  		page = pfn_to_page(check);
>  
> +		/*
> +		 * Vmemmap pages are not needed to be moved around.
> +		 */
> +		if (PageVmemmap(page)) {
> +			iter += get_nr_vmemmap_pages(page) - 1;
> +			continue;
> +		}
> +
>  		if (PageReserved(page))
>  			goto unmovable;
>  
> @@ -8510,6 +8523,11 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
>  			continue;
>  		}
>  		page = pfn_to_page(pfn);
> +
> +		if (PageVmemmap(page)) {
> +			pfn += get_nr_vmemmap_pages(page);
> +			continue;
> +		}
>  		/*
>  		 * The HWPoisoned page may be not in buddy system, and
>  		 * page_count() is not 0.
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index e3638a5bafff..128c47a27925 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -146,7 +146,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
>  static inline struct page *
>  __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>  {
> -	int i;
> +	unsigned long i;
>  
>  	for (i = 0; i < nr_pages; i++) {
>  		struct page *page;
> @@ -154,6 +154,10 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
>  		page = pfn_to_online_page(pfn + i);
>  		if (!page)
>  			continue;
> +		if (PageVmemmap(page)) {
> +			i += get_nr_vmemmap_pages(page) - 1;
> +			continue;
> +		}
>  		return page;
>  	}
>  	return NULL;
> @@ -268,6 +272,14 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
>  			continue;
>  		}
>  		page = pfn_to_page(pfn);
> +		/*
> +		 * Vmemmap pages are not isolated. Skip them.
> +		 */
> +		if (PageVmemmap(page)) {
> +			pfn += get_nr_vmemmap_pages(page);
> +			continue;
> +		}
> +
>  		if (PageBuddy(page))
>  			/*
>  			 * If the page is on a free list, it has to be on
> diff --git a/mm/sparse.c b/mm/sparse.c
> index b77ca21a27a4..04b395fb4463 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -635,6 +635,94 @@ void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
>  #endif
>  
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void mark_vmemmap_pages(struct vmem_altmap *self)
> +{
> +	unsigned long pfn = self->base_pfn + self->reserve;
> +	unsigned long nr_pages = self->alloc;
> +	unsigned long nr_sects = self->free / PAGES_PER_SECTION;
> +	unsigned long i;
> +	struct page *head;
> +
> +	if (!nr_pages)
> +		return;
> +
> +	pr_debug("%s: marking %px - %px as Vmemmap (%ld pages)\n",
> +						__func__,
> +						pfn_to_page(pfn),
> +						pfn_to_page(pfn + nr_pages - 1),
> +						nr_pages);
> +
> +	/*
> +	 * All allocations for the memory hotplug are the same sized so align
> +	 * should be 0.
> +	 */
> +	WARN_ON(self->align);
> +
> +	/*
> +	 * Layout of vmemmap pages:
> +	 * [Head->refcount] : Nr sections used by this altmap
> +	 * [Head->private]  : Nr of vmemmap pages
> +	 * [Tail->freelist] : Pointer to the head page
> +	 */
> +
> +	/*
> +	 * Head, first vmemmap page
> +	 */
> +	head = pfn_to_page(pfn);
> +	for (i = 0; i < nr_pages; i++, pfn++) {
> +		struct page *page = pfn_to_page(pfn);
> +
> +		mm_zero_struct_page(page);
> +		__SetPageVmemmap(page);
> +		page->freelist = head;
> +		init_page_count(page);
> +	}
> +	set_page_count(head, (int)nr_sects);
> +	set_page_private(head, nr_pages);
> +}
> +/*
> + * If the range we are trying to remove was hot-added with vmemmap pages
> + * using MHP_MEMMAP_DEVICE, we need to keep track of it to know how much
> + * do we have do defer the free up.
> + * Since sections are removed sequentally in __remove_pages()->
> + * __remove_section(), we just wait until we hit the last section.
> + * Once that happens, we can trigger free_deferred_vmemmap_range to actually
> + * free the whole memory-range.
> + */
> +static struct page *head_vmemmap_page = NULL;;
> +static bool freeing_vmemmap_range = false;
> +
> +static inline bool vmemmap_dec_and_test(void)
> +{
> +	return page_ref_dec_and_test(head_vmemmap_page);
> +}
> +
> +static void free_deferred_vmemmap_range(unsigned long start,
> +                                       unsigned long end)
> +{
> +	unsigned long nr_pages = end - start;
> +	unsigned long first_section = (unsigned long)head_vmemmap_page;
> +
> +	while (start >= first_section) {
> +		vmemmap_free(start, end, NULL);
> +		end = start;
> +		start -= nr_pages;
> +	}
> +	head_vmemmap_page = NULL;
> +	freeing_vmemmap_range = false;
> +}
> +
> +static void deferred_vmemmap_free(unsigned long start, unsigned long end)
> +{
> +	if (!freeing_vmemmap_range) {
> +		freeing_vmemmap_range = true;
> +		head_vmemmap_page = (struct page *)start;
> +	}
> +
> +	if (vmemmap_dec_and_test())
> +		free_deferred_vmemmap_range(start, end);
> +}
> +
>  static struct page *populate_section_memmap(unsigned long pfn,
>  		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
>  {
> @@ -647,6 +735,11 @@ static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
>  	unsigned long start = (unsigned long) pfn_to_page(pfn);
>  	unsigned long end = start + nr_pages * sizeof(struct page);
>  
> +	if (PageVmemmap((struct page *)start) || freeing_vmemmap_range) {
> +		deferred_vmemmap_free(start, end);
> +		return;
> +	}
> +
>  	vmemmap_free(start, end, altmap);
>  }
>  static void free_map_bootmem(struct page *memmap)
>
Oscar Salvador June 26, 2019, 8:13 a.m. UTC | #2
On Tue, Jun 25, 2019 at 10:49:10AM +0200, David Hildenbrand wrote:
> On 25.06.19 09:52, Oscar Salvador wrote:
> > Physical memory hotadd has to allocate a memmap (struct page array) for
> > the newly added memory section. Currently, alloc_pages_node() is used
> > for those allocations.
> > 
> > This has some disadvantages:
> >  a) an existing memory is consumed for that purpose
> >     (~2MB per 128MB memory section on x86_64)
> >  b) if the whole node is movable then we have off-node struct pages
> >     which has performance drawbacks.
> > 
> > a) has turned out to be a problem for memory hotplug based ballooning
> >    because the userspace might not react in time to online memory while
> >    the memory consumed during physical hotadd consumes enough memory to
> >    push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
> >    policy for the newly added memory") has been added to workaround that
> >    problem.
> > 
> > I have also seen hot-add operations failing on powerpc due to the fact
> > that we try to use order-8 pages. If the base page size is 64KB, this
> > gives us 16MB, and if we run out of those, we simply fail.
> > One could arge that we can fall back to basepages as we do in x86_64, but
> > we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
> > 
> > Vmemap page tables can map arbitrary memory.
> > That means that we can simply use the beginning of each memory section and
> > map struct pages there.
> > struct pages which back the allocated space then just need to be treated
> > carefully.
> > 
> > Implementation wise we reuse vmem_altmap infrastructure to override
> > the default allocator used by __vmemap_populate. Once the memmap is
> > allocated we need a way to mark altmap pfns used for the allocation.
> > If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> > altmap structure at the beginning of __add_pages(), and then we call
> > mark_vmemmap_pages().
> > 
> > Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> > mark_vmemmap_pages() gets called at a different stage.
> > With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> > fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> > sections have been populated.
> 
> So, only MHP_MEMMAP_DEVICE will be used. Would it make sense to only
> implement one for now (after we decide which one to use), to make things
> simpler?
> 
> Or do you have a real user in mind for the other?

Currently, only MHP_MEMMAP_DEVICE will be used, as we only pass flags from
acpi memory-hotplug path.

All the others: hyper-v, Xen,... will have to be evaluated to see which one
do they want to use.

Although MHP_MEMMAP_DEVICE is the only one used right now, I introduced
MHP_MEMMAP_MEMBLOCK to give the callers the choice of using MHP_MEMMAP_MEMBLOCK
if they think that a strategy where hot-removing works in a different granularity
makes sense.

Moreover, since they both use the same API, there is no extra code needed to
handle it. (Just two lines in __add_pages())

This arose here [1].

[1] https://patchwork.kernel.org/project/linux-mm/list/?submitter=137061
David Hildenbrand June 26, 2019, 8:15 a.m. UTC | #3
On 26.06.19 10:13, Oscar Salvador wrote:
> On Tue, Jun 25, 2019 at 10:49:10AM +0200, David Hildenbrand wrote:
>> On 25.06.19 09:52, Oscar Salvador wrote:
>>> Physical memory hotadd has to allocate a memmap (struct page array) for
>>> the newly added memory section. Currently, alloc_pages_node() is used
>>> for those allocations.
>>>
>>> This has some disadvantages:
>>>  a) an existing memory is consumed for that purpose
>>>     (~2MB per 128MB memory section on x86_64)
>>>  b) if the whole node is movable then we have off-node struct pages
>>>     which has performance drawbacks.
>>>
>>> a) has turned out to be a problem for memory hotplug based ballooning
>>>    because the userspace might not react in time to online memory while
>>>    the memory consumed during physical hotadd consumes enough memory to
>>>    push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
>>>    policy for the newly added memory") has been added to workaround that
>>>    problem.
>>>
>>> I have also seen hot-add operations failing on powerpc due to the fact
>>> that we try to use order-8 pages. If the base page size is 64KB, this
>>> gives us 16MB, and if we run out of those, we simply fail.
>>> One could arge that we can fall back to basepages as we do in x86_64, but
>>> we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>>>
>>> Vmemap page tables can map arbitrary memory.
>>> That means that we can simply use the beginning of each memory section and
>>> map struct pages there.
>>> struct pages which back the allocated space then just need to be treated
>>> carefully.
>>>
>>> Implementation wise we reuse vmem_altmap infrastructure to override
>>> the default allocator used by __vmemap_populate. Once the memmap is
>>> allocated we need a way to mark altmap pfns used for the allocation.
>>> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
>>> altmap structure at the beginning of __add_pages(), and then we call
>>> mark_vmemmap_pages().
>>>
>>> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
>>> mark_vmemmap_pages() gets called at a different stage.
>>> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
>>> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
>>> sections have been populated.
>>
>> So, only MHP_MEMMAP_DEVICE will be used. Would it make sense to only
>> implement one for now (after we decide which one to use), to make things
>> simpler?
>>
>> Or do you have a real user in mind for the other?
> 
> Currently, only MHP_MEMMAP_DEVICE will be used, as we only pass flags from
> acpi memory-hotplug path.
> 
> All the others: hyper-v, Xen,... will have to be evaluated to see which one
> do they want to use.
> 
> Although MHP_MEMMAP_DEVICE is the only one used right now, I introduced
> MHP_MEMMAP_MEMBLOCK to give the callers the choice of using MHP_MEMMAP_MEMBLOCK
> if they think that a strategy where hot-removing works in a different granularity
> makes sense.
> 
> Moreover, since they both use the same API, there is no extra code needed to
> handle it. (Just two lines in __add_pages())
> 
> This arose here [1].
> 
> [1] https://patchwork.kernel.org/project/linux-mm/list/?submitter=137061
> 

Just noting that you can emulate MHP_MEMMAP_MEMBLOCK via
MHP_MEMMAP_DEVICE by adding memory in memory block granularity (which is
what hyper-v and xen do if I am not wrong!).

Not yet convinced that both, MHP_MEMMAP_MEMBLOCK and MHP_MEMMAP_DEVICE
are needed. But we can sort that out later.
Anshuman Khandual June 26, 2019, 8:17 a.m. UTC | #4
Hello Oscar,

On 06/25/2019 01:22 PM, Oscar Salvador wrote:
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df4df79..d4b5661fa6b6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>  		if (pmd_none(READ_ONCE(*pmdp))) {
>  			void *p = NULL;
>  
> -			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> +			if (altmap)
> +				p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> +			else
> +				p = vmemmap_alloc_block_buf(PMD_SIZE, node);
>  			if (!p)
>  				return -ENOMEM;

Is this really required to be part of this series ? I have an ongoing work
(reworked https://patchwork.kernel.org/patch/10882781/) enabling altmap
support on arm64 during memory hot add and remove path which is waiting on
arm64 memory-hot remove to be merged first.

- Anshuman
Oscar Salvador June 26, 2019, 8:28 a.m. UTC | #5
On Wed, Jun 26, 2019 at 01:47:32PM +0530, Anshuman Khandual wrote:
> Hello Oscar,
> 
> On 06/25/2019 01:22 PM, Oscar Salvador wrote:
> > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> > index 93ed0df4df79..d4b5661fa6b6 100644
> > --- a/arch/arm64/mm/mmu.c
> > +++ b/arch/arm64/mm/mmu.c
> > @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
> >  		if (pmd_none(READ_ONCE(*pmdp))) {
> >  			void *p = NULL;
> >  
> > -			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> > +			if (altmap)
> > +				p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> > +			else
> > +				p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> >  			if (!p)
> >  				return -ENOMEM;
> 
> Is this really required to be part of this series ? I have an ongoing work
> (reworked https://patchwork.kernel.org/patch/10882781/) enabling altmap
> support on arm64 during memory hot add and remove path which is waiting on
> arm64 memory-hot remove to be merged first.

Hi Anshuman,

I can drop this chunk in the next version.
No problem.
Williams, Dan J July 24, 2019, 9:49 p.m. UTC | #6
On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> Physical memory hotadd has to allocate a memmap (struct page array) for
> the newly added memory section. Currently, alloc_pages_node() is used
> for those allocations.
>
> This has some disadvantages:
>  a) an existing memory is consumed for that purpose
>     (~2MB per 128MB memory section on x86_64)
>  b) if the whole node is movable then we have off-node struct pages
>     which has performance drawbacks.
>
> a) has turned out to be a problem for memory hotplug based ballooning
>    because the userspace might not react in time to online memory while
>    the memory consumed during physical hotadd consumes enough memory to
>    push system to OOM. 31bc3858ea3e ("memory-hotplug: add automatic onlining
>    policy for the newly added memory") has been added to workaround that
>    problem.
>
> I have also seen hot-add operations failing on powerpc due to the fact
> that we try to use order-8 pages. If the base page size is 64KB, this
> gives us 16MB, and if we run out of those, we simply fail.
> One could arge that we can fall back to basepages as we do in x86_64, but
> we can do better when CONFIG_SPARSEMEM_VMEMMAP is enabled.
>
> Vmemap page tables can map arbitrary memory.
> That means that we can simply use the beginning of each memory section and
> map struct pages there.
> struct pages which back the allocated space then just need to be treated
> carefully.
>
> Implementation wise we reuse vmem_altmap infrastructure to override
> the default allocator used by __vmemap_populate. Once the memmap is
> allocated we need a way to mark altmap pfns used for the allocation.
> If MHP_MEMMAP_{DEVICE,MEMBLOCK} flag was passed, we set up the layout of the
> altmap structure at the beginning of __add_pages(), and then we call
> mark_vmemmap_pages().
>
> Depending on which flag is passed (MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK),
> mark_vmemmap_pages() gets called at a different stage.
> With MHP_MEMMAP_MEMBLOCK, we call it once we have populated the sections
> fitting in a single memblock, while with MHP_MEMMAP_DEVICE we wait until all
> sections have been populated.
>
> mark_vmemmap_pages() marks the pages as vmemmap and sets some metadata:
>
> The current layout of the Vmemmap pages are:
>
>         [Head->refcount] : Nr sections used by this altmap
>         [Head->private]  : Nr of vmemmap pages
>         [Tail->freelist] : Pointer to the head page
>
> This is done to easy the computation we need in some places.
> E.g:
>
> Example 1)
> We hot-add 1GB on x86_64 (memory block 128MB) using
> MHP_MEMMAP_DEVICE:
>
> head->_refcount = 8 sections
> head->private = 4096 vmemmap pages
> tail's->freelist = head
>
> Example 2)
> We hot-add 1GB on x86_64 using MHP_MEMMAP_MEMBLOCK:
>
> [at the beginning of each memblock]
> head->_refcount = 1 section
> head->private = 512 vmemmap pages
> tail's->freelist = head
>
> We have the refcount because when using MHP_MEMMAP_DEVICE, we need to know
> how much do we have to defer the call to vmemmap_free().
> The thing is that the first pages of the hot-added range are used to create
> the memmap mapping, so we cannot remove those first, otherwise we would blow up
> when accessing the other pages.
>
> What we do is that since when we hot-remove a memory-range, sections are being
> removed sequentially, we wait until we hit the last section, and then we free
> the hole range to vmemmap_free backwards.
> We know that it is the last section because in every pass we
> decrease head->_refcount, and when it reaches 0, we got our last section.
>
> We also have to be careful about those pages during online and offline
> operations. They are simply skipped, so online will keep them
> reserved and so unusable for any other purpose and offline ignores them
> so they do not block the offline operation.
>
> In offline operation we only have to check for one particularity.
> Depending on how large was the hot-added range, and using MHP_MEMMAP_DEVICE,
> can be that one or more than one memory block is filled with only vmemmap pages.
> We just need to check for this case and skip 1) isolating 2) migrating,
> because those pages do not need to be migrated anywhere, they are self-hosted.

Can you rewrite the changelog without using the word 'we' I get
confused when it seems to reference the 'we' current implementation vs
the 'we' new implementation.

>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  arch/arm64/mm/mmu.c            |   5 +-
>  arch/powerpc/mm/init_64.c      |   7 +++
>  arch/s390/mm/init.c            |   6 ++
>  arch/x86/mm/init_64.c          |  10 +++
>  drivers/acpi/acpi_memhotplug.c |   2 +-
>  drivers/base/memory.c          |   2 +-
>  include/linux/memory_hotplug.h |   6 ++
>  include/linux/memremap.h       |   2 +-
>  mm/compaction.c                |   7 +++
>  mm/memory_hotplug.c            | 138 +++++++++++++++++++++++++++++++++++------
>  mm/page_alloc.c                |  22 ++++++-
>  mm/page_isolation.c            |  14 ++++-
>  mm/sparse.c                    |  93 +++++++++++++++++++++++++++
>  13 files changed, 289 insertions(+), 25 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 93ed0df4df79..d4b5661fa6b6 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -765,7 +765,10 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
>                 if (pmd_none(READ_ONCE(*pmdp))) {
>                         void *p = NULL;
>
> -                       p = vmemmap_alloc_block_buf(PMD_SIZE, node);
> +                       if (altmap)
> +                               p = altmap_alloc_block_buf(PMD_SIZE, altmap);
> +                       else
> +                               p = vmemmap_alloc_block_buf(PMD_SIZE, node);
>                         if (!p)
>                                 return -ENOMEM;
>
> diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
> index a4e17a979e45..ff9d2c245321 100644
> --- a/arch/powerpc/mm/init_64.c
> +++ b/arch/powerpc/mm/init_64.c
> @@ -289,6 +289,13 @@ void __ref vmemmap_free(unsigned long start, unsigned long end,
>
>                 if (base_pfn >= alt_start && base_pfn < alt_end) {
>                         vmem_altmap_free(altmap, nr_pages);
> +               } else if (PageVmemmap(page)) {
> +                       /*
> +                        * runtime vmemmap pages are residing inside the memory
> +                        * section so they do not have to be freed anywhere.
> +                        */
> +                       while (PageVmemmap(page))
> +                               __ClearPageVmemmap(page++);
>                 } else if (PageReserved(page)) {
>                         /* allocated from bootmem */
>                         if (page_size < PAGE_SIZE) {
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ffb81fe95c77..c045411552a3 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -226,6 +226,12 @@ int arch_add_memory(int nid, u64 start, u64 size,
>         unsigned long size_pages = PFN_DOWN(size);
>         int rc;
>
> +       /*
> +        * Physical memory is added only later during the memory online so we
> +        * cannot use the added range at this stage unfortunately.
> +        */
> +       restrictions->flags &= ~restrictions->flags;
> +
>         if (WARN_ON_ONCE(restrictions->altmap))
>                 return -EINVAL;

Perhaps these per-arch changes should be pulled out into separate prep patches?

>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 688fb0687e55..00d17b666337 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -874,6 +874,16 @@ static void __meminit free_pagetable(struct page *page, int order)
>         unsigned long magic;
>         unsigned int nr_pages = 1 << order;
>
> +       /*
> +        * Runtime vmemmap pages are residing inside the memory section so
> +        * they do not have to be freed anywhere.
> +        */
> +       if (PageVmemmap(page)) {
> +               while (nr_pages--)
> +                       __ClearPageVmemmap(page++);
> +               return;
> +       }

If there is nothing to do and these pages are just going to be
released, why spend any effort clearing the vmemmap state?

> +
>         /* bootmem page has reserved flag */
>         if (PageReserved(page)) {
>                 __ClearPageReserved(page);
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 860f84e82dd0..3257edb98d90 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>                 if (node < 0)
>                         node = memory_add_physaddr_to_nid(info->start_addr);
>
> -               result = __add_memory(node, info->start_addr, info->length, 0);
> +               result = __add_memory(node, info->start_addr, info->length, MHP_MEMMAP_DEVICE);

Why is this changed to MHP_MEMMAP_DEVICE? Where does it get the altmap?

>
>                 /*
>                  * If the memory block has been used by the kernel, add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index ad9834b8b7f7..e0ac9a3b66f8 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -32,7 +32,7 @@ static DEFINE_MUTEX(mem_sysfs_mutex);
>
>  #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
>
> -static int sections_per_block;
> +int sections_per_block;
>
>  static inline int base_memory_block_id(int section_nr)
>  {
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 6fdbce9d04f9..e28e226c9a20 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -375,4 +375,10 @@ extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
>                 int online_type);
>  extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
>                 unsigned long nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +extern void mark_vmemmap_pages(struct vmem_altmap *self);
> +#else
> +static inline void mark_vmemmap_pages(struct vmem_altmap *self) {}
> +#endif
>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 1732dea030b2..6de37e168f57 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -16,7 +16,7 @@ struct device;
>   * @alloc: track pages consumed, private to vmemmap_populate()
>   */
>  struct vmem_altmap {
> -       const unsigned long base_pfn;
> +       unsigned long base_pfn;
>         const unsigned long reserve;
>         unsigned long free;
>         unsigned long align;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 9e1b9acb116b..40697f74b8b4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -855,6 +855,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>                 nr_scanned++;
>
>                 page = pfn_to_page(low_pfn);
> +               /*
> +                * Vmemmap pages do not need to be isolated.
> +                */
> +               if (PageVmemmap(page)) {
> +                       low_pfn += get_nr_vmemmap_pages(page) - 1;

I'm failing to grok the get_nr_vmemmap_pages() api. It seems this is
more of a get_next_mapped_page() and perhaps it should VM_BUG_ON if it
is not passed a Vmemmap page.

> +                       continue;
> +               }
>
>                 /*
>                  * Check if the pageblock has already been marked skipped.
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index e4e3baa6eaa7..b5106cb75795 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -42,6 +42,8 @@
>  #include "internal.h"
>  #include "shuffle.h"
>
> +extern int sections_per_block;
> +
>  /*
>   * online_page_callback contains pointer to current page onlining function.
>   * Initially it is generic_online_page(). If it is required it could be
> @@ -279,6 +281,24 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
>         return 0;
>  }
>
> +static void mhp_reset_altmap(unsigned long next_pfn,
> +                            struct vmem_altmap *altmap)
> +{
> +       altmap->base_pfn = next_pfn;
> +       altmap->alloc = 0;
> +}
> +
> +static void mhp_init_altmap(unsigned long pfn, unsigned long nr_pages,
> +                           unsigned long mhp_flags,
> +                           struct vmem_altmap *altmap)
> +{
> +       if (mhp_flags & MHP_MEMMAP_DEVICE)
> +               altmap->free = nr_pages;
> +       else
> +               altmap->free = PAGES_PER_SECTION * sections_per_block;
> +       altmap->base_pfn = pfn;

The ->free member is meant to be the number of free pages in the
altmap this seems to be set to the number of pages being mapped. Am I
misreading?

> +}
> +
>  /*
>   * Reasonably generic function for adding memory.  It is
>   * expected that archs that support memory hotplug will
> @@ -290,8 +310,17 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>  {
>         unsigned long i;
>         int start_sec, end_sec, err;
> -       struct vmem_altmap *altmap = restrictions->altmap;
> +       struct vmem_altmap *altmap;
> +       struct vmem_altmap __memblk_altmap = {};
> +       unsigned long mhp_flags = restrictions->flags;
> +       unsigned long sections_added;
> +
> +       if (mhp_flags & MHP_VMEMMAP_FLAGS) {
> +               mhp_init_altmap(pfn, nr_pages, mhp_flags, &__memblk_altmap);
> +               restrictions->altmap = &__memblk_altmap;
> +       }

So this silently overrides a passed in altmap if a flag is set? The
NVDIMM use case can't necessarily trust __memblk_altmap to be
consistent with what the nvdimm namespace has reserved.

>
> +       altmap = restrictions->altmap;
>         if (altmap) {
>                 /*
>                  * Validate altmap is within bounds of the total request
> @@ -308,9 +337,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>         if (err)
>                 return err;
>
> +       sections_added = 1;
>         start_sec = pfn_to_section_nr(pfn);
>         end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
> -       for (i = start_sec; i <= end_sec; i++) {
> +       for (i = start_sec; i <= end_sec; i++, sections_added++) {
>                 unsigned long pfns;
>
>                 pfns = min(nr_pages, PAGES_PER_SECTION
> @@ -320,9 +350,19 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
>                         break;
>                 pfn += pfns;
>                 nr_pages -= pfns;
> +
> +               if (mhp_flags & MHP_MEMMAP_MEMBLOCK &&
> +                   !(sections_added % sections_per_block)) {
> +                       mark_vmemmap_pages(altmap);
> +                       mhp_reset_altmap(pfn, altmap);
> +               }
>                 cond_resched();
>         }
>         vmemmap_populate_print_last();
> +
> +       if (mhp_flags & MHP_MEMMAP_DEVICE)
> +               mark_vmemmap_pages(altmap);
> +
>         return err;
>  }
>
> @@ -642,6 +682,14 @@ static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
>         while (start < end) {
>                 order = min(MAX_ORDER - 1,
>                         get_order(PFN_PHYS(end) - PFN_PHYS(start)));
> +               /*
> +                * Check if the pfn is aligned to its order.
> +                * If not, we decrement the order until it is,
> +                * otherwise __free_one_page will bug us.
> +                */
> +               while (start & ((1 << order) - 1))
> +                       order--;
> +

Is this a candidate for a standalone patch? It seems out of place for
this patch.

>                 (*online_page_callback)(pfn_to_page(start), order);
>
>                 onlined_pages += (1UL << order);
> @@ -654,13 +702,30 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
>                         void *arg)
>  {
>         unsigned long onlined_pages = *(unsigned long *)arg;
> +       unsigned long pfn = start_pfn;
> +       unsigned long nr_vmemmap_pages = 0;
>
> -       if (PageReserved(pfn_to_page(start_pfn)))
> -               onlined_pages += online_pages_blocks(start_pfn, nr_pages);
> +       if (PageVmemmap(pfn_to_page(pfn))) {
> +               /*
> +                * Do not send vmemmap pages to the page allocator.
> +                */
> +               nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
> +               nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
> +               pfn += nr_vmemmap_pages;
> +               if (nr_vmemmap_pages == nr_pages)
> +                       /*
> +                        * If the entire range contains only vmemmap pages,
> +                        * there are no pages left for the page allocator.
> +                        */
> +                       goto skip_online;
> +       }

Seems this should be caller (online_pages()) responsibility rather
than making this fixup internal to the helper... and if it's moved up
can it be pushed one more level up so even online_pages() need not
worry about this fixup? It just does not seem to an operation that
belongs to the online path. Might that eliminate the need for tracking
altmap parameters in struct page?

Patch
diff mbox series

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 93ed0df4df79..d4b5661fa6b6 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -765,7 +765,10 @@  int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
 		if (pmd_none(READ_ONCE(*pmdp))) {
 			void *p = NULL;
 
-			p = vmemmap_alloc_block_buf(PMD_SIZE, node);
+			if (altmap)
+				p = altmap_alloc_block_buf(PMD_SIZE, altmap);
+			else
+				p = vmemmap_alloc_block_buf(PMD_SIZE, node);
 			if (!p)
 				return -ENOMEM;
 
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a4e17a979e45..ff9d2c245321 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -289,6 +289,13 @@  void __ref vmemmap_free(unsigned long start, unsigned long end,
 
 		if (base_pfn >= alt_start && base_pfn < alt_end) {
 			vmem_altmap_free(altmap, nr_pages);
+		} else if (PageVmemmap(page)) {
+			/*
+			 * runtime vmemmap pages are residing inside the memory
+			 * section so they do not have to be freed anywhere.
+			 */
+			while (PageVmemmap(page))
+				__ClearPageVmemmap(page++);
 		} else if (PageReserved(page)) {
 			/* allocated from bootmem */
 			if (page_size < PAGE_SIZE) {
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ffb81fe95c77..c045411552a3 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -226,6 +226,12 @@  int arch_add_memory(int nid, u64 start, u64 size,
 	unsigned long size_pages = PFN_DOWN(size);
 	int rc;
 
+	/*
+	 * Physical memory is added only later during the memory online so we
+	 * cannot use the added range at this stage unfortunately.
+	 */
+	restrictions->flags &= ~restrictions->flags;
+
 	if (WARN_ON_ONCE(restrictions->altmap))
 		return -EINVAL;
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 688fb0687e55..00d17b666337 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -874,6 +874,16 @@  static void __meminit free_pagetable(struct page *page, int order)
 	unsigned long magic;
 	unsigned int nr_pages = 1 << order;
 
+	/*
+	 * Runtime vmemmap pages are residing inside the memory section so
+	 * they do not have to be freed anywhere.
+	 */
+	if (PageVmemmap(page)) {
+		while (nr_pages--)
+			__ClearPageVmemmap(page++);
+		return;
+	}
+
 	/* bootmem page has reserved flag */
 	if (PageReserved(page)) {
 		__ClearPageReserved(page);
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 860f84e82dd0..3257edb98d90 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -218,7 +218,7 @@  static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length, 0);
+		result = __add_memory(node, info->start_addr, info->length, MHP_MEMMAP_DEVICE);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index ad9834b8b7f7..e0ac9a3b66f8 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -32,7 +32,7 @@  static DEFINE_MUTEX(mem_sysfs_mutex);
 
 #define to_memory_block(dev) container_of(dev, struct memory_block, dev)
 
-static int sections_per_block;
+int sections_per_block;
 
 static inline int base_memory_block_id(int section_nr)
 {
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 6fdbce9d04f9..e28e226c9a20 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -375,4 +375,10 @@  extern bool allow_online_pfn_range(int nid, unsigned long pfn, unsigned long nr_
 		int online_type);
 extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
 		unsigned long nr_pages);
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+extern void mark_vmemmap_pages(struct vmem_altmap *self);
+#else
+static inline void mark_vmemmap_pages(struct vmem_altmap *self) {}
+#endif
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 1732dea030b2..6de37e168f57 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -16,7 +16,7 @@  struct device;
  * @alloc: track pages consumed, private to vmemmap_populate()
  */
 struct vmem_altmap {
-	const unsigned long base_pfn;
+	unsigned long base_pfn;
 	const unsigned long reserve;
 	unsigned long free;
 	unsigned long align;
diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..40697f74b8b4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -855,6 +855,13 @@  isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		nr_scanned++;
 
 		page = pfn_to_page(low_pfn);
+		/*
+		 * Vmemmap pages do not need to be isolated.
+		 */
+		if (PageVmemmap(page)) {
+			low_pfn += get_nr_vmemmap_pages(page) - 1;
+			continue;
+		}
 
 		/*
 		 * Check if the pageblock has already been marked skipped.
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4e3baa6eaa7..b5106cb75795 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -42,6 +42,8 @@ 
 #include "internal.h"
 #include "shuffle.h"
 
+extern int sections_per_block;
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -279,6 +281,24 @@  static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 	return 0;
 }
 
+static void mhp_reset_altmap(unsigned long next_pfn,
+			     struct vmem_altmap *altmap)
+{
+	altmap->base_pfn = next_pfn;
+	altmap->alloc = 0;
+}
+
+static void mhp_init_altmap(unsigned long pfn, unsigned long nr_pages,
+			    unsigned long mhp_flags,
+			    struct vmem_altmap *altmap)
+{
+	if (mhp_flags & MHP_MEMMAP_DEVICE)
+		altmap->free = nr_pages;
+	else
+		altmap->free = PAGES_PER_SECTION * sections_per_block;
+	altmap->base_pfn = pfn;
+}
+
 /*
  * Reasonably generic function for adding memory.  It is
  * expected that archs that support memory hotplug will
@@ -290,8 +310,17 @@  int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 {
 	unsigned long i;
 	int start_sec, end_sec, err;
-	struct vmem_altmap *altmap = restrictions->altmap;
+	struct vmem_altmap *altmap;
+	struct vmem_altmap __memblk_altmap = {};
+	unsigned long mhp_flags = restrictions->flags;
+	unsigned long sections_added;
+
+	if (mhp_flags & MHP_VMEMMAP_FLAGS) {
+		mhp_init_altmap(pfn, nr_pages, mhp_flags, &__memblk_altmap);
+		restrictions->altmap = &__memblk_altmap;
+	}
 
+	altmap = restrictions->altmap;
 	if (altmap) {
 		/*
 		 * Validate altmap is within bounds of the total request
@@ -308,9 +337,10 @@  int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 	if (err)
 		return err;
 
+	sections_added = 1;
 	start_sec = pfn_to_section_nr(pfn);
 	end_sec = pfn_to_section_nr(pfn + nr_pages - 1);
-	for (i = start_sec; i <= end_sec; i++) {
+	for (i = start_sec; i <= end_sec; i++, sections_added++) {
 		unsigned long pfns;
 
 		pfns = min(nr_pages, PAGES_PER_SECTION
@@ -320,9 +350,19 @@  int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
 			break;
 		pfn += pfns;
 		nr_pages -= pfns;
+
+		if (mhp_flags & MHP_MEMMAP_MEMBLOCK &&
+		    !(sections_added % sections_per_block)) {
+			mark_vmemmap_pages(altmap);
+			mhp_reset_altmap(pfn, altmap);
+		}
 		cond_resched();
 	}
 	vmemmap_populate_print_last();
+
+	if (mhp_flags & MHP_MEMMAP_DEVICE)
+		mark_vmemmap_pages(altmap);
+
 	return err;
 }
 
@@ -642,6 +682,14 @@  static int online_pages_blocks(unsigned long start, unsigned long nr_pages)
 	while (start < end) {
 		order = min(MAX_ORDER - 1,
 			get_order(PFN_PHYS(end) - PFN_PHYS(start)));
+		/*
+		 * Check if the pfn is aligned to its order.
+		 * If not, we decrement the order until it is,
+		 * otherwise __free_one_page will bug us.
+		 */
+		while (start & ((1 << order) - 1))
+			order--;
+
 		(*online_page_callback)(pfn_to_page(start), order);
 
 		onlined_pages += (1UL << order);
@@ -654,13 +702,30 @@  static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
 			void *arg)
 {
 	unsigned long onlined_pages = *(unsigned long *)arg;
+	unsigned long pfn = start_pfn;
+	unsigned long nr_vmemmap_pages = 0;
 
-	if (PageReserved(pfn_to_page(start_pfn)))
-		onlined_pages += online_pages_blocks(start_pfn, nr_pages);
+	if (PageVmemmap(pfn_to_page(pfn))) {
+		/*
+		 * Do not send vmemmap pages to the page allocator.
+		 */
+		nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
+		nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
+		pfn += nr_vmemmap_pages;
+		if (nr_vmemmap_pages == nr_pages)
+			/*
+			 * If the entire range contains only vmemmap pages,
+			 * there are no pages left for the page allocator.
+			 */
+			goto skip_online;
+	}
 
+	if (PageReserved(pfn_to_page(pfn)))
+		onlined_pages += online_pages_blocks(pfn, nr_pages - nr_vmemmap_pages);
+skip_online:
 	online_mem_sections(start_pfn, start_pfn + nr_pages);
 
-	*(unsigned long *)arg = onlined_pages;
+	*(unsigned long *)arg = onlined_pages + nr_vmemmap_pages;
 	return 0;
 }
 
@@ -1051,6 +1116,23 @@  static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
+static bool mhp_check_correct_flags(unsigned long flags)
+{
+	if (flags & MHP_VMEMMAP_FLAGS) {
+		if (!IS_ENABLED(CONFIG_SPARSEMEM_VMEMMAP)) {
+			WARN(1, "Vmemmap capability can only be used on"
+				"CONFIG_SPARSEMEM_VMEMMAP. Ignoring flags.\n");
+			return false;
+		}
+		if ((flags & MHP_VMEMMAP_FLAGS) == MHP_VMEMMAP_FLAGS) {
+			WARN(1, "Both MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK"
+				"were passed. Ignoring flags.\n");
+			return false;
+		}
+	}
+	return true;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1086,6 +1168,9 @@  int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags
 		goto error;
 	new_node = ret;
 
+	if (mhp_check_correct_flags(flags))
+		restrictions.flags = flags;
+
 	/* call arch's memory hotadd */
 	ret = arch_add_memory(nid, start, size, &restrictions);
 	if (ret < 0)
@@ -1518,12 +1603,14 @@  static int __ref __offline_pages(unsigned long start_pfn,
 {
 	unsigned long pfn, nr_pages;
 	unsigned long offlined_pages = 0;
+	unsigned long nr_vmemmap_pages = 0;
 	int ret, node, nr_isolate_pageblock;
 	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
 	struct memory_notify arg;
 	char *reason;
+	bool skip = false;
 
 	mem_hotplug_begin();
 
@@ -1540,15 +1627,24 @@  static int __ref __offline_pages(unsigned long start_pfn,
 	node = zone_to_nid(zone);
 	nr_pages = end_pfn - start_pfn;
 
-	/* set above range as isolated */
-	ret = start_isolate_page_range(start_pfn, end_pfn,
-				       MIGRATE_MOVABLE,
-				       SKIP_HWPOISON | REPORT_FAILURE);
-	if (ret < 0) {
-		reason = "failure to isolate range";
-		goto failed_removal;
+	if (PageVmemmap(pfn_to_page(start_pfn))) {
+		nr_vmemmap_pages = get_nr_vmemmap_pages(pfn_to_page(start_pfn));
+		nr_vmemmap_pages = min(nr_vmemmap_pages, nr_pages);
+		if (nr_vmemmap_pages == nr_pages)
+			skip = true;
+	}
+
+	if (!skip) {
+		/* set above range as isolated */
+		ret = start_isolate_page_range(start_pfn, end_pfn,
+					       MIGRATE_MOVABLE,
+					       SKIP_HWPOISON | REPORT_FAILURE);
+		if (ret < 0) {
+			reason = "failure to isolate range";
+			goto failed_removal;
+		}
+		nr_isolate_pageblock = ret;
 	}
-	nr_isolate_pageblock = ret;
 
 	arg.start_pfn = start_pfn;
 	arg.nr_pages = nr_pages;
@@ -1561,6 +1657,9 @@  static int __ref __offline_pages(unsigned long start_pfn,
 		goto failed_removal_isolated;
 	}
 
+	if (skip)
+		goto skip_migration;
+
 	do {
 		for (pfn = start_pfn; pfn;) {
 			if (signal_pending(current)) {
@@ -1601,7 +1700,9 @@  static int __ref __offline_pages(unsigned long start_pfn,
 	   We cannot do rollback at this point. */
 	walk_system_ram_range(start_pfn, end_pfn - start_pfn,
 			      &offlined_pages, offline_isolated_pages_cb);
-	pr_info("Offlined Pages %ld\n", offlined_pages);
+
+skip_migration:
+	pr_info("Offlined Pages %ld\n", offlined_pages + nr_vmemmap_pages);
 	/*
 	 * Onlining will reset pagetype flags and makes migrate type
 	 * MOVABLE, so just need to decrease the number of isolated
@@ -1612,11 +1713,12 @@  static int __ref __offline_pages(unsigned long start_pfn,
 	spin_unlock_irqrestore(&zone->lock, flags);
 
 	/* removal success */
-	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
-	zone->present_pages -= offlined_pages;
+	if (offlined_pages)
+		adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
+	zone->present_pages -= offlined_pages + nr_vmemmap_pages;
 
 	pgdat_resize_lock(zone->zone_pgdat, &flags);
-	zone->zone_pgdat->node_present_pages -= offlined_pages;
+	zone->zone_pgdat->node_present_pages -= offlined_pages + nr_vmemmap_pages;
 	pgdat_resize_unlock(zone->zone_pgdat, &flags);
 
 	init_per_zone_wmark_min();
@@ -1645,7 +1747,7 @@  static int __ref __offline_pages(unsigned long start_pfn,
 	memory_notify(MEM_CANCEL_OFFLINE, &arg);
 failed_removal:
 	pr_debug("memory offlining [mem %#010llx-%#010llx] failed due to %s\n",
-		 (unsigned long long) start_pfn << PAGE_SHIFT,
+		 (unsigned long long) (start_pfn - nr_vmemmap_pages) << PAGE_SHIFT,
 		 ((unsigned long long) end_pfn << PAGE_SHIFT) - 1,
 		 reason);
 	/* pushback to free area */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b3266d63521..7a73a06c5730 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1282,9 +1282,14 @@  static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
 				unsigned long zone, int nid)
 {
-	mm_zero_struct_page(page);
+	if (!__PageVmemmap(page)) {
+		/*
+		 * Vmemmap pages need to preserve their state.
+		 */
+		mm_zero_struct_page(page);
+		init_page_count(page);
+	}
 	set_page_links(page, zone, nid, pfn);
-	init_page_count(page);
 	page_mapcount_reset(page);
 	page_cpupid_reset_last(page);
 	page_kasan_tag_reset(page);
@@ -8143,6 +8148,14 @@  bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
 
 		page = pfn_to_page(check);
 
+		/*
+		 * Vmemmap pages are not needed to be moved around.
+		 */
+		if (PageVmemmap(page)) {
+			iter += get_nr_vmemmap_pages(page) - 1;
+			continue;
+		}
+
 		if (PageReserved(page))
 			goto unmovable;
 
@@ -8510,6 +8523,11 @@  __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 			continue;
 		}
 		page = pfn_to_page(pfn);
+
+		if (PageVmemmap(page)) {
+			pfn += get_nr_vmemmap_pages(page);
+			continue;
+		}
 		/*
 		 * The HWPoisoned page may be not in buddy system, and
 		 * page_count() is not 0.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index e3638a5bafff..128c47a27925 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -146,7 +146,7 @@  static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 static inline struct page *
 __first_valid_page(unsigned long pfn, unsigned long nr_pages)
 {
-	int i;
+	unsigned long i;
 
 	for (i = 0; i < nr_pages; i++) {
 		struct page *page;
@@ -154,6 +154,10 @@  __first_valid_page(unsigned long pfn, unsigned long nr_pages)
 		page = pfn_to_online_page(pfn + i);
 		if (!page)
 			continue;
+		if (PageVmemmap(page)) {
+			i += get_nr_vmemmap_pages(page) - 1;
+			continue;
+		}
 		return page;
 	}
 	return NULL;
@@ -268,6 +272,14 @@  __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			continue;
 		}
 		page = pfn_to_page(pfn);
+		/*
+		 * Vmemmap pages are not isolated. Skip them.
+		 */
+		if (PageVmemmap(page)) {
+			pfn += get_nr_vmemmap_pages(page);
+			continue;
+		}
+
 		if (PageBuddy(page))
 			/*
 			 * If the page is on a free list, it has to be on
diff --git a/mm/sparse.c b/mm/sparse.c
index b77ca21a27a4..04b395fb4463 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -635,6 +635,94 @@  void offline_mem_sections(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
+void mark_vmemmap_pages(struct vmem_altmap *self)
+{
+	unsigned long pfn = self->base_pfn + self->reserve;
+	unsigned long nr_pages = self->alloc;
+	unsigned long nr_sects = self->free / PAGES_PER_SECTION;
+	unsigned long i;
+	struct page *head;
+
+	if (!nr_pages)
+		return;
+
+	pr_debug("%s: marking %px - %px as Vmemmap (%ld pages)\n",
+						__func__,
+						pfn_to_page(pfn),
+						pfn_to_page(pfn + nr_pages - 1),
+						nr_pages);
+
+	/*
+	 * All allocations for the memory hotplug are the same sized so align
+	 * should be 0.
+	 */
+	WARN_ON(self->align);
+
+	/*
+	 * Layout of vmemmap pages:
+	 * [Head->refcount] : Nr sections used by this altmap
+	 * [Head->private]  : Nr of vmemmap pages
+	 * [Tail->freelist] : Pointer to the head page
+	 */
+
+	/*
+	 * Head, first vmemmap page
+	 */
+	head = pfn_to_page(pfn);
+	for (i = 0; i < nr_pages; i++, pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		mm_zero_struct_page(page);
+		__SetPageVmemmap(page);
+		page->freelist = head;
+		init_page_count(page);
+	}
+	set_page_count(head, (int)nr_sects);
+	set_page_private(head, nr_pages);
+}
+/*
+ * If the range we are trying to remove was hot-added with vmemmap pages
+ * using MHP_MEMMAP_DEVICE, we need to keep track of it to know how much
+ * do we have do defer the free up.
+ * Since sections are removed sequentally in __remove_pages()->
+ * __remove_section(), we just wait until we hit the last section.
+ * Once that happens, we can trigger free_deferred_vmemmap_range to actually
+ * free the whole memory-range.
+ */
+static struct page *head_vmemmap_page = NULL;;
+static bool freeing_vmemmap_range = false;
+
+static inline bool vmemmap_dec_and_test(void)
+{
+	return page_ref_dec_and_test(head_vmemmap_page);
+}
+
+static void free_deferred_vmemmap_range(unsigned long start,
+                                       unsigned long end)
+{
+	unsigned long nr_pages = end - start;
+	unsigned long first_section = (unsigned long)head_vmemmap_page;
+
+	while (start >= first_section) {
+		vmemmap_free(start, end, NULL);
+		end = start;
+		start -= nr_pages;
+	}
+	head_vmemmap_page = NULL;
+	freeing_vmemmap_range = false;
+}
+
+static void deferred_vmemmap_free(unsigned long start, unsigned long end)
+{
+	if (!freeing_vmemmap_range) {
+		freeing_vmemmap_range = true;
+		head_vmemmap_page = (struct page *)start;
+	}
+
+	if (vmemmap_dec_and_test())
+		free_deferred_vmemmap_range(start, end);
+}
+
 static struct page *populate_section_memmap(unsigned long pfn,
 		unsigned long nr_pages, int nid, struct vmem_altmap *altmap)
 {
@@ -647,6 +735,11 @@  static void depopulate_section_memmap(unsigned long pfn, unsigned long nr_pages,
 	unsigned long start = (unsigned long) pfn_to_page(pfn);
 	unsigned long end = start + nr_pages * sizeof(struct page);
 
+	if (PageVmemmap((struct page *)start) || freeing_vmemmap_range) {
+		deferred_vmemmap_free(start, end);
+		return;
+	}
+
 	vmemmap_free(start, end, altmap);
 }
 static void free_map_bootmem(struct page *memmap)