diff mbox series

[RFC,v2,2/2] mm/memory_hotplug: Shrink spanned pages when offlining memory

Message ID 20180817154127.28602-3-osalvador@techadventures.net (mailing list archive)
State New, archived
Headers show
Series Do not touch pages in remove_memory path | expand

Commit Message

Oscar Salvador Aug. 17, 2018, 3:41 p.m. UTC
From: Oscar Salvador <osalvador@suse.de>

Currently, we decrement zone/node spanned_pages when we
remove memory and not when we offline it.

This, besides of not being consistent with the current code,
implies that we can access steal pages if we never get to online
that memory.

In order to prevent that, we have to move all zone/pages stuff to
the offlining memory stage.
Removing memory path should only care about memory sections and memory
blocks.

Another thing to notice here is that this is not so easy to be done
as HMM/devm have a particular handling of memory-hotplug.
They do not go through the common path, and so, they do not
call either offline_pages() nor online_pages().

All they care about is to add the sections, move the pages to
ZONE_DEVICE, and in some cases, to create the linear mapping.

In order to do this more smooth, two new functions are created
to deal with these particular cases:

del_device_memory
add_device_memory

add_device_memory is in charge of

a) calling either arch_add_memory() or add_pages(), depending on whether
   we want a linear mapping
b) online the memory sections that correspond to the pfn range
c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
   expand zone/pgdat spanned pages and initialize its pages

del_device_memory, on the other hand, is in charge of

a) offline the memory sections that correspond to the pfn range
b) calling shrink_pages(), which shrinks node/zone spanned pages.
c) calling either arch_remove_memory() or __remove_pages(), depending on
   whether we need to tear down the linear mapping or not

These two functions are called from:

add_device_memory:
	- devm_memremap_pages()
	- hmm_devmem_pages_create()

del_device_memory:
	- devm_memremap_pages_release()
	- hmm_devmem_release()

I think that this will get easier as soon as [1] gets merged.

Finally, shrink_pages() is moved to offline_pages(), so now,
all pages/zone handling is being taken care in online/offline_pages stage.

[1] https://lkml.org/lkml/2018/6/19/110

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 arch/ia64/mm/init.c            |   4 +-
 arch/powerpc/mm/mem.c          |  10 +--
 arch/sh/mm/init.c              |   4 +-
 arch/x86/mm/init_32.c          |   4 +-
 arch/x86/mm/init_64.c          |   8 +--
 include/linux/memory_hotplug.h |   9 ++-
 kernel/memremap.c              |  14 ++--
 kernel/resource.c              |  16 +++++
 mm/hmm.c                       |  32 ++++-----
 mm/memory_hotplug.c            | 143 +++++++++++++++++++++++++++--------------
 mm/sparse.c                    |   4 +-
 11 files changed, 145 insertions(+), 103 deletions(-)

Comments

David Hildenbrand Aug. 21, 2018, 1:17 p.m. UTC | #1
> add_device_memory is in charge of

I wouldn't use the terminology of onlining/offlining here. That applies
rather to memory that is exposed to the rest of the system (e.g. buddy
allocator, has underlying memory block devices). I guess it is rather a
pure setup/teardown of that device memory.

> 
> a) calling either arch_add_memory() or add_pages(), depending on whether
>    we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
>    expand zone/pgdat spanned pages and initialize its pages
> 
> del_device_memory, on the other hand, is in charge of
> 
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
>    whether we need to tear down the linear mapping or not
> 
> These two functions are called from:
> 
> add_device_memory:
> 	- devm_memremap_pages()
> 	- hmm_devmem_pages_create()
> 
> del_device_memory:
> 	- devm_memremap_pages_release()
> 	- hmm_devmem_release()
> 
> I think that this will get easier as soon as [1] gets merged.
> 
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
> 
> [1] https://lkml.org/lkml/2018/6/19/110
> 

[...]

> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index f90fa077b0c4..d04338ffabec 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1152,15 +1152,9 @@ int __ref arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *a
>  {
>  	unsigned long start_pfn = start >> PAGE_SHIFT;
>  	unsigned long nr_pages = size >> PAGE_SHIFT;
> -	struct page *page = pfn_to_page(start_pfn);
> -	struct zone *zone;
>  	int ret;
>  
> -	/* With altmap the first mapped page is offset from @start */
> -	if (altmap)
> -		page += vmem_altmap_offset(altmap);
> -	zone = page_zone(page);
> -	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
> +	ret = __remove_pages(nid, start_pfn, nr_pages, altmap);

These changes certainly look nice.

[...]

> index 7a832b844f24..95df37686196 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -119,6 +119,8 @@ static void devm_memremap_pages_release(void *data)
>  	struct dev_pagemap *pgmap = data;
>  	struct device *dev = pgmap->dev;
>  	struct resource *res = &pgmap->res;
> +	struct vmem_altmap *altmap = pgmap->altmap_valid ?
> +					&pgmap->altmap : NULL;
>  	resource_size_t align_start, align_size;
>  	unsigned long pfn;
>  	int nid;
> @@ -138,8 +140,7 @@ static void devm_memremap_pages_release(void *data)
>  	nid = dev_to_node(dev);
>  
>  	mem_hotplug_begin();

I would really like to see the mem_hotplug_begin/end also getting moved
inside add_device_memory()/del_device_memory(). (just like for
add/remove_memory)

I wonder if kasan_ stuff actually requires this lock, or if it could
also be somehow moved inside add_device_memory/del_device_memory.

> -	arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ?
> -			&pgmap->altmap : NULL);
> +	del_device_memory(nid, align_start, align_size, altmap, true);
>  	kasan_remove_zero_shadow(__va(align_start), align_size);
>  	mem_hotplug_done();
>  
> @@ -175,7 +176,7 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  {
>  	resource_size_t align_start, align_size, align_end;
>  	struct vmem_altmap *altmap = pgmap->altmap_valid ?
> -			&pgmap->altmap : NULL;
> +					&pgmap->altmap : NULL;
>  	struct resource *res = &pgmap->res;
>  	unsigned long pfn, pgoff, order;
>  	pgprot_t pgprot = PAGE_KERNEL;
> @@ -249,11 +250,8 @@ void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
>  		goto err_kasan;
>  	}
>  
> -	error = arch_add_memory(nid, align_start, align_size, altmap, false);
> -	if (!error)
> -		move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> -					align_start >> PAGE_SHIFT,
> -					align_size >> PAGE_SHIFT, altmap);
> +	error = add_device_memory(nid, align_start, align_size, altmap, true);
> +
>  	mem_hotplug_done();
>  	if (error)
>  		goto err_add_memory;
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 30e1bc68503b..8e68b5ca67ca 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1262,6 +1262,22 @@ int release_mem_region_adjustable(struct resource *parent,
>  			continue;
>  		}
>  
> +		/*
> +		 * All memory regions added from memory-hotplug path
> +		 * have the flag IORESOURCE_SYSTEM_RAM.
> +		 * IORESOURCE_SYSTEM_RAM = (IORESOURCE_MEM|IORESOURCE_SYSRAM)
> +		 * If the resource does not have this flag, we know that
> +		 * we are dealing with a resource coming from HMM/devm.
> +		 * HMM/devm use another mechanism to add/release a resource.
> +		 * This goes via devm_request_mem_region/devm_release_mem_region.
> +		 * HMM/devm take care to release their resources when they want, so
> +		 * if we are dealing with them, let us just back off nicely
> +		 */

Maybe shorten that a bit

"HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
 devm_request_mem_region/devm_release_mem_region to add/release a
 resource. Just back off here."

> +		if (!(res->flags & IORESOURCE_SYSRAM)) {
> +			ret = 0;
> +			break;
> +		}
> +
>  		if (!(res->flags & IORESOURCE_MEM))
>  			break;
>  
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 21787e480b4a..23ce7fbdb651 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -996,6 +996,7 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  	struct zone *zone;
>  	struct page *page;
>  	int nid;
> +	bool mapping;
>  
>  	if (percpu_ref_tryget_live(&devmem->ref)) {
>  		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
> @@ -1012,12 +1013,14 @@ static void hmm_devmem_release(struct device *dev, void *data)
>  
>  	mem_hotplug_begin();
>  	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
> -		__remove_pages(zone, start_pfn, npages, NULL);
> +		mapping = false;
>  	else
> -		arch_remove_memory(nid, start_pfn << PAGE_SHIFT,
> -				   npages << PAGE_SHIFT, NULL);
> -	mem_hotplug_done();
> +		mapping = true;
>  
> +	del_device_memory(nid, start_pfn << PAGE_SHIFT, npages << PAGE_SHIFT,
> +									NULL,
> +									mapping);
> +	mem_hotplug_done();
>  	hmm_devmem_radix_release(resource);
>  }
>  
> @@ -1027,6 +1030,7 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	struct device *device = devmem->device;
>  	int ret, nid, is_ram;
>  	unsigned long pfn;
> +	bool mapping;
>  
>  	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
>  	align_size = ALIGN(devmem->resource->start +
> @@ -1085,7 +1089,6 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	if (nid < 0)
>  		nid = numa_mem_id();
>  
> -	mem_hotplug_begin();
>  	/*
>  	 * For device private memory we call add_pages() as we only need to
>  	 * allocate and initialize struct page for the device memory. More-
> @@ -1096,21 +1099,18 @@ static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
>  	 * For device public memory, which is accesible by the CPU, we do
>  	 * want the linear mapping and thus use arch_add_memory().
>  	 */
> +	mem_hotplug_begin();
>  	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
> -		ret = arch_add_memory(nid, align_start, align_size, NULL,
> -				false);
> +		mapping = true;
>  	else
> -		ret = add_pages(nid, align_start >> PAGE_SHIFT,
> -				align_size >> PAGE_SHIFT, NULL, false);
> -	if (ret) {
> -		mem_hotplug_done();
> -		goto error_add_memory;
> -	}
> -	move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
> -				align_start >> PAGE_SHIFT,
> -				align_size >> PAGE_SHIFT, NULL);
> +		mapping = false;
> +
> +	ret = add_device_memory(nid, align_start, align_size, NULL, mapping);
>  	mem_hotplug_done();
>  
> +	if (ret)
> +		goto error_add_memory;
> +
>  	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>  
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 9bd629944c91..60b67f09956e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -320,12 +320,10 @@ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
>  				     unsigned long start_pfn,
>  				     unsigned long end_pfn)
>  {
> -	struct mem_section *ms;
> -
>  	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
> -		ms = __pfn_to_section(start_pfn);
> +		struct mem_section *ms = __pfn_to_section(start_pfn);
>  
> -		if (unlikely(!valid_section(ms)))
> +		if (unlikely(!online_section(ms)))
>  			continue;
>  
>  		if (unlikely(pfn_to_nid(start_pfn) != nid))
> @@ -345,15 +343,14 @@ static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
>  				    unsigned long start_pfn,
>  				    unsigned long end_pfn)
>  {
> -	struct mem_section *ms;
>  	unsigned long pfn;
>  
>  	/* pfn is the end pfn of a memory section. */
>  	pfn = end_pfn - 1;
>  	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
> -		ms = __pfn_to_section(pfn);
> +		struct mem_section *ms = __pfn_to_section(pfn);
>  
> -		if (unlikely(!valid_section(ms)))
> +		if (unlikely(!online_section(ms)))
>  			continue;
>  
>  		if (unlikely(pfn_to_nid(pfn) != nid))
> @@ -415,7 +412,7 @@ static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
>  	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
>  		ms = __pfn_to_section(pfn);
>  
> -		if (unlikely(!valid_section(ms)))
> +		if (unlikely(!online_section(ms)))
>  			continue;
>  
>  		if (page_zone(pfn_to_page(pfn)) != zone)
> @@ -502,23 +499,33 @@ static void shrink_pgdat_span(struct pglist_data *pgdat,
>  	pgdat->node_spanned_pages = 0;
>  }
>  
> -static void __remove_zone(struct zone *zone, unsigned long start_pfn)
> +static void shrink_pages(struct zone *zone, unsigned long start_pfn,
> +						unsigned long end_pfn,
> +						unsigned long offlined_pages)
>  {
>  	struct pglist_data *pgdat = zone->zone_pgdat;
>  	int nr_pages = PAGES_PER_SECTION;
>  	unsigned long flags;
> +	unsigned long pfn;
>  
> -	pgdat_resize_lock(zone->zone_pgdat, &flags);
> -	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
> -	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
> -	pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +	zone->present_pages -= offlined_pages;
> +
> +	clear_zone_contiguous(zone);
> +	pgdat_resize_lock(pgdat, &flags);
> +
> +	for(pfn = start_pfn; pfn < end_pfn; pfn += nr_pages) {
> +		shrink_zone_span(zone, pfn, pfn + nr_pages);
> +		shrink_pgdat_span(pgdat, pfn, pfn + nr_pages);
> +	}
> +	pgdat->node_present_pages -= offlined_pages;
> +
> +	pgdat_resize_unlock(pgdat, &flags);
> +	set_zone_contiguous(zone);
>  }
>  
> -static int __remove_section(struct zone *zone, struct mem_section *ms,
> +static int __remove_section(int nid, struct mem_section *ms,
>  		unsigned long map_offset, struct vmem_altmap *altmap)
>  {
> -	unsigned long start_pfn;
> -	int scn_nr;
>  	int ret = -EINVAL;
>  
>  	if (!valid_section(ms))
> @@ -528,17 +535,13 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
>  	if (ret)
>  		return ret;
>  
> -	scn_nr = __section_nr(ms);
> -	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
> -	__remove_zone(zone, start_pfn);
> -
> -	sparse_remove_one_section(zone, ms, map_offset, altmap);
> +	sparse_remove_one_section(nid, ms, map_offset, altmap);
>  	return 0;
>  }
>  
>  /**
>   * __remove_pages() - remove sections of pages from a zone
> - * @zone: zone from which pages need to be removed
> + * @nid: nid from which pages belong to
>   * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
>   * @nr_pages: number of pages to remove (must be multiple of section size)
>   * @altmap: alternative device page map or %NULL if default memmap is used
> @@ -548,35 +551,27 @@ static int __remove_section(struct zone *zone, struct mem_section *ms,
>   * sure that pages are marked reserved and zones are adjust properly by
>   * calling offline_pages().
>   */
> -int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
> +int __remove_pages(int nid, unsigned long phys_start_pfn,
>  		 unsigned long nr_pages, struct vmem_altmap *altmap)
>  {
>  	unsigned long i;
>  	unsigned long map_offset = 0;
>  	int sections_to_remove, ret = 0;
> +	resource_size_t start, size;
>  
> -	/* In the ZONE_DEVICE case device driver owns the memory region */
> -	if (is_dev_zone(zone)) {
> -		if (altmap)
> -			map_offset = vmem_altmap_offset(altmap);
> -	} else {
> -		resource_size_t start, size;
> -
> -		start = phys_start_pfn << PAGE_SHIFT;
> -		size = nr_pages * PAGE_SIZE;
> +	start = phys_start_pfn << PAGE_SHIFT;
> +	size = nr_pages * PAGE_SIZE;
>  
> -		ret = release_mem_region_adjustable(&iomem_resource, start,
> -					size);
> -		if (ret) {
> -			resource_size_t endres = start + size - 1;
> +	if (altmap)
> +		map_offset = vmem_altmap_offset(altmap);
>  
> -			pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> -					&start, &endres, ret);
> -		}
> +	ret = release_mem_region_adjustable(&iomem_resource, start, size);
> +	if (ret) {
> +		resource_size_t endres = start + size - 1;
> +		pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
> +						&start, &endres, ret);
>  	}
>  
> -	clear_zone_contiguous(zone);
> -
>  	/*
>  	 * We can only remove entire sections
>  	 */
> @@ -587,15 +582,13 @@ int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
>  	for (i = 0; i < sections_to_remove; i++) {
>  		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
>  
> -		ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
> -				altmap);
> +		ret = __remove_section(nid, __pfn_to_section(pfn), map_offset,
> +									altmap);
>  		map_offset = 0;
>  		if (ret)
>  			break;
>  	}
>  
> -	set_zone_contiguous(zone);
> -
>  	return ret;
>  }
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> @@ -1595,7 +1588,6 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	unsigned long pfn, nr_pages;
>  	long offlined_pages;
>  	int ret, node;
> -	unsigned long flags;
>  	unsigned long valid_start, valid_end;
>  	struct zone *zone;
>  	struct memory_notify arg;
> @@ -1665,11 +1657,9 @@ static int __ref __offline_pages(unsigned long start_pfn,
>  	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
>  	/* removal success */
>  	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
> -	zone->present_pages -= offlined_pages;
>  
> -	pgdat_resize_lock(zone->zone_pgdat, &flags);
> -	zone->zone_pgdat->node_present_pages -= offlined_pages;
> -	pgdat_resize_unlock(zone->zone_pgdat, &flags);
> +	/* Here we will shrink zone/node's spanned/present_pages */
> +	shrink_pages(zone, valid_start, valid_end, offlined_pages);
>  
>  	init_per_zone_wmark_min();
>  
> @@ -1902,4 +1892,57 @@ void __ref remove_memory(int nid, u64 start, u64 size)
>  	mem_hotplug_done();
>  }
>  EXPORT_SYMBOL_GPL(remove_memory);
> +
> +static int __del_device_memory(int nid, unsigned long start, unsigned long size,
> +				struct vmem_altmap *altmap, bool mapping)
> +{
> +	int ret;
> +	unsigned long start_pfn = PHYS_PFN(start);
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> +	offline_mem_sections(start_pfn, start_pfn + nr_pages);
> +	shrink_pages(zone, start_pfn, start_pfn + nr_pages, 0);
> +
> +	if (mapping)
> +		ret = arch_remove_memory(nid, start, size, altmap);
> +	else
> +		ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
> +
> +	return ret;
> +}
> +
> +int del_device_memory(int nid, unsigned long start, unsigned long size,
> +			struct vmem_altmap *altmap, bool mapping)
> +{
> +	return __del_device_memory(nid, start, size, altmap, mapping);
> +}
>  #endif /* CONFIG_MEMORY_HOTREMOVE */
> +
> +static int __add_device_memory(int nid, unsigned long start, unsigned long size,
> +				struct vmem_altmap *altmap, bool mapping)
> +{
> +	int ret;
> +        unsigned long start_pfn = PHYS_PFN(start);
> +        unsigned long nr_pages = size >> PAGE_SHIFT;
> +
> +	if (mapping)
> +		ret = arch_add_memory(nid, start, size, altmap, false);
> +        else
> +		ret = add_pages(nid, start_pfn, nr_pages, altmap, false);
> +
> +	if (!ret) {
> +		struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> +
> +		online_mem_sections(start_pfn, start_pfn + nr_pages);
> +		move_pfn_range_to_zone(zone, start_pfn, nr_pages, altmap);
> +	}
> +
> +	return ret;
> +}
> +
> +int add_device_memory(int nid, unsigned long start, unsigned long size,
> +				struct vmem_altmap *altmap, bool mapping)
> +{
> +	return __add_device_memory(nid, start, size, altmap, mapping);
> +}

Any reason for these indirections?

> diff --git a/mm/sparse.c b/mm/sparse.c
> index 10b07eea9a6e..016020bd20b5 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -766,12 +766,12 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap,
>  		free_map_bootmem(memmap);
>  }


I guess for readability, this patch could be split up into several
patches. E.g. factoring out of add_device_memory/del_device_memory,
release_mem_region_adjustable change ...
Oscar Salvador Aug. 22, 2018, 7:50 a.m. UTC | #2
On Tue, Aug 21, 2018 at 03:17:10PM +0200, David Hildenbrand wrote:
> > add_device_memory is in charge of
> 
> I wouldn't use the terminology of onlining/offlining here. That applies
> rather to memory that is exposed to the rest of the system (e.g. buddy
> allocator, has underlying memory block devices). I guess it is rather a
> pure setup/teardown of that device memory.

Hi David,

I am not sure if you are referring to:

"
a) calling either arch_add_memory() or add_pages(), depending on whether
   we want a linear mapping
b) online the memory sections that correspond to the pfn range
c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
   expand zone/pgdat spanned pages and initialize its pages
"

Well, that is partialy true.
I mean, in order to make this work, we need to offline/online the memory
sections, because shrink_pages will rely on that from now on.
Is what we do when online/offline pages, but since device memory
does not go through the "official" channels, we need to do it there
as well.

Sure I can use another terminology, but since that is what
offline/online_mem_sections do, I just came up with that.

> I would really like to see the mem_hotplug_begin/end also getting moved
> inside add_device_memory()/del_device_memory(). (just like for
> add/remove_memory)
> 
> I wonder if kasan_ stuff actually requires this lock, or if it could
> also be somehow moved inside add_device_memory/del_device_memory.

Yes, that was my first approach, but then I saw that the kasan stuff is being
handled whithin those locks, so I was not sure and I backed off leaving the
mem_hotplug_begin/end where they were.

Maybe Jerome can shed some light and, and we can just handle the kasan stuff
out of the locks.

> Maybe shorten that a bit
> 
> "HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
>  devm_request_mem_region/devm_release_mem_region to add/release a
>  resource. Just back off here."

Uhm, fair enough.

> Any reason for these indirections?

I wanted to hide the internals in the memory_hotplug code.
I thought about removing them, but I finally left them.
If people think that we are better off without them, I can just
remove them.

> I guess for readability, this patch could be split up into several
> patches. E.g. factoring out of add_device_memory/del_device_memory,
> release_mem_region_adjustable change ...

Yes, really true.
But I wanted first to gather feedback mainly from HMM/devm people to see
if they saw an outright bug within the series because I am not so
familiar with that part of the code.

Feedback from Jerome/Dan will be appreciate as well to see if this is a good
direction.

But you are right, in the end, this will have to be slipt up into several
parts to ease the review.

Thanks for reviewing this David!
I will try to address your concerns.

Thanks
David Hildenbrand Aug. 22, 2018, 8:11 a.m. UTC | #3
On 22.08.2018 09:50, Oscar Salvador wrote:
> On Tue, Aug 21, 2018 at 03:17:10PM +0200, David Hildenbrand wrote:
>>> add_device_memory is in charge of
>>
>> I wouldn't use the terminology of onlining/offlining here. That applies
>> rather to memory that is exposed to the rest of the system (e.g. buddy
>> allocator, has underlying memory block devices). I guess it is rather a
>> pure setup/teardown of that device memory.
> 
> Hi David,
> 
> I am not sure if you are referring to:
> 
> "
> a) calling either arch_add_memory() or add_pages(), depending on whether
>    we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
>    expand zone/pgdat spanned pages and initialize its pages
> "
> 
> Well, that is partialy true.
> I mean, in order to make this work, we need to offline/online the memory
> sections, because shrink_pages will rely on that from now on.
> Is what we do when online/offline pages, but since device memory
> does not go through the "official" channels, we need to do it there
> as well.
> 
> Sure I can use another terminology, but since that is what
> offline/online_mem_sections do, I just came up with that.
> 

Okay, got it, so it is basically "mark the sections as online/offline".

>> I would really like to see the mem_hotplug_begin/end also getting moved
>> inside add_device_memory()/del_device_memory(). (just like for
>> add/remove_memory)
>>
>> I wonder if kasan_ stuff actually requires this lock, or if it could
>> also be somehow moved inside add_device_memory/del_device_memory.
> 
> Yes, that was my first approach, but then I saw that the kasan stuff is being
> handled whithin those locks, so I was not sure and I backed off leaving the
> mem_hotplug_begin/end where they were.
> 
> Maybe Jerome can shed some light and, and we can just handle the kasan stuff
> out of the locks.
> 
>> Maybe shorten that a bit
>>
>> "HMM/devm memory does not have IORESOURCE_SYSTEM_RAM set. They use
>>  devm_request_mem_region/devm_release_mem_region to add/release a
>>  resource. Just back off here."
> 
> Uhm, fair enough.
> 
>> Any reason for these indirections?
> 
> I wanted to hide the internals in the memory_hotplug code.
> I thought about removing them, but I finally left them.
> If people think that we are better off without them, I can just
> remove them.

I don't see a need for that. (everyone following the functions has to go
via one indirection that just passes on parameters). It is also not done
for other functions (a.g. add_memory)

> 
>> I guess for readability, this patch could be split up into several
>> patches. E.g. factoring out of add_device_memory/del_device_memory,
>> release_mem_region_adjustable change ...
> 
> Yes, really true.
> But I wanted first to gather feedback mainly from HMM/devm people to see
> if they saw an outright bug within the series because I am not so
> familiar with that part of the code.
> 
> Feedback from Jerome/Dan will be appreciate as well to see if this is a good
> direction.

Yes, they probably know best how this all fits together.

> 
> But you are right, in the end, this will have to be slipt up into several
> parts to ease the review.
> 
> Thanks for reviewing this David!
> I will try to address your concerns.
> 
> Thanks 
>
Pasha Tatashin Aug. 29, 2018, 11:09 p.m. UTC | #4
On 8/17/18 11:41 AM, Oscar Salvador wrote:
> From: Oscar Salvador <osalvador@suse.de>
> 
> Currently, we decrement zone/node spanned_pages when we
> remove memory and not when we offline it.
> 
> This, besides of not being consistent with the current code,
> implies that we can access steal pages if we never get to online
> that memory.
> 
> In order to prevent that, we have to move all zone/pages stuff to
> the offlining memory stage.
> Removing memory path should only care about memory sections and memory
> blocks.
> 
> Another thing to notice here is that this is not so easy to be done
> as HMM/devm have a particular handling of memory-hotplug.
> They do not go through the common path, and so, they do not
> call either offline_pages() nor online_pages().
> 
> All they care about is to add the sections, move the pages to
> ZONE_DEVICE, and in some cases, to create the linear mapping.
> 
> In order to do this more smooth, two new functions are created
> to deal with these particular cases:
> 
> del_device_memory
> add_device_memory
> 
> add_device_memory is in charge of
> 
> a) calling either arch_add_memory() or add_pages(), depending on whether
>    we want a linear mapping
> b) online the memory sections that correspond to the pfn range
> c) calling move_pfn_range_to_zone() being zone ZONE_DEVICE to
>    expand zone/pgdat spanned pages and initialize its pages
> 
> del_device_memory, on the other hand, is in charge of
> 
> a) offline the memory sections that correspond to the pfn range
> b) calling shrink_pages(), which shrinks node/zone spanned pages.
> c) calling either arch_remove_memory() or __remove_pages(), depending on
>    whether we need to tear down the linear mapping or not
> 
> These two functions are called from:
> 
> add_device_memory:
> 	- devm_memremap_pages()
> 	- hmm_devmem_pages_create()
> 
> del_device_memory:
> 	- devm_memremap_pages_release()
> 	- hmm_devmem_release()
> 
> I think that this will get easier as soon as [1] gets merged.
> 
> Finally, shrink_pages() is moved to offline_pages(), so now,
> all pages/zone handling is being taken care in online/offline_pages stage.
> 
> [1] https://lkml.org/lkml/2018/6/19/110
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  arch/ia64/mm/init.c            |   4 +-
>  arch/powerpc/mm/mem.c          |  10 +--
>  arch/sh/mm/init.c              |   4 +-
>  arch/x86/mm/init_32.c          |   4 +-
>  arch/x86/mm/init_64.c          |   8 +--
>  include/linux/memory_hotplug.h |   9 ++-
>  kernel/memremap.c              |  14 ++--
>  kernel/resource.c              |  16 +++++
>  mm/hmm.c                       |  32 ++++-----
>  mm/memory_hotplug.c            | 143 +++++++++++++++++++++++++++--------------
>  mm/sparse.c                    |   4 +-
>  11 files changed, 145 insertions(+), 103 deletions(-)

Hi Oscar,

I have been studying this patch, and do not see anything bad about it
except that it begs to be split into smaller patches. I think you can
send this work as a series without RFC if this patch is split into 3 or
so patches. I will review that series.

Thank you,
Pavel
Oscar Salvador Aug. 31, 2018, 8:50 p.m. UTC | #5
On Wed, Aug 29, 2018 at 11:09:01PM +0000, Pasha Tatashin wrote:
> Hi Oscar,
> 
> I have been studying this patch, and do not see anything bad about it
> except that it begs to be split into smaller patches. I think you can
> send this work as a series without RFC if this patch is split into 3 or
> so patches. I will review that series.

Thanks Pavel for having taken a look at this.
I will split up the patch and re-send it without RFC.

Thanks!
diff mbox series

Patch

diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index b54d0ee74b53..30ed58e5e86c 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -666,11 +666,9 @@  int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 	int ret;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+	ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
 	if (ret)
 		pr_warn("%s: Problem encountered in __remove_pages() as"
 			" ret=%d\n", __func__,  ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 9e17d57a9948..f3b47d02b879 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -146,15 +146,7 @@  int __meminit arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altma
 	struct page *page;
 	int ret;
 
-	/*
-	 * If we have an altmap then we need to skip over any reserved PFNs
-	 * when querying the zone.
-	 */
-	page = pfn_to_page(start_pfn);
-	if (altmap)
-		page += vmem_altmap_offset(altmap);
-
-	ret = __remove_pages(page_zone(page), start_pfn, nr_pages, altmap);
+	ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
 	if (ret)
 		return ret;
 
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 5c91bb6abc35..6f6e7d321b1f 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -448,11 +448,9 @@  int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 	int ret;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+	ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
 	if (unlikely(ret))
 		pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
 			ret);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 9fa503f2f56c..88909e88c873 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -865,10 +865,8 @@  int arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct zone *zone;
 
-	zone = page_zone(pfn_to_page(start_pfn));
-	return __remove_pages(zone, start_pfn, nr_pages, altmap);
+	return __remove_pages(nid, start_pfn, nr_pages, altmap);
 }
 #endif
 #endif
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index f90fa077b0c4..d04338ffabec 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1152,15 +1152,9 @@  int __ref arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *a
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
-	struct page *page = pfn_to_page(start_pfn);
-	struct zone *zone;
 	int ret;
 
-	/* With altmap the first mapped page is offset from @start */
-	if (altmap)
-		page += vmem_altmap_offset(altmap);
-	zone = page_zone(page);
-	ret = __remove_pages(zone, start_pfn, nr_pages, altmap);
+	ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
 	WARN_ON_ONCE(ret);
 	kernel_physical_mapping_remove(start, start + size);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index c68b03fd87bd..e1446dbb37dc 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -109,14 +109,19 @@  static inline bool movable_node_is_enabled(void)
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern int arch_remove_memory(int nid, u64 start, u64 size,
 		struct vmem_altmap *altmap);
-extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
+extern int __remove_pages(int nid, unsigned long start_pfn,
 	unsigned long nr_pages, struct vmem_altmap *altmap);
+extern int del_device_memory(int nid, unsigned long start, unsigned long size,
+					struct vmem_altmap *altmap, bool mapping);
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 /* reasonably generic interface to expand the physical pages */
 extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
 		struct vmem_altmap *altmap, bool want_memblock);
 
+extern int add_device_memory(int nid, unsigned long start, unsigned long size,
+					struct vmem_altmap *altmap, bool mapping);
+
 #ifndef CONFIG_ARCH_HAS_ADD_PAGES
 static inline int add_pages(int nid, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap,
@@ -333,7 +338,7 @@  extern bool is_memblock_offlined(struct memory_block *mem);
 extern void remove_memory(int nid, u64 start, u64 size);
 extern int sparse_add_one_section(struct pglist_data *pgdat,
 		unsigned long start_pfn, struct vmem_altmap *altmap);
-extern void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+extern void sparse_remove_one_section(int nid, struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 7a832b844f24..95df37686196 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -119,6 +119,8 @@  static void devm_memremap_pages_release(void *data)
 	struct dev_pagemap *pgmap = data;
 	struct device *dev = pgmap->dev;
 	struct resource *res = &pgmap->res;
+	struct vmem_altmap *altmap = pgmap->altmap_valid ?
+					&pgmap->altmap : NULL;
 	resource_size_t align_start, align_size;
 	unsigned long pfn;
 	int nid;
@@ -138,8 +140,7 @@  static void devm_memremap_pages_release(void *data)
 	nid = dev_to_node(dev);
 
 	mem_hotplug_begin();
-	arch_remove_memory(nid, align_start, align_size, pgmap->altmap_valid ?
-			&pgmap->altmap : NULL);
+	del_device_memory(nid, align_start, align_size, altmap, true);
 	kasan_remove_zero_shadow(__va(align_start), align_size);
 	mem_hotplug_done();
 
@@ -175,7 +176,7 @@  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 {
 	resource_size_t align_start, align_size, align_end;
 	struct vmem_altmap *altmap = pgmap->altmap_valid ?
-			&pgmap->altmap : NULL;
+					&pgmap->altmap : NULL;
 	struct resource *res = &pgmap->res;
 	unsigned long pfn, pgoff, order;
 	pgprot_t pgprot = PAGE_KERNEL;
@@ -249,11 +250,8 @@  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		goto err_kasan;
 	}
 
-	error = arch_add_memory(nid, align_start, align_size, altmap, false);
-	if (!error)
-		move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-					align_start >> PAGE_SHIFT,
-					align_size >> PAGE_SHIFT, altmap);
+	error = add_device_memory(nid, align_start, align_size, altmap, true);
+
 	mem_hotplug_done();
 	if (error)
 		goto err_add_memory;
diff --git a/kernel/resource.c b/kernel/resource.c
index 30e1bc68503b..8e68b5ca67ca 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1262,6 +1262,22 @@  int release_mem_region_adjustable(struct resource *parent,
 			continue;
 		}
 
+		/*
+		 * All memory regions added from memory-hotplug path
+		 * have the flag IORESOURCE_SYSTEM_RAM.
+		 * IORESOURCE_SYSTEM_RAM = (IORESOURCE_MEM|IORESOURCE_SYSRAM)
+		 * If the resource does not have this flag, we know that
+		 * we are dealing with a resource coming from HMM/devm.
+		 * HMM/devm use another mechanism to add/release a resource.
+		 * This goes via devm_request_mem_region/devm_release_mem_region.
+		 * HMM/devm take care to release their resources when they want, so
+		 * if we are dealing with them, let us just back off nicely
+		 */
+		if (!(res->flags & IORESOURCE_SYSRAM)) {
+			ret = 0;
+			break;
+		}
+
 		if (!(res->flags & IORESOURCE_MEM))
 			break;
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 21787e480b4a..23ce7fbdb651 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -996,6 +996,7 @@  static void hmm_devmem_release(struct device *dev, void *data)
 	struct zone *zone;
 	struct page *page;
 	int nid;
+	bool mapping;
 
 	if (percpu_ref_tryget_live(&devmem->ref)) {
 		dev_WARN(dev, "%s: page mapping is still live!\n", __func__);
@@ -1012,12 +1013,14 @@  static void hmm_devmem_release(struct device *dev, void *data)
 
 	mem_hotplug_begin();
 	if (resource->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY)
-		__remove_pages(zone, start_pfn, npages, NULL);
+		mapping = false;
 	else
-		arch_remove_memory(nid, start_pfn << PAGE_SHIFT,
-				   npages << PAGE_SHIFT, NULL);
-	mem_hotplug_done();
+		mapping = true;
 
+	del_device_memory(nid, start_pfn << PAGE_SHIFT, npages << PAGE_SHIFT,
+									NULL,
+									mapping);
+	mem_hotplug_done();
 	hmm_devmem_radix_release(resource);
 }
 
@@ -1027,6 +1030,7 @@  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	struct device *device = devmem->device;
 	int ret, nid, is_ram;
 	unsigned long pfn;
+	bool mapping;
 
 	align_start = devmem->resource->start & ~(PA_SECTION_SIZE - 1);
 	align_size = ALIGN(devmem->resource->start +
@@ -1085,7 +1089,6 @@  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	if (nid < 0)
 		nid = numa_mem_id();
 
-	mem_hotplug_begin();
 	/*
 	 * For device private memory we call add_pages() as we only need to
 	 * allocate and initialize struct page for the device memory. More-
@@ -1096,21 +1099,18 @@  static int hmm_devmem_pages_create(struct hmm_devmem *devmem)
 	 * For device public memory, which is accesible by the CPU, we do
 	 * want the linear mapping and thus use arch_add_memory().
 	 */
+	mem_hotplug_begin();
 	if (devmem->pagemap.type == MEMORY_DEVICE_PUBLIC)
-		ret = arch_add_memory(nid, align_start, align_size, NULL,
-				false);
+		mapping = true;
 	else
-		ret = add_pages(nid, align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL, false);
-	if (ret) {
-		mem_hotplug_done();
-		goto error_add_memory;
-	}
-	move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-				align_start >> PAGE_SHIFT,
-				align_size >> PAGE_SHIFT, NULL);
+		mapping = false;
+
+	ret = add_device_memory(nid, align_start, align_size, NULL, mapping);
 	mem_hotplug_done();
 
+	if (ret)
+		goto error_add_memory;
+
 	for (pfn = devmem->pfn_first; pfn < devmem->pfn_last; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9bd629944c91..60b67f09956e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -320,12 +320,10 @@  static unsigned long find_smallest_section_pfn(int nid, struct zone *zone,
 				     unsigned long start_pfn,
 				     unsigned long end_pfn)
 {
-	struct mem_section *ms;
-
 	for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
-		ms = __pfn_to_section(start_pfn);
+		struct mem_section *ms = __pfn_to_section(start_pfn);
 
-		if (unlikely(!valid_section(ms)))
+		if (unlikely(!online_section(ms)))
 			continue;
 
 		if (unlikely(pfn_to_nid(start_pfn) != nid))
@@ -345,15 +343,14 @@  static unsigned long find_biggest_section_pfn(int nid, struct zone *zone,
 				    unsigned long start_pfn,
 				    unsigned long end_pfn)
 {
-	struct mem_section *ms;
 	unsigned long pfn;
 
 	/* pfn is the end pfn of a memory section. */
 	pfn = end_pfn - 1;
 	for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
-		ms = __pfn_to_section(pfn);
+		struct mem_section *ms = __pfn_to_section(pfn);
 
-		if (unlikely(!valid_section(ms)))
+		if (unlikely(!online_section(ms)))
 			continue;
 
 		if (unlikely(pfn_to_nid(pfn) != nid))
@@ -415,7 +412,7 @@  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
 		ms = __pfn_to_section(pfn);
 
-		if (unlikely(!valid_section(ms)))
+		if (unlikely(!online_section(ms)))
 			continue;
 
 		if (page_zone(pfn_to_page(pfn)) != zone)
@@ -502,23 +499,33 @@  static void shrink_pgdat_span(struct pglist_data *pgdat,
 	pgdat->node_spanned_pages = 0;
 }
 
-static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+static void shrink_pages(struct zone *zone, unsigned long start_pfn,
+						unsigned long end_pfn,
+						unsigned long offlined_pages)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int nr_pages = PAGES_PER_SECTION;
 	unsigned long flags;
+	unsigned long pfn;
 
-	pgdat_resize_lock(zone->zone_pgdat, &flags);
-	shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
-	shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+	zone->present_pages -= offlined_pages;
+
+	clear_zone_contiguous(zone);
+	pgdat_resize_lock(pgdat, &flags);
+
+	for(pfn = start_pfn; pfn < end_pfn; pfn += nr_pages) {
+		shrink_zone_span(zone, pfn, pfn + nr_pages);
+		shrink_pgdat_span(pgdat, pfn, pfn + nr_pages);
+	}
+	pgdat->node_present_pages -= offlined_pages;
+
+	pgdat_resize_unlock(pgdat, &flags);
+	set_zone_contiguous(zone);
 }
 
-static int __remove_section(struct zone *zone, struct mem_section *ms,
+static int __remove_section(int nid, struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap)
 {
-	unsigned long start_pfn;
-	int scn_nr;
 	int ret = -EINVAL;
 
 	if (!valid_section(ms))
@@ -528,17 +535,13 @@  static int __remove_section(struct zone *zone, struct mem_section *ms,
 	if (ret)
 		return ret;
 
-	scn_nr = __section_nr(ms);
-	start_pfn = section_nr_to_pfn((unsigned long)scn_nr);
-	__remove_zone(zone, start_pfn);
-
-	sparse_remove_one_section(zone, ms, map_offset, altmap);
+	sparse_remove_one_section(nid, ms, map_offset, altmap);
 	return 0;
 }
 
 /**
  * __remove_pages() - remove sections of pages from a zone
- * @zone: zone from which pages need to be removed
+ * @nid: nid from which pages belong to
  * @phys_start_pfn: starting pageframe (must be aligned to start of a section)
  * @nr_pages: number of pages to remove (must be multiple of section size)
  * @altmap: alternative device page map or %NULL if default memmap is used
@@ -548,35 +551,27 @@  static int __remove_section(struct zone *zone, struct mem_section *ms,
  * sure that pages are marked reserved and zones are adjust properly by
  * calling offline_pages().
  */
-int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
+int __remove_pages(int nid, unsigned long phys_start_pfn,
 		 unsigned long nr_pages, struct vmem_altmap *altmap)
 {
 	unsigned long i;
 	unsigned long map_offset = 0;
 	int sections_to_remove, ret = 0;
+	resource_size_t start, size;
 
-	/* In the ZONE_DEVICE case device driver owns the memory region */
-	if (is_dev_zone(zone)) {
-		if (altmap)
-			map_offset = vmem_altmap_offset(altmap);
-	} else {
-		resource_size_t start, size;
-
-		start = phys_start_pfn << PAGE_SHIFT;
-		size = nr_pages * PAGE_SIZE;
+	start = phys_start_pfn << PAGE_SHIFT;
+	size = nr_pages * PAGE_SIZE;
 
-		ret = release_mem_region_adjustable(&iomem_resource, start,
-					size);
-		if (ret) {
-			resource_size_t endres = start + size - 1;
+	if (altmap)
+		map_offset = vmem_altmap_offset(altmap);
 
-			pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
-					&start, &endres, ret);
-		}
+	ret = release_mem_region_adjustable(&iomem_resource, start, size);
+	if (ret) {
+		resource_size_t endres = start + size - 1;
+		pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
+						&start, &endres, ret);
 	}
 
-	clear_zone_contiguous(zone);
-
 	/*
 	 * We can only remove entire sections
 	 */
@@ -587,15 +582,13 @@  int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
 	for (i = 0; i < sections_to_remove; i++) {
 		unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
 
-		ret = __remove_section(zone, __pfn_to_section(pfn), map_offset,
-				altmap);
+		ret = __remove_section(nid, __pfn_to_section(pfn), map_offset,
+									altmap);
 		map_offset = 0;
 		if (ret)
 			break;
 	}
 
-	set_zone_contiguous(zone);
-
 	return ret;
 }
 #endif /* CONFIG_MEMORY_HOTREMOVE */
@@ -1595,7 +1588,6 @@  static int __ref __offline_pages(unsigned long start_pfn,
 	unsigned long pfn, nr_pages;
 	long offlined_pages;
 	int ret, node;
-	unsigned long flags;
 	unsigned long valid_start, valid_end;
 	struct zone *zone;
 	struct memory_notify arg;
@@ -1665,11 +1657,9 @@  static int __ref __offline_pages(unsigned long start_pfn,
 	undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
-	zone->present_pages -= offlined_pages;
 
-	pgdat_resize_lock(zone->zone_pgdat, &flags);
-	zone->zone_pgdat->node_present_pages -= offlined_pages;
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+	/* Here we will shrink zone/node's spanned/present_pages */
+	shrink_pages(zone, valid_start, valid_end, offlined_pages);
 
 	init_per_zone_wmark_min();
 
@@ -1902,4 +1892,57 @@  void __ref remove_memory(int nid, u64 start, u64 size)
 	mem_hotplug_done();
 }
 EXPORT_SYMBOL_GPL(remove_memory);
+
+static int __del_device_memory(int nid, unsigned long start, unsigned long size,
+				struct vmem_altmap *altmap, bool mapping)
+{
+	int ret;
+	unsigned long start_pfn = PHYS_PFN(start);
+	unsigned long nr_pages = size >> PAGE_SHIFT;
+	struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+
+	offline_mem_sections(start_pfn, start_pfn + nr_pages);
+	shrink_pages(zone, start_pfn, start_pfn + nr_pages, 0);
+
+	if (mapping)
+		ret = arch_remove_memory(nid, start, size, altmap);
+	else
+		ret = __remove_pages(nid, start_pfn, nr_pages, altmap);
+
+	return ret;
+}
+
+int del_device_memory(int nid, unsigned long start, unsigned long size,
+			struct vmem_altmap *altmap, bool mapping)
+{
+	return __del_device_memory(nid, start, size, altmap, mapping);
+}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
+
+static int __add_device_memory(int nid, unsigned long start, unsigned long size,
+				struct vmem_altmap *altmap, bool mapping)
+{
+	int ret;
+        unsigned long start_pfn = PHYS_PFN(start);
+        unsigned long nr_pages = size >> PAGE_SHIFT;
+
+	if (mapping)
+		ret = arch_add_memory(nid, start, size, altmap, false);
+        else
+		ret = add_pages(nid, start_pfn, nr_pages, altmap, false);
+
+	if (!ret) {
+		struct zone *zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
+
+		online_mem_sections(start_pfn, start_pfn + nr_pages);
+		move_pfn_range_to_zone(zone, start_pfn, nr_pages, altmap);
+	}
+
+	return ret;
+}
+
+int add_device_memory(int nid, unsigned long start, unsigned long size,
+				struct vmem_altmap *altmap, bool mapping)
+{
+	return __add_device_memory(nid, start, size, altmap, mapping);
+}
diff --git a/mm/sparse.c b/mm/sparse.c
index 10b07eea9a6e..016020bd20b5 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -766,12 +766,12 @@  static void free_section_usemap(struct page *memmap, unsigned long *usemap,
 		free_map_bootmem(memmap);
 }
 
-void sparse_remove_one_section(struct zone *zone, struct mem_section *ms,
+void sparse_remove_one_section(int nid, struct mem_section *ms,
 		unsigned long map_offset, struct vmem_altmap *altmap)
 {
 	struct page *memmap = NULL;
 	unsigned long *usemap = NULL, flags;
-	struct pglist_data *pgdat = zone->zone_pgdat;
+	struct pglist_data *pgdat = NODE_DATA(nid);
 
 	pgdat_resize_lock(pgdat, &flags);
 	if (ms->section_mem_map) {