diff mbox series

[v3] resource: Merge resources on a node when hot-adding memory

Message ID 20180809025409.31552-1-rashmica.g@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v3] resource: Merge resources on a node when hot-adding memory | expand

Commit Message

Rashmica Gupta Aug. 9, 2018, 2:54 a.m. UTC
When hot-removing memory release_mem_region_adjustable() splits
iomem resources if they are not the exact size of the memory being
hot-deleted. Adding this memory back to the kernel adds a new
resource.

Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing
1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being
split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.

When we hot-add the memory back we now have three resources:
0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.

Now if we try to remove some memory that overlaps these resources,
like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it
expects the chunk of memory to be within the boundaries of a single
resource.

This patch adds a function request_resource_and_merge(). This is called
instead of request_resource_conflict() when registering a resource in
add_memory(). It calls request_resource_conflict() and if hot-removing is
enabled (if it isn't we won't get resource fragmentation) we attempt to
merge contiguous resources on the node.

Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
---
v2->v3: Update Xen balloon, make the commit msg and a comment clearer,
and changed '>' to '>=' when comparing the end of a resource and the
end of a node.

v1->v2: Only attempt to merge resources if hot-remove is enabled.

 drivers/xen/balloon.c          |   3 +-
 include/linux/ioport.h         |   2 +
 include/linux/memory_hotplug.h |   2 +-
 kernel/resource.c              | 120 +++++++++++++++++++++++++++++++++++++++++
 mm/memory_hotplug.c            |  22 ++++----
 5 files changed, 136 insertions(+), 13 deletions(-)

Comments

Vlastimil Babka Aug. 9, 2018, 2:30 p.m. UTC | #1
On 08/09/2018 04:54 AM, Rashmica Gupta wrote:
> When hot-removing memory release_mem_region_adjustable() splits
> iomem resources if they are not the exact size of the memory being
> hot-deleted. Adding this memory back to the kernel adds a new
> resource.
> 
> Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing
> 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being
> split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.
> 
> When we hot-add the memory back we now have three resources:
> 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.
> 
> Now if we try to remove some memory that overlaps these resources,
> like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it
> expects the chunk of memory to be within the boundaries of a single
> resource.
> 
> This patch adds a function request_resource_and_merge(). This is called
> instead of request_resource_conflict() when registering a resource in
> add_memory(). It calls request_resource_conflict() and if hot-removing is
> enabled (if it isn't we won't get resource fragmentation) we attempt to
> merge contiguous resources on the node.
> 
> Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Mike Rapoport Aug. 9, 2018, 3:03 p.m. UTC | #2
On Thu, Aug 09, 2018 at 12:54:09PM +1000, Rashmica Gupta wrote:
> When hot-removing memory release_mem_region_adjustable() splits
> iomem resources if they are not the exact size of the memory being
> hot-deleted. Adding this memory back to the kernel adds a new
> resource.
> 
> Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing
> 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being
> split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.
> 
> When we hot-add the memory back we now have three resources:
> 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.
> 
> Now if we try to remove some memory that overlaps these resources,
> like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it
> expects the chunk of memory to be within the boundaries of a single
> resource.
> 
> This patch adds a function request_resource_and_merge(). This is called
> instead of request_resource_conflict() when registering a resource in
> add_memory(). It calls request_resource_conflict() and if hot-removing is
> enabled (if it isn't we won't get resource fragmentation) we attempt to
> merge contiguous resources on the node.
> 
> Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>

Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>

> ---
> v2->v3: Update Xen balloon, make the commit msg and a comment clearer,
> and changed '>' to '>=' when comparing the end of a resource and the
> end of a node.
> 
> v1->v2: Only attempt to merge resources if hot-remove is enabled.
> 
>  drivers/xen/balloon.c          |   3 +-
>  include/linux/ioport.h         |   2 +
>  include/linux/memory_hotplug.h |   2 +-
>  kernel/resource.c              | 120 +++++++++++++++++++++++++++++++++++++++++
>  mm/memory_hotplug.c            |  22 ++++----
>  5 files changed, 136 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index 065f0b607373..9b972b37b0da 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -401,7 +401,8 @@ static enum bp_state reserve_additional_memory(void)
>  	 * callers drop the mutex before trying again.
>  	 */
>  	mutex_unlock(&balloon_mutex);
> -	rc = add_memory_resource(nid, resource, memhp_auto_online);
> +	rc = add_memory_resource(nid, resource->start, resource_size(resource),
> +				 memhp_auto_online);
>  	mutex_lock(&balloon_mutex);
> 
>  	if (rc) {
> diff --git a/include/linux/ioport.h b/include/linux/ioport.h
> index da0ebaec25f0..f5b93a711e86 100644
> --- a/include/linux/ioport.h
> +++ b/include/linux/ioport.h
> @@ -189,6 +189,8 @@ extern int allocate_resource(struct resource *root, struct resource *new,
>  						       resource_size_t,
>  						       resource_size_t),
>  			     void *alignf_data);
> +extern struct resource *request_resource_and_merge(struct resource *parent,
> +						   struct resource *new, int nid);
>  struct resource *lookup_resource(struct resource *root, resource_size_t start);
>  int adjust_resource(struct resource *res, resource_size_t start,
>  		    resource_size_t size);
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 4e9828cda7a2..9c00f97c8cc6 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -322,7 +322,7 @@ static inline void remove_memory(int nid, u64 start, u64 size) {}
>  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
>  		void *arg, int (*func)(struct memory_block *, void *));
>  extern int add_memory(int nid, u64 start, u64 size);
> -extern int add_memory_resource(int nid, struct resource *resource, bool online);
> +extern int add_memory_resource(int nid, u64 start, u64 size, bool online);
>  extern int arch_add_memory(int nid, u64 start, u64 size,
>  		struct vmem_altmap *altmap, bool want_memblock);
>  extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
> diff --git a/kernel/resource.c b/kernel/resource.c
> index 30e1bc68503b..a31d3f5bccb7 100644
> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -1621,3 +1621,123 @@ static int __init strict_iomem(char *str)
>  }
> 
>  __setup("iomem=", strict_iomem);
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +/*
> + * Attempt to merge resource and it's sibling
> + */
> +static int merge_resources(struct resource *res)
> +{
> +	struct resource *next;
> +	struct resource *tmp;
> +	uint64_t size;
> +	int ret = -EINVAL;
> +
> +	next = res->sibling;
> +
> +	/*
> +	 * Not sure how to handle two different children. So only attempt
> +	 * to merge two resources if neither have children, only one has a
> +	 * child or if both have the same child.
> +	 */
> +	if ((res->child && next->child) && (res->child != next->child))
> +		return ret;
> +
> +	if (res->end + 1 != next->start)
> +		return ret;
> +
> +	if (res->flags != next->flags)
> +		return ret;
> +
> +	/* Update sibling and child of resource */
> +	res->sibling = next->sibling;
> +	tmp = res->child;
> +	if (!res->child)
> +		res->child = next->child;
> +
> +	size = next->end - res->start + 1;
> +	ret = __adjust_resource(res, res->start, size);
> +	if (ret) {
> +		/* Failed so restore resource to original state */
> +		res->sibling = next;
> +		res->child = tmp;
> +		return ret;
> +	}
> +
> +	free_resource(next);
> +
> +	return ret;
> +}
> +
> +/*
> + * Attempt to merge resources on the node
> + */
> +static void merge_node_resources(int nid, struct resource *parent)
> +{
> +	struct resource *res;
> +	uint64_t start_addr;
> +	uint64_t end_addr;
> +	int ret;
> +
> +	start_addr = node_start_pfn(nid) << PAGE_SHIFT;
> +	end_addr = node_end_pfn(nid) << PAGE_SHIFT;
> +
> +	write_lock(&resource_lock);
> +
> +	/* Get the first resource */
> +	res = parent->child;
> +
> +	while (res) {
> +		/* Check that the resource is within the node */
> +		if (res->start < start_addr) {
> +			res = res->sibling;
> +			continue;
> +		}
> +		/* Exit if sibling resource is past end of node */
> +		if (res->sibling->end >= end_addr)
> +			break;
> +
> +		ret = merge_resources(res);
> +		if (!ret)
> +			continue;
> +		res = res->sibling;
> +	}
> +	write_unlock(&resource_lock);
> +}
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +
> +/**
> + * request_resource_and_merge() - request an I/O or memory resource for hot-add
> + * @parent: parent resource descriptor
> + * @new: resource descriptor desired by caller
> + * @nid: node id of the node we want the resource on
> + *
> + * If no conflict resource then attempt to merge resources on the node.
> + *
> + * This is intended to cleanup the fragmentation of resources that occurs when
> + * hot-removing memory (see release_mem_region_adjustable). If hot-removing is
> + * not enabled then there is no point trying to merge resources.
> + *
> + * Note that the inability to merge resources is not an error.
> + *
> + * Return: NULL for successful request of resource and conflict resource if
> + * there was a conflict.
> + */
> +struct resource *request_resource_and_merge(struct resource *parent,
> +					    struct resource *new, int nid)
> +{
> +	struct resource *conflict;
> +
> +	conflict = request_resource_conflict(parent, new);
> +
> +	if (conflict)
> +		return conflict;
> +
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	merge_node_resources(nid, parent);
> +#endif /* CONFIG_MEMORY_HOTREMOVE */
> +
> +	return NULL;
> +}
> +#endif /* CONFIG_MEMORY_HOTPLUG */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 7deb49f69e27..2e342f5ce322 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -97,7 +97,7 @@ void mem_hotplug_done(void)
>  }
> 
>  /* add this memory to iomem resource */
> -static struct resource *register_memory_resource(u64 start, u64 size)
> +static struct resource *register_memory_resource(int nid, u64 start, u64 size)
>  {
>  	struct resource *res, *conflict;
>  	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
> @@ -108,7 +108,7 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>  	res->start = start;
>  	res->end = start + size - 1;
>  	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> -	conflict =  request_resource_conflict(&iomem_resource, res);
> +	conflict =  request_resource_and_merge(&iomem_resource, res, nid);
>  	if (conflict) {
>  		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
>  			pr_debug("Device unaddressable memory block "
> @@ -122,11 +122,15 @@ static struct resource *register_memory_resource(u64 start, u64 size)
>  	return res;
>  }
> 
> -static void release_memory_resource(struct resource *res)
> +static void release_memory_resource(struct resource *res, u64 start, u64 size)
>  {
>  	if (!res)
>  		return;
> +#ifdef CONFIG_MEMORY_HOTREMOVE
> +	release_mem_region_adjustable(&iomem_resource, start, size);
> +#else
>  	release_resource(res);
> +#endif
>  	kfree(res);
>  	return;
>  }
> @@ -1096,17 +1100,13 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>  }
> 
>  /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
> -int __ref add_memory_resource(int nid, struct resource *res, bool online)
> +int __ref add_memory_resource(int nid, u64 start, u64 size, bool online)
>  {
> -	u64 start, size;
>  	pg_data_t *pgdat = NULL;
>  	bool new_pgdat;
>  	bool new_node;
>  	int ret;
> 
> -	start = res->start;
> -	size = resource_size(res);
> -
>  	ret = check_hotplug_memory_range(start, size);
>  	if (ret)
>  		return ret;
> @@ -1195,13 +1195,13 @@ int __ref add_memory(int nid, u64 start, u64 size)
>  	struct resource *res;
>  	int ret;
> 
> -	res = register_memory_resource(start, size);
> +	res = register_memory_resource(nid, start, size);
>  	if (IS_ERR(res))
>  		return PTR_ERR(res);
> 
> -	ret = add_memory_resource(nid, res, memhp_auto_online);
> +	ret = add_memory_resource(nid, start, size, memhp_auto_online);
>  	if (ret < 0)
> -		release_memory_resource(res);
> +		release_memory_resource(res, start, size);
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(add_memory);
> -- 
> 2.14.4
>
Andrew Morton Aug. 10, 2018, 1:12 a.m. UTC | #3
On Thu,  9 Aug 2018 12:54:09 +1000 Rashmica Gupta <rashmica.g@gmail.com> wrote:

> When hot-removing memory release_mem_region_adjustable() splits
> iomem resources if they are not the exact size of the memory being
> hot-deleted. Adding this memory back to the kernel adds a new
> resource.
> 
> Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing
> 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being
> split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.
> 
> When we hot-add the memory back we now have three resources:
> 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.
> 
> Now if we try to remove some memory that overlaps these resources,
> like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it
> expects the chunk of memory to be within the boundaries of a single
> resource.
> 
> This patch adds a function request_resource_and_merge(). This is called
> instead of request_resource_conflict() when registering a resource in
> add_memory(). It calls request_resource_conflict() and if hot-removing is
> enabled (if it isn't we won't get resource fragmentation) we attempt to
> merge contiguous resources on the node.

What is the end-user impact of this patch?

Do you believe the fix should be merged into 4.18?  Backporting into
-stable kernels?  If so, why?

Thanks.
Rashmica Gupta Aug. 10, 2018, 6:55 a.m. UTC | #4
On Fri, Aug 10, 2018 at 11:12 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Thu,  9 Aug 2018 12:54:09 +1000 Rashmica Gupta <rashmica.g@gmail.com> wrote:
>
>> When hot-removing memory release_mem_region_adjustable() splits
>> iomem resources if they are not the exact size of the memory being
>> hot-deleted. Adding this memory back to the kernel adds a new
>> resource.
>>
>> Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing
>> 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being
>> split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.
>>
>> When we hot-add the memory back we now have three resources:
>> 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.
>>
>> Now if we try to remove some memory that overlaps these resources,
>> like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it
>> expects the chunk of memory to be within the boundaries of a single
>> resource.
>>
>> This patch adds a function request_resource_and_merge(). This is called
>> instead of request_resource_conflict() when registering a resource in
>> add_memory(). It calls request_resource_conflict() and if hot-removing is
>> enabled (if it isn't we won't get resource fragmentation) we attempt to
>> merge contiguous resources on the node.
>
> What is the end-user impact of this patch?
>

Only architectures/setups that allow the user to remove and add memory of
different sizes or different start addresses from the kernel at runtime will
potentially encounter the resource fragmentation.

Trying to remove memory that overlaps iomem resources the first time
gives us this warning: "Unable to release resource <%pa-%pa>".

Attempting a second time results in a kernel oops (on ppc at least).

With this patch the user will not be restricted, by resource fragmentation
caused by previous hotremove/hotplug attempts, to what chunks of memory
they can remove.



> Do you believe the fix should be merged into 4.18?  Backporting into
> -stable kernels?  If so, why?


I hit this when adding hot-add code to memtrace on ppc.

Most memory hotplug/hotremove seems to be block or section based, and
always adds and removes memory at the same place.

Memtrace on ppc is different in that given a size (aligned to a block size),
it scans each node and finds a chunk of memory of that size that we can offline
and then removes it.

As this is possibly only as issue for memtrace on ppc with a patch that is not
in 4.18, I don't think this code needs to go in 4.18.


>
> Thanks.
Vlastimil Babka Aug. 10, 2018, 8:28 a.m. UTC | #5
On 08/10/2018 08:55 AM, Rashmica Gupta wrote:
> On Fri, Aug 10, 2018 at 11:12 AM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
>>
>> What is the end-user impact of this patch?
>>
> 
> Only architectures/setups that allow the user to remove and add memory of
> different sizes or different start addresses from the kernel at runtime will
> potentially encounter the resource fragmentation.
> 
> Trying to remove memory that overlaps iomem resources the first time
> gives us this warning: "Unable to release resource <%pa-%pa>".
> 
> Attempting a second time results in a kernel oops (on ppc at least).

An oops? I think that should be investigated and fixed, even if resource
merging prevents it. Do you have the details?

Thanks,
Vlastimil
Oscar Salvador Aug. 10, 2018, 12:26 p.m. UTC | #6
On Thu, Aug 09, 2018 at 12:54:09PM +1000, Rashmica Gupta wrote:
> When hot-removing memory release_mem_region_adjustable() splits
> iomem resources if they are not the exact size of the memory being
> hot-deleted. Adding this memory back to the kernel adds a new
> resource.
> 
> Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing
> 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being
> split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.
> 
> When we hot-add the memory back we now have three resources:
> 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.
> 
> Now if we try to remove some memory that overlaps these resources,
> like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it
> expects the chunk of memory to be within the boundaries of a single
> resource.
> 
> This patch adds a function request_resource_and_merge(). This is called
> instead of request_resource_conflict() when registering a resource in
> add_memory(). It calls request_resource_conflict() and if hot-removing is
> enabled (if it isn't we won't get resource fragmentation) we attempt to
> merge contiguous resources on the node.
> 
> Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>

Hi Rashmica,

Unfortunately this patch breaks memory-hotplug.

It makes my kernel go boom when hot-adding memory via qemu:

Way to reproduce it:

# connect to a qemu console
# add hot memory:

(qemu) object_add memory-backend-ram,id=ram0,size=1G
(qemu) device_add pc-dimm,id=dimm2,memdev=ram0,node=1

and...

kernel: BUG: unable to handle kernel paging request at 0000000000029ce8
kernel: PGD 0 P4D 0 
kernel: Oops: 0000 [#1] SMP PTI
kernel: CPU: 1 PID: 7 Comm: kworker/u4:0 Tainted: G            E     4.18.0-rc8-next-20180810-1-default+ #292
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
kernel: Workqueue: kacpi_hotplug acpi_hotplug_work_fn
kernel: RIP: 0010:request_resource_and_merge+0x51/0x120
kernel: Code: df e8 13 e6 ff ff c6 05 bc eb 50 01 00 48 85 c0 74 09 5b 5d 41 5c 41 5d 41 5e c3 4a 8b 04 e5 40 e6 00 82 48 c7 c7 d0 70 58 82 <4c> 8b a8 e8 9c 02 00 4d 89 ec 4c 03 a8 f8 9c 02 00 e8 89 aa 57 00
kernel: RSP: 0018:ffffc90000367d48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffffffff81e4e060 RCX: 000000013fffffff
kernel: RDX: 0000000100000000 RSI: ffff880077467580 RDI: ffffffff825870d0
kernel: RBP: ffff880077467580 R08: ffff88007ffabcf0 R09: ffff880077467580
kernel: R10: 0000000000000000 R11: ffff8800376eec09 R12: 0000000000000001
kernel: R13: 0000000040000000 R14: 0000000000000001 R15: 0000000000000001
kernel: FS:  0000000000000000(0000) GS:ffff88007db00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000029ce8 CR3: 00000000783ac000 CR4: 00000000000006a0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
kernel: Call Trace:
kernel:  add_memory+0x68/0x120
kernel:  acpi_memory_device_add+0x134/0x2e0
kernel:  acpi_bus_attach+0xd9/0x190
kernel:  acpi_bus_scan+0x37/0x70
kernel:  acpi_device_hotplug+0x389/0x4e0
kernel:  acpi_hotplug_work_fn+0x1a/0x30
kernel:  process_one_work+0x15f/0x350
kernel:  worker_thread+0x49/0x3e0
kernel:  kthread+0xf5/0x130
kernel:  ? max_active_store+0x60/0x60
kernel:  ? kthread_bind+0x10/0x10
kernel:  ret_from_fork+0x35/0x40
kernel: Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) bochs_drm(E) ttm(E) drm_kms_helper(E) drm(E) virtio_net(E) net_failover(E) i2c_piix4(E) parport_pc(E) parport(E) failover(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) nfit(E) libnvdimm(E) button(E) pcspkr(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) virtio_pci(E) virtio_ring(E) virtio(E) serio_raw(E) libata(E) sg(E) scsi_mod(E) autofs4(E)
kernel: CR2: 0000000000029ce8
kernel: ---[ end trace be1a8c4d1824ebf4 ]---
kernel: RIP: 0010:request_resource_and_merge+0x51/0x120
kernel: Code: df e8 13 e6 ff ff c6 05 bc eb 50 01 00 48 85 c0 74 09 5b 5d 41 5c 41 5d 41 5e c3 4a 8b 04 e5 40 e6 00 82 48 c7 c7 d0 70 58 82 <4c> 8b a8 e8 9c 02 00 4d 89 ec 4c 03 a8 f8 9c 02 00 e8 89 aa 57 00
kernel: RSP: 0018:ffffc90000367d48 EFLAGS: 00010246
kernel: RAX: 0000000000000000 RBX: ffffffff81e4e060 RCX: 000000013fffffff
kernel: RDX: 0000000100000000 RSI: ffff880077467580 RDI: ffffffff825870d0
kernel: RBP: ffff880077467580 R08: ffff88007ffabcf0 R09: ffff880077467580
kernel: R10: 0000000000000000 R11: ffff8800376eec09 R12: 0000000000000001
kernel: R13: 0000000040000000 R14: 0000000000000001 R15: 0000000000000001
kernel: FS:  0000000000000000(0000) GS:ffff88007db00000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000029ce8 CR3: 00000000783ac000 CR4: 00000000000006a0
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


The problem is in this function you added:

+static void merge_node_resources(int nid, struct resource *parent)
+{
+       struct resource *res;
+       uint64_t start_addr;
+       uint64_t end_addr;
+       int ret;
+
+       start_addr = node_start_pfn(nid) << PAGE_SHIFT;
+       end_addr = node_end_pfn(nid) << PAGE_SHIFT;


node_start_pfn() calls NODE_DATA(nid), which then tries to get the node_data[] structure,
and then try to dereference a value in there.
This will only work for node's that are already online, but if you are adding memory
to a new node, this will blow up.

In the case we are adding memory from a node which is not onlined yet, we online it later on
in add_memory_resource:

add_memore_resource
 __try_online_node
  hotadd_new_pgdat


static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
{
        struct pglist_data *pgdat;
        unsigned long start_pfn = PFN_DOWN(start);

        pgdat = NODE_DATA(nid);
        if (!pgdat) {
                pgdat = arch_alloc_nodedata(nid);
                if (!pgdat)
                        return NULL;

                arch_refresh_nodedata(nid, pgdat);
        }
	...
	...

I did not have time to think about a fix for that, so unless we come up with something,
this will have to be reverted for 4.18.

Thanks
Rashmica Gupta Aug. 10, 2018, 12:35 p.m. UTC | #7
On Fri, Aug 10, 2018, 10:26 PM Oscar Salvador <osalvador@techadventures.net>
wrote:

> On Thu, Aug 09, 2018 at 12:54:09PM +1000, Rashmica Gupta wrote:
> > When hot-removing memory release_mem_region_adjustable() splits
> > iomem resources if they are not the exact size of the memory being
> > hot-deleted. Adding this memory back to the kernel adds a new
> > resource.
> >
> > Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing
> > 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being
> > split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.
> >
> > When we hot-add the memory back we now have three resources:
> > 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.
> >
> > Now if we try to remove some memory that overlaps these resources,
> > like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it
> > expects the chunk of memory to be within the boundaries of a single
> > resource.
> >
> > This patch adds a function request_resource_and_merge(). This is called
> > instead of request_resource_conflict() when registering a resource in
> > add_memory(). It calls request_resource_conflict() and if hot-removing is
> > enabled (if it isn't we won't get resource fragmentation) we attempt to
> > merge contiguous resources on the node.
> >
> > Signed-off-by: Rashmica Gupta <rashmica.g@gmail.com>
>
> Hi Rashmica,
>
> Unfortunately this patch breaks memory-hotplug.
>
> It makes my kernel go boom when hot-adding memory via qemu:
>
> Way to reproduce it:
>
> # connect to a qemu console
> # add hot memory:
>
> (qemu) object_add memory-backend-ram,id=ram0,size=1G
> (qemu) device_add pc-dimm,id=dimm2,memdev=ram0,node=1
>
>
>
>
> and...
>
> kernel: BUG: unable to handle kernel paging request at 0000000000029ce8
> kernel: PGD 0 P4D 0
> kernel: Oops: 0000 [#1] SMP PTI
> kernel: CPU: 1 PID: 7 Comm: kworker/u4:0 Tainted: G            E
>  4.18.0-rc8-next-20180810-1-default+ #292
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.0.0-prebuilt.qemu-project.org 04/01/2014
> kernel: Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> kernel: RIP: 0010:request_resource_and_merge+0x51/0x120
> kernel: Code: df e8 13 e6 ff ff c6 05 bc eb 50 01 00 48 85 c0 74 09 5b 5d
> 41 5c 41 5d 41 5e c3 4a 8b 04 e5 40 e6 00 82 48 c7 c7 d0 70 58 82 <4c> 8b
> a8 e8 9c 02 00 4d 89 ec 4c 03 a8 f8 9c 02 00 e8 89 aa 57 00
> kernel: RSP: 0018:ffffc90000367d48 EFLAGS: 00010246
> kernel: RAX: 0000000000000000 RBX: ffffffff81e4e060 RCX: 000000013fffffff
> kernel: RDX: 0000000100000000 RSI: ffff880077467580 RDI: ffffffff825870d0
> kernel: RBP: ffff880077467580 R08: ffff88007ffabcf0 R09: ffff880077467580
> kernel: R10: 0000000000000000 R11: ffff8800376eec09 R12: 0000000000000001
> kernel: R13: 0000000040000000 R14: 0000000000000001 R15: 0000000000000001
> kernel: FS:  0000000000000000(0000) GS:ffff88007db00000(0000)
> knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> kernel: CR2: 0000000000029ce8 CR3: 00000000783ac000 CR4: 00000000000006a0
> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> kernel: Call Trace:
> kernel:  add_memory+0x68/0x120
> kernel:  acpi_memory_device_add+0x134/0x2e0
> kernel:  acpi_bus_attach+0xd9/0x190
> kernel:  acpi_bus_scan+0x37/0x70
> kernel:  acpi_device_hotplug+0x389/0x4e0
> kernel:  acpi_hotplug_work_fn+0x1a/0x30
> kernel:  process_one_work+0x15f/0x350
> kernel:  worker_thread+0x49/0x3e0
> kernel:  kthread+0xf5/0x130
> kernel:  ? max_active_store+0x60/0x60
> kernel:  ? kthread_bind+0x10/0x10
> kernel:  ret_from_fork+0x35/0x40
> kernel: Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E)
> xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E)
> ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E)
> iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E)
> iptable_filter(E) ip_tables(E) x_tables(E) bochs_drm(E) ttm(E)
> drm_kms_helper(E) drm(E) virtio_net(E) net_failover(E) i2c_piix4(E)
> parport_pc(E) parport(E) failover(E) syscopyarea(E) sysfillrect(E)
> sysimgblt(E) fb_sys_fops(E) nfit(E) libnvdimm(E) button(E) pcspkr(E)
> btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E)
> raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E)
> virtio_pci(E) virtio_ring(E) virtio(E) serio_raw(E) libata(E) sg(E)
> scsi_mod(E) autofs4(E)
> kernel: CR2: 0000000000029ce8
> kernel: ---[ end trace be1a8c4d1824ebf4 ]---
> kernel: RIP: 0010:request_resource_and_merge+0x51/0x120
> kernel: Code: df e8 13 e6 ff ff c6 05 bc eb 50 01 00 48 85 c0 74 09 5b 5d
> 41 5c 41 5d 41 5e c3 4a 8b 04 e5 40 e6 00 82 48 c7 c7 d0 70 58 82 <4c> 8b
> a8 e8 9c 02 00 4d 89 ec 4c 03 a8 f8 9c 02 00 e8 89 aa 57 00
> kernel: RSP: 0018:ffffc90000367d48 EFLAGS: 00010246
> kernel: RAX: 0000000000000000 RBX: ffffffff81e4e060 RCX: 000000013fffffff
> kernel: RDX: 0000000100000000 RSI: ffff880077467580 RDI: ffffffff825870d0
> kernel: RBP: ffff880077467580 R08: ffff88007ffabcf0 R09: ffff880077467580
> kernel: R10: 0000000000000000 R11: ffff8800376eec09 R12: 0000000000000001
> kernel: R13: 0000000040000000 R14: 0000000000000001 R15: 0000000000000001
> kernel: FS:  0000000000000000(0000) GS:ffff88007db00000(0000)
> knlGS:0000000000000000
> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> kernel: CR2: 0000000000029ce8 CR3: 00000000783ac000 CR4: 00000000000006a0
> kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
>
> The problem is in this function you added:
>
> +static void merge_node_resources(int nid, struct resource *parent)
> +{
> +       struct resource *res;
> +       uint64_t start_addr;
> +       uint64_t end_addr;
> +       int ret;
> +
> +       start_addr = node_start_pfn(nid) << PAGE_SHIFT;
> +       end_addr = node_end_pfn(nid) << PAGE_SHIFT;
>
>
> node_start_pfn() calls NODE_DATA(nid), which then tries to get the
> node_data[] structure,
> and then try to dereference a value in there.
> This will only work for node's that are already online, but if you are
> adding memory
> to a new node, this will blow up.
>
> In the case we are adding memory from a node which is not onlined yet, we
> online it later on
> in add_memory_resource:
>
> add_memore_resource
>  __try_online_node
>   hotadd_new_pgdat
>
>
> static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
> {
>         struct pglist_data *pgdat;
>         unsigned long start_pfn = PFN_DOWN(start);
>
>         pgdat = NODE_DATA(nid);
>         if (!pgdat) {
>                 pgdat = arch_alloc_nodedata(nid);
>                 if (!pgdat)
>                         return NULL;
>
>                 arch_refresh_nodedata(nid, pgdat);
>         }
>         ...
>         ...
>
> I did not have time to think about a fix for that, so unless we come up
> with something,
> this will have to be reverted for 4.18.
>
> Thanks
> --
> Oscar Salvador
> SUSE L3
>
<div dir="auto"><div><br><br><div class="gmail_quote"><div dir="ltr">On Fri, Aug 10, 2018, 10:26 PM Oscar Salvador &lt;<a href="mailto:osalvador@techadventures.net">osalvador@techadventures.net</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, Aug 09, 2018 at 12:54:09PM +1000, Rashmica Gupta wrote:<br>
&gt; When hot-removing memory release_mem_region_adjustable() splits<br>
&gt; iomem resources if they are not the exact size of the memory being<br>
&gt; hot-deleted. Adding this memory back to the kernel adds a new<br>
&gt; resource.<br>
&gt; <br>
&gt; Eg a node has memory 0x0 - 0xfffffffff. Offlining and hot-removing<br>
&gt; 1GB from 0xf40000000 results in the single resource 0x0-0xfffffffff being<br>
&gt; split into two resources: 0x0-0xf3fffffff and 0xf80000000-0xfffffffff.<br>
&gt; <br>
&gt; When we hot-add the memory back we now have three resources:<br>
&gt; 0x0-0xf3fffffff, 0xf40000000-0xf7fffffff, and 0xf80000000-0xfffffffff.<br>
&gt; <br>
&gt; Now if we try to remove some memory that overlaps these resources,<br>
&gt; like 2GB from 0xf40000000, release_mem_region_adjustable() fails as it<br>
&gt; expects the chunk of memory to be within the boundaries of a single<br>
&gt; resource.<br>
&gt; <br>
&gt; This patch adds a function request_resource_and_merge(). This is called<br>
&gt; instead of request_resource_conflict() when registering a resource in<br>
&gt; add_memory(). It calls request_resource_conflict() and if hot-removing is<br>
&gt; enabled (if it isn&#39;t we won&#39;t get resource fragmentation) we attempt to<br>
&gt; merge contiguous resources on the node.<br>
&gt; <br>
&gt; Signed-off-by: Rashmica Gupta &lt;<a href="mailto:rashmica.g@gmail.com" target="_blank" rel="noreferrer">rashmica.g@gmail.com</a>&gt;<br>
<br>
Hi Rashmica,<br>
<br>
Unfortunately this patch breaks memory-hotplug.<br>
<br>
It makes my kernel go boom when hot-adding memory via qemu:<br>
<br>
Way to reproduce it:<br>
<br>
# connect to a qemu console<br>
# add hot memory:<br>
<br>
(qemu) object_add memory-backend-ram,id=ram0,size=1G<br>
(qemu) device_add pc-dimm,id=dimm2,memdev=ram0,node=1<br>
<br><br><br><br>
and...<br>
<br>
kernel: BUG: unable to handle kernel paging request at 0000000000029ce8<br>
kernel: PGD 0 P4D 0 <br>
kernel: Oops: 0000 [#1] SMP PTI<br>
kernel: CPU: 1 PID: 7 Comm: kworker/u4:0 Tainted: G            E     4.18.0-rc8-next-20180810-1-default+ #292<br>
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS <a href="http://1.0.0-prebuilt.qemu-project.org" rel="noreferrer noreferrer" target="_blank">1.0.0-prebuilt.qemu-project.org</a> 04/01/2014<br>
kernel: Workqueue: kacpi_hotplug acpi_hotplug_work_fn<br>
kernel: RIP: 0010:request_resource_and_merge+0x51/0x120<br>
kernel: Code: df e8 13 e6 ff ff c6 05 bc eb 50 01 00 48 85 c0 74 09 5b 5d 41 5c 41 5d 41 5e c3 4a 8b 04 e5 40 e6 00 82 48 c7 c7 d0 70 58 82 &lt;4c&gt; 8b a8 e8 9c 02 00 4d 89 ec 4c 03 a8 f8 9c 02 00 e8 89 aa 57 00<br>
kernel: RSP: 0018:ffffc90000367d48 EFLAGS: 00010246<br>
kernel: RAX: 0000000000000000 RBX: ffffffff81e4e060 RCX: 000000013fffffff<br>
kernel: RDX: 0000000100000000 RSI: ffff880077467580 RDI: ffffffff825870d0<br>
kernel: RBP: ffff880077467580 R08: ffff88007ffabcf0 R09: ffff880077467580<br>
kernel: R10: 0000000000000000 R11: ffff8800376eec09 R12: 0000000000000001<br>
kernel: R13: 0000000040000000 R14: 0000000000000001 R15: 0000000000000001<br>
kernel: FS:  0000000000000000(0000) GS:ffff88007db00000(0000) knlGS:0000000000000000<br>
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br>
kernel: CR2: 0000000000029ce8 CR3: 00000000783ac000 CR4: 00000000000006a0<br>
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000<br>
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400<br>
kernel: Call Trace:<br>
kernel:  add_memory+0x68/0x120<br>
kernel:  acpi_memory_device_add+0x134/0x2e0<br>
kernel:  acpi_bus_attach+0xd9/0x190<br>
kernel:  acpi_bus_scan+0x37/0x70<br>
kernel:  acpi_device_hotplug+0x389/0x4e0<br>
kernel:  acpi_hotplug_work_fn+0x1a/0x30<br>
kernel:  process_one_work+0x15f/0x350<br>
kernel:  worker_thread+0x49/0x3e0<br>
kernel:  kthread+0xf5/0x130<br>
kernel:  ? max_active_store+0x60/0x60<br>
kernel:  ? kthread_bind+0x10/0x10<br>
kernel:  ret_from_fork+0x35/0x40<br>
kernel: Modules linked in: af_packet(E) xt_tcpudp(E) ipt_REJECT(E) xt_conntrack(E) nf_conntrack(E) nf_defrag_ipv4(E) ip_set(E) nfnetlink(E) ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) ebtable_filter(E) ebtables(E) iptable_filter(E) ip_tables(E) x_tables(E) bochs_drm(E) ttm(E) drm_kms_helper(E) drm(E) virtio_net(E) net_failover(E) i2c_piix4(E) parport_pc(E) parport(E) failover(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) nfit(E) libnvdimm(E) button(E) pcspkr(E) btrfs(E) libcrc32c(E) xor(E) zstd_decompress(E) zstd_compress(E) xxhash(E) raid6_pq(E) sd_mod(E) ata_generic(E) ata_piix(E) ahci(E) libahci(E) virtio_pci(E) virtio_ring(E) virtio(E) serio_raw(E) libata(E) sg(E) scsi_mod(E) autofs4(E)<br>
kernel: CR2: 0000000000029ce8<br>
kernel: ---[ end trace be1a8c4d1824ebf4 ]---<br>
kernel: RIP: 0010:request_resource_and_merge+0x51/0x120<br>
kernel: Code: df e8 13 e6 ff ff c6 05 bc eb 50 01 00 48 85 c0 74 09 5b 5d 41 5c 41 5d 41 5e c3 4a 8b 04 e5 40 e6 00 82 48 c7 c7 d0 70 58 82 &lt;4c&gt; 8b a8 e8 9c 02 00 4d 89 ec 4c 03 a8 f8 9c 02 00 e8 89 aa 57 00<br>
kernel: RSP: 0018:ffffc90000367d48 EFLAGS: 00010246<br>
kernel: RAX: 0000000000000000 RBX: ffffffff81e4e060 RCX: 000000013fffffff<br>
kernel: RDX: 0000000100000000 RSI: ffff880077467580 RDI: ffffffff825870d0<br>
kernel: RBP: ffff880077467580 R08: ffff88007ffabcf0 R09: ffff880077467580<br>
kernel: R10: 0000000000000000 R11: ffff8800376eec09 R12: 0000000000000001<br>
kernel: R13: 0000000040000000 R14: 0000000000000001 R15: 0000000000000001<br>
kernel: FS:  0000000000000000(0000) GS:ffff88007db00000(0000) knlGS:0000000000000000<br>
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033<br>
kernel: CR2: 0000000000029ce8 CR3: 00000000783ac000 CR4: 00000000000006a0<br>
kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000<br>
kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400<br>
<br>
<br>
The problem is in this function you added:<br>
<br>
+static void merge_node_resources(int nid, struct resource *parent)<br>
+{<br>
+       struct resource *res;<br>
+       uint64_t start_addr;<br>
+       uint64_t end_addr;<br>
+       int ret;<br>
+<br>
+       start_addr = node_start_pfn(nid) &lt;&lt; PAGE_SHIFT;<br>
+       end_addr = node_end_pfn(nid) &lt;&lt; PAGE_SHIFT;<br>
<br>
<br>
node_start_pfn() calls NODE_DATA(nid), which then tries to get the node_data[] structure,<br>
and then try to dereference a value in there.<br>
This will only work for node&#39;s that are already online, but if you are adding memory<br>
to a new node, this will blow up.<br>
<br>
In the case we are adding memory from a node which is not onlined yet, we online it later on<br>
in add_memory_resource:<br>
<br>
add_memore_resource<br>
 __try_online_node<br>
  hotadd_new_pgdat<br>
<br>
<br>
static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)<br>
{<br>
        struct pglist_data *pgdat;<br>
        unsigned long start_pfn = PFN_DOWN(start);<br>
<br>
        pgdat = NODE_DATA(nid);<br>
        if (!pgdat) {<br>
                pgdat = arch_alloc_nodedata(nid);<br>
                if (!pgdat)<br>
                        return NULL;<br>
<br>
                arch_refresh_nodedata(nid, pgdat);<br>
        }<br>
        ...<br>
        ...<br>
<br>
I did not have time to think about a fix for that, so unless we come up with something,<br>
this will have to be reverted for 4.18.<br>
<br>
Thanks<br>
-- <br>
Oscar Salvador<br>
SUSE L3<br>
</blockquote></div></div></div>
Michal Hocko Aug. 10, 2018, 1 p.m. UTC | #8
On Fri 10-08-18 16:55:40, Rashmica Gupta wrote:
[...]
> Most memory hotplug/hotremove seems to be block or section based, and
> always adds and removes memory at the same place.

Yes and that is hard wired to the memory hotplug code. It is not easy to
make it work outside of section units restriction. So whatever your
memtrace is doing and if it relies on subsection hotplug it cannot
possibly work with the current code.

I didn't get to review your patch but if it is only needed for an
unmerged code I would rather incline to not merge it unless it is a
clear win to the resource subsystem. A report from Oscar shows that this
is not the case though.
Rashmica Gupta Aug. 10, 2018, 2:25 p.m. UTC | #9
On Fri, Aug 10, 2018 at 11:00 PM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 10-08-18 16:55:40, Rashmica Gupta wrote:
> [...]
>> Most memory hotplug/hotremove seems to be block or section based, and
>> always adds and removes memory at the same place.
>
> Yes and that is hard wired to the memory hotplug code. It is not easy to
> make it work outside of section units restriction. So whatever your
> memtrace is doing and if it relies on subsection hotplug it cannot
> possibly work with the current code.
>
> I didn't get to review your patch but if it is only needed for an
> unmerged code I would rather incline to not merge it unless it is a
> clear win to the resource subsystem. A report from Oscar shows that this
> is not the case though.
>

Yup, makes sense. I'll work on it and see if I can not break things.

> --
> Michal Hocko
> SUSE Labs
Oscar Salvador Aug. 10, 2018, 7:32 p.m. UTC | #10
On Sat, Aug 11, 2018 at 12:25:39AM +1000, Rashmica Gupta wrote:
> On Fri, Aug 10, 2018 at 11:00 PM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 10-08-18 16:55:40, Rashmica Gupta wrote:
> > [...]
> >> Most memory hotplug/hotremove seems to be block or section based, and
> >> always adds and removes memory at the same place.
> >
> > Yes and that is hard wired to the memory hotplug code. It is not easy to
> > make it work outside of section units restriction. So whatever your
> > memtrace is doing and if it relies on subsection hotplug it cannot
> > possibly work with the current code.
> >
> > I didn't get to review your patch but if it is only needed for an
> > unmerged code I would rather incline to not merge it unless it is a
> > clear win to the resource subsystem. A report from Oscar shows that this
> > is not the case though.
> >
> 
> Yup, makes sense. I'll work on it and see if I can not break things.

In __case__ we really need this patch, I think that one way to fix this is
to only call merge_node_resources() in case the node is already online.
Something like this (completely untested):

+struct resource *request_resource_and_merge(struct resource *parent,
+                                           struct resource *new, int nid)
+{
+       struct resource *conflict;
+
+       conflict = request_resource_conflict(parent, new);
+
+       if (conflict)
+               return conflict;
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+       /* We do not need to merge any resources on a node that is being
+        * hot-added together with its memory.
+	 * The node will be allocated later.
+	 */
+       if (node_online(nid))
+       	merge_node_resources(nid, parent);
+#endif /* CONFIG_MEMORY_HOTREMOVE */

Although as Michal said, all memory-hotplug code is section-oriented, so
whatever it is that interacts with it should expect that.
Otherwise it can fail soon or later.
diff mbox series

Patch

diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 065f0b607373..9b972b37b0da 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -401,7 +401,8 @@  static enum bp_state reserve_additional_memory(void)
 	 * callers drop the mutex before trying again.
 	 */
 	mutex_unlock(&balloon_mutex);
-	rc = add_memory_resource(nid, resource, memhp_auto_online);
+	rc = add_memory_resource(nid, resource->start, resource_size(resource),
+				 memhp_auto_online);
 	mutex_lock(&balloon_mutex);
 
 	if (rc) {
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index da0ebaec25f0..f5b93a711e86 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -189,6 +189,8 @@  extern int allocate_resource(struct resource *root, struct resource *new,
 						       resource_size_t,
 						       resource_size_t),
 			     void *alignf_data);
+extern struct resource *request_resource_and_merge(struct resource *parent,
+						   struct resource *new, int nid);
 struct resource *lookup_resource(struct resource *root, resource_size_t start);
 int adjust_resource(struct resource *res, resource_size_t start,
 		    resource_size_t size);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 4e9828cda7a2..9c00f97c8cc6 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -322,7 +322,7 @@  static inline void remove_memory(int nid, u64 start, u64 size) {}
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
 extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource, bool online);
+extern int add_memory_resource(int nid, u64 start, u64 size, bool online);
 extern int arch_add_memory(int nid, u64 start, u64 size,
 		struct vmem_altmap *altmap, bool want_memblock);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
diff --git a/kernel/resource.c b/kernel/resource.c
index 30e1bc68503b..a31d3f5bccb7 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1621,3 +1621,123 @@  static int __init strict_iomem(char *str)
 }
 
 __setup("iomem=", strict_iomem);
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+#ifdef CONFIG_MEMORY_HOTREMOVE
+/*
+ * Attempt to merge resource and it's sibling
+ */
+static int merge_resources(struct resource *res)
+{
+	struct resource *next;
+	struct resource *tmp;
+	uint64_t size;
+	int ret = -EINVAL;
+
+	next = res->sibling;
+
+	/*
+	 * Not sure how to handle two different children. So only attempt
+	 * to merge two resources if neither have children, only one has a
+	 * child or if both have the same child.
+	 */
+	if ((res->child && next->child) && (res->child != next->child))
+		return ret;
+
+	if (res->end + 1 != next->start)
+		return ret;
+
+	if (res->flags != next->flags)
+		return ret;
+
+	/* Update sibling and child of resource */
+	res->sibling = next->sibling;
+	tmp = res->child;
+	if (!res->child)
+		res->child = next->child;
+
+	size = next->end - res->start + 1;
+	ret = __adjust_resource(res, res->start, size);
+	if (ret) {
+		/* Failed so restore resource to original state */
+		res->sibling = next;
+		res->child = tmp;
+		return ret;
+	}
+
+	free_resource(next);
+
+	return ret;
+}
+
+/*
+ * Attempt to merge resources on the node
+ */
+static void merge_node_resources(int nid, struct resource *parent)
+{
+	struct resource *res;
+	uint64_t start_addr;
+	uint64_t end_addr;
+	int ret;
+
+	start_addr = node_start_pfn(nid) << PAGE_SHIFT;
+	end_addr = node_end_pfn(nid) << PAGE_SHIFT;
+
+	write_lock(&resource_lock);
+
+	/* Get the first resource */
+	res = parent->child;
+
+	while (res) {
+		/* Check that the resource is within the node */
+		if (res->start < start_addr) {
+			res = res->sibling;
+			continue;
+		}
+		/* Exit if sibling resource is past end of node */
+		if (res->sibling->end >= end_addr)
+			break;
+
+		ret = merge_resources(res);
+		if (!ret)
+			continue;
+		res = res->sibling;
+	}
+	write_unlock(&resource_lock);
+}
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
+/**
+ * request_resource_and_merge() - request an I/O or memory resource for hot-add
+ * @parent: parent resource descriptor
+ * @new: resource descriptor desired by caller
+ * @nid: node id of the node we want the resource on
+ *
+ * If no conflict resource then attempt to merge resources on the node.
+ *
+ * This is intended to cleanup the fragmentation of resources that occurs when
+ * hot-removing memory (see release_mem_region_adjustable). If hot-removing is
+ * not enabled then there is no point trying to merge resources.
+ *
+ * Note that the inability to merge resources is not an error.
+ *
+ * Return: NULL for successful request of resource and conflict resource if
+ * there was a conflict.
+ */
+struct resource *request_resource_and_merge(struct resource *parent,
+					    struct resource *new, int nid)
+{
+	struct resource *conflict;
+
+	conflict = request_resource_conflict(parent, new);
+
+	if (conflict)
+		return conflict;
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	merge_node_resources(nid, parent);
+#endif /* CONFIG_MEMORY_HOTREMOVE */
+
+	return NULL;
+}
+#endif /* CONFIG_MEMORY_HOTPLUG */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7deb49f69e27..2e342f5ce322 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -97,7 +97,7 @@  void mem_hotplug_done(void)
 }
 
 /* add this memory to iomem resource */
-static struct resource *register_memory_resource(u64 start, u64 size)
+static struct resource *register_memory_resource(int nid, u64 start, u64 size)
 {
 	struct resource *res, *conflict;
 	res = kzalloc(sizeof(struct resource), GFP_KERNEL);
@@ -108,7 +108,7 @@  static struct resource *register_memory_resource(u64 start, u64 size)
 	res->start = start;
 	res->end = start + size - 1;
 	res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
-	conflict =  request_resource_conflict(&iomem_resource, res);
+	conflict =  request_resource_and_merge(&iomem_resource, res, nid);
 	if (conflict) {
 		if (conflict->desc == IORES_DESC_DEVICE_PRIVATE_MEMORY) {
 			pr_debug("Device unaddressable memory block "
@@ -122,11 +122,15 @@  static struct resource *register_memory_resource(u64 start, u64 size)
 	return res;
 }
 
-static void release_memory_resource(struct resource *res)
+static void release_memory_resource(struct resource *res, u64 start, u64 size)
 {
 	if (!res)
 		return;
+#ifdef CONFIG_MEMORY_HOTREMOVE
+	release_mem_region_adjustable(&iomem_resource, start, size);
+#else
 	release_resource(res);
+#endif
 	kfree(res);
 	return;
 }
@@ -1096,17 +1100,13 @@  static int online_memory_block(struct memory_block *mem, void *arg)
 }
 
 /* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
-int __ref add_memory_resource(int nid, struct resource *res, bool online)
+int __ref add_memory_resource(int nid, u64 start, u64 size, bool online)
 {
-	u64 start, size;
 	pg_data_t *pgdat = NULL;
 	bool new_pgdat;
 	bool new_node;
 	int ret;
 
-	start = res->start;
-	size = resource_size(res);
-
 	ret = check_hotplug_memory_range(start, size);
 	if (ret)
 		return ret;
@@ -1195,13 +1195,13 @@  int __ref add_memory(int nid, u64 start, u64 size)
 	struct resource *res;
 	int ret;
 
-	res = register_memory_resource(start, size);
+	res = register_memory_resource(nid, start, size);
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res, memhp_auto_online);
+	ret = add_memory_resource(nid, start, size, memhp_auto_online);
 	if (ret < 0)
-		release_memory_resource(res);
+		release_memory_resource(res, start, size);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(add_memory);