diff mbox series

[v2,2/5] mm,memory_hotplug: Introduce MHP_VMEMMAP_FLAGS

Message ID 20190625075227.15193-3-osalvador@suse.de (mailing list archive)
State New, archived
Headers show
Series Allocate memmap from hotadded memory | expand

Commit Message

Oscar Salvador June 25, 2019, 7:52 a.m. UTC
This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
and prepares the callers that add memory to take a "flags" parameter.
This "flags" parameter will be evaluated later on in Patch#3
to init mhp_restrictions struct.

The callers are:

add_memory
__add_memory
add_memory_resource

Unfortunately, we do not have a single entry point to add memory, as depending
on the requisites of the caller, they want to hook up in different places,
(e.g: Xen reserve_additional_memory()), so we have to spread the parameter
in the three callers.

The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
in the way they allocate vmemmap pages within the memory blocks.

MHP_MEMMAP_MEMBLOCK:
	- With this flag, we will allocate vmemmap pages in each memory block.
	  This means that if we hot-add a range that spans multiple memory blocks,
	  we will use the beginning of each memory block for the vmemmap pages.
	  This strategy is good for cases where the caller wants the flexiblity
	  to hot-remove memory in a different granularity than when it was added.

	  E.g:
		We allocate a range (x,y], that spans 3 memory blocks, and given
		memory block size = 128MB.
		[memblock#0  ]
		[0 - 511 pfns      ] - vmemmaps for section#0
		[512 - 32767 pfns  ] - normal memory

		[memblock#1 ]
		[32768 - 33279 pfns] - vmemmaps for section#1
		[33280 - 65535 pfns] - normal memory

		[memblock#2 ]
		[65536 - 66047 pfns] - vmemmap for section#2
		[66048 - 98304 pfns] - normal memory

MHP_MEMMAP_DEVICE:
	- With this flag, we will store all vmemmap pages at the beginning of
	  hot-added memory.

	  E.g:
		We allocate a range (x,y], that spans 3 memory blocks, and given
		memory block size = 128MB.
		[memblock #0 ]
		[0 - 1533 pfns    ] - vmemmap for section#{0-2}
		[1534 - 98304 pfns] - normal memory

When using larger memory blocks (1GB or 2GB), the principle is the same.

Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
memory.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/acpi/acpi_memhotplug.c |  2 +-
 drivers/base/memory.c          |  2 +-
 drivers/dax/kmem.c             |  2 +-
 drivers/hv/hv_balloon.c        |  2 +-
 drivers/s390/char/sclp_cmd.c   |  2 +-
 drivers/xen/balloon.c          |  2 +-
 include/linux/memory_hotplug.h | 22 +++++++++++++++++++---
 mm/memory_hotplug.c            | 10 +++++-----
 8 files changed, 30 insertions(+), 14 deletions(-)

Comments

David Hildenbrand June 25, 2019, 8:31 a.m. UTC | #1
On 25.06.19 09:52, Oscar Salvador wrote:
> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> and prepares the callers that add memory to take a "flags" parameter.
> This "flags" parameter will be evaluated later on in Patch#3
> to init mhp_restrictions struct.
> 
> The callers are:
> 
> add_memory
> __add_memory
> add_memory_resource
> 
> Unfortunately, we do not have a single entry point to add memory, as depending
> on the requisites of the caller, they want to hook up in different places,
> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> in the three callers.
> 
> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> in the way they allocate vmemmap pages within the memory blocks.
> 
> MHP_MEMMAP_MEMBLOCK:
> 	- With this flag, we will allocate vmemmap pages in each memory block.
> 	  This means that if we hot-add a range that spans multiple memory blocks,
> 	  we will use the beginning of each memory block for the vmemmap pages.
> 	  This strategy is good for cases where the caller wants the flexiblity
> 	  to hot-remove memory in a different granularity than when it was added.
> 
> 	  E.g:
> 		We allocate a range (x,y], that spans 3 memory blocks, and given
> 		memory block size = 128MB.
> 		[memblock#0  ]
> 		[0 - 511 pfns      ] - vmemmaps for section#0
> 		[512 - 32767 pfns  ] - normal memory
> 
> 		[memblock#1 ]
> 		[32768 - 33279 pfns] - vmemmaps for section#1
> 		[33280 - 65535 pfns] - normal memory
> 
> 		[memblock#2 ]
> 		[65536 - 66047 pfns] - vmemmap for section#2
> 		[66048 - 98304 pfns] - normal memory
> 
> MHP_MEMMAP_DEVICE:
> 	- With this flag, we will store all vmemmap pages at the beginning of
> 	  hot-added memory.
> 
> 	  E.g:
> 		We allocate a range (x,y], that spans 3 memory blocks, and given
> 		memory block size = 128MB.
> 		[memblock #0 ]
> 		[0 - 1533 pfns    ] - vmemmap for section#{0-2}
> 		[1534 - 98304 pfns] - normal memory
> 
> When using larger memory blocks (1GB or 2GB), the principle is the same.
> 
> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> memory.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  drivers/base/memory.c          |  2 +-
>  drivers/dax/kmem.c             |  2 +-
>  drivers/hv/hv_balloon.c        |  2 +-
>  drivers/s390/char/sclp_cmd.c   |  2 +-
>  drivers/xen/balloon.c          |  2 +-
>  include/linux/memory_hotplug.h | 22 +++++++++++++++++++---
>  mm/memory_hotplug.c            | 10 +++++-----
>  8 files changed, 30 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index db013dc21c02..860f84e82dd0 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -218,7 +218,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  		if (node < 0)
>  			node = memory_add_physaddr_to_nid(info->start_addr);
>  
> -		result = __add_memory(node, info->start_addr, info->length);
> +		result = __add_memory(node, info->start_addr, info->length, 0);
>  
>  		/*
>  		 * If the memory block has been used by the kernel, add_memory()
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 07ba731beb42..ad9834b8b7f7 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -516,7 +516,7 @@ static ssize_t probe_store(struct device *dev, struct device_attribute *attr,
>  
>  	nid = memory_add_physaddr_to_nid(phys_addr);
>  	ret = __add_memory(nid, phys_addr,
> -			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
> +			   MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
>  
>  	if (ret)
>  		goto out;
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 3d0a7e702c94..e159184e0ba0 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -65,7 +65,7 @@ int dev_dax_kmem_probe(struct device *dev)
>  	new_res->flags = IORESOURCE_SYSTEM_RAM;
>  	new_res->name = dev_name(dev);
>  
> -	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
> +	rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
>  	if (rc) {
>  		release_resource(new_res);
>  		kfree(new_res);
> diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
> index 6fb4ea5f0304..beb92bc56186 100644
> --- a/drivers/hv/hv_balloon.c
> +++ b/drivers/hv/hv_balloon.c
> @@ -731,7 +731,7 @@ static void hv_mem_hot_add(unsigned long start, unsigned long size,
>  
>  		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
>  		ret = add_memory(nid, PFN_PHYS((start_pfn)),
> -				(HA_CHUNK << PAGE_SHIFT));
> +				(HA_CHUNK << PAGE_SHIFT), 0);
>  
>  		if (ret) {
>  			pr_err("hot_add memory failed error is %d\n", ret);
> diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
> index 37d42de06079..f61026c7db7e 100644
> --- a/drivers/s390/char/sclp_cmd.c
> +++ b/drivers/s390/char/sclp_cmd.c
> @@ -406,7 +406,7 @@ static void __init add_memory_merged(u16 rn)
>  	if (!size)
>  		goto skip_add;
>  	for (addr = start; addr < start + size; addr += block_size)
> -		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
> +		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, 0);
>  skip_add:
>  	first_rn = rn;
>  	num = 1;
> diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
> index 37a36c6b9f93..33814b3513ca 100644
> --- a/drivers/xen/balloon.c
> +++ b/drivers/xen/balloon.c
> @@ -349,7 +349,7 @@ static enum bp_state reserve_additional_memory(void)
>  	mutex_unlock(&balloon_mutex);
>  	/* add_memory_resource() requires the device_hotplug lock */
>  	lock_device_hotplug();
> -	rc = add_memory_resource(nid, resource);
> +	rc = add_memory_resource(nid, resource, 0);
>  	unlock_device_hotplug();
>  	mutex_lock(&balloon_mutex);
>  
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 0b8a5e5ef2da..6fdbce9d04f9 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -54,6 +54,22 @@ enum {
>  };
>  
>  /*
> + * We want memmap (struct page array) to be allocated from the hotadded range.
> + * To do so, there are two possible ways depending on what the caller wants.
> + * 1) Allocate memmap pages per device (whole hot-added range)
> + * 2) Allocate memmap pages per memblock
> + * The former implies that we wil use the beginning of the hot-added range

s/wil/will/

> + * to store the memmap pages of the whole range, while the latter implies
> + * that we will use the beginning of each memblock to store its own memmap
> + * pages.
> + * Please note that only SPARSE_VMEMMAP implements this feature and some
> + * architectures might not support it even for that memory model (e.g. s390)

Probably rephrase to "This is only a hint, not a guarantee. Only
selected architectures support it with SPARSE_VMEMMAP."

> + */
> +#define MHP_MEMMAP_DEVICE	(1UL<<0)
> +#define MHP_MEMMAP_MEMBLOCK	(1UL<<1)
> +#define MHP_VMEMMAP_FLAGS	(MHP_MEMMAP_DEVICE|MHP_MEMMAP_MEMBLOCK)
> +
> +/*
>   * Restrictions for the memory hotplug:
>   * flags:  MHP_ flags
>   * altmap: alternative allocator for memmap array
> @@ -342,9 +358,9 @@ static inline void __remove_memory(int nid, u64 start, u64 size) {}
>  extern void __ref free_area_init_core_hotplug(int nid);
>  extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
>  		void *arg, int (*func)(struct memory_block *, void *));
> -extern int __add_memory(int nid, u64 start, u64 size);
> -extern int add_memory(int nid, u64 start, u64 size);
> -extern int add_memory_resource(int nid, struct resource *resource);
> +extern int __add_memory(int nid, u64 start, u64 size, unsigned long flags);
> +extern int add_memory(int nid, u64 start, u64 size, unsigned long flags);
> +extern int add_memory_resource(int nid, struct resource *resource, unsigned long flags);
>  extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
>  		unsigned long nr_pages, struct vmem_altmap *altmap);
>  extern bool is_memblock_offlined(struct memory_block *mem);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 4e8e65954f31..e4e3baa6eaa7 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1057,7 +1057,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>   *
>   * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
>   */
> -int __ref add_memory_resource(int nid, struct resource *res)
> +int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags)
>  {
>  	struct mhp_restrictions restrictions = {};
>  	u64 start, size;
> @@ -1135,7 +1135,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
>  }
>  
>  /* requires device_hotplug_lock, see add_memory_resource() */
> -int __ref __add_memory(int nid, u64 start, u64 size)
> +int __ref __add_memory(int nid, u64 start, u64 size, unsigned long flags)
>  {
>  	struct resource *res;
>  	int ret;
> @@ -1144,18 +1144,18 @@ int __ref __add_memory(int nid, u64 start, u64 size)
>  	if (IS_ERR(res))
>  		return PTR_ERR(res);
>  
> -	ret = add_memory_resource(nid, res);
> +	ret = add_memory_resource(nid, res, flags);
>  	if (ret < 0)
>  		release_memory_resource(res);
>  	return ret;
>  }
>  
> -int add_memory(int nid, u64 start, u64 size)
> +int add_memory(int nid, u64 start, u64 size, unsigned long flags)
>  {
>  	int rc;
>  
>  	lock_device_hotplug();
> -	rc = __add_memory(nid, start, size);
> +	rc = __add_memory(nid, start, size, flags);
>  	unlock_device_hotplug();
>  
>  	return rc;
> 

Apart from that, looks good to me.

Reviewed-by: David Hildenbrand <david@redhat.com>
Dan Williams July 24, 2019, 8:11 p.m. UTC | #2
On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> and prepares the callers that add memory to take a "flags" parameter.
> This "flags" parameter will be evaluated later on in Patch#3
> to init mhp_restrictions struct.
>
> The callers are:
>
> add_memory
> __add_memory
> add_memory_resource
>
> Unfortunately, we do not have a single entry point to add memory, as depending
> on the requisites of the caller, they want to hook up in different places,
> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> in the three callers.
>
> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> in the way they allocate vmemmap pages within the memory blocks.
>
> MHP_MEMMAP_MEMBLOCK:
>         - With this flag, we will allocate vmemmap pages in each memory block.
>           This means that if we hot-add a range that spans multiple memory blocks,
>           we will use the beginning of each memory block for the vmemmap pages.
>           This strategy is good for cases where the caller wants the flexiblity
>           to hot-remove memory in a different granularity than when it was added.
>
>           E.g:
>                 We allocate a range (x,y], that spans 3 memory blocks, and given
>                 memory block size = 128MB.
>                 [memblock#0  ]
>                 [0 - 511 pfns      ] - vmemmaps for section#0
>                 [512 - 32767 pfns  ] - normal memory
>
>                 [memblock#1 ]
>                 [32768 - 33279 pfns] - vmemmaps for section#1
>                 [33280 - 65535 pfns] - normal memory
>
>                 [memblock#2 ]
>                 [65536 - 66047 pfns] - vmemmap for section#2
>                 [66048 - 98304 pfns] - normal memory
>
> MHP_MEMMAP_DEVICE:
>         - With this flag, we will store all vmemmap pages at the beginning of
>           hot-added memory.
>
>           E.g:
>                 We allocate a range (x,y], that spans 3 memory blocks, and given
>                 memory block size = 128MB.
>                 [memblock #0 ]
>                 [0 - 1533 pfns    ] - vmemmap for section#{0-2}
>                 [1534 - 98304 pfns] - normal memory
>
> When using larger memory blocks (1GB or 2GB), the principle is the same.
>
> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> memory.

Concept and patch looks good to me, but I don't quite like the
proliferation of the _DEVICE naming, in theory it need not necessarily
be ZONE_DEVICE that is the only user of that flag. I also think it
might be useful to assign a flag for the default 'allocate from RAM'
case, just so the code is explicit. So, how about:

MHP_MEMMAP_PAGE_ALLOC
MHP_MEMMAP_MEMBLOCK
MHP_MEMMAP_RESERVED

...for the 3 cases?

Other than that, feel free to add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Oscar Salvador July 24, 2019, 9:36 p.m. UTC | #3
On 2019-07-24 22:11, Dan Williams wrote:
> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@suse.de> 
> wrote:
>> 
>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
>> and prepares the callers that add memory to take a "flags" parameter.
>> This "flags" parameter will be evaluated later on in Patch#3
>> to init mhp_restrictions struct.
>> 
>> The callers are:
>> 
>> add_memory
>> __add_memory
>> add_memory_resource
>> 
>> Unfortunately, we do not have a single entry point to add memory, as 
>> depending
>> on the requisites of the caller, they want to hook up in different 
>> places,
>> (e.g: Xen reserve_additional_memory()), so we have to spread the 
>> parameter
>> in the three callers.
>> 
>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and 
>> only differ
>> in the way they allocate vmemmap pages within the memory blocks.
>> 
>> MHP_MEMMAP_MEMBLOCK:
>>         - With this flag, we will allocate vmemmap pages in each 
>> memory block.
>>           This means that if we hot-add a range that spans multiple 
>> memory blocks,
>>           we will use the beginning of each memory block for the 
>> vmemmap pages.
>>           This strategy is good for cases where the caller wants the 
>> flexiblity
>>           to hot-remove memory in a different granularity than when it 
>> was added.
>> 
>>           E.g:
>>                 We allocate a range (x,y], that spans 3 memory blocks, 
>> and given
>>                 memory block size = 128MB.
>>                 [memblock#0  ]
>>                 [0 - 511 pfns      ] - vmemmaps for section#0
>>                 [512 - 32767 pfns  ] - normal memory
>> 
>>                 [memblock#1 ]
>>                 [32768 - 33279 pfns] - vmemmaps for section#1
>>                 [33280 - 65535 pfns] - normal memory
>> 
>>                 [memblock#2 ]
>>                 [65536 - 66047 pfns] - vmemmap for section#2
>>                 [66048 - 98304 pfns] - normal memory
>> 
>> MHP_MEMMAP_DEVICE:
>>         - With this flag, we will store all vmemmap pages at the 
>> beginning of
>>           hot-added memory.
>> 
>>           E.g:
>>                 We allocate a range (x,y], that spans 3 memory blocks, 
>> and given
>>                 memory block size = 128MB.
>>                 [memblock #0 ]
>>                 [0 - 1533 pfns    ] - vmemmap for section#{0-2}
>>                 [1534 - 98304 pfns] - normal memory
>> 
>> When using larger memory blocks (1GB or 2GB), the principle is the 
>> same.
>> 
>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large 
>> contigous
>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when 
>> removing the
>> memory.
> 
> Concept and patch looks good to me, but I don't quite like the
> proliferation of the _DEVICE naming, in theory it need not necessarily
> be ZONE_DEVICE that is the only user of that flag. I also think it
> might be useful to assign a flag for the default 'allocate from RAM'
> case, just so the code is explicit. So, how about:
> 
> MHP_MEMMAP_PAGE_ALLOC
> MHP_MEMMAP_MEMBLOCK
> MHP_MEMMAP_RESERVED
> 
> ...for the 3 cases?
> 
> Other than that, feel free to add:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
  HI Dan,

I'll be sending V3 tomorrow, with some major rewrites (more simplified).

Thanks
Oscar Salvador July 25, 2019, 9:27 a.m. UTC | #4
On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@suse.de> wrote:
> >
> > This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> > and prepares the callers that add memory to take a "flags" parameter.
> > This "flags" parameter will be evaluated later on in Patch#3
> > to init mhp_restrictions struct.
> >
> > The callers are:
> >
> > add_memory
> > __add_memory
> > add_memory_resource
> >
> > Unfortunately, we do not have a single entry point to add memory, as depending
> > on the requisites of the caller, they want to hook up in different places,
> > (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> > in the three callers.
> >
> > The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> > in the way they allocate vmemmap pages within the memory blocks.
> >
> > MHP_MEMMAP_MEMBLOCK:
> >         - With this flag, we will allocate vmemmap pages in each memory block.
> >           This means that if we hot-add a range that spans multiple memory blocks,
> >           we will use the beginning of each memory block for the vmemmap pages.
> >           This strategy is good for cases where the caller wants the flexiblity
> >           to hot-remove memory in a different granularity than when it was added.
> >
> >           E.g:
> >                 We allocate a range (x,y], that spans 3 memory blocks, and given
> >                 memory block size = 128MB.
> >                 [memblock#0  ]
> >                 [0 - 511 pfns      ] - vmemmaps for section#0
> >                 [512 - 32767 pfns  ] - normal memory
> >
> >                 [memblock#1 ]
> >                 [32768 - 33279 pfns] - vmemmaps for section#1
> >                 [33280 - 65535 pfns] - normal memory
> >
> >                 [memblock#2 ]
> >                 [65536 - 66047 pfns] - vmemmap for section#2
> >                 [66048 - 98304 pfns] - normal memory
> >
> > MHP_MEMMAP_DEVICE:
> >         - With this flag, we will store all vmemmap pages at the beginning of
> >           hot-added memory.
> >
> >           E.g:
> >                 We allocate a range (x,y], that spans 3 memory blocks, and given
> >                 memory block size = 128MB.
> >                 [memblock #0 ]
> >                 [0 - 1533 pfns    ] - vmemmap for section#{0-2}
> >                 [1534 - 98304 pfns] - normal memory
> >
> > When using larger memory blocks (1GB or 2GB), the principle is the same.
> >
> > Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> > area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> > memory.
> 
> Concept and patch looks good to me, but I don't quite like the
> proliferation of the _DEVICE naming, in theory it need not necessarily
> be ZONE_DEVICE that is the only user of that flag. I also think it
> might be useful to assign a flag for the default 'allocate from RAM'
> case, just so the code is explicit. So, how about:

Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
MHP_MEMMAP_DEVICE was chosen to make a difference between:

 * allocate memmap pages for the whole memory-device
 * allocate memmap pages on each memoryblock that this memory-device spans

> 
> MHP_MEMMAP_PAGE_ALLOC
> MHP_MEMMAP_MEMBLOCK
> MHP_MEMMAP_RESERVED
> 
> ...for the 3 cases?
> 
> Other than that, feel free to add:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
David Hildenbrand July 25, 2019, 9:30 a.m. UTC | #5
On 25.07.19 11:27, Oscar Salvador wrote:
> On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
>> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@suse.de> wrote:
>>>
>>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
>>> and prepares the callers that add memory to take a "flags" parameter.
>>> This "flags" parameter will be evaluated later on in Patch#3
>>> to init mhp_restrictions struct.
>>>
>>> The callers are:
>>>
>>> add_memory
>>> __add_memory
>>> add_memory_resource
>>>
>>> Unfortunately, we do not have a single entry point to add memory, as depending
>>> on the requisites of the caller, they want to hook up in different places,
>>> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
>>> in the three callers.
>>>
>>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
>>> in the way they allocate vmemmap pages within the memory blocks.
>>>
>>> MHP_MEMMAP_MEMBLOCK:
>>>         - With this flag, we will allocate vmemmap pages in each memory block.
>>>           This means that if we hot-add a range that spans multiple memory blocks,
>>>           we will use the beginning of each memory block for the vmemmap pages.
>>>           This strategy is good for cases where the caller wants the flexiblity
>>>           to hot-remove memory in a different granularity than when it was added.
>>>
>>>           E.g:
>>>                 We allocate a range (x,y], that spans 3 memory blocks, and given
>>>                 memory block size = 128MB.
>>>                 [memblock#0  ]
>>>                 [0 - 511 pfns      ] - vmemmaps for section#0
>>>                 [512 - 32767 pfns  ] - normal memory
>>>
>>>                 [memblock#1 ]
>>>                 [32768 - 33279 pfns] - vmemmaps for section#1
>>>                 [33280 - 65535 pfns] - normal memory
>>>
>>>                 [memblock#2 ]
>>>                 [65536 - 66047 pfns] - vmemmap for section#2
>>>                 [66048 - 98304 pfns] - normal memory
>>>
>>> MHP_MEMMAP_DEVICE:
>>>         - With this flag, we will store all vmemmap pages at the beginning of
>>>           hot-added memory.
>>>
>>>           E.g:
>>>                 We allocate a range (x,y], that spans 3 memory blocks, and given
>>>                 memory block size = 128MB.
>>>                 [memblock #0 ]
>>>                 [0 - 1533 pfns    ] - vmemmap for section#{0-2}
>>>                 [1534 - 98304 pfns] - normal memory
>>>
>>> When using larger memory blocks (1GB or 2GB), the principle is the same.
>>>
>>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
>>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
>>> memory.
>>
>> Concept and patch looks good to me, but I don't quite like the
>> proliferation of the _DEVICE naming, in theory it need not necessarily
>> be ZONE_DEVICE that is the only user of that flag. I also think it
>> might be useful to assign a flag for the default 'allocate from RAM'
>> case, just so the code is explicit. So, how about:
> 
> Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
> MHP_MEMMAP_DEVICE was chosen to make a difference between:
> 
>  * allocate memmap pages for the whole memory-device
>  * allocate memmap pages on each memoryblock that this memory-device spans

I agree that DEVICE is misleading here, you are assuming a one-to-one
mapping between a device and add_memory(). You are actually taliing
about "allocate a single chunk of mmap pages for the whole memory range
that is added - which could consist of multiple memory blocks".
Oscar Salvador July 25, 2019, 9:40 a.m. UTC | #6
On Thu, Jul 25, 2019 at 11:30:23AM +0200, David Hildenbrand wrote:
> On 25.07.19 11:27, Oscar Salvador wrote:
> > On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
> >> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@suse.de> wrote:
> >>>
> >>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
> >>> and prepares the callers that add memory to take a "flags" parameter.
> >>> This "flags" parameter will be evaluated later on in Patch#3
> >>> to init mhp_restrictions struct.
> >>>
> >>> The callers are:
> >>>
> >>> add_memory
> >>> __add_memory
> >>> add_memory_resource
> >>>
> >>> Unfortunately, we do not have a single entry point to add memory, as depending
> >>> on the requisites of the caller, they want to hook up in different places,
> >>> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
> >>> in the three callers.
> >>>
> >>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
> >>> in the way they allocate vmemmap pages within the memory blocks.
> >>>
> >>> MHP_MEMMAP_MEMBLOCK:
> >>>         - With this flag, we will allocate vmemmap pages in each memory block.
> >>>           This means that if we hot-add a range that spans multiple memory blocks,
> >>>           we will use the beginning of each memory block for the vmemmap pages.
> >>>           This strategy is good for cases where the caller wants the flexiblity
> >>>           to hot-remove memory in a different granularity than when it was added.
> >>>
> >>>           E.g:
> >>>                 We allocate a range (x,y], that spans 3 memory blocks, and given
> >>>                 memory block size = 128MB.
> >>>                 [memblock#0  ]
> >>>                 [0 - 511 pfns      ] - vmemmaps for section#0
> >>>                 [512 - 32767 pfns  ] - normal memory
> >>>
> >>>                 [memblock#1 ]
> >>>                 [32768 - 33279 pfns] - vmemmaps for section#1
> >>>                 [33280 - 65535 pfns] - normal memory
> >>>
> >>>                 [memblock#2 ]
> >>>                 [65536 - 66047 pfns] - vmemmap for section#2
> >>>                 [66048 - 98304 pfns] - normal memory
> >>>
> >>> MHP_MEMMAP_DEVICE:
> >>>         - With this flag, we will store all vmemmap pages at the beginning of
> >>>           hot-added memory.
> >>>
> >>>           E.g:
> >>>                 We allocate a range (x,y], that spans 3 memory blocks, and given
> >>>                 memory block size = 128MB.
> >>>                 [memblock #0 ]
> >>>                 [0 - 1533 pfns    ] - vmemmap for section#{0-2}
> >>>                 [1534 - 98304 pfns] - normal memory
> >>>
> >>> When using larger memory blocks (1GB or 2GB), the principle is the same.
> >>>
> >>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
> >>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
> >>> memory.
> >>
> >> Concept and patch looks good to me, but I don't quite like the
> >> proliferation of the _DEVICE naming, in theory it need not necessarily
> >> be ZONE_DEVICE that is the only user of that flag. I also think it
> >> might be useful to assign a flag for the default 'allocate from RAM'
> >> case, just so the code is explicit. So, how about:
> > 
> > Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
> > MHP_MEMMAP_DEVICE was chosen to make a difference between:
> > 
> >  * allocate memmap pages for the whole memory-device
> >  * allocate memmap pages on each memoryblock that this memory-device spans
> 
> I agree that DEVICE is misleading here, you are assuming a one-to-one
> mapping between a device and add_memory(). You are actually taliing
> about "allocate a single chunk of mmap pages for the whole memory range
> that is added - which could consist of multiple memory blocks".

Well, I could not come up with a better name.

MHP_MEMMAP_ALL?
MHP_MEMMAP_WHOLE?

I will send v3 shortly and then we can think of a better name.

> 
> -- 
> 
> Thanks,
> 
> David / dhildenb
>
David Hildenbrand July 25, 2019, 10:04 a.m. UTC | #7
On 25.07.19 11:40, Oscar Salvador wrote:
> On Thu, Jul 25, 2019 at 11:30:23AM +0200, David Hildenbrand wrote:
>> On 25.07.19 11:27, Oscar Salvador wrote:
>>> On Wed, Jul 24, 2019 at 01:11:52PM -0700, Dan Williams wrote:
>>>> On Tue, Jun 25, 2019 at 12:53 AM Oscar Salvador <osalvador@suse.de> wrote:
>>>>>
>>>>> This patch introduces MHP_MEMMAP_DEVICE and MHP_MEMMAP_MEMBLOCK flags,
>>>>> and prepares the callers that add memory to take a "flags" parameter.
>>>>> This "flags" parameter will be evaluated later on in Patch#3
>>>>> to init mhp_restrictions struct.
>>>>>
>>>>> The callers are:
>>>>>
>>>>> add_memory
>>>>> __add_memory
>>>>> add_memory_resource
>>>>>
>>>>> Unfortunately, we do not have a single entry point to add memory, as depending
>>>>> on the requisites of the caller, they want to hook up in different places,
>>>>> (e.g: Xen reserve_additional_memory()), so we have to spread the parameter
>>>>> in the three callers.
>>>>>
>>>>> The flags are either MHP_MEMMAP_DEVICE or MHP_MEMMAP_MEMBLOCK, and only differ
>>>>> in the way they allocate vmemmap pages within the memory blocks.
>>>>>
>>>>> MHP_MEMMAP_MEMBLOCK:
>>>>>         - With this flag, we will allocate vmemmap pages in each memory block.
>>>>>           This means that if we hot-add a range that spans multiple memory blocks,
>>>>>           we will use the beginning of each memory block for the vmemmap pages.
>>>>>           This strategy is good for cases where the caller wants the flexiblity
>>>>>           to hot-remove memory in a different granularity than when it was added.
>>>>>
>>>>>           E.g:
>>>>>                 We allocate a range (x,y], that spans 3 memory blocks, and given
>>>>>                 memory block size = 128MB.
>>>>>                 [memblock#0  ]
>>>>>                 [0 - 511 pfns      ] - vmemmaps for section#0
>>>>>                 [512 - 32767 pfns  ] - normal memory
>>>>>
>>>>>                 [memblock#1 ]
>>>>>                 [32768 - 33279 pfns] - vmemmaps for section#1
>>>>>                 [33280 - 65535 pfns] - normal memory
>>>>>
>>>>>                 [memblock#2 ]
>>>>>                 [65536 - 66047 pfns] - vmemmap for section#2
>>>>>                 [66048 - 98304 pfns] - normal memory
>>>>>
>>>>> MHP_MEMMAP_DEVICE:
>>>>>         - With this flag, we will store all vmemmap pages at the beginning of
>>>>>           hot-added memory.
>>>>>
>>>>>           E.g:
>>>>>                 We allocate a range (x,y], that spans 3 memory blocks, and given
>>>>>                 memory block size = 128MB.
>>>>>                 [memblock #0 ]
>>>>>                 [0 - 1533 pfns    ] - vmemmap for section#{0-2}
>>>>>                 [1534 - 98304 pfns] - normal memory
>>>>>
>>>>> When using larger memory blocks (1GB or 2GB), the principle is the same.
>>>>>
>>>>> Of course, MHP_MEMMAP_DEVICE is nicer when it comes to have a large contigous
>>>>> area, while MHP_MEMMAP_MEMBLOCK allows us to have flexibility when removing the
>>>>> memory.
>>>>
>>>> Concept and patch looks good to me, but I don't quite like the
>>>> proliferation of the _DEVICE naming, in theory it need not necessarily
>>>> be ZONE_DEVICE that is the only user of that flag. I also think it
>>>> might be useful to assign a flag for the default 'allocate from RAM'
>>>> case, just so the code is explicit. So, how about:
>>>
>>> Well, MHP_MEMMAP_DEVICE is not tied to ZONE_DEVICE.
>>> MHP_MEMMAP_DEVICE was chosen to make a difference between:
>>>
>>>  * allocate memmap pages for the whole memory-device
>>>  * allocate memmap pages on each memoryblock that this memory-device spans
>>
>> I agree that DEVICE is misleading here, you are assuming a one-to-one
>> mapping between a device and add_memory(). You are actually taliing
>> about "allocate a single chunk of mmap pages for the whole memory range
>> that is added - which could consist of multiple memory blocks".
> 
> Well, I could not come up with a better name.
> 
> MHP_MEMMAP_ALL?
> MHP_MEMMAP_WHOLE?

As I said somewhere already (as far as I recall), one mode would be
sufficient. If you want per memblock, add the memory in memblock
granularity.

So having a MHP_MEMMAP_ON_MEMORY that allocates it in one chunk would be
sufficient for the current use cases (DIMMs, Hyper-V).

MHP_MEMMAP_ON_MEMORY: Allocate the memmap for the added memory in one
chunk from the beginning of the added memory. This piece of memory will
be accessed and used even before the memory is onlined.

Of course, if we want to make it configurable (e.g., for ACPI) it would
be a different story. But for now this isn't really needed as far as I
can tell.
Oscar Salvador July 25, 2019, 10:13 a.m. UTC | #8
On Thu, Jul 25, 2019 at 12:04:08PM +0200, David Hildenbrand wrote:
> As I said somewhere already (as far as I recall), one mode would be
> sufficient. If you want per memblock, add the memory in memblock
> granularity.
> 
> So having a MHP_MEMMAP_ON_MEMORY that allocates it in one chunk would be
> sufficient for the current use cases (DIMMs, Hyper-V).
> 
> MHP_MEMMAP_ON_MEMORY: Allocate the memmap for the added memory in one
> chunk from the beginning of the added memory. This piece of memory will
> be accessed and used even before the memory is onlined.

This is what I had in my early versions of the patchset, but I do remember
that Michal suggested to let the caller specify if it wants the memmaps
to be allocated per memblock, or per whole-range.

I still think it makes somse sense, you can just pass a large chunk
(spanning multiple memory-blocks) at once and yet specify to allocate
it per memory-blocks.

Of course, I also agree that having only one mode would ease things
(not that much as v3 does not suppose that difference wrt. range vs
memory-block).
David Hildenbrand July 25, 2019, 10:15 a.m. UTC | #9
On 25.07.19 12:13, Oscar Salvador wrote:
> On Thu, Jul 25, 2019 at 12:04:08PM +0200, David Hildenbrand wrote:
>> As I said somewhere already (as far as I recall), one mode would be
>> sufficient. If you want per memblock, add the memory in memblock
>> granularity.
>>
>> So having a MHP_MEMMAP_ON_MEMORY that allocates it in one chunk would be
>> sufficient for the current use cases (DIMMs, Hyper-V).
>>
>> MHP_MEMMAP_ON_MEMORY: Allocate the memmap for the added memory in one
>> chunk from the beginning of the added memory. This piece of memory will
>> be accessed and used even before the memory is onlined.
> 
> This is what I had in my early versions of the patchset, but I do remember
> that Michal suggested to let the caller specify if it wants the memmaps
> to be allocated per memblock, or per whole-range.
> 
> I still think it makes somse sense, you can just pass a large chunk
> (spanning multiple memory-blocks) at once and yet specify to allocate
> it per memory-blocks.
> 
> Of course, I also agree that having only one mode would ease things
> (not that much as v3 does not suppose that difference wrt. range vs
> memory-block).

I prefer simplicity. No user, no implementation. We can easily add this
later on if there is a good reason/user.
diff mbox series

Patch

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index db013dc21c02..860f84e82dd0 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -218,7 +218,7 @@  static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (node < 0)
 			node = memory_add_physaddr_to_nid(info->start_addr);
 
-		result = __add_memory(node, info->start_addr, info->length);
+		result = __add_memory(node, info->start_addr, info->length, 0);
 
 		/*
 		 * If the memory block has been used by the kernel, add_memory()
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 07ba731beb42..ad9834b8b7f7 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -516,7 +516,7 @@  static ssize_t probe_store(struct device *dev, struct device_attribute *attr,
 
 	nid = memory_add_physaddr_to_nid(phys_addr);
 	ret = __add_memory(nid, phys_addr,
-			   MIN_MEMORY_BLOCK_SIZE * sections_per_block);
+			   MIN_MEMORY_BLOCK_SIZE * sections_per_block, 0);
 
 	if (ret)
 		goto out;
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 3d0a7e702c94..e159184e0ba0 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -65,7 +65,7 @@  int dev_dax_kmem_probe(struct device *dev)
 	new_res->flags = IORESOURCE_SYSTEM_RAM;
 	new_res->name = dev_name(dev);
 
-	rc = add_memory(numa_node, new_res->start, resource_size(new_res));
+	rc = add_memory(numa_node, new_res->start, resource_size(new_res), 0);
 	if (rc) {
 		release_resource(new_res);
 		kfree(new_res);
diff --git a/drivers/hv/hv_balloon.c b/drivers/hv/hv_balloon.c
index 6fb4ea5f0304..beb92bc56186 100644
--- a/drivers/hv/hv_balloon.c
+++ b/drivers/hv/hv_balloon.c
@@ -731,7 +731,7 @@  static void hv_mem_hot_add(unsigned long start, unsigned long size,
 
 		nid = memory_add_physaddr_to_nid(PFN_PHYS(start_pfn));
 		ret = add_memory(nid, PFN_PHYS((start_pfn)),
-				(HA_CHUNK << PAGE_SHIFT));
+				(HA_CHUNK << PAGE_SHIFT), 0);
 
 		if (ret) {
 			pr_err("hot_add memory failed error is %d\n", ret);
diff --git a/drivers/s390/char/sclp_cmd.c b/drivers/s390/char/sclp_cmd.c
index 37d42de06079..f61026c7db7e 100644
--- a/drivers/s390/char/sclp_cmd.c
+++ b/drivers/s390/char/sclp_cmd.c
@@ -406,7 +406,7 @@  static void __init add_memory_merged(u16 rn)
 	if (!size)
 		goto skip_add;
 	for (addr = start; addr < start + size; addr += block_size)
-		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size);
+		add_memory(numa_pfn_to_nid(PFN_DOWN(addr)), addr, block_size, 0);
 skip_add:
 	first_rn = rn;
 	num = 1;
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 37a36c6b9f93..33814b3513ca 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -349,7 +349,7 @@  static enum bp_state reserve_additional_memory(void)
 	mutex_unlock(&balloon_mutex);
 	/* add_memory_resource() requires the device_hotplug lock */
 	lock_device_hotplug();
-	rc = add_memory_resource(nid, resource);
+	rc = add_memory_resource(nid, resource, 0);
 	unlock_device_hotplug();
 	mutex_lock(&balloon_mutex);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 0b8a5e5ef2da..6fdbce9d04f9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -54,6 +54,22 @@  enum {
 };
 
 /*
+ * We want memmap (struct page array) to be allocated from the hotadded range.
+ * To do so, there are two possible ways depending on what the caller wants.
+ * 1) Allocate memmap pages per device (whole hot-added range)
+ * 2) Allocate memmap pages per memblock
+ * The former implies that we wil use the beginning of the hot-added range
+ * to store the memmap pages of the whole range, while the latter implies
+ * that we will use the beginning of each memblock to store its own memmap
+ * pages.
+ * Please note that only SPARSE_VMEMMAP implements this feature and some
+ * architectures might not support it even for that memory model (e.g. s390)
+ */
+#define MHP_MEMMAP_DEVICE	(1UL<<0)
+#define MHP_MEMMAP_MEMBLOCK	(1UL<<1)
+#define MHP_VMEMMAP_FLAGS	(MHP_MEMMAP_DEVICE|MHP_MEMMAP_MEMBLOCK)
+
+/*
  * Restrictions for the memory hotplug:
  * flags:  MHP_ flags
  * altmap: alternative allocator for memmap array
@@ -342,9 +358,9 @@  static inline void __remove_memory(int nid, u64 start, u64 size) {}
 extern void __ref free_area_init_core_hotplug(int nid);
 extern int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
 		void *arg, int (*func)(struct memory_block *, void *));
-extern int __add_memory(int nid, u64 start, u64 size);
-extern int add_memory(int nid, u64 start, u64 size);
-extern int add_memory_resource(int nid, struct resource *resource);
+extern int __add_memory(int nid, u64 start, u64 size, unsigned long flags);
+extern int add_memory(int nid, u64 start, u64 size, unsigned long flags);
+extern int add_memory_resource(int nid, struct resource *resource, unsigned long flags);
 extern void move_pfn_range_to_zone(struct zone *zone, unsigned long start_pfn,
 		unsigned long nr_pages, struct vmem_altmap *altmap);
 extern bool is_memblock_offlined(struct memory_block *mem);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 4e8e65954f31..e4e3baa6eaa7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1057,7 +1057,7 @@  static int online_memory_block(struct memory_block *mem, void *arg)
  *
  * we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG
  */
-int __ref add_memory_resource(int nid, struct resource *res)
+int __ref add_memory_resource(int nid, struct resource *res, unsigned long flags)
 {
 	struct mhp_restrictions restrictions = {};
 	u64 start, size;
@@ -1135,7 +1135,7 @@  int __ref add_memory_resource(int nid, struct resource *res)
 }
 
 /* requires device_hotplug_lock, see add_memory_resource() */
-int __ref __add_memory(int nid, u64 start, u64 size)
+int __ref __add_memory(int nid, u64 start, u64 size, unsigned long flags)
 {
 	struct resource *res;
 	int ret;
@@ -1144,18 +1144,18 @@  int __ref __add_memory(int nid, u64 start, u64 size)
 	if (IS_ERR(res))
 		return PTR_ERR(res);
 
-	ret = add_memory_resource(nid, res);
+	ret = add_memory_resource(nid, res, flags);
 	if (ret < 0)
 		release_memory_resource(res);
 	return ret;
 }
 
-int add_memory(int nid, u64 start, u64 size)
+int add_memory(int nid, u64 start, u64 size, unsigned long flags)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc = __add_memory(nid, start, size);
+	rc = __add_memory(nid, start, size, flags);
 	unlock_device_hotplug();
 
 	return rc;