diff mbox

[v3,1/2] mm/sparse: add sparse_init_nid()

Message ID 20180702020417.21281-2-pasha.tatashin@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

Pavel Tatashin July 2, 2018, 2:04 a.m. UTC
sparse_init() requires to temporary allocate two large buffers:
usemap_map and map_map. Baoquan He has identified that these buffers are so
large that Linux is not bootable on small memory machines, such as a kdump
boot. The buffers are especially large when CONFIG_X86_5LEVEL is set, as
they are scaled to the maximum physical memory size.

Baoquan provided a fix, which reduces these sizes of these buffers, but it
is much better to get rid of them entirely.

Add a new way to initialize sparse memory: sparse_init_nid(), which only
operates within one memory node, and thus allocates memory either in large
contiguous block or allocates section by section. This eliminates the need
for use of temporary buffers.

For simplified bisecting and review, the new interface is going to be
enabled as well as old code removed in the next patch.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/mm.h  |  8 ++++
 mm/sparse-vmemmap.c | 49 ++++++++++++++++++++++++
 mm/sparse.c         | 91 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 148 insertions(+)

Comments

Baoquan He July 2, 2018, 2:11 a.m. UTC | #1
Hi Pavel,

Thanks for your quick fix. You might have missed another comment to v2
patch 1/2 which is at the bottom.

On 07/01/18 at 10:04pm, Pavel Tatashin wrote:
> +/*
> + * Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
> + * And number of present sections in this node is map_count.
> + */
> +void __init sparse_init_nid(int nid, unsigned long pnum_begin,
> +				   unsigned long pnum_end,
> +				   unsigned long map_count)
> +{
> +	unsigned long pnum, usemap_longs, *usemap, map_index;
> +	struct page *map, *map_base;
> +
> +	usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS);
> +	usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
> +							  usemap_size() *
> +							  map_count);
> +	if (!usemap) {
> +		pr_err("%s: usemap allocation failed", __func__);
> +		goto failed;
> +	}
> +	map_base = sparse_populate_node(pnum_begin, pnum_end,
> +					map_count, nid);
> +	map_index = 0;
> +	for_each_present_section_nr(pnum_begin, pnum) {
> +		if (pnum >= pnum_end)
> +			break;
> +
> +		BUG_ON(map_index == map_count);
> +		map = sparse_populate_node_section(map_base, map_index,
> +						   pnum, nid);
> +		if (!map) {

Here, I think it might be not right to jump to 'failed' directly if one
section of the node failed to populate memmap. I think the original code
is only skipping the section which memmap failed to populate by marking
it as not present with "ms->section_mem_map = 0".

> +			pr_err("%s: memory map backing failed. Some memory will not be available.",
> +			       __func__);
> +			pnum_begin = pnum;
> +			goto failed;
> +		}
> +		check_usemap_section_nr(nid, usemap);
> +		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
> +					usemap);
> +		map_index++;
> +		usemap += usemap_longs;
> +	}
> +	return;
> +failed:
> +	/* We failed to allocate, mark all the following pnums as not present */
> +	for_each_present_section_nr(pnum_begin, pnum) {
> +		struct mem_section *ms;
> +
> +		if (pnum >= pnum_end)
> +			break;
> +		ms = __nr_to_section(pnum);
> +		ms->section_mem_map = 0;
> +	}
> +}
> +
>  /*
>   * Allocate the accumulated non-linear sections, allocate a mem_map
>   * for each and record the physical to section mapping.
> -- 
> 2.18.0
>
Pavel Tatashin July 2, 2018, 2:18 a.m. UTC | #2
> Here, I think it might be not right to jump to 'failed' directly if one
> section of the node failed to populate memmap. I think the original code
> is only skipping the section which memmap failed to populate by marking
> it as not present with "ms->section_mem_map = 0".
>

Hi Baoquan,

Thank you for a careful review. This is an intended change compared to
the original code. Because we operate per-node now, if we fail to
allocate a single section, in this node, it means we also will fail to
allocate all the consequent sections in the same node and no need to
check them anymore. In the original code we could not simply bailout,
because we still might have valid entries in the following nodes.
Similarly, sparse_init() will call sparse_init_nid() for the next node
even if previous node failed to setup all the memory.

Pavel
Baoquan He July 2, 2018, 2:31 a.m. UTC | #3
On 07/01/18 at 10:18pm, Pavel Tatashin wrote:
> > Here, I think it might be not right to jump to 'failed' directly if one
> > section of the node failed to populate memmap. I think the original code
> > is only skipping the section which memmap failed to populate by marking
> > it as not present with "ms->section_mem_map = 0".
> >
> 
> Hi Baoquan,
> 
> Thank you for a careful review. This is an intended change compared to
> the original code. Because we operate per-node now, if we fail to
> allocate a single section, in this node, it means we also will fail to
> allocate all the consequent sections in the same node and no need to
> check them anymore. In the original code we could not simply bailout,
> because we still might have valid entries in the following nodes.
> Similarly, sparse_init() will call sparse_init_nid() for the next node
> even if previous node failed to setup all the memory.

Hmm, say the node we are handling is node5, and there are 100 sections.
If you allocate memmap for section at one time, you have succeeded to
handle for the first 99 sections, now the 100th failed, so you will mark
all sections on node5 as not present. And the allocation failure is only
for single section memmap allocation case.

Please think about the vmemmap case, it will map the struct page pages
to vmemmap, and will populate page tables for them to map. That is a
long walk, not only memmory allocation, and page table checking and
populating, one section failing to populate memmap doesn't mean all the
consequent sections also failed. I think the original code is
reasonable.

Thanks
Baoquan
Pavel Tatashin July 2, 2018, 2:43 a.m. UTC | #4
On Sun, Jul 1, 2018 at 10:31 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 07/01/18 at 10:18pm, Pavel Tatashin wrote:
> > > Here, I think it might be not right to jump to 'failed' directly if one
> > > section of the node failed to populate memmap. I think the original code
> > > is only skipping the section which memmap failed to populate by marking
> > > it as not present with "ms->section_mem_map = 0".
> > >
> >
> > Hi Baoquan,
> >
> > Thank you for a careful review. This is an intended change compared to
> > the original code. Because we operate per-node now, if we fail to
> > allocate a single section, in this node, it means we also will fail to
> > allocate all the consequent sections in the same node and no need to
> > check them anymore. In the original code we could not simply bailout,
> > because we still might have valid entries in the following nodes.
> > Similarly, sparse_init() will call sparse_init_nid() for the next node
> > even if previous node failed to setup all the memory.
>
> Hmm, say the node we are handling is node5, and there are 100 sections.
> If you allocate memmap for section at one time, you have succeeded to
> handle for the first 99 sections, now the 100th failed, so you will mark
> all sections on node5 as not present. And the allocation failure is only
> for single section memmap allocation case.

No, unless I am missing something, that's not how code works:

463                 if (!map) {
464                         pr_err("%s: memory map backing failed.
Some memory will not be available.",
465                                __func__);
466                         pnum_begin = pnum;
467                         goto failed;
468                 }

476 failed:
477         /* We failed to allocate, mark all the following pnums as
not present */
478         for_each_present_section_nr(pnum_begin, pnum) {

We continue from the pnum that failed as we set pnum_begin to pnum,
and mark all the consequent sections as not-present.

The only change compared to the original code is that once we found an
empty pnum we stop checking the consequent pnums in this node, as we
know they are empty as well, because there is no more memory in this
node to allocate from.

Pavel
Baoquan He July 2, 2018, 2:53 a.m. UTC | #5
On 07/01/18 at 10:43pm, Pavel Tatashin wrote:
> On Sun, Jul 1, 2018 at 10:31 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 07/01/18 at 10:18pm, Pavel Tatashin wrote:
> > > > Here, I think it might be not right to jump to 'failed' directly if one
> > > > section of the node failed to populate memmap. I think the original code
> > > > is only skipping the section which memmap failed to populate by marking
> > > > it as not present with "ms->section_mem_map = 0".
> > > >
> > >
> > > Hi Baoquan,
> > >
> > > Thank you for a careful review. This is an intended change compared to
> > > the original code. Because we operate per-node now, if we fail to
> > > allocate a single section, in this node, it means we also will fail to
> > > allocate all the consequent sections in the same node and no need to
> > > check them anymore. In the original code we could not simply bailout,
> > > because we still might have valid entries in the following nodes.
> > > Similarly, sparse_init() will call sparse_init_nid() for the next node
> > > even if previous node failed to setup all the memory.
> >
> > Hmm, say the node we are handling is node5, and there are 100 sections.
> > If you allocate memmap for section at one time, you have succeeded to
> > handle for the first 99 sections, now the 100th failed, so you will mark
> > all sections on node5 as not present. And the allocation failure is only
> > for single section memmap allocation case.
> 
> No, unless I am missing something, that's not how code works:
> 
> 463                 if (!map) {
> 464                         pr_err("%s: memory map backing failed.
> Some memory will not be available.",
> 465                                __func__);
> 466                         pnum_begin = pnum;
> 467                         goto failed;
> 468                 }
> 
> 476 failed:
> 477         /* We failed to allocate, mark all the following pnums as
> not present */
> 478         for_each_present_section_nr(pnum_begin, pnum) {
> 
> We continue from the pnum that failed as we set pnum_begin to pnum,
> and mark all the consequent sections as not-present.

Ah, yes, I misunderstood it, sorry for that.

Then I have only one concern, for vmemmap case, if one section doesn't
succeed to populate its memmap, do we need to skip all the remaining
sections in that node?

> 
> The only change compared to the original code is that once we found an
> empty pnum we stop checking the consequent pnums in this node, as we
> know they are empty as well, because there is no more memory in this
> node to allocate from.
>
Baoquan He July 2, 2018, 2:56 a.m. UTC | #6
On 07/01/18 at 10:04pm, Pavel Tatashin wrote:
> +/*
> + * Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
> + * And number of present sections in this node is map_count.
> + */
> +void __init sparse_init_nid(int nid, unsigned long pnum_begin,
> +				   unsigned long pnum_end,
> +				   unsigned long map_count)
> +{
> +	unsigned long pnum, usemap_longs, *usemap, map_index;
> +	struct page *map, *map_base;
> +
> +	usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS);
> +	usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
> +							  usemap_size() *
> +							  map_count);
> +	if (!usemap) {
> +		pr_err("%s: usemap allocation failed", __func__);

Wondering if we can provide more useful information for better debugging
if failed. E.g here tell on what nid the usemap allocation failed.

> +		goto failed;
> +	}
> +	map_base = sparse_populate_node(pnum_begin, pnum_end,
> +					map_count, nid);
> +	map_index = 0;
> +	for_each_present_section_nr(pnum_begin, pnum) {
> +		if (pnum >= pnum_end)
> +			break;
> +
> +		BUG_ON(map_index == map_count);
> +		map = sparse_populate_node_section(map_base, map_index,
> +						   pnum, nid);
> +		if (!map) {
> +			pr_err("%s: memory map backing failed. Some memory will not be available.",
> +			       __func__);
And here tell nid and the memory section nr failed.

> +			pnum_begin = pnum;
> +			goto failed;
> +		}
> +		check_usemap_section_nr(nid, usemap);
> +		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
> +					usemap);
> +		map_index++;
> +		usemap += usemap_longs;
> +	}
> +	return;
> +failed:
> +	/* We failed to allocate, mark all the following pnums as not present */
> +	for_each_present_section_nr(pnum_begin, pnum) {
> +		struct mem_section *ms;
> +
> +		if (pnum >= pnum_end)
> +			break;
> +		ms = __nr_to_section(pnum);
> +		ms->section_mem_map = 0;
> +	}
> +}
> +
>  /*
>   * Allocate the accumulated non-linear sections, allocate a mem_map
>   * for each and record the physical to section mapping.
> -- 
> 2.18.0
>
Pavel Tatashin July 2, 2018, 3:03 a.m. UTC | #7
> Ah, yes, I misunderstood it, sorry for that.
>
> Then I have only one concern, for vmemmap case, if one section doesn't
> succeed to populate its memmap, do we need to skip all the remaining
> sections in that node?

Yes, in sparse_populate_node() we have the following:

294         for (pnum = pnum_begin; map_index < map_count; pnum++) {
295                 if (!present_section_nr(pnum))
296                         continue;
297                 if (!sparse_mem_map_populate(pnum, nid, NULL))
298                         break;

So, on the first failure, we even stop trying to populate other
sections. No more memory to do so.

Pavel
Pavel Tatashin July 2, 2018, 3:05 a.m. UTC | #8
> > +     if (!usemap) {
> > +             pr_err("%s: usemap allocation failed", __func__);
>
> Wondering if we can provide more useful information for better debugging
> if failed. E.g here tell on what nid the usemap allocation failed.
>
> > +                                                pnum, nid);
> > +             if (!map) {
> > +                     pr_err("%s: memory map backing failed. Some memory will not be available.",
> > +                            __func__);
> And here tell nid and the memory section nr failed.

Sure, I will wait for more comments, if any, and add more info to the
error messages in the next revision.

Thank you,
Pavel
Baoquan He July 2, 2018, 3:14 a.m. UTC | #9
On 07/01/18 at 11:03pm, Pavel Tatashin wrote:
> > Ah, yes, I misunderstood it, sorry for that.
> >
> > Then I have only one concern, for vmemmap case, if one section doesn't
> > succeed to populate its memmap, do we need to skip all the remaining
> > sections in that node?
> 
> Yes, in sparse_populate_node() we have the following:
> 
> 294         for (pnum = pnum_begin; map_index < map_count; pnum++) {
> 295                 if (!present_section_nr(pnum))
> 296                         continue;
> 297                 if (!sparse_mem_map_populate(pnum, nid, NULL))
> 298                         break;
> 
> So, on the first failure, we even stop trying to populate other
> sections. No more memory to do so.

This is the thing I worry about. In old sparse_mem_maps_populate_node()
you can see, when not present or failed to populate, just continue. This
is the main difference between yours and the old code. The key logic is
changed here.

void __init sparse_mem_maps_populate_node(struct page **map_map,
                                          unsigned long pnum_begin,
                                          unsigned long pnum_end,
                                          unsigned long map_count, int nodeid)
{
	...
	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
                struct mem_section *ms;

                if (!present_section_nr(pnum))
                        continue;

                map_map[pnum] = sparse_mem_map_populate(pnum, nodeid, NULL);
                if (map_map[pnum])                                                                                                                
                        continue;
                ms = __nr_to_section(pnum);
                pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
                       __func__);
                ms->section_mem_map = 0;
        }
	...
}
Baoquan He July 2, 2018, 3:17 a.m. UTC | #10
On 07/02/18 at 11:14am, Baoquan He wrote:
> On 07/01/18 at 11:03pm, Pavel Tatashin wrote:
> > > Ah, yes, I misunderstood it, sorry for that.
> > >
> > > Then I have only one concern, for vmemmap case, if one section doesn't
> > > succeed to populate its memmap, do we need to skip all the remaining
> > > sections in that node?
> > 
> > Yes, in sparse_populate_node() we have the following:
> > 
> > 294         for (pnum = pnum_begin; map_index < map_count; pnum++) {
> > 295                 if (!present_section_nr(pnum))
> > 296                         continue;
> > 297                 if (!sparse_mem_map_populate(pnum, nid, NULL))
> > 298                         break;
> > 
> > So, on the first failure, we even stop trying to populate other
> > sections. No more memory to do so.
> 
> This is the thing I worry about. In old sparse_mem_maps_populate_node()
> you can see, when not present or failed to populate, just continue. This
> is the main difference between yours and the old code. The key logic is
> changed here.
> 
Forgot mentioning it's the vervion in mm/sparse-vmemmap.c

> void __init sparse_mem_maps_populate_node(struct page **map_map,
>                                           unsigned long pnum_begin,
>                                           unsigned long pnum_end,
>                                           unsigned long map_count, int nodeid)
> {
> 	...
> 	for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
>                 struct mem_section *ms;
> 
>                 if (!present_section_nr(pnum))
>                         continue;
> 
>                 map_map[pnum] = sparse_mem_map_populate(pnum, nodeid, NULL);
>                 if (map_map[pnum])                                                                                                                
>                         continue;
>                 ms = __nr_to_section(pnum);
>                 pr_err("%s: sparsemem memory map backing failed some memory will not be available\n",
>                        __func__);
>                 ms->section_mem_map = 0;
>         }
> 	...
> }
>
Pavel Tatashin July 2, 2018, 3:28 a.m. UTC | #11
> > So, on the first failure, we even stop trying to populate other
> > sections. No more memory to do so.
>
> This is the thing I worry about. In old sparse_mem_maps_populate_node()
> you can see, when not present or failed to populate, just continue. This
> is the main difference between yours and the old code. The key logic is
> changed here.
>

I do not see how  we can succeed after the first failure. We still
allocate from the same node:

sparse_mem_map_populate() may fail only if we could not allocate large
enough buffer vmemmap_buf_start earlier.

This means that in:
sparse_mem_map_populate()
  vmemmap_populate()
    vmemmap_populate_hugepages()
      vmemmap_alloc_block_buf() (no buffer, so call allocator)
        vmemmap_alloc_block(size, node);
            __earlyonly_bootmem_alloc(node, size, size, __pa(MAX_DMA_ADDRESS));
              memblock_virt_alloc_try_nid_raw() -> Nothing changes for
this call to succeed. So, all consequent calls to
sparse_mem_map_populate() in this node will fail as well.

> >
> Forgot mentioning it's the vervion in mm/sparse-vmemmap.c

Sorry, I do not understand what is vervion.

Thank you,
Pavel
Baoquan He July 2, 2018, 3:42 a.m. UTC | #12
On 07/01/18 at 11:28pm, Pavel Tatashin wrote:
> > > So, on the first failure, we even stop trying to populate other
> > > sections. No more memory to do so.
> >
> > This is the thing I worry about. In old sparse_mem_maps_populate_node()
> > you can see, when not present or failed to populate, just continue. This
> > is the main difference between yours and the old code. The key logic is
> > changed here.
> >
> 
> I do not see how  we can succeed after the first failure. We still
> allocate from the same node:
> 
> sparse_mem_map_populate() may fail only if we could not allocate large
> enough buffer vmemmap_buf_start earlier.
> 
> This means that in:
> sparse_mem_map_populate()
>   vmemmap_populate()
>     vmemmap_populate_hugepages()
>       vmemmap_alloc_block_buf() (no buffer, so call allocator)
>         vmemmap_alloc_block(size, node);
>             __earlyonly_bootmem_alloc(node, size, size, __pa(MAX_DMA_ADDRESS));
>               memblock_virt_alloc_try_nid_raw() -> Nothing changes for
> this call to succeed. So, all consequent calls to
> sparse_mem_map_populate() in this node will fail as well.

Yes, you are right, it's improvement. Thanks.

> 
> > >
> > Forgot mentioning it's the vervion in mm/sparse-vmemmap.c
> 
> Sorry, I do not understand what is vervion.

Typo, 'version', should be. Sorry for that.
Dave Hansen July 2, 2018, 7:59 p.m. UTC | #13
> @@ -2651,6 +2651,14 @@ void sparse_mem_maps_populate_node(struct page **map_map,
>  				   unsigned long pnum_end,
>  				   unsigned long map_count,
>  				   int nodeid);
> +struct page * sparse_populate_node(unsigned long pnum_begin,

CodingStyle: put the "*" next to the function name, no space, please.

> +				   unsigned long pnum_end,
> +				   unsigned long map_count,
> +				   int nid);
> +struct page * sparse_populate_node_section(struct page *map_base,
> +				   unsigned long map_index,
> +				   unsigned long pnum,
> +				   int nid);

These two functions are named in very similar ways.  Do they do similar
things?

>  struct page *sparse_mem_map_populate(unsigned long pnum, int nid,
>  		struct vmem_altmap *altmap);
> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> index e1a54ba411ec..b3e325962306 100644
> --- a/mm/sparse-vmemmap.c
> +++ b/mm/sparse-vmemmap.c
> @@ -311,3 +311,52 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>  		vmemmap_buf_end = NULL;
>  	}
>  }
> +
> +struct page * __init sparse_populate_node(unsigned long pnum_begin,
> +					  unsigned long pnum_end,
> +					  unsigned long map_count,
> +					  int nid)
> +{

Could you comment what the function is doing, please?

> +	unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
> +	unsigned long pnum, map_index = 0;
> +	void *vmemmap_buf_start;
> +
> +	size = ALIGN(size, PMD_SIZE) * map_count;
> +	vmemmap_buf_start = __earlyonly_bootmem_alloc(nid, size,
> +						      PMD_SIZE,
> +						      __pa(MAX_DMA_ADDRESS));

Let's not repeat the mistakes of the previous version of the code.
Please explain why we are aligning this.  Also,
__earlyonly_bootmem_alloc()->memblock_virt_alloc_try_nid_raw() claims to
be aligning the size.  Do we also need to do it here?

Yes, I know the old code did this, but this is the cost of doing a
rewrite. :)

> +	if (vmemmap_buf_start) {
> +		vmemmap_buf = vmemmap_buf_start;
> +		vmemmap_buf_end = vmemmap_buf_start + size;
> +	}

It would be nice to call out that these are globals that other code
picks up.

> +	for (pnum = pnum_begin; map_index < map_count; pnum++) {
> +		if (!present_section_nr(pnum))
> +			continue;
> +		if (!sparse_mem_map_populate(pnum, nid, NULL))
> +			break;

^ This consumes "vmemmap_buf", right?  That seems like a really nice
thing to point out here if so.

> +		map_index++;
> +		BUG_ON(pnum >= pnum_end);
> +	}
> +
> +	if (vmemmap_buf_start) {
> +		/* need to free left buf */
> +		memblock_free_early(__pa(vmemmap_buf),
> +				    vmemmap_buf_end - vmemmap_buf);
> +		vmemmap_buf = NULL;
> +		vmemmap_buf_end = NULL;
> +	}
> +	return pfn_to_page(section_nr_to_pfn(pnum_begin));
> +}
> +
> +/*
> + * Return map for pnum section. sparse_populate_node() has populated memory map
> + * in this node, we simply do pnum to struct page conversion.
> + */
> +struct page * __init sparse_populate_node_section(struct page *map_base,
> +						  unsigned long map_index,
> +						  unsigned long pnum,
> +						  int nid)
> +{
> +	return pfn_to_page(section_nr_to_pfn(pnum));
> +}

What is up with all of the unused arguments to this function?

> diff --git a/mm/sparse.c b/mm/sparse.c
> index d18e2697a781..c18d92b8ab9b 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -456,6 +456,43 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>  		       __func__);
>  	}
>  }
> +
> +static unsigned long section_map_size(void)
> +{
> +	return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
> +}

Seems like if we have this, we should use it wherever possible, like
sparse_populate_node().


> +/*
> + * Try to allocate all struct pages for this node, if this fails, we will
> + * be allocating one section at a time in sparse_populate_node_section().
> + */
> +struct page * __init sparse_populate_node(unsigned long pnum_begin,
> +					  unsigned long pnum_end,
> +					  unsigned long map_count,
> +					  int nid)
> +{
> +	return memblock_virt_alloc_try_nid_raw(section_map_size() * map_count,
> +					       PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> +					       BOOTMEM_ALLOC_ACCESSIBLE, nid);
> +}
> +
> +/*
> + * Return map for pnum section. map_base is not NULL if we could allocate map
> + * for this node together. Otherwise we allocate one section at a time.
> + * map_index is the index of pnum in this node counting only present sections.
> + */
> +struct page * __init sparse_populate_node_section(struct page *map_base,
> +						  unsigned long map_index,
> +						  unsigned long pnum,
> +						  int nid)
> +{
> +	if (map_base) {
> +		unsigned long offset = section_map_size() * map_index;
> +
> +		return (struct page *)((char *)map_base + offset);
> +	}
> +	return sparse_mem_map_populate(pnum, nid, NULL);

Oh, you have a vmemmap and non-vmemmap version.

BTW, can't the whole map base calculation just be replaced with:

	return &map_base[PAGES_PER_SECTION * map_index];

?
Pavel Tatashin July 2, 2018, 8:29 p.m. UTC | #14
On Mon, Jul 2, 2018 at 4:00 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> > @@ -2651,6 +2651,14 @@ void sparse_mem_maps_populate_node(struct page **map_map,
> >                                  unsigned long pnum_end,
> >                                  unsigned long map_count,
> >                                  int nodeid);
> > +struct page * sparse_populate_node(unsigned long pnum_begin,
>
> CodingStyle: put the "*" next to the function name, no space, please.

OK

>
> > +                                unsigned long pnum_end,
> > +                                unsigned long map_count,
> > +                                int nid);
> > +struct page * sparse_populate_node_section(struct page *map_base,
> > +                                unsigned long map_index,
> > +                                unsigned long pnum,
> > +                                int nid);
>
> These two functions are named in very similar ways.  Do they do similar
> things?

Yes, they do in non-vmemmap:

sparse_populate_node() -> populates the whole node if we can using a
single allocation
sparse_populate_node_section() -> populate only one section in the
given node if the whole node is not already populated.

However, vemmap variant is a little different: sparse_populate_node()
populates in a single allocation if can, and if not it still populates
the whole node but in smaller chunks, so
sparse_populate_node_section()  has nothing left to do.

>
> >  struct page *sparse_mem_map_populate(unsigned long pnum, int nid,
> >               struct vmem_altmap *altmap);
> > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
> > index e1a54ba411ec..b3e325962306 100644
> > --- a/mm/sparse-vmemmap.c
> > +++ b/mm/sparse-vmemmap.c
> > @@ -311,3 +311,52 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> >               vmemmap_buf_end = NULL;
> >       }
> >  }
> > +
> > +struct page * __init sparse_populate_node(unsigned long pnum_begin,
> > +                                       unsigned long pnum_end,
> > +                                       unsigned long map_count,
> > +                                       int nid)
> > +{
>
> Could you comment what the function is doing, please?

Sure, I will add comments.

>
> > +     unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
> > +     unsigned long pnum, map_index = 0;
> > +     void *vmemmap_buf_start;
> > +
> > +     size = ALIGN(size, PMD_SIZE) * map_count;
> > +     vmemmap_buf_start = __earlyonly_bootmem_alloc(nid, size,
> > +                                                   PMD_SIZE,
> > +                                                   __pa(MAX_DMA_ADDRESS));
>
> Let's not repeat the mistakes of the previous version of the code.
> Please explain why we are aligning this.  Also,
> __earlyonly_bootmem_alloc()->memblock_virt_alloc_try_nid_raw() claims to
> be aligning the size.  Do we also need to do it here?
>
> Yes, I know the old code did this, but this is the cost of doing a
> rewrite. :)

Actually, I was thinking about this particular case when I was
rewriting this code. Here we align size before multiplying by
map_count aligns after memblock_virt_alloc_try_nid_raw(). So, we must
have both as they are different.

>
> > +     if (vmemmap_buf_start) {
> > +             vmemmap_buf = vmemmap_buf_start;
> > +             vmemmap_buf_end = vmemmap_buf_start + size;
> > +     }
>
> It would be nice to call out that these are globals that other code
> picks up.

I do not like these globals, they should have specific functions that
access them only, something:
static struct {
  buffer;
  buffer_end;
} vmemmap_buffer;
vmemmap_buffer_init() allocate buffer
vmemmap_buffer_alloc()  return NULL if buffer is empty
vmemmap_buffer_fini()

Call vmemmap_buffer_init()  and vmemmap_buffer_fini()  from
sparse_populate_node() and
vmemmap_buffer_alloc() from vmemmap_alloc_block_buf().

But, it should be a separate patch. If you would like I can add it to
this series, or submit separately.

>
> > +     for (pnum = pnum_begin; map_index < map_count; pnum++) {
> > +             if (!present_section_nr(pnum))
> > +                     continue;
> > +             if (!sparse_mem_map_populate(pnum, nid, NULL))
> > +                     break;
>
> ^ This consumes "vmemmap_buf", right?  That seems like a really nice
> thing to point out here if so.

It consumes vmemmap_buf if __earlyonly_bootmem_alloc() was successful,
otherwise it allocates struct pages a section at a time.

>
> > +             map_index++;
> > +             BUG_ON(pnum >= pnum_end);
> > +     }
> > +
> > +     if (vmemmap_buf_start) {
> > +             /* need to free left buf */
> > +             memblock_free_early(__pa(vmemmap_buf),
> > +                                 vmemmap_buf_end - vmemmap_buf);
> > +             vmemmap_buf = NULL;
> > +             vmemmap_buf_end = NULL;
> > +     }
> > +     return pfn_to_page(section_nr_to_pfn(pnum_begin));
> > +}
> > +
> > +/*
> > + * Return map for pnum section. sparse_populate_node() has populated memory map
> > + * in this node, we simply do pnum to struct page conversion.
> > + */
> > +struct page * __init sparse_populate_node_section(struct page *map_base,
> > +                                               unsigned long map_index,
> > +                                               unsigned long pnum,
> > +                                               int nid)
> > +{
> > +     return pfn_to_page(section_nr_to_pfn(pnum));
> > +}
>
> What is up with all of the unused arguments to this function?

Because the same function is called from non-vmemmap sparse code.

>
> > diff --git a/mm/sparse.c b/mm/sparse.c
> > index d18e2697a781..c18d92b8ab9b 100644
> > --- a/mm/sparse.c
> > +++ b/mm/sparse.c
> > @@ -456,6 +456,43 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> >                      __func__);
> >       }
> >  }
> > +
> > +static unsigned long section_map_size(void)
> > +{
> > +     return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
> > +}
>
> Seems like if we have this, we should use it wherever possible, like
> sparse_populate_node().

It is used in sparse_populate_node():

401 struct page * __init sparse_populate_node(unsigned long pnum_begin,
406         return memblock_virt_alloc_try_nid_raw(section_map_size()
* map_count,
407                                                PAGE_SIZE,
__pa(MAX_DMA_ADDRESS),
408
BOOTMEM_ALLOC_ACCESSIBLE, nid);


>
>
> > +/*
> > + * Try to allocate all struct pages for this node, if this fails, we will
> > + * be allocating one section at a time in sparse_populate_node_section().
> > + */
> > +struct page * __init sparse_populate_node(unsigned long pnum_begin,
> > +                                       unsigned long pnum_end,
> > +                                       unsigned long map_count,
> > +                                       int nid)
> > +{
> > +     return memblock_virt_alloc_try_nid_raw(section_map_size() * map_count,
> > +                                            PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> > +                                            BOOTMEM_ALLOC_ACCESSIBLE, nid);
> > +}
> > +
> > +/*
> > + * Return map for pnum section. map_base is not NULL if we could allocate map
> > + * for this node together. Otherwise we allocate one section at a time.
> > + * map_index is the index of pnum in this node counting only present sections.
> > + */
> > +struct page * __init sparse_populate_node_section(struct page *map_base,
> > +                                               unsigned long map_index,
> > +                                               unsigned long pnum,
> > +                                               int nid)
> > +{
> > +     if (map_base) {
> > +             unsigned long offset = section_map_size() * map_index;
> > +
> > +             return (struct page *)((char *)map_base + offset);
> > +     }
> > +     return sparse_mem_map_populate(pnum, nid, NULL);
>
> Oh, you have a vmemmap and non-vmemmap version.
>
> BTW, can't the whole map base calculation just be replaced with:
>
>         return &map_base[PAGES_PER_SECTION * map_index];

Unfortunately no.  Because map_base might be allocated in chunks
larger than PAGES_PER_SECTION * sizeof(struct page). See: PAGE_ALIGN()
in section_map_size

Thank you,
Pavel
Dave Hansen July 5, 2018, 1:39 p.m. UTC | #15
On 07/02/2018 01:29 PM, Pavel Tatashin wrote:
> On Mon, Jul 2, 2018 at 4:00 PM Dave Hansen <dave.hansen@intel.com> wrote:
>>> +     unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
>>> +     unsigned long pnum, map_index = 0;
>>> +     void *vmemmap_buf_start;
>>> +
>>> +     size = ALIGN(size, PMD_SIZE) * map_count;
>>> +     vmemmap_buf_start = __earlyonly_bootmem_alloc(nid, size,
>>> +                                                   PMD_SIZE,
>>> +                                                   __pa(MAX_DMA_ADDRESS));
>>
>> Let's not repeat the mistakes of the previous version of the code.
>> Please explain why we are aligning this.  Also,
>> __earlyonly_bootmem_alloc()->memblock_virt_alloc_try_nid_raw() claims to
>> be aligning the size.  Do we also need to do it here?
>>
>> Yes, I know the old code did this, but this is the cost of doing a
>> rewrite. :)
> 
> Actually, I was thinking about this particular case when I was
> rewriting this code. Here we align size before multiplying by
> map_count aligns after memblock_virt_alloc_try_nid_raw(). So, we must
> have both as they are different.

That's a good point that they do different things.

But, which behavior of the two different things is the one we _want_?

>>> +     if (vmemmap_buf_start) {
>>> +             vmemmap_buf = vmemmap_buf_start;
>>> +             vmemmap_buf_end = vmemmap_buf_start + size;
>>> +     }
>>
>> It would be nice to call out that these are globals that other code
>> picks up.
> 
> I do not like these globals, they should have specific functions that
> access them only, something:
> static struct {
>   buffer;
>   buffer_end;
> } vmemmap_buffer;
> vmemmap_buffer_init() allocate buffer
> vmemmap_buffer_alloc()  return NULL if buffer is empty
> vmemmap_buffer_fini()
> 
> Call vmemmap_buffer_init()  and vmemmap_buffer_fini()  from
> sparse_populate_node() and
> vmemmap_buffer_alloc() from vmemmap_alloc_block_buf().
> 
> But, it should be a separate patch. If you would like I can add it to
> this series, or submit separately.

Seems like a nice cleanup, but I don't think it needs to be done here.

>>> + * Return map for pnum section. sparse_populate_node() has populated memory map
>>> + * in this node, we simply do pnum to struct page conversion.
>>> + */
>>> +struct page * __init sparse_populate_node_section(struct page *map_base,
>>> +                                               unsigned long map_index,
>>> +                                               unsigned long pnum,
>>> +                                               int nid)
>>> +{
>>> +     return pfn_to_page(section_nr_to_pfn(pnum));
>>> +}
>>
>> What is up with all of the unused arguments to this function?
> 
> Because the same function is called from non-vmemmap sparse code.

That's probably good to call out in the patch description if not there
already.

>>> diff --git a/mm/sparse.c b/mm/sparse.c
>>> index d18e2697a781..c18d92b8ab9b 100644
>>> --- a/mm/sparse.c
>>> +++ b/mm/sparse.c
>>> @@ -456,6 +456,43 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
>>>                      __func__);
>>>       }
>>>  }
>>> +
>>> +static unsigned long section_map_size(void)
>>> +{
>>> +     return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
>>> +}
>>
>> Seems like if we have this, we should use it wherever possible, like
>> sparse_populate_node().
> 
> It is used in sparse_populate_node():
> 
> 401 struct page * __init sparse_populate_node(unsigned long pnum_begin,
> 406         return memblock_virt_alloc_try_nid_raw(section_map_size()
> * map_count,
> 407                                                PAGE_SIZE,
> __pa(MAX_DMA_ADDRESS),
> 408
> BOOTMEM_ALLOC_ACCESSIBLE, nid);

I missed the PAGE_ALIGN() until now.  That really needs a comment
calling out how it's not really the map size but the *allocation* size
of a single section's map.

It probably also needs a name like section_memmap_allocation_size() or
something to differentiate it from the *used* size.

>>> +/*
>>> + * Try to allocate all struct pages for this node, if this fails, we will
>>> + * be allocating one section at a time in sparse_populate_node_section().
>>> + */
>>> +struct page * __init sparse_populate_node(unsigned long pnum_begin,
>>> +                                       unsigned long pnum_end,
>>> +                                       unsigned long map_count,
>>> +                                       int nid)
>>> +{
>>> +     return memblock_virt_alloc_try_nid_raw(section_map_size() * map_count,
>>> +                                            PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
>>> +                                            BOOTMEM_ALLOC_ACCESSIBLE, nid);
>>> +}
>>> +
>>> +/*
>>> + * Return map for pnum section. map_base is not NULL if we could allocate map
>>> + * for this node together. Otherwise we allocate one section at a time.
>>> + * map_index is the index of pnum in this node counting only present sections.
>>> + */
>>> +struct page * __init sparse_populate_node_section(struct page *map_base,
>>> +                                               unsigned long map_index,
>>> +                                               unsigned long pnum,
>>> +                                               int nid)
>>> +{
>>> +     if (map_base) {
>>> +             unsigned long offset = section_map_size() * map_index;
>>> +
>>> +             return (struct page *)((char *)map_base + offset);
>>> +     }
>>> +     return sparse_mem_map_populate(pnum, nid, NULL);
>>
>> Oh, you have a vmemmap and non-vmemmap version.
>>
>> BTW, can't the whole map base calculation just be replaced with:
>>
>>         return &map_base[PAGES_PER_SECTION * map_index];
> 
> Unfortunately no.  Because map_base might be allocated in chunks
> larger than PAGES_PER_SECTION * sizeof(struct page). See: PAGE_ALIGN()
> in section_map_size

Good point.

Oh, well, can you at least get rid of the superfluous "(char *)" cast?
That should make the whole thing a bit less onerous.
Pavel Tatashin July 9, 2018, 2:31 p.m. UTC | #16
On Thu, Jul 5, 2018 at 9:39 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 07/02/2018 01:29 PM, Pavel Tatashin wrote:
> > On Mon, Jul 2, 2018 at 4:00 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >>> +     unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
> >>> +     unsigned long pnum, map_index = 0;
> >>> +     void *vmemmap_buf_start;
> >>> +
> >>> +     size = ALIGN(size, PMD_SIZE) * map_count;
> >>> +     vmemmap_buf_start = __earlyonly_bootmem_alloc(nid, size,
> >>> +                                                   PMD_SIZE,
> >>> +                                                   __pa(MAX_DMA_ADDRESS));
> >>
> >> Let's not repeat the mistakes of the previous version of the code.
> >> Please explain why we are aligning this.  Also,
> >> __earlyonly_bootmem_alloc()->memblock_virt_alloc_try_nid_raw() claims to
> >> be aligning the size.  Do we also need to do it here?
> >>
> >> Yes, I know the old code did this, but this is the cost of doing a
> >> rewrite. :)
> >
> > Actually, I was thinking about this particular case when I was
> > rewriting this code. Here we align size before multiplying by
> > map_count aligns after memblock_virt_alloc_try_nid_raw(). So, we must
> > have both as they are different.
>
> That's a good point that they do different things.
>
> But, which behavior of the two different things is the one we _want_?

We definitely want the first one:
size = ALIGN(size, PMD_SIZE) * map_count;

The alignment in memblock is not strictly needed for this case, but it
already comes with memblock allocator.

>
> >>> +     if (vmemmap_buf_start) {
> >>> +             vmemmap_buf = vmemmap_buf_start;
> >>> +             vmemmap_buf_end = vmemmap_buf_start + size;
> >>> +     }
> >>
> >> It would be nice to call out that these are globals that other code
> >> picks up.
> >
> > I do not like these globals, they should have specific functions that
> > access them only, something:
> > static struct {
> >   buffer;
> >   buffer_end;
> > } vmemmap_buffer;
> > vmemmap_buffer_init() allocate buffer
> > vmemmap_buffer_alloc()  return NULL if buffer is empty
> > vmemmap_buffer_fini()
> >
> > Call vmemmap_buffer_init()  and vmemmap_buffer_fini()  from
> > sparse_populate_node() and
> > vmemmap_buffer_alloc() from vmemmap_alloc_block_buf().
> >
> > But, it should be a separate patch. If you would like I can add it to
> > this series, or submit separately.
>
> Seems like a nice cleanup, but I don't think it needs to be done here.
>
> >>> + * Return map for pnum section. sparse_populate_node() has populated memory map
> >>> + * in this node, we simply do pnum to struct page conversion.
> >>> + */
> >>> +struct page * __init sparse_populate_node_section(struct page *map_base,
> >>> +                                               unsigned long map_index,
> >>> +                                               unsigned long pnum,
> >>> +                                               int nid)
> >>> +{
> >>> +     return pfn_to_page(section_nr_to_pfn(pnum));
> >>> +}
> >>
> >> What is up with all of the unused arguments to this function?
> >
> > Because the same function is called from non-vmemmap sparse code.
>
> That's probably good to call out in the patch description if not there
> already.
>
> >>> diff --git a/mm/sparse.c b/mm/sparse.c
> >>> index d18e2697a781..c18d92b8ab9b 100644
> >>> --- a/mm/sparse.c
> >>> +++ b/mm/sparse.c
> >>> @@ -456,6 +456,43 @@ void __init sparse_mem_maps_populate_node(struct page **map_map,
> >>>                      __func__);
> >>>       }
> >>>  }
> >>> +
> >>> +static unsigned long section_map_size(void)
> >>> +{
> >>> +     return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
> >>> +}
> >>
> >> Seems like if we have this, we should use it wherever possible, like
> >> sparse_populate_node().
> >
> > It is used in sparse_populate_node():
> >
> > 401 struct page * __init sparse_populate_node(unsigned long pnum_begin,
> > 406         return memblock_virt_alloc_try_nid_raw(section_map_size()
> > * map_count,
> > 407                                                PAGE_SIZE,
> > __pa(MAX_DMA_ADDRESS),
> > 408
> > BOOTMEM_ALLOC_ACCESSIBLE, nid);
>
> I missed the PAGE_ALIGN() until now.  That really needs a comment
> calling out how it's not really the map size but the *allocation* size
> of a single section's map.
>
> It probably also needs a name like section_memmap_allocation_size() or
> something to differentiate it from the *used* size.
>
> >>> +/*
> >>> + * Try to allocate all struct pages for this node, if this fails, we will
> >>> + * be allocating one section at a time in sparse_populate_node_section().
> >>> + */
> >>> +struct page * __init sparse_populate_node(unsigned long pnum_begin,
> >>> +                                       unsigned long pnum_end,
> >>> +                                       unsigned long map_count,
> >>> +                                       int nid)
> >>> +{
> >>> +     return memblock_virt_alloc_try_nid_raw(section_map_size() * map_count,
> >>> +                                            PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
> >>> +                                            BOOTMEM_ALLOC_ACCESSIBLE, nid);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Return map for pnum section. map_base is not NULL if we could allocate map
> >>> + * for this node together. Otherwise we allocate one section at a time.
> >>> + * map_index is the index of pnum in this node counting only present sections.
> >>> + */
> >>> +struct page * __init sparse_populate_node_section(struct page *map_base,
> >>> +                                               unsigned long map_index,
> >>> +                                               unsigned long pnum,
> >>> +                                               int nid)
> >>> +{
> >>> +     if (map_base) {
> >>> +             unsigned long offset = section_map_size() * map_index;
> >>> +
> >>> +             return (struct page *)((char *)map_base + offset);
> >>> +     }
> >>> +     return sparse_mem_map_populate(pnum, nid, NULL);
> >>
> >> Oh, you have a vmemmap and non-vmemmap version.
> >>
> >> BTW, can't the whole map base calculation just be replaced with:
> >>
> >>         return &map_base[PAGES_PER_SECTION * map_index];
> >
> > Unfortunately no.  Because map_base might be allocated in chunks
> > larger than PAGES_PER_SECTION * sizeof(struct page). See: PAGE_ALIGN()
> > in section_map_size
>
> Good point.
>
> Oh, well, can you at least get rid of the superfluous "(char *)" cast?
> That should make the whole thing a bit less onerous.

I will see what can be done, if it is not going to be cleaner, I will
keep the cast.

Thank you,
Pavel
diff mbox

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a0fbb9ffe380..85530fdfb1f2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2651,6 +2651,14 @@  void sparse_mem_maps_populate_node(struct page **map_map,
 				   unsigned long pnum_end,
 				   unsigned long map_count,
 				   int nodeid);
+struct page * sparse_populate_node(unsigned long pnum_begin,
+				   unsigned long pnum_end,
+				   unsigned long map_count,
+				   int nid);
+struct page * sparse_populate_node_section(struct page *map_base,
+				   unsigned long map_index,
+				   unsigned long pnum,
+				   int nid);
 
 struct page *sparse_mem_map_populate(unsigned long pnum, int nid,
 		struct vmem_altmap *altmap);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index e1a54ba411ec..b3e325962306 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -311,3 +311,52 @@  void __init sparse_mem_maps_populate_node(struct page **map_map,
 		vmemmap_buf_end = NULL;
 	}
 }
+
+struct page * __init sparse_populate_node(unsigned long pnum_begin,
+					  unsigned long pnum_end,
+					  unsigned long map_count,
+					  int nid)
+{
+	unsigned long size = sizeof(struct page) * PAGES_PER_SECTION;
+	unsigned long pnum, map_index = 0;
+	void *vmemmap_buf_start;
+
+	size = ALIGN(size, PMD_SIZE) * map_count;
+	vmemmap_buf_start = __earlyonly_bootmem_alloc(nid, size,
+						      PMD_SIZE,
+						      __pa(MAX_DMA_ADDRESS));
+	if (vmemmap_buf_start) {
+		vmemmap_buf = vmemmap_buf_start;
+		vmemmap_buf_end = vmemmap_buf_start + size;
+	}
+
+	for (pnum = pnum_begin; map_index < map_count; pnum++) {
+		if (!present_section_nr(pnum))
+			continue;
+		if (!sparse_mem_map_populate(pnum, nid, NULL))
+			break;
+		map_index++;
+		BUG_ON(pnum >= pnum_end);
+	}
+
+	if (vmemmap_buf_start) {
+		/* need to free left buf */
+		memblock_free_early(__pa(vmemmap_buf),
+				    vmemmap_buf_end - vmemmap_buf);
+		vmemmap_buf = NULL;
+		vmemmap_buf_end = NULL;
+	}
+	return pfn_to_page(section_nr_to_pfn(pnum_begin));
+}
+
+/*
+ * Return map for pnum section. sparse_populate_node() has populated memory map
+ * in this node, we simply do pnum to struct page conversion.
+ */
+struct page * __init sparse_populate_node_section(struct page *map_base,
+						  unsigned long map_index,
+						  unsigned long pnum,
+						  int nid)
+{
+	return pfn_to_page(section_nr_to_pfn(pnum));
+}
diff --git a/mm/sparse.c b/mm/sparse.c
index d18e2697a781..c18d92b8ab9b 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -456,6 +456,43 @@  void __init sparse_mem_maps_populate_node(struct page **map_map,
 		       __func__);
 	}
 }
+
+static unsigned long section_map_size(void)
+{
+	return PAGE_ALIGN(sizeof(struct page) * PAGES_PER_SECTION);
+}
+
+/*
+ * Try to allocate all struct pages for this node, if this fails, we will
+ * be allocating one section at a time in sparse_populate_node_section().
+ */
+struct page * __init sparse_populate_node(unsigned long pnum_begin,
+					  unsigned long pnum_end,
+					  unsigned long map_count,
+					  int nid)
+{
+	return memblock_virt_alloc_try_nid_raw(section_map_size() * map_count,
+					       PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+					       BOOTMEM_ALLOC_ACCESSIBLE, nid);
+}
+
+/*
+ * Return map for pnum section. map_base is not NULL if we could allocate map
+ * for this node together. Otherwise we allocate one section at a time.
+ * map_index is the index of pnum in this node counting only present sections.
+ */
+struct page * __init sparse_populate_node_section(struct page *map_base,
+						  unsigned long map_index,
+						  unsigned long pnum,
+						  int nid)
+{
+	if (map_base) {
+		unsigned long offset = section_map_size() * map_index;
+
+		return (struct page *)((char *)map_base + offset);
+	}
+	return sparse_mem_map_populate(pnum, nid, NULL);
+}
 #endif /* !CONFIG_SPARSEMEM_VMEMMAP */
 
 static void __init sparse_early_mem_maps_alloc_node(void *data,
@@ -520,6 +557,60 @@  static void __init alloc_usemap_and_memmap(void (*alloc_func)
 						map_count, nodeid_begin);
 }
 
+/*
+ * Initialize sparse on a specific node. The node spans [pnum_begin, pnum_end)
+ * And number of present sections in this node is map_count.
+ */
+void __init sparse_init_nid(int nid, unsigned long pnum_begin,
+				   unsigned long pnum_end,
+				   unsigned long map_count)
+{
+	unsigned long pnum, usemap_longs, *usemap, map_index;
+	struct page *map, *map_base;
+
+	usemap_longs = BITS_TO_LONGS(SECTION_BLOCKFLAGS_BITS);
+	usemap = sparse_early_usemaps_alloc_pgdat_section(NODE_DATA(nid),
+							  usemap_size() *
+							  map_count);
+	if (!usemap) {
+		pr_err("%s: usemap allocation failed", __func__);
+		goto failed;
+	}
+	map_base = sparse_populate_node(pnum_begin, pnum_end,
+					map_count, nid);
+	map_index = 0;
+	for_each_present_section_nr(pnum_begin, pnum) {
+		if (pnum >= pnum_end)
+			break;
+
+		BUG_ON(map_index == map_count);
+		map = sparse_populate_node_section(map_base, map_index,
+						   pnum, nid);
+		if (!map) {
+			pr_err("%s: memory map backing failed. Some memory will not be available.",
+			       __func__);
+			pnum_begin = pnum;
+			goto failed;
+		}
+		check_usemap_section_nr(nid, usemap);
+		sparse_init_one_section(__nr_to_section(pnum), pnum, map,
+					usemap);
+		map_index++;
+		usemap += usemap_longs;
+	}
+	return;
+failed:
+	/* We failed to allocate, mark all the following pnums as not present */
+	for_each_present_section_nr(pnum_begin, pnum) {
+		struct mem_section *ms;
+
+		if (pnum >= pnum_end)
+			break;
+		ms = __nr_to_section(pnum);
+		ms->section_mem_map = 0;
+	}
+}
+
 /*
  * Allocate the accumulated non-linear sections, allocate a mem_map
  * for each and record the physical to section mapping.