diff mbox

[v2,3/7] x86, gfp: Cache best near node for memory allocation.

Message ID 1441859269-25831-4-git-send-email-tangchen@cn.fujitsu.com (mailing list archive)
State Not Applicable, archived
Headers show

Commit Message

tangchen Sept. 10, 2015, 4:27 a.m. UTC
From: Gu Zheng <guz.fnst@cn.fujitsu.com>

In the current kernel, all possible cpus are mapped to the best near online
node if they reside in a memory-less node in init_cpu_to_node().

init_cpu_to_node()
{
	......
	for_each_possible_cpu(cpu) {
		......
		if (!node_online(node))
			node = find_near_online_node(node);
		numa_set_node(cpu, node);
	}
}

The reason for doing this is to prevent memory allocation failure if the
cpu is online but there is no memory on that node.

But since cpuid <-> nodeid mapping is planed to be made static, doing
so in initialization pharse makes no sense any more.

The best near online node for each cpu has been cached in an array in previous
patch. And the reason for doing this is to avoid mapping CPUs on memory-less
nodes to other nodes.

So in this patch, we get best near online node for CPUs on memory-less nodes
inside alloc_pages_node() and alloc_pages_exact_node() to avoid memory allocation
failure.

Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
---
 arch/x86/mm/numa.c  | 3 +--
 include/linux/gfp.h | 8 +++++++-
 2 files changed, 8 insertions(+), 3 deletions(-)

Comments

Tejun Heo Sept. 10, 2015, 7:29 p.m. UTC | #1
Hello,

On Thu, Sep 10, 2015 at 12:27:45PM +0800, Tang Chen wrote:
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index ad35f30..1a1324f 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  	if (nid < 0)
>  		nid = numa_node_id();
>  
> +	if (!node_online(nid))
> +		nid = get_near_online_node(nid);
> +
>  	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>  }

Why not just update node_data[]->node_zonelist in the first place?
Also, what's the synchronization rule here?  How are allocators
synchronized against node hot [un]plugs?

Thanks.
Tejun Heo Sept. 10, 2015, 7:38 p.m. UTC | #2
(cc'ing Christoph Lameter)

On Thu, Sep 10, 2015 at 03:29:35PM -0400, Tejun Heo wrote:
> Hello,
> 
> On Thu, Sep 10, 2015 at 12:27:45PM +0800, Tang Chen wrote:
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index ad35f30..1a1324f 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> >  	if (nid < 0)
> >  		nid = numa_node_id();
> >  
> > +	if (!node_online(nid))
> > +		nid = get_near_online_node(nid);
> > +
> >  	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> >  }
> 
> Why not just update node_data[]->node_zonelist in the first place?
> Also, what's the synchronization rule here?  How are allocators
> synchronized against node hot [un]plugs?

Also, shouldn't kmalloc_node() or any public allocator fall back
automatically to a near node w/o GFP_THISNODE?  Why is this failing at
all?  I get that cpu id -> node id mapping changing messes up the
locality but allocations shouldn't fail, right?

Thanks.
Christoph Lameter (Ampere) Sept. 10, 2015, 10:02 p.m. UTC | #3
On Thu, 10 Sep 2015, Tejun Heo wrote:

> > Why not just update node_data[]->node_zonelist in the first place?
> > Also, what's the synchronization rule here?  How are allocators
> > synchronized against node hot [un]plugs?
>
> Also, shouldn't kmalloc_node() or any public allocator fall back
> automatically to a near node w/o GFP_THISNODE?  Why is this failing at
> all?  I get that cpu id -> node id mapping changing messes up the
> locality but allocations shouldn't fail, right?

Without a node specification allocations are subject to various
constraints and memory policies. It is not simply going to the next node.
The memory load may require spreading out the allocations over multiple
nodes, the app may have specified which nodes are to be used etc etc.


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Sept. 10, 2015, 10:08 p.m. UTC | #4
Hello,

On Thu, Sep 10, 2015 at 05:02:31PM -0500, Christoph Lameter wrote:
> > Also, shouldn't kmalloc_node() or any public allocator fall back
> > automatically to a near node w/o GFP_THISNODE?  Why is this failing at
> > all?  I get that cpu id -> node id mapping changing messes up the
> > locality but allocations shouldn't fail, right?
> 
> Without a node specification allocations are subject to various
> constraints and memory policies. It is not simply going to the next node.
> The memory load may require spreading out the allocations over multiple
> nodes, the app may have specified which nodes are to be used etc etc.

Yeah, sure, but even w/ node specified, it shouldn't fail unless
THISNODE, right?

Thanks.
Christoph Lameter (Ampere) Sept. 11, 2015, 12:14 a.m. UTC | #5
On Thu, 10 Sep 2015, Tejun Heo wrote:

> > Why not just update node_data[]->node_zonelist in the first place?
> > Also, what's the synchronization rule here?  How are allocators
> > synchronized against node hot [un]plugs?
>
> Also, shouldn't kmalloc_node() or any public allocator fall back
> automatically to a near node w/o GFP_THISNODE?  Why is this failing at
> all?  I get that cpu id -> node id mapping changing messes up the
> locality but allocations shouldn't fail, right?

Yes that should occur in the absence of other constraints (mempolicies,
cpusets, cgroups, allocation type). If the constraints do not allow an
allocation then the allocation will fail.

Also: Are the zonelists setup the right way?


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen Sept. 26, 2015, 9:31 a.m. UTC | #6
Hi, tj

On 09/11/2015 03:29 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Sep 10, 2015 at 12:27:45PM +0800, Tang Chen wrote:
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index ad35f30..1a1324f 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>>   	if (nid < 0)
>>   		nid = numa_node_id();
>>   
>> +	if (!node_online(nid))
>> +		nid = get_near_online_node(nid);
>> +
>>   	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>>   }
> Why not just update node_data[]->node_zonelist in the first place?

zonelist will be rebuilt in __offline_pages() when the zone is not 
populated any more.

Here, getting the best near online node is for those cpus on memory-less 
nodes.

In the original code, if nid is NUMA_NO_NODE, the node the current cpu 
resides in
will be chosen. And if the node is memory-less node, the cpu will be 
mapped to its
best near online node.

But this patch-set will map the cpu to its original node, so 
numa_node_id() may return
a memory-less node to allocator. And then memory allocation may fail.

> Also, what's the synchronization rule here?  How are allocators
> synchronized against node hot [un]plugs?

The rule is: node_to_near_node_map[] array will be updated each time 
node [un]hotplug happens.

Now it is not protected by a lock. But I think acquiring a lock may 
cause performance regression
to memory allocator.

When rebuilding zonelist, stop_machine is used. So I think maybe 
updating the
node_to_near_node_map[] array at the same time when zonelist is rebuilt 
could be a good idea.

Thanks.



--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
tangchen Sept. 26, 2015, 9:35 a.m. UTC | #7
Hi, Christoph, tj,

On 09/11/2015 08:14 AM, Christoph Lameter wrote:
> On Thu, 10 Sep 2015, Tejun Heo wrote:
>
>>> Why not just update node_data[]->node_zonelist in the first place?
>>> Also, what's the synchronization rule here?  How are allocators
>>> synchronized against node hot [un]plugs?
>> Also, shouldn't kmalloc_node() or any public allocator fall back
>> automatically to a near node w/o GFP_THISNODE?  Why is this failing at
>> all?  I get that cpu id -> node id mapping changing messes up the
>> locality but allocations shouldn't fail, right?

Yes. That is the reason we are getting near online node here.

> Yes that should occur in the absence of other constraints (mempolicies,
> cpusets, cgroups, allocation type). If the constraints do not allow an
> allocation then the allocation will fail.
>
> Also: Are the zonelists setup the right way?

zonelist will be rebuilt in __offline_pages() when the zone is not 
populated any more.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Sept. 26, 2015, 5:53 p.m. UTC | #8
Hello, Tang.

On Sat, Sep 26, 2015 at 05:31:07PM +0800, Tang Chen wrote:
> >>@@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> >>  	if (nid < 0)
> >>  		nid = numa_node_id();
> >>+	if (!node_online(nid))
> >>+		nid = get_near_online_node(nid);
> >>+
> >>  	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> >>  }
> >Why not just update node_data[]->node_zonelist in the first place?
> 
> zonelist will be rebuilt in __offline_pages() when the zone is not populated
> any more.
> 
> Here, getting the best near online node is for those cpus on memory-less
> nodes.
> 
> In the original code, if nid is NUMA_NO_NODE, the node the current cpu
> resides in
> will be chosen. And if the node is memory-less node, the cpu will be mapped
> to its
> best near online node.
> 
> But this patch-set will map the cpu to its original node, so numa_node_id()
> may return
> a memory-less node to allocator. And then memory allocation may fail.

Correct me if I'm wrong but the zonelist dictates which memory areas
the page allocator is gonna try to from, right?  What I'm wondering is
why we aren't handling memory-less nodes by simply updating their
zonelists.  I mean, if, say, node 2 is memory-less, its zonelist can
simply point to zones from other nodes, right?  What am I missing
here?

Thanks.
tangchen Sept. 28, 2015, 1:50 a.m. UTC | #9
Hi, tj,

On 09/27/2015 01:53 AM, Tejun Heo wrote:
> Hello, Tang.
>
> On Sat, Sep 26, 2015 at 05:31:07PM +0800, Tang Chen wrote:
>>>> @@ -307,13 +307,19 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>>>>   	if (nid < 0)
>>>>   		nid = numa_node_id();
>>>> +	if (!node_online(nid))
>>>> +		nid = get_near_online_node(nid);
>>>> +
>>>>   	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
>>>>   }
>>> Why not just update node_data[]->node_zonelist in the first place?
>> zonelist will be rebuilt in __offline_pages() when the zone is not populated
>> any more.
>>
>> Here, getting the best near online node is for those cpus on memory-less
>> nodes.
>>
>> In the original code, if nid is NUMA_NO_NODE, the node the current cpu
>> resides in
>> will be chosen. And if the node is memory-less node, the cpu will be mapped
>> to its
>> best near online node.
>>
>> But this patch-set will map the cpu to its original node, so numa_node_id()
>> may return
>> a memory-less node to allocator. And then memory allocation may fail.
> Correct me if I'm wrong but the zonelist dictates which memory areas
> the page allocator is gonna try to from, right?  What I'm wondering is
> why we aren't handling memory-less nodes by simply updating their
> zonelists.  I mean, if, say, node 2 is memory-less, its zonelist can
> simply point to zones from other nodes, right?  What am I missing
> here?

Oh, yes, you are right. But I remember some time ago, Liu, Jiang has or was
going to handle memory less node like this in his patch:

https://lkml.org/lkml/2015/8/16/130

BTW, to Liu Jiang, how is your patches going on ?

Thanks.

>
> Thanks.
>

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bd7661..e89b9fb 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -151,6 +151,7 @@  void numa_set_node(int cpu, int node)
 		return;
 	}
 #endif
+
 	per_cpu(x86_cpu_to_node_map, cpu) = node;
 
 	set_near_online_node(node);
@@ -787,8 +788,6 @@  void __init init_cpu_to_node(void)
 
 		if (node == NUMA_NO_NODE)
 			continue;
-		if (!node_online(node))
-			node = find_near_online_node(node);
 		numa_set_node(cpu, node);
 	}
 }
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ad35f30..1a1324f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -307,13 +307,19 @@  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 	if (nid < 0)
 		nid = numa_node_id();
 
+	if (!node_online(nid))
+		nid = get_near_online_node(nid);
+
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
-	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid));
+	VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES);
+
+	if (!node_online(nid))
+		nid = get_near_online_node(nid);
 
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }