Message ID | 1436261425-29881-2-git-send-email-tangchen@cn.fujitsu.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
Hello, On Tue, Jul 07, 2015 at 05:30:21PM +0800, Tang Chen wrote: ... > Why doing this is to prevent memory allocation failure if the cpu is "The reason for doing this ..." > online but there is no memory on that node. > > But since cpuid <-> nodeid mapping will fix after this patch-set, doing "But since cpuid <-> nodeid mapping is planned to be made static, ..." > so in initialization pharse makes no sense any more. The best near online > node for each cpu should be cached somewhere. I'm not really following. Is this because the now offline node can later come online and we'd have to break the constant mapping invariant if we update the mapping later? If so, it'd be nice to spell that out. > void numa_set_node(int cpu, int node) > { > int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); > @@ -95,7 +121,11 @@ void numa_set_node(int cpu, int node) > return; > } > #endif > + > + per_cpu(x86_cpu_to_near_online_node, cpu) = > + find_near_online_node(numa_cpu_node(cpu)); > per_cpu(x86_cpu_to_node_map, cpu) = node; > + cpumask_set_cpu(cpu, &node_to_cpuid_mask_map[numa_cpu_node(cpu)]); > > set_cpu_numa_node(cpu, node); > } > @@ -105,6 +135,13 @@ void numa_clear_node(int cpu) > numa_set_node(cpu, NUMA_NO_NODE); > } > > +int get_near_online_node(int node) > +{ > + return per_cpu(x86_cpu_to_near_online_node, > + cpumask_first(&node_to_cpuid_mask_map[node])); > +} > +EXPORT_SYMBOL(get_near_online_node); Umm... this function is sitting on a fairly hot path and scanning a cpumask each time. Why not just build a numa node -> numa node array? > @@ -702,24 +739,6 @@ void __init x86_numa_init(void) > numa_init(dummy_numa_init); > } > > -static __init int find_near_online_node(int node) > -{ > - int n, val; > - int min_val = INT_MAX; > - int best_node = -1; > - > - for_each_online_node(n) { > - val = node_distance(node, n); > - > - if (val < min_val) { > - min_val = val; > - best_node = n; > - } > - } > - > - return best_node; > -} It's usually better to not mix code movement with actual changes. > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 6ba7cf2..4a18b21 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -307,13 +307,23 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > if (nid < 0) > nid = numa_node_id(); > > +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) > + if (!node_online(nid)) > + nid = get_near_online_node(nid); > +#endif Can you please introduce a wrapper function to do the above so that we don't open code ifdefs? > + > return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > } > > static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask, > unsigned int order) > { > - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); > + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > + > +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) > + if (!node_online(nid)) > + nid = get_near_online_node(nid); > +#endif > > return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > } Ditto. Also, what's the synchronization rules for NUMA node on/offlining. If you end up updating the mapping later, how would that be synchronized against the above usages? Thanks.
Hi TJ, Sorry for the late reply. On 07/16/2015 05:48 AM, Tejun Heo wrote: > ...... > so in initialization pharse makes no sense any more. The best near online > node for each cpu should be cached somewhere. > I'm not really following. Is this because the now offline node can > later come online and we'd have to break the constant mapping > invariant if we update the mapping later? If so, it'd be nice to > spell that out. Yes. Will document this in the next version. >> ...... >> >> +int get_near_online_node(int node) >> +{ >> + return per_cpu(x86_cpu_to_near_online_node, >> + cpumask_first(&node_to_cpuid_mask_map[node])); >> +} >> +EXPORT_SYMBOL(get_near_online_node); > Umm... this function is sitting on a fairly hot path and scanning a > cpumask each time. Why not just build a numa node -> numa node array? Indeed. Will avoid to scan a cpumask. > ...... > >> >> static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask, >> unsigned int order) >> { >> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); >> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >> + >> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) >> + if (!node_online(nid)) >> + nid = get_near_online_node(nid); >> +#endif >> >> return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); >> } > Ditto. Also, what's the synchronization rules for NUMA node > on/offlining. If you end up updating the mapping later, how would > that be synchronized against the above usages? I think the near online node map should be updated when node online/offline happens. But about this, I think the current numa code has a little problem. As you know, firmware info binds a set of CPUs and memory to a node. But at boot time, if the node has no memory (a memory-less node) , it won't be online. But the CPUs on that node is available, and bound to the near online node. (Here, I mean numa_set_node(cpu, node).) Why does the kernel do this ? I think it is used to ensure that we can allocate memory successfully by calling functions like alloc_pages_node() and alloc_pages_exact_node(). By these two fuctions, any CPU should be bound to a node who has memory so that memory allocation can be successful. That means, for a memory-less node at boot time, CPUs on the node is online, but the node is not online. That also means, "the node is online" equals to "the node has memory". Actually, there are a lot of code in the kernel is using this rule. But, 1) in cpu_up(), it will try to online a node, and it doesn't check if the node has memory. 2) in try_offline_node(), it offlines CPUs first, and then the memory. This behavior looks a little wired, or let's say it is ambiguous. It seems that a NUMA node consists of CPUs and memory. So if the CPUs are online, the node should be online. And also, The main purpose of this patch-set is to make the cpuid <-> nodeid mapping persistent. After this patch-set, alloc_pages_node() and alloc_pages_exact_node() won't depend on cpuid <-> nodeid mapping any more. So the node should be online if the CPUs on it are online. Otherwise, we cannot setup interfaces of CPUs under /sys. Unfortunately, since I don't have a machine a with memory-less node, I cannot reproduce the problem right now. How do you think the node online behavior should be changed ? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015/8/4 11:36, Tang Chen wrote: > Hi TJ, > > Sorry for the late reply. > > On 07/16/2015 05:48 AM, Tejun Heo wrote: >> ...... >> so in initialization pharse makes no sense any more. The best near online >> node for each cpu should be cached somewhere. >> I'm not really following. Is this because the now offline node can >> later come online and we'd have to break the constant mapping >> invariant if we update the mapping later? If so, it'd be nice to >> spell that out. > > Yes. Will document this in the next version. > >>> ...... >>> +int get_near_online_node(int node) >>> +{ >>> + return per_cpu(x86_cpu_to_near_online_node, >>> + cpumask_first(&node_to_cpuid_mask_map[node])); >>> +} >>> +EXPORT_SYMBOL(get_near_online_node); >> Umm... this function is sitting on a fairly hot path and scanning a >> cpumask each time. Why not just build a numa node -> numa node array? > > Indeed. Will avoid to scan a cpumask. > >> ...... >> >>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t >>> gfp_mask, >>> unsigned int order) >>> { >>> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); >>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >>> + >>> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) >>> + if (!node_online(nid)) >>> + nid = get_near_online_node(nid); >>> +#endif >>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, >>> gfp_mask)); >>> } >> Ditto. Also, what's the synchronization rules for NUMA node >> on/offlining. If you end up updating the mapping later, how would >> that be synchronized against the above usages? > > I think the near online node map should be updated when node online/offline > happens. But about this, I think the current numa code has a little > problem. > > As you know, firmware info binds a set of CPUs and memory to a node. But > at boot time, if the node has no memory (a memory-less node) , it won't > be online. > But the CPUs on that node is available, and bound to the near online node. > (Here, I mean numa_set_node(cpu, node).) > > Why does the kernel do this ? I think it is used to ensure that we can > allocate memory > successfully by calling functions like alloc_pages_node() and > alloc_pages_exact_node(). > By these two fuctions, any CPU should be bound to a node who has memory > so that > memory allocation can be successful. > > That means, for a memory-less node at boot time, CPUs on the node is > online, > but the node is not online. > > That also means, "the node is online" equals to "the node has memory". > Actually, there > are a lot of code in the kernel is using this rule. > > > But, > 1) in cpu_up(), it will try to online a node, and it doesn't check if > the node has memory. > 2) in try_offline_node(), it offlines CPUs first, and then the memory. > > This behavior looks a little wired, or let's say it is ambiguous. It > seems that a NUMA node > consists of CPUs and memory. So if the CPUs are online, the node should > be online. Hi Chen, I have posted a patch set to enable memoryless node on x86, will repost it for review:) Hope it help to solve this issue. Thanks! Gerry > > And also, > The main purpose of this patch-set is to make the cpuid <-> nodeid > mapping persistent. > After this patch-set, alloc_pages_node() and alloc_pages_exact_node() > won't depend on > cpuid <-> nodeid mapping any more. So the node should be online if the > CPUs on it are > online. Otherwise, we cannot setup interfaces of CPUs under /sys. > > > Unfortunately, since I don't have a machine a with memory-less node, I > cannot reproduce > the problem right now. > > How do you think the node online behavior should be changed ? > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/04/2015 04:05 PM, Jiang Liu wrote: > ...... > > > But, > 1) in cpu_up(), it will try to online a node, and it doesn't check if > the node has memory. > 2) in try_offline_node(), it offlines CPUs first, and then the memory. > > This behavior looks a little wired, or let's say it is ambiguous. It > seems that a NUMA node > consists of CPUs and memory. So if the CPUs are online, the node should > be online. > Hi Chen, > I have posted a patch set to enable memoryless node on x86, > will repost it for review:) Hope it help to solve this issue. > Thanks! > Gerry > > Sure. Looking forward to the latest patches. :) Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Liu, Have you posted your new patches ? (I mean memory-less node support patches.) If you are going to post them, please cc me. And BTW, how did you reproduce the memory-less node problem ? Do you have a real memory-less node on your machine ? Thanks. :) On 08/04/2015 04:05 PM, Jiang Liu wrote: > On 2015/8/4 11:36, Tang Chen wrote: >> Hi TJ, >> >> Sorry for the late reply. >> >> On 07/16/2015 05:48 AM, Tejun Heo wrote: >>> ...... >>> so in initialization pharse makes no sense any more. The best near online >>> node for each cpu should be cached somewhere. >>> I'm not really following. Is this because the now offline node can >>> later come online and we'd have to break the constant mapping >>> invariant if we update the mapping later? If so, it'd be nice to >>> spell that out. >> Yes. Will document this in the next version. >> >>>> ...... >>>> +int get_near_online_node(int node) >>>> +{ >>>> + return per_cpu(x86_cpu_to_near_online_node, >>>> + cpumask_first(&node_to_cpuid_mask_map[node])); >>>> +} >>>> +EXPORT_SYMBOL(get_near_online_node); >>> Umm... this function is sitting on a fairly hot path and scanning a >>> cpumask each time. Why not just build a numa node -> numa node array? >> Indeed. Will avoid to scan a cpumask. >> >>> ...... >>> >>>> static inline struct page *alloc_pages_exact_node(int nid, gfp_t >>>> gfp_mask, >>>> unsigned int order) >>>> { >>>> - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); >>>> + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); >>>> + >>>> +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) >>>> + if (!node_online(nid)) >>>> + nid = get_near_online_node(nid); >>>> +#endif >>>> return __alloc_pages(gfp_mask, order, node_zonelist(nid, >>>> gfp_mask)); >>>> } >>> Ditto. Also, what's the synchronization rules for NUMA node >>> on/offlining. If you end up updating the mapping later, how would >>> that be synchronized against the above usages? >> I think the near online node map should be updated when node online/offline >> happens. But about this, I think the current numa code has a little >> problem. >> >> As you know, firmware info binds a set of CPUs and memory to a node. But >> at boot time, if the node has no memory (a memory-less node) , it won't >> be online. >> But the CPUs on that node is available, and bound to the near online node. >> (Here, I mean numa_set_node(cpu, node).) >> >> Why does the kernel do this ? I think it is used to ensure that we can >> allocate memory >> successfully by calling functions like alloc_pages_node() and >> alloc_pages_exact_node(). >> By these two fuctions, any CPU should be bound to a node who has memory >> so that >> memory allocation can be successful. >> >> That means, for a memory-less node at boot time, CPUs on the node is >> online, >> but the node is not online. >> >> That also means, "the node is online" equals to "the node has memory". >> Actually, there >> are a lot of code in the kernel is using this rule. >> >> >> But, >> 1) in cpu_up(), it will try to online a node, and it doesn't check if >> the node has memory. >> 2) in try_offline_node(), it offlines CPUs first, and then the memory. >> >> This behavior looks a little wired, or let's say it is ambiguous. It >> seems that a NUMA node >> consists of CPUs and memory. So if the CPUs are online, the node should >> be online. > Hi Chen, > I have posted a patch set to enable memoryless node on x86, > will repost it for review:) Hope it help to solve this issue. > Thanks! > Gerry > >> And also, >> The main purpose of this patch-set is to make the cpuid <-> nodeid >> mapping persistent. >> After this patch-set, alloc_pages_node() and alloc_pages_exact_node() >> won't depend on >> cpuid <-> nodeid mapping any more. So the node should be online if the >> CPUs on it are >> online. Otherwise, we cannot setup interfaces of CPUs under /sys. >> >> >> Unfortunately, since I don't have a machine a with memory-less node, I >> cannot reproduce >> the problem right now. >> >> How do you think the node online behavior should be changed ? >> >> Thanks. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> Please read the FAQ at http://www.tux.org/lkml/ > . > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015/8/9 14:15, Tang Chen wrote: > Hi Liu, > > Have you posted your new patches ? > (I mean memory-less node support patches.) Hi Chen, I have rebased my patches to v4.2-rc4, but unfortunately it breaks. Seems there are some changes in x86 NUMA support since 3.17. I need some time to figure it out. > > If you are going to post them, please cc me. Sure. > > And BTW, how did you reproduce the memory-less node problem ? > Do you have a real memory-less node on your machine ? Yes, we have a system with memoryless nodes. Thanks! Gerry -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h index 0fb4648..e3e22b2 100644 --- a/arch/x86/include/asm/topology.h +++ b/arch/x86/include/asm/topology.h @@ -82,6 +82,8 @@ static inline const struct cpumask *cpumask_of_node(int node) } #endif +extern int get_near_online_node(int node); + extern void setup_node_to_cpumask_map(void); /* diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index 4053bb5..13bd0d7 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -69,6 +69,7 @@ int numa_cpu_node(int cpu) return NUMA_NO_NODE; } +cpumask_t node_to_cpuid_mask_map[MAX_NUMNODES]; cpumask_var_t node_to_cpumask_map[MAX_NUMNODES]; EXPORT_SYMBOL(node_to_cpumask_map); @@ -78,6 +79,31 @@ EXPORT_SYMBOL(node_to_cpumask_map); DEFINE_EARLY_PER_CPU(int, x86_cpu_to_node_map, NUMA_NO_NODE); EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_node_map); +/* + * Map cpu index to the best near online node. The best near online node + * is the backup node for memory allocation on offline node. + */ +DEFINE_PER_CPU(int, x86_cpu_to_near_online_node); +EXPORT_PER_CPU_SYMBOL(x86_cpu_to_near_online_node); + +static int find_near_online_node(int node) +{ + int n, val; + int min_val = INT_MAX; + int best_node = -1; + + for_each_online_node(n) { + val = node_distance(node, n); + + if (val < min_val) { + min_val = val; + best_node = n; + } + } + + return best_node; +} + void numa_set_node(int cpu, int node) { int *cpu_to_node_map = early_per_cpu_ptr(x86_cpu_to_node_map); @@ -95,7 +121,11 @@ void numa_set_node(int cpu, int node) return; } #endif + + per_cpu(x86_cpu_to_near_online_node, cpu) = + find_near_online_node(numa_cpu_node(cpu)); per_cpu(x86_cpu_to_node_map, cpu) = node; + cpumask_set_cpu(cpu, &node_to_cpuid_mask_map[numa_cpu_node(cpu)]); set_cpu_numa_node(cpu, node); } @@ -105,6 +135,13 @@ void numa_clear_node(int cpu) numa_set_node(cpu, NUMA_NO_NODE); } +int get_near_online_node(int node) +{ + return per_cpu(x86_cpu_to_near_online_node, + cpumask_first(&node_to_cpuid_mask_map[node])); +} +EXPORT_SYMBOL(get_near_online_node); + /* * Allocate node_to_cpumask_map based on number of available nodes * Requires node_possible_map to be valid. @@ -702,24 +739,6 @@ void __init x86_numa_init(void) numa_init(dummy_numa_init); } -static __init int find_near_online_node(int node) -{ - int n, val; - int min_val = INT_MAX; - int best_node = -1; - - for_each_online_node(n) { - val = node_distance(node, n); - - if (val < min_val) { - min_val = val; - best_node = n; - } - } - - return best_node; -} - /* * Setup early cpu_to_node. * @@ -746,8 +765,6 @@ void __init init_cpu_to_node(void) if (node == NUMA_NO_NODE) continue; - if (!node_online(node)) - node = find_near_online_node(node); numa_set_node(cpu, node); } } diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 6ba7cf2..4a18b21 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -307,13 +307,23 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, if (nid < 0) nid = numa_node_id(); +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) + if (!node_online(nid)) + nid = get_near_online_node(nid); +#endif + return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); } static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask, unsigned int order) { - VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); + VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); + +#if IS_ENABLED(CONFIG_X86) && IS_ENABLED(CONFIG_NUMA) + if (!node_online(nid)) + nid = get_near_online_node(nid); +#endif return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); }