Message ID | 1437486951-19898-1-git-send-email-vbabka@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, 21 Jul 2015, Vlastimil Babka wrote: > The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page > allocator: do not check NUMA node ID when the caller knows the node is valid") > as an optimized variant of alloc_pages_node(), that doesn't allow the node id > to be -1. Unfortunately the name of the function can easily suggest that the > allocation is restricted to the given node. In truth, the node is only > preferred, unless __GFP_THISNODE is among the gfp flags. Yup. I complained about this when this was introduced. Glad to see this fixed. Initially this was alloc_pages_node() which just means that a node is specified. The exact behavior of the allocation is determined by flags such as GFP_THISNODE. I'd rather have that restored because otherwise we get into weird code like the one below. And such an arrangement also leaves the way open to add more flags in the future that may change the allocation behavior. > area->nid = nid; > area->order = order; > - area->pages = alloc_pages_exact_node(area->nid, > + area->pages = alloc_pages_prefer_node(area->nid, > GFP_KERNEL|__GFP_THISNODE, > area->order); This is not preferring a node but requiring alloction on that node. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 21 Jul 2015, Vlastimil Babka wrote: > The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page > allocator: do not check NUMA node ID when the caller knows the node is valid") > as an optimized variant of alloc_pages_node(), that doesn't allow the node id > to be -1. Unfortunately the name of the function can easily suggest that the > allocation is restricted to the given node. In truth, the node is only > preferred, unless __GFP_THISNODE is among the gfp flags. > > The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm, > thp: really limit transparent hugepage allocation to local node") and > b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). > > To prevent further mistakes, this patch renames the function to > alloc_pages_prefer_node() and documents it together with alloc_pages_node(). > alloc_pages_exact_node(), as you said, connotates that the allocation will take place on that node or will fail. So why not go beyond this patch and actually make alloc_pages_exact_node() set __GFP_THISNODE and then call into a new alloc_pages_prefer_node(), which would be the current alloc_pages_exact_node() implementation, and then fix up the callers? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-07-21 at 15:55 +0200, Vlastimil Babka wrote: > The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page > allocator: do not check NUMA node ID when the caller knows the node is valid") > as an optimized variant of alloc_pages_node(), that doesn't allow the node id > to be -1. Unfortunately the name of the function can easily suggest that the > allocation is restricted to the given node. In truth, the node is only > preferred, unless __GFP_THISNODE is among the gfp flags. > > The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm, > thp: really limit transparent hugepage allocation to local node") and > b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). > > To prevent further mistakes, this patch renames the function to > alloc_pages_prefer_node() and documents it together with alloc_pages_node(). > > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> > I'm CC'ing also maintainers of the callsites so they can verify that the > callsites that don't pass __GFP_THISNODE are really not intended to restrict > allocation to the given node. I went through them myself and each looked like > it's better off if it can successfully allocate on a fallback node rather > than fail. DavidR checked them also I think, but it's better if maintainers > can verify that. I'm not completely sure about all the usages in sl*b due to > multiple layers through which gfp flags are being passed. > diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c > index e865d74..646a310 100644 > --- a/arch/powerpc/platforms/cell/ras.c > +++ b/arch/powerpc/platforms/cell/ras.c > @@ -123,7 +123,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order) > > area->nid = nid; > area->order = order; > - area->pages = alloc_pages_exact_node(area->nid, > + area->pages = alloc_pages_prefer_node(area->nid, > GFP_KERNEL|__GFP_THISNODE, > area->order); Yeah that looks right to me. This code enables some firmware memory calibration so I think it really does want to get memory on the specified node, or else fail. Acked-by: Michael Ellerman <mpe@ellerman.id.au> cheers -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 21/07/2015 15:55, Vlastimil Babka wrote: > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 2d73807..a8723a8 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu) > struct page *pages; > struct vmcs *vmcs; > > - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order); > + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order); > if (!pages) > return NULL; > vmcs = page_address(pages); Even though there's a pretty strong preference for the "right" node, things can work if the node is the wrong one. The order is always zero in practice, so the allocation should succeed. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/21/2015 11:31 PM, David Rientjes wrote: > On Tue, 21 Jul 2015, Vlastimil Babka wrote: > >> The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page >> allocator: do not check NUMA node ID when the caller knows the node is valid") >> as an optimized variant of alloc_pages_node(), that doesn't allow the node id >> to be -1. Unfortunately the name of the function can easily suggest that the >> allocation is restricted to the given node. In truth, the node is only >> preferred, unless __GFP_THISNODE is among the gfp flags. >> >> The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm, >> thp: really limit transparent hugepage allocation to local node") and >> b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). >> >> To prevent further mistakes, this patch renames the function to >> alloc_pages_prefer_node() and documents it together with alloc_pages_node(). >> > > alloc_pages_exact_node(), as you said, connotates that the allocation will > take place on that node or will fail. So why not go beyond this patch and > actually make alloc_pages_exact_node() set __GFP_THISNODE and then call > into a new alloc_pages_prefer_node(), which would be the current > alloc_pages_exact_node() implementation, and then fix up the callers? OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and alloc_pages_exact_node(). Isn't that a bit too much? The first two differ only in tiny bit: static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { /* Unknown node is current node */ if (nid < 0) nid = numa_node_id(); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); } static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); } So _prefer_node is just a tiny optimization over the other one. It should be maybe called __alloc_pages_node() then? This would perhaps discourage users outside of mm/arch code (where it may matter). The savings of a skipped branch is likely dubious anyway... It would be also nice if alloc_pages_node() could use __alloc_pages_node() internally, but I'm not sure if all callers are safe wrt the VM_BUG_ON(!node_online(nid)) part. So when the alloc_pages_prefer_node is diminished as __alloc_pages_node or outright removed, then maybe alloc_pages_exact_node() which adds __GFP_THISNODE on its own, might be a useful wrapper. But I agree with Christoph it's a duplication of the gfp_flags functionality and I don't think there would be many users left anyway. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Jul 2015, Paolo Bonzini wrote: > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > index 2d73807..a8723a8 100644 > > --- a/arch/x86/kvm/vmx.c > > +++ b/arch/x86/kvm/vmx.c > > @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu) > > struct page *pages; > > struct vmcs *vmcs; > > > > - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order); > > + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order); > > if (!pages) > > return NULL; > > vmcs = page_address(pages); > > Even though there's a pretty strong preference for the "right" node, > things can work if the node is the wrong one. The order is always zero > in practice, so the allocation should succeed. > You're code is fine both before and after the patch since __GFP_THISNODE isn't set. The allocation will eventually succeed but, as you said, may be from remote memory (and the success of allocating on node may be influenced by the global setting of zone_reclaim_mode). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Jul 2015, Vlastimil Babka wrote: > > alloc_pages_exact_node(), as you said, connotates that the allocation will > > take place on that node or will fail. So why not go beyond this patch and > > actually make alloc_pages_exact_node() set __GFP_THISNODE and then call > > into a new alloc_pages_prefer_node(), which would be the current > > alloc_pages_exact_node() implementation, and then fix up the callers? > > OK, but then we have alloc_pages_node(), alloc_pages_prefer_node() and > alloc_pages_exact_node(). Isn't that a bit too much? The first two > differ only in tiny bit: > > static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, > unsigned int order) > { > /* Unknown node is current node */ > if (nid < 0) > nid = numa_node_id(); > > return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > } > > static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask, > unsigned int order) > { > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); > > return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > } > Eek, yeah, that does look bad. I'm not even sure the if (nid < 0) nid = numa_node_id(); is correct; I think this should be comparing to NUMA_NO_NODE rather than all negative numbers, otherwise we silently ignore overflow and nobody ever knows. > So _prefer_node is just a tiny optimization over the other one. It > should be maybe called __alloc_pages_node() then? This would perhaps > discourage users outside of mm/arch code (where it may matter). The > savings of a skipped branch is likely dubious anyway... It would be also > nice if alloc_pages_node() could use __alloc_pages_node() internally, but > I'm not sure if all callers are safe wrt the > VM_BUG_ON(!node_online(nid)) part. > I'm not sure how large you want to make your patch :) In a perfect world I would think that we wouldn't have an alloc_pages_prefer_node() at all and everything would be converted to alloc_pages_node() which would do if (nid == NUMA_NO_NODE) nid = numa_mem_id(); VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); and then alloc_pages_exact_node() would do return alloc_pages_node(nid, gfp_mask | __GFP_THISNODE, order); and existing alloc_pages_exact_node() callers fixed up depending on whether they set the bit or not. The only possible downside would be existing users of alloc_pages_node() that are calling it with an offline node. Since it's a VM_BUG_ON() that would catch that, I think it should be changed to a VM_WARN_ON() and eventually fixed up because it's nonsensical. VM_BUG_ON() here should be avoided. Or just go with a single alloc_pages_node() and rename __GFP_THISNODE to __GFP_EXACT_NODE which may accomplish the same thing :) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 22 Jul 2015, David Rientjes wrote: > Eek, yeah, that does look bad. I'm not even sure the > > if (nid < 0) > nid = numa_node_id(); > > is correct; I think this should be comparing to NUMA_NO_NODE rather than > all negative numbers, otherwise we silently ignore overflow and nobody > ever knows. Comparing to NUMA_NO_NODE would be better. Also use numa_mem_id() instead to support memoryless nodes better? > The only possible downside would be existing users of > alloc_pages_node() that are calling it with an offline node. Since it's a > VM_BUG_ON() that would catch that, I think it should be changed to a > VM_WARN_ON() and eventually fixed up because it's nonsensical. > VM_BUG_ON() here should be avoided. The offline node thing could be addresses by using numa_mem_id()? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 23 Jul 2015, Christoph Lameter wrote: > > The only possible downside would be existing users of > > alloc_pages_node() that are calling it with an offline node. Since it's a > > VM_BUG_ON() that would catch that, I think it should be changed to a > > VM_WARN_ON() and eventually fixed up because it's nonsensical. > > VM_BUG_ON() here should be avoided. > > The offline node thing could be addresses by using numa_mem_id()? > I was concerned about any callers that were passing an offline node, not NUMA_NO_NODE, today. One of the alloc-node functions has a VM_BUG_ON() for it, the other silently calls node_zonelist() on it. I suppose the final alloc_pages_node() implementation could be if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid))) nid = numa_mem_id(); VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); though. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/23/2015 10:27 PM, David Rientjes wrote: > On Thu, 23 Jul 2015, Christoph Lameter wrote: > >>> The only possible downside would be existing users of >>> alloc_pages_node() that are calling it with an offline node. Since it's a >>> VM_BUG_ON() that would catch that, I think it should be changed to a >>> VM_WARN_ON() and eventually fixed up because it's nonsensical. >>> VM_BUG_ON() here should be avoided. >> >> The offline node thing could be addresses by using numa_mem_id()? >> > > I was concerned about any callers that were passing an offline node, not > NUMA_NO_NODE, today. One of the alloc-node functions has a VM_BUG_ON() > for it, the other silently calls node_zonelist() on it. > > I suppose the final alloc_pages_node() implementation could be > > if (nid == NUMA_NO_NODE || VM_WARN_ON(!node_online(nid))) > nid = numa_mem_id(); > > VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES); > return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); > > though. I've posted v2 based on David's and Christoph's suggestions (thanks) but to avoid spamming everyone until we agree on the final interface, it's marked as RFC and excludes the arch people from CC: http://marc.info/?l=linux-kernel&m=143774920902608&w=2 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c index 344387a..17762af 100644 --- a/arch/ia64/hp/common/sba_iommu.c +++ b/arch/ia64/hp/common/sba_iommu.c @@ -1146,7 +1146,7 @@ sba_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle, if (node == NUMA_NO_NODE) node = numa_node_id(); - page = alloc_pages_exact_node(node, flags, get_order(size)); + page = alloc_pages_prefer_node(node, flags, get_order(size)); if (unlikely(!page)) return NULL; diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c index 20e8a9b..1451961 100644 --- a/arch/ia64/kernel/uncached.c +++ b/arch/ia64/kernel/uncached.c @@ -97,7 +97,7 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid) /* attempt to allocate a granule's worth of cached memory pages */ - page = alloc_pages_exact_node(nid, + page = alloc_pages_prefer_node(nid, GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, IA64_GRANULE_SHIFT-PAGE_SHIFT); if (!page) { diff --git a/arch/ia64/sn/pci/pci_dma.c b/arch/ia64/sn/pci/pci_dma.c index d0853e8..dcab82f 100644 --- a/arch/ia64/sn/pci/pci_dma.c +++ b/arch/ia64/sn/pci/pci_dma.c @@ -92,7 +92,7 @@ static void *sn_dma_alloc_coherent(struct device *dev, size_t size, */ node = pcibus_to_node(pdev->bus); if (likely(node >=0)) { - struct page *p = alloc_pages_exact_node(node, + struct page *p = alloc_pages_prefer_node(node, flags, get_order(size)); if (likely(p)) diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c index e865d74..646a310 100644 --- a/arch/powerpc/platforms/cell/ras.c +++ b/arch/powerpc/platforms/cell/ras.c @@ -123,7 +123,7 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order) area->nid = nid; area->order = order; - area->pages = alloc_pages_exact_node(area->nid, + area->pages = alloc_pages_prefer_node(area->nid, GFP_KERNEL|__GFP_THISNODE, area->order); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 2d73807..a8723a8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3158,7 +3158,7 @@ static struct vmcs *alloc_vmcs_cpu(int cpu) struct page *pages; struct vmcs *vmcs; - pages = alloc_pages_exact_node(node, GFP_KERNEL, vmcs_config.order); + pages = alloc_pages_prefer_node(node, GFP_KERNEL, vmcs_config.order); if (!pages) return NULL; vmcs = page_address(pages); diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c index 95c8944..3ff0a24 100644 --- a/drivers/misc/sgi-xp/xpc_uv.c +++ b/drivers/misc/sgi-xp/xpc_uv.c @@ -239,7 +239,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name, mq->mmr_blade = uv_cpu_to_blade_id(cpu); nid = cpu_to_node(cpu); - page = alloc_pages_exact_node(nid, + page = alloc_pages_prefer_node(nid, GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, pg_order); if (page == NULL) { diff --git a/include/linux/gfp.h b/include/linux/gfp.h index 15928f0..ff892d7 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -300,6 +300,10 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL); } +/* + * Allocate pages, preferring the node given as nid. When nid equals -1, + * prefer the current CPU's node. + */ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { @@ -310,7 +314,14 @@ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask)); } -static inline struct page *alloc_pages_exact_node(int nid, gfp_t gfp_mask, +/* + * Allocate pages, preferring the node given as nid. Unlike alloc_pages_node(), + * nid must be a valid online node, there is no fallback to current CPU's node. + * + * In order to completely restrict allocation to the preferred node, include + * __GFP_THISNODE in the gfp_mask. + */ +static inline struct page *alloc_pages_prefer_node(int nid, gfp_t gfp_mask, unsigned int order) { VM_BUG_ON(nid < 0 || nid >= MAX_NUMNODES || !node_online(nid)); @@ -354,7 +365,6 @@ extern unsigned long get_zeroed_page(gfp_t gfp_mask); void *alloc_pages_exact(size_t size, gfp_t gfp_mask); void free_pages_exact(void *virt, size_t size); -/* This is different from alloc_pages_exact_node !!! */ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask); #define __get_free_page(gfp_mask) \ diff --git a/kernel/profile.c b/kernel/profile.c index a7bcd28..6ee695e 100644 --- a/kernel/profile.c +++ b/kernel/profile.c @@ -339,7 +339,7 @@ static int profile_cpu_callback(struct notifier_block *info, node = cpu_to_mem(cpu); per_cpu(cpu_profile_flip, cpu) = 0; if (!per_cpu(cpu_profile_hits, cpu)[1]) { - page = alloc_pages_exact_node(node, + page = alloc_pages_prefer_node(node, GFP_KERNEL | __GFP_ZERO, 0); if (!page) @@ -347,7 +347,7 @@ static int profile_cpu_callback(struct notifier_block *info, per_cpu(cpu_profile_hits, cpu)[1] = page_address(page); } if (!per_cpu(cpu_profile_hits, cpu)[0]) { - page = alloc_pages_exact_node(node, + page = alloc_pages_prefer_node(node, GFP_KERNEL | __GFP_ZERO, 0); if (!page) @@ -543,14 +543,14 @@ static int create_hash_tables(void) int node = cpu_to_mem(cpu); struct page *page; - page = alloc_pages_exact_node(node, + page = alloc_pages_prefer_node(node, GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, 0); if (!page) goto out_cleanup; per_cpu(cpu_profile_hits, cpu)[1] = (struct profile_hit *)page_address(page); - page = alloc_pages_exact_node(node, + page = alloc_pages_prefer_node(node, GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE, 0); if (!page) diff --git a/mm/filemap.c b/mm/filemap.c index 6bf5e42..52e3b55 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -648,7 +648,7 @@ struct page *__page_cache_alloc(gfp_t gfp) do { cpuset_mems_cookie = read_mems_allowed_begin(); n = cpuset_mem_spread_node(); - page = alloc_pages_exact_node(n, gfp, 0); + page = alloc_pages_prefer_node(n, gfp, 0); } while (!page && read_mems_allowed_retry(cpuset_mems_cookie)); return page; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 078832c..7130e97 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2350,7 +2350,7 @@ khugepaged_alloc_page(struct page **hpage, gfp_t gfp, struct mm_struct *mm, */ up_read(&mm->mmap_sem); - *hpage = alloc_pages_exact_node(node, gfp, HPAGE_PMD_ORDER); + *hpage = alloc_pages_prefer_node(node, gfp, HPAGE_PMD_ORDER); if (unlikely(!*hpage)) { count_vm_event(THP_COLLAPSE_ALLOC_FAILED); *hpage = ERR_PTR(-ENOMEM); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 271e443..4c6c0c1 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1088,7 +1088,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid) { struct page *page; - page = alloc_pages_exact_node(nid, + page = alloc_pages_prefer_node(nid, htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); @@ -1250,7 +1250,7 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, int nid) __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); else - page = alloc_pages_exact_node(nid, + page = alloc_pages_prefer_node(nid, htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 501820c..132f043 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1503,7 +1503,7 @@ static struct page *new_page(struct page *p, unsigned long private, int **x) return alloc_huge_page_node(page_hstate(compound_head(p)), nid); else - return alloc_pages_exact_node(nid, GFP_HIGHUSER_MOVABLE, 0); + return alloc_pages_prefer_node(nid, GFP_HIGHUSER_MOVABLE, 0); } /* diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7477432..2e2a876 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -945,7 +945,7 @@ static struct page *new_node_page(struct page *page, unsigned long node, int **x return alloc_huge_page_node(page_hstate(compound_head(page)), node); else - return alloc_pages_exact_node(node, GFP_HIGHUSER_MOVABLE | + return alloc_pages_prefer_node(node, GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0); } @@ -1986,7 +1986,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, nmask = policy_nodemask(gfp, pol); if (!nmask || node_isset(node, *nmask)) { mpol_cond_put(pol); - page = alloc_pages_exact_node(node, + page = alloc_pages_prefer_node(node, gfp | __GFP_THISNODE, order); goto out; } diff --git a/mm/migrate.c b/mm/migrate.c index f53838f..b51e106 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1187,7 +1187,7 @@ static struct page *new_page_node(struct page *p, unsigned long private, return alloc_huge_page_node(page_hstate(compound_head(p)), pm->node); else - return alloc_pages_exact_node(pm->node, + return alloc_pages_prefer_node(pm->node, GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0); } @@ -1553,7 +1553,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page, int nid = (int) data; struct page *newpage; - newpage = alloc_pages_exact_node(nid, + newpage = alloc_pages_prefer_node(nid, (GFP_HIGHUSER_MOVABLE | __GFP_THISNODE | __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ebffa0e..e6eef56 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3062,8 +3062,6 @@ EXPORT_SYMBOL(alloc_pages_exact); * * Like alloc_pages_exact(), but try to allocate on node nid first before falling * back. - * Note this is not alloc_pages_exact_node() which allocates on a specific node, - * but is not exact. */ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) { diff --git a/mm/slab.c b/mm/slab.c index 7eb38dd..471fde8 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1594,7 +1594,7 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, if (memcg_charge_slab(cachep, flags, cachep->gfporder)) return NULL; - page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder); + page = alloc_pages_prefer_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder); if (!page) { memcg_uncharge_slab(cachep, cachep->gfporder); slab_out_of_memory(cachep, flags, nodeid); diff --git a/mm/slob.c b/mm/slob.c index 4765f65..cfda5e2 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -45,7 +45,7 @@ * NUMA support in SLOB is fairly simplistic, pushing most of the real * logic down to the page allocator, and simply doing the node accounting * on the upper levels. In the event that a node id is explicitly - * provided, alloc_pages_exact_node() with the specified node id is used + * provided, alloc_pages_prefer_node() with the specified node id is used * instead. The common case (or when the node id isn't explicitly provided) * will default to the current node, as per numa_node_id(). * @@ -193,7 +193,7 @@ static void *slob_new_pages(gfp_t gfp, int order, int node) #ifdef CONFIG_NUMA if (node != NUMA_NO_NODE) - page = alloc_pages_exact_node(node, gfp, order); + page = alloc_pages_prefer_node(node, gfp, order); else #endif page = alloc_pages(gfp, order); diff --git a/mm/slub.c b/mm/slub.c index 54c0876..8b10404 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1323,7 +1323,7 @@ static inline struct page *alloc_slab_page(struct kmem_cache *s, if (node == NUMA_NO_NODE) page = alloc_pages(flags, order); else - page = alloc_pages_exact_node(node, flags, order); + page = alloc_pages_prefer_node(node, flags, order); if (!page) memcg_uncharge_slab(s, order);
The function alloc_pages_exact_node() was introduced in 6484eb3e2a81 ("page allocator: do not check NUMA node ID when the caller knows the node is valid") as an optimized variant of alloc_pages_node(), that doesn't allow the node id to be -1. Unfortunately the name of the function can easily suggest that the allocation is restricted to the given node. In truth, the node is only preferred, unless __GFP_THISNODE is among the gfp flags. The misleading name has lead to mistakes in the past, see 5265047ac301 ("mm, thp: really limit transparent hugepage allocation to local node") and b360edb43f8e ("mm, mempolicy: migrate_to_node should only migrate to node"). To prevent further mistakes, this patch renames the function to alloc_pages_prefer_node() and documents it together with alloc_pages_node(). Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Gleb Natapov <gleb@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Cliff Whickman <cpw@sgi.com> Cc: Robin Holt <robinmholt@gmail.com> --- I hope the new name will be OK. Although it doesn't fully convey the difference from alloc_pages_node(), renaming the latter would be a much larger patch (some 60 callsites) and hopefully the new comments are enough to prevent further confusion. I'm CC'ing also maintainers of the callsites so they can verify that the callsites that don't pass __GFP_THISNODE are really not intended to restrict allocation to the given node. I went through them myself and each looked like it's better off if it can successfully allocate on a fallback node rather than fail. DavidR checked them also I think, but it's better if maintainers can verify that. I'm not completely sure about all the usages in sl*b due to multiple layers through which gfp flags are being passed. arch/ia64/hp/common/sba_iommu.c | 2 +- arch/ia64/kernel/uncached.c | 2 +- arch/ia64/sn/pci/pci_dma.c | 2 +- arch/powerpc/platforms/cell/ras.c | 2 +- arch/x86/kvm/vmx.c | 2 +- drivers/misc/sgi-xp/xpc_uv.c | 2 +- include/linux/gfp.h | 14 ++++++++++++-- kernel/profile.c | 8 ++++---- mm/filemap.c | 2 +- mm/huge_memory.c | 2 +- mm/hugetlb.c | 4 ++-- mm/memory-failure.c | 2 +- mm/mempolicy.c | 4 ++-- mm/migrate.c | 4 ++-- mm/page_alloc.c | 2 -- mm/slab.c | 2 +- mm/slob.c | 4 ++-- mm/slub.c | 2 +- 18 files changed, 35 insertions(+), 27 deletions(-)