Message ID | 20211018123710.1540996-2-chenwandun@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fix numa spreading for large hash tables | expand |
On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote: > Eric Dumazet reported a strange numa spreading info in [1], and found > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > this issue [2]. I think the root problem here is that we have two meanings for NUMA_NO_NODE. I tend to read it as "The memory can be allocated from any node", but here it's used to mean "The memory should be spread over every node". Should we split those out as -1 and -2?
在 2021/10/18 20:39, Matthew Wilcox 写道: > On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote: >> Eric Dumazet reported a strange numa spreading info in [1], and found >> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced >> this issue [2]. > > I think the root problem here is that we have two meanings for > NUMA_NO_NODE. I tend to read it as "The memory can be allocated from > any node", but here it's used to mean "The memory should be spread over > every node". Should we split those out as -1 and -2? Yes, the intent of NUMA_NO_NODE some time is confused. Besides,I think NUMA_NO_NODE should consider mempolicy in most cases in the kernel unless it point out explicitly memory can be allocated without considering mempolicy. > . >
> Eric Dumazet reported a strange numa spreading info in [1], and found > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > this issue [2]. > > Dig into the difference before and after this patch, page allocation has > some difference: > > before: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ > alloc_pages_current > alloc_page_interleave /* can be proved by print policy mode */ > > after: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_pages_node /* choose nid by nuam_mem_id() */ > __alloc_pages_node(nid, ....) > > So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), > it will allocate memory in current node instead of interleaving allocate > memory. > > [1] > https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ > > [2] > https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ > > Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") > Reported-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Chen Wandun <chenwandun@huawei.com> > --- > mm/vmalloc.c | 16 ++++++++++------ > 1 file changed, 10 insertions(+), 6 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index d77830ff604c..87552a4018aa 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2816,6 +2816,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > unsigned int order, unsigned int nr_pages, struct page **pages) > { > unsigned int nr_allocated = 0; > + struct page *page; > + int i; > > /* > * For order-0 pages we make use of bulk allocator, if > @@ -2823,7 +2825,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > * to fails, fallback to a single page allocator that is > * more permissive. > */ > - if (!order) { > + if (!order && nid != NUMA_NO_NODE) { > while (nr_allocated < nr_pages) { > unsigned int nr, nr_pages_request; > > @@ -2848,7 +2850,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > if (nr != nr_pages_request) > break; > } > - } else > + } else if (order) > /* > * Compound pages required for remap_vmalloc_page if > * high-order pages. > @@ -2856,11 +2858,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > gfp |= __GFP_COMP; > > /* High-order pages or fallback path if "bulk" fails. */ > - while (nr_allocated < nr_pages) { > - struct page *page; > - int i; > > - page = alloc_pages_node(nid, gfp, order); > + page = NULL; > Why do you need to set page to NULL here?
On Mon, Oct 18, 2021 at 5:41 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote: > > Eric Dumazet reported a strange numa spreading info in [1], and found > > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > > this issue [2]. > > I think the root problem here is that we have two meanings for > NUMA_NO_NODE. I tend to read it as "The memory can be allocated from > any node", but here it's used to mean "The memory should be spread over > every node". Should we split those out as -1 and -2? I agree with Willy's suggestion to make it more explicit but as a followup work. This patch needs a backport, so keep this simple.
On Mon, Oct 18, 2021 at 5:23 AM Chen Wandun <chenwandun@huawei.com> wrote: > [...] > > /* High-order pages or fallback path if "bulk" fails. */ > - while (nr_allocated < nr_pages) { > - struct page *page; > - int i; > > - page = alloc_pages_node(nid, gfp, order); > + page = NULL; No need for the above NULL assignment. After removing this, you can add: Reviewed-by: Shakeel Butt <shakeelb@google.com> > + while (nr_allocated < nr_pages) { > + if (nid == NUMA_NO_NODE) > + page = alloc_pages(gfp, order); > + else > + page = alloc_pages_node(nid, gfp, order); > if (unlikely(!page)) > break; > > -- > 2.25.1 >
On Tue, Oct 19, 2021 at 8:54 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, Oct 18, 2021 at 5:41 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote: > > > Eric Dumazet reported a strange numa spreading info in [1], and found > > > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > > > this issue [2]. > > > > I think the root problem here is that we have two meanings for > > NUMA_NO_NODE. I tend to read it as "The memory can be allocated from > > any node", but here it's used to mean "The memory should be spread over > > every node". Should we split those out as -1 and -2? > > I agree with Willy's suggestion to make it more explicit but as a > followup work. This patch needs a backport, so keep this simple. NUMA_NO_NODE in process context also meant : Please follow current thread NUMA policies. One could hope for instance, that whenever large BPF maps are allocated, current thread could set non default NUMA policies.
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d77830ff604c..87552a4018aa 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2816,6 +2816,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, unsigned int order, unsigned int nr_pages, struct page **pages) { unsigned int nr_allocated = 0; + struct page *page; + int i; /* * For order-0 pages we make use of bulk allocator, if @@ -2823,7 +2825,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, * to fails, fallback to a single page allocator that is * more permissive. */ - if (!order) { + if (!order && nid != NUMA_NO_NODE) { while (nr_allocated < nr_pages) { unsigned int nr, nr_pages_request; @@ -2848,7 +2850,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (nr != nr_pages_request) break; } - } else + } else if (order) /* * Compound pages required for remap_vmalloc_page if * high-order pages. @@ -2856,11 +2858,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, gfp |= __GFP_COMP; /* High-order pages or fallback path if "bulk" fails. */ - while (nr_allocated < nr_pages) { - struct page *page; - int i; - page = alloc_pages_node(nid, gfp, order); + page = NULL; + while (nr_allocated < nr_pages) { + if (nid == NUMA_NO_NODE) + page = alloc_pages(gfp, order); + else + page = alloc_pages_node(nid, gfp, order); if (unlikely(!page)) break;
Eric Dumazet reported a strange numa spreading info in [1], and found commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced this issue [2]. Dig into the difference before and after this patch, page allocation has some difference: before: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ alloc_pages_current alloc_page_interleave /* can be proved by print policy mode */ after: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_pages_node /* choose nid by nuam_mem_id() */ __alloc_pages_node(nid, ....) So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), it will allocate memory in current node instead of interleaving allocate memory. [1] https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Chen Wandun <chenwandun@huawei.com> --- mm/vmalloc.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-)