Message ID | 20210928121040.2547407-1-chenwandun@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/vmalloc: fix numa spreading for large hash tables | expand |
On Tue, 28 Sep 2021 20:10:40 +0800 Chen Wandun <chenwandun@huawei.com> wrote: > Eric Dumazet reported a strange numa spreading info in [1], and found > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > this issue [2]. > > Dig into the difference before and after this patch, page allocation has > some difference: > > before: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ > alloc_pages_current > alloc_page_interleave /* can be proved by print policy mode */ > > after: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_pages_node /* choose nid by nuam_mem_id() */ > __alloc_pages_node(nid, ....) > > So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), > it will allocate memory in current node instead of interleaving allocate > memory. > > [1] > https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ > > [2] > https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ > > Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") > Reported-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Chen Wandun <chenwandun@huawei.com> This seems like it could cause significant performance regressions in some situations? If "yes" then wouldn't a cc:stable be appropriate? And some (perhaps handwavy) quantification of the slowdown would help people understand why we're recommending a backport. If "no" then why the heck do we have that feature in there anyway ;)
On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <chenwandun@huawei.com> wrote: > > Eric Dumazet reported a strange numa spreading info in [1], and found > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > this issue [2]. > > Dig into the difference before and after this patch, page allocation has > some difference: > > before: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ > alloc_pages_current > alloc_page_interleave /* can be proved by print policy mode */ > > after: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_pages_node /* choose nid by nuam_mem_id() */ > __alloc_pages_node(nid, ....) > > So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), > it will allocate memory in current node instead of interleaving allocate > memory. > > [1] > https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ > > [2] > https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ > > Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") > Reported-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Chen Wandun <chenwandun@huawei.com> > --- > mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- > 1 file changed, 26 insertions(+), 7 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f884706c5280..48e717626e94 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > unsigned int order, unsigned int nr_pages, struct page **pages) > { > unsigned int nr_allocated = 0; > + struct page *page; > + int i; > > /* > * For order-0 pages we make use of bulk allocator, if > @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > if (!order) { Can you please replace the above with if (!order && nid != NUMA_NO_NODE)? > while (nr_allocated < nr_pages) { > unsigned int nr, nr_pages_request; > + page = NULL; > > /* > * A maximum allowed request is hard-coded and is 100 > @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > */ > nr_pages_request = min(100U, nr_pages - nr_allocated); > Undo the following change in this if block. > - nr = alloc_pages_bulk_array_node(gfp, nid, > - nr_pages_request, pages + nr_allocated); > - > + if (nid == NUMA_NO_NODE) { > + for (i = 0; i < nr_pages_request; i++) { > + page = alloc_page(gfp); > + if (page) > + pages[nr_allocated + i] = page; > + else { > + nr = i; > + break; > + } > + } > + if (i >= nr_pages_request) > + nr = nr_pages_request; > + } else { > + nr = alloc_pages_bulk_array_node(gfp, nid, > + nr_pages_request, > + pages + nr_allocated); > + } > nr_allocated += nr; > cond_resched(); > > @@ -2863,11 +2880,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, Put the following line under "else if (order)" > gfp |= __GFP_COMP; > > /* High-order pages or fallback path if "bulk" fails. */ > - while (nr_allocated < nr_pages) { Keep the following declarations inside the while loop. > - struct page *page; > - int i; > > - page = alloc_pages_node(nid, gfp, order); > + page = NULL; > + while (nr_allocated < nr_pages) { > + if (nid == NUMA_NO_NODE) > + page = alloc_pages(gfp, order); > + else > + page = alloc_pages_node(nid, gfp, order); > if (unlikely(!page)) > break; > > -- > 2.25.1 >
在 2021/9/29 6:33, Andrew Morton 写道: > On Tue, 28 Sep 2021 20:10:40 +0800 Chen Wandun <chenwandun@huawei.com> wrote: > >> Eric Dumazet reported a strange numa spreading info in [1], and found >> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced >> this issue [2]. >> >> Dig into the difference before and after this patch, page allocation has >> some difference: >> >> before: >> alloc_large_system_hash >> __vmalloc >> __vmalloc_node(..., NUMA_NO_NODE, ...) >> __vmalloc_node_range >> __vmalloc_area_node >> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ >> alloc_pages_current >> alloc_page_interleave /* can be proved by print policy mode */ >> >> after: >> alloc_large_system_hash >> __vmalloc >> __vmalloc_node(..., NUMA_NO_NODE, ...) >> __vmalloc_node_range >> __vmalloc_area_node >> alloc_pages_node /* choose nid by nuam_mem_id() */ >> __alloc_pages_node(nid, ....) >> >> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), >> it will allocate memory in current node instead of interleaving allocate >> memory. >> >> [1] >> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ >> >> [2] >> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ >> >> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") >> Reported-by: Eric Dumazet <edumazet@google.com> >> Signed-off-by: Chen Wandun <chenwandun@huawei.com> > > This seems like it could cause significant performance regressions in > some situations? Yes,I indeed will cause some performance regressions, I will send a optimization patch based on this patch. > > If "yes" then wouldn't a cc:stable be appropriate? And some (perhaps > handwavy) quantification of the slowdown would help people understand > why we're recommending a backport. > > If "no" then why the heck do we have that feature in there anyway ;) > > . >
在 2021/10/14 5:46, Shakeel Butt 写道: > On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <chenwandun@huawei.com> wrote: >> >> Eric Dumazet reported a strange numa spreading info in [1], and found >> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced >> this issue [2]. >> >> Dig into the difference before and after this patch, page allocation has >> some difference: >> >> before: >> alloc_large_system_hash >> __vmalloc >> __vmalloc_node(..., NUMA_NO_NODE, ...) >> __vmalloc_node_range >> __vmalloc_area_node >> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ >> alloc_pages_current >> alloc_page_interleave /* can be proved by print policy mode */ >> >> after: >> alloc_large_system_hash >> __vmalloc >> __vmalloc_node(..., NUMA_NO_NODE, ...) >> __vmalloc_node_range >> __vmalloc_area_node >> alloc_pages_node /* choose nid by nuam_mem_id() */ >> __alloc_pages_node(nid, ....) >> >> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), >> it will allocate memory in current node instead of interleaving allocate >> memory. >> >> [1] >> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ >> >> [2] >> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ >> >> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") >> Reported-by: Eric Dumazet <edumazet@google.com> >> Signed-off-by: Chen Wandun <chenwandun@huawei.com> >> --- >> mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- >> 1 file changed, 26 insertions(+), 7 deletions(-) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index f884706c5280..48e717626e94 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> unsigned int order, unsigned int nr_pages, struct page **pages) >> { >> unsigned int nr_allocated = 0; >> + struct page *page; >> + int i; >> >> /* >> * For order-0 pages we make use of bulk allocator, if >> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> if (!order) { > > Can you please replace the above with if (!order && nid != NUMA_NO_NODE)? > >> while (nr_allocated < nr_pages) { >> unsigned int nr, nr_pages_request; >> + page = NULL; >> >> /* >> * A maximum allowed request is hard-coded and is 100 >> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> */ >> nr_pages_request = min(100U, nr_pages - nr_allocated); >> > > Undo the following change in this if block. Yes, It seem like more simpler as you suggested, But it still have performance regression, I plan to change the following to consider both mempolcy and alloc_pages_bulk. Thanks, Wandun > >> - nr = alloc_pages_bulk_array_node(gfp, nid, >> - nr_pages_request, pages + nr_allocated); >> - >> + if (nid == NUMA_NO_NODE) { >> + for (i = 0; i < nr_pages_request; i++) { >> + page = alloc_page(gfp); >> + if (page) >> + pages[nr_allocated + i] = page; >> + else { >> + nr = i; >> + break; >> + } >> + } >> + if (i >= nr_pages_request) >> + nr = nr_pages_request; >> + } else { >> + nr = alloc_pages_bulk_array_node(gfp, nid, >> + nr_pages_request, >> + pages + nr_allocated); >> + } >> nr_allocated += nr; >> cond_resched(); >> >> @@ -2863,11 +2880,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > > Put the following line under "else if (order)" > >> gfp |= __GFP_COMP; >> >> /* High-order pages or fallback path if "bulk" fails. */ >> - while (nr_allocated < nr_pages) { > > Keep the following declarations inside the while loop. > >> - struct page *page; >> - int i; >> >> - page = alloc_pages_node(nid, gfp, order); >> + page = NULL; >> + while (nr_allocated < nr_pages) { >> + if (nid == NUMA_NO_NODE) >> + page = alloc_pages(gfp, order); >> + else >> + page = alloc_pages_node(nid, gfp, order); >> if (unlikely(!page)) >> break; >> >> -- >> 2.25.1 >> > . >
On Tue, Sep 28, 2021 at 08:10:40PM +0800, Chen Wandun wrote: > Eric Dumazet reported a strange numa spreading info in [1], and found > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > this issue [2]. > > Dig into the difference before and after this patch, page allocation has > some difference: > > before: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ > alloc_pages_current > alloc_page_interleave /* can be proved by print policy mode */ > > after: > alloc_large_system_hash > __vmalloc > __vmalloc_node(..., NUMA_NO_NODE, ...) > __vmalloc_node_range > __vmalloc_area_node > alloc_pages_node /* choose nid by nuam_mem_id() */ > __alloc_pages_node(nid, ....) > > So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), > it will allocate memory in current node instead of interleaving allocate > memory. > > [1] > https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ > > [2] > https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ > > Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") > Reported-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: Chen Wandun <chenwandun@huawei.com> > --- > mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- > 1 file changed, 26 insertions(+), 7 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index f884706c5280..48e717626e94 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > unsigned int order, unsigned int nr_pages, struct page **pages) > { > unsigned int nr_allocated = 0; > + struct page *page; > + int i; > > /* > * For order-0 pages we make use of bulk allocator, if > @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > if (!order) { > while (nr_allocated < nr_pages) { > unsigned int nr, nr_pages_request; > + page = NULL; > > /* > * A maximum allowed request is hard-coded and is 100 > @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > */ > nr_pages_request = min(100U, nr_pages - nr_allocated); > > - nr = alloc_pages_bulk_array_node(gfp, nid, > - nr_pages_request, pages + nr_allocated); > - > + if (nid == NUMA_NO_NODE) { > <snip> void *vmalloc(unsigned long size) { return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE, __builtin_return_address(0)); } EXPORT_SYMBOL(vmalloc); <snip> vmalloc() uses NUMA_NO_NODE, so all vmalloc calls will be reverted to a single page allocator for NUMA and non-NUMA systems. Is it intentional to bypass the optimized bulk allocator for non-NUMA systems? Thanks! -- Vlad Rezki
Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm: > > > 在 2021/10/14 5:46, Shakeel Butt 写道: >> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <chenwandun@huawei.com> wrote: >>> >>> Eric Dumazet reported a strange numa spreading info in [1], and found >>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced >>> this issue [2]. >>> >>> Dig into the difference before and after this patch, page allocation has >>> some difference: >>> >>> before: >>> alloc_large_system_hash >>> __vmalloc >>> __vmalloc_node(..., NUMA_NO_NODE, ...) >>> __vmalloc_node_range >>> __vmalloc_area_node >>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ >>> alloc_pages_current >>> alloc_page_interleave /* can be proved by print policy mode */ >>> >>> after: >>> alloc_large_system_hash >>> __vmalloc >>> __vmalloc_node(..., NUMA_NO_NODE, ...) >>> __vmalloc_node_range >>> __vmalloc_area_node >>> alloc_pages_node /* choose nid by nuam_mem_id() */ >>> __alloc_pages_node(nid, ....) >>> >>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), >>> it will allocate memory in current node instead of interleaving allocate >>> memory. >>> >>> [1] >>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ >>> >>> [2] >>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ >>> >>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") >>> Reported-by: Eric Dumazet <edumazet@google.com> >>> Signed-off-by: Chen Wandun <chenwandun@huawei.com> >>> --- >>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- >>> 1 file changed, 26 insertions(+), 7 deletions(-) >>> >>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >>> index f884706c5280..48e717626e94 100644 >>> --- a/mm/vmalloc.c >>> +++ b/mm/vmalloc.c >>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>> unsigned int order, unsigned int nr_pages, struct page **pages) >>> { >>> unsigned int nr_allocated = 0; >>> + struct page *page; >>> + int i; >>> >>> /* >>> * For order-0 pages we make use of bulk allocator, if >>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>> if (!order) { >> >> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)? >> >>> while (nr_allocated < nr_pages) { >>> unsigned int nr, nr_pages_request; >>> + page = NULL; >>> >>> /* >>> * A maximum allowed request is hard-coded and is 100 >>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>> */ >>> nr_pages_request = min(100U, nr_pages - nr_allocated); >>> >> >> Undo the following change in this if block. > > Yes, It seem like more simpler as you suggested, But it still have > performance regression, I plan to change the following to consider > both mempolcy and alloc_pages_bulk. Thanks for finding and debugging this. These APIs are a maze of twisty little passages, all alike so I could be as confused as I was when I wrote that patch, but doesn't a minimal fix look something like this? diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d77830ff604c..75ee9679f521 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2860,7 +2860,10 @@ vm_area_alloc_pages(gfp_t gfp, int nid, struct page *page; int i; - page = alloc_pages_node(nid, gfp, order); + if (nid == NUMA_NO_NODE) + page = alloc_pages(gfp, order); + else + page = alloc_pages_node(nid, gfp, order); if (unlikely(!page)) break; Thanks, Nick
在 2021/10/14 18:01, Uladzislau Rezki 写道: > On Tue, Sep 28, 2021 at 08:10:40PM +0800, Chen Wandun wrote: >> Eric Dumazet reported a strange numa spreading info in [1], and found >> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced >> this issue [2]. >> >> Dig into the difference before and after this patch, page allocation has >> some difference: >> >> before: >> alloc_large_system_hash >> __vmalloc >> __vmalloc_node(..., NUMA_NO_NODE, ...) >> __vmalloc_node_range >> __vmalloc_area_node >> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ >> alloc_pages_current >> alloc_page_interleave /* can be proved by print policy mode */ >> >> after: >> alloc_large_system_hash >> __vmalloc >> __vmalloc_node(..., NUMA_NO_NODE, ...) >> __vmalloc_node_range >> __vmalloc_area_node >> alloc_pages_node /* choose nid by nuam_mem_id() */ >> __alloc_pages_node(nid, ....) >> >> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), >> it will allocate memory in current node instead of interleaving allocate >> memory. >> >> [1] >> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ >> >> [2] >> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ >> >> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") >> Reported-by: Eric Dumazet <edumazet@google.com> >> Signed-off-by: Chen Wandun <chenwandun@huawei.com> >> --- >> mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- >> 1 file changed, 26 insertions(+), 7 deletions(-) >> >> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >> index f884706c5280..48e717626e94 100644 >> --- a/mm/vmalloc.c >> +++ b/mm/vmalloc.c >> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> unsigned int order, unsigned int nr_pages, struct page **pages) >> { >> unsigned int nr_allocated = 0; >> + struct page *page; >> + int i; >> >> /* >> * For order-0 pages we make use of bulk allocator, if >> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> if (!order) { >> while (nr_allocated < nr_pages) { >> unsigned int nr, nr_pages_request; >> + page = NULL; >> >> /* >> * A maximum allowed request is hard-coded and is 100 >> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >> */ >> nr_pages_request = min(100U, nr_pages - nr_allocated); >> >> - nr = alloc_pages_bulk_array_node(gfp, nid, >> - nr_pages_request, pages + nr_allocated); >> - >> + if (nid == NUMA_NO_NODE) { >> > <snip> > void *vmalloc(unsigned long size) > { > return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE, > __builtin_return_address(0)); > } > EXPORT_SYMBOL(vmalloc); > <snip> > > vmalloc() uses NUMA_NO_NODE, so all vmalloc calls will be reverted to a single > page allocator for NUMA and non-NUMA systems. Is it intentional to bypass the > optimized bulk allocator for non-NUMA systems? I sent a patch, it will help to solve this. [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation Thanks, Wandun > > Thanks! > > -- > Vlad Rezki > . >
在 2021/10/15 9:34, Nicholas Piggin 写道: > Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm: >> >> >> 在 2021/10/14 5:46, Shakeel Butt 写道: >>> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <chenwandun@huawei.com> wrote: >>>> >>>> Eric Dumazet reported a strange numa spreading info in [1], and found >>>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced >>>> this issue [2]. >>>> >>>> Dig into the difference before and after this patch, page allocation has >>>> some difference: >>>> >>>> before: >>>> alloc_large_system_hash >>>> __vmalloc >>>> __vmalloc_node(..., NUMA_NO_NODE, ...) >>>> __vmalloc_node_range >>>> __vmalloc_area_node >>>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ >>>> alloc_pages_current >>>> alloc_page_interleave /* can be proved by print policy mode */ >>>> >>>> after: >>>> alloc_large_system_hash >>>> __vmalloc >>>> __vmalloc_node(..., NUMA_NO_NODE, ...) >>>> __vmalloc_node_range >>>> __vmalloc_area_node >>>> alloc_pages_node /* choose nid by nuam_mem_id() */ >>>> __alloc_pages_node(nid, ....) >>>> >>>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), >>>> it will allocate memory in current node instead of interleaving allocate >>>> memory. >>>> >>>> [1] >>>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ >>>> >>>> [2] >>>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ >>>> >>>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") >>>> Reported-by: Eric Dumazet <edumazet@google.com> >>>> Signed-off-by: Chen Wandun <chenwandun@huawei.com> >>>> --- >>>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- >>>> 1 file changed, 26 insertions(+), 7 deletions(-) >>>> >>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >>>> index f884706c5280..48e717626e94 100644 >>>> --- a/mm/vmalloc.c >>>> +++ b/mm/vmalloc.c >>>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>>> unsigned int order, unsigned int nr_pages, struct page **pages) >>>> { >>>> unsigned int nr_allocated = 0; >>>> + struct page *page; >>>> + int i; >>>> >>>> /* >>>> * For order-0 pages we make use of bulk allocator, if >>>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>>> if (!order) { >>> >>> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)? >>> >>>> while (nr_allocated < nr_pages) { >>>> unsigned int nr, nr_pages_request; >>>> + page = NULL; >>>> >>>> /* >>>> * A maximum allowed request is hard-coded and is 100 >>>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>>> */ >>>> nr_pages_request = min(100U, nr_pages - nr_allocated); >>>> >>> >>> Undo the following change in this if block. >> >> Yes, It seem like more simpler as you suggested, But it still have >> performance regression, I plan to change the following to consider >> both mempolcy and alloc_pages_bulk. > > Thanks for finding and debugging this. These APIs are a maze of twisty > little passages, all alike so I could be as confused as I was when I > wrote that patch, but doesn't a minimal fix look something like this? Yes, I sent a patch,it looks like as you show, besides it also contains some performance optimization. [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation Thanks, Wandun > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index d77830ff604c..75ee9679f521 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2860,7 +2860,10 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > struct page *page; > int i; > > - page = alloc_pages_node(nid, gfp, order); > + if (nid == NUMA_NO_NODE) > + page = alloc_pages(gfp, order); > + else > + page = alloc_pages_node(nid, gfp, order); > if (unlikely(!page)) > break; > > > Thanks, > Nick > . >
Excerpts from Chen Wandun's message of October 15, 2021 12:31 pm: > > > 在 2021/10/15 9:34, Nicholas Piggin 写道: >> Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm: >>> >>> >>> 在 2021/10/14 5:46, Shakeel Butt 写道: >>>> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <chenwandun@huawei.com> wrote: >>>>> >>>>> Eric Dumazet reported a strange numa spreading info in [1], and found >>>>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced >>>>> this issue [2]. >>>>> >>>>> Dig into the difference before and after this patch, page allocation has >>>>> some difference: >>>>> >>>>> before: >>>>> alloc_large_system_hash >>>>> __vmalloc >>>>> __vmalloc_node(..., NUMA_NO_NODE, ...) >>>>> __vmalloc_node_range >>>>> __vmalloc_area_node >>>>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ >>>>> alloc_pages_current >>>>> alloc_page_interleave /* can be proved by print policy mode */ >>>>> >>>>> after: >>>>> alloc_large_system_hash >>>>> __vmalloc >>>>> __vmalloc_node(..., NUMA_NO_NODE, ...) >>>>> __vmalloc_node_range >>>>> __vmalloc_area_node >>>>> alloc_pages_node /* choose nid by nuam_mem_id() */ >>>>> __alloc_pages_node(nid, ....) >>>>> >>>>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), >>>>> it will allocate memory in current node instead of interleaving allocate >>>>> memory. >>>>> >>>>> [1] >>>>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ >>>>> >>>>> [2] >>>>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ >>>>> >>>>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") >>>>> Reported-by: Eric Dumazet <edumazet@google.com> >>>>> Signed-off-by: Chen Wandun <chenwandun@huawei.com> >>>>> --- >>>>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- >>>>> 1 file changed, 26 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c >>>>> index f884706c5280..48e717626e94 100644 >>>>> --- a/mm/vmalloc.c >>>>> +++ b/mm/vmalloc.c >>>>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>>>> unsigned int order, unsigned int nr_pages, struct page **pages) >>>>> { >>>>> unsigned int nr_allocated = 0; >>>>> + struct page *page; >>>>> + int i; >>>>> >>>>> /* >>>>> * For order-0 pages we make use of bulk allocator, if >>>>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>>>> if (!order) { >>>> >>>> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)? >>>> >>>>> while (nr_allocated < nr_pages) { >>>>> unsigned int nr, nr_pages_request; >>>>> + page = NULL; >>>>> >>>>> /* >>>>> * A maximum allowed request is hard-coded and is 100 >>>>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, >>>>> */ >>>>> nr_pages_request = min(100U, nr_pages - nr_allocated); >>>>> >>>> >>>> Undo the following change in this if block. >>> >>> Yes, It seem like more simpler as you suggested, But it still have >>> performance regression, I plan to change the following to consider >>> both mempolcy and alloc_pages_bulk. >> >> Thanks for finding and debugging this. These APIs are a maze of twisty >> little passages, all alike so I could be as confused as I was when I >> wrote that patch, but doesn't a minimal fix look something like this? > > Yes, I sent a patch,it looks like as you show, besides it also > contains some performance optimization. > > [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to > accelerate memory allocation Okay. It would be better to do it as two patches. First the minimal fix so it can be backported easily and have the Fixes: tag pointed at my commit. Then the performance optimization. Thanks, Nick
On Fri, Oct 15, 2021 at 12:11 AM Nicholas Piggin <npiggin@gmail.com> wrote: > Okay. It would be better to do it as two patches. First the minimal fix > so it can be backported easily and have the Fixes: tag pointed at my > commit. Then the performance optimization. > +2, we need a small fix for stable branches. Thanks > Thanks, > Nick
On Fri, Oct 15, 2021 at 05:11:25PM +1000, Nicholas Piggin wrote: > Excerpts from Chen Wandun's message of October 15, 2021 12:31 pm: > > > > > > 在 2021/10/15 9:34, Nicholas Piggin 写道: > >> Excerpts from Chen Wandun's message of October 14, 2021 6:59 pm: > >>> > >>> > >>> 在 2021/10/14 5:46, Shakeel Butt 写道: > >>>> On Tue, Sep 28, 2021 at 5:03 AM Chen Wandun <chenwandun@huawei.com> wrote: > >>>>> > >>>>> Eric Dumazet reported a strange numa spreading info in [1], and found > >>>>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced > >>>>> this issue [2]. > >>>>> > >>>>> Dig into the difference before and after this patch, page allocation has > >>>>> some difference: > >>>>> > >>>>> before: > >>>>> alloc_large_system_hash > >>>>> __vmalloc > >>>>> __vmalloc_node(..., NUMA_NO_NODE, ...) > >>>>> __vmalloc_node_range > >>>>> __vmalloc_area_node > >>>>> alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ > >>>>> alloc_pages_current > >>>>> alloc_page_interleave /* can be proved by print policy mode */ > >>>>> > >>>>> after: > >>>>> alloc_large_system_hash > >>>>> __vmalloc > >>>>> __vmalloc_node(..., NUMA_NO_NODE, ...) > >>>>> __vmalloc_node_range > >>>>> __vmalloc_area_node > >>>>> alloc_pages_node /* choose nid by nuam_mem_id() */ > >>>>> __alloc_pages_node(nid, ....) > >>>>> > >>>>> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), > >>>>> it will allocate memory in current node instead of interleaving allocate > >>>>> memory. > >>>>> > >>>>> [1] > >>>>> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ > >>>>> > >>>>> [2] > >>>>> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ > >>>>> > >>>>> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") > >>>>> Reported-by: Eric Dumazet <edumazet@google.com> > >>>>> Signed-off-by: Chen Wandun <chenwandun@huawei.com> > >>>>> --- > >>>>> mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- > >>>>> 1 file changed, 26 insertions(+), 7 deletions(-) > >>>>> > >>>>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c > >>>>> index f884706c5280..48e717626e94 100644 > >>>>> --- a/mm/vmalloc.c > >>>>> +++ b/mm/vmalloc.c > >>>>> @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > >>>>> unsigned int order, unsigned int nr_pages, struct page **pages) > >>>>> { > >>>>> unsigned int nr_allocated = 0; > >>>>> + struct page *page; > >>>>> + int i; > >>>>> > >>>>> /* > >>>>> * For order-0 pages we make use of bulk allocator, if > >>>>> @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > >>>>> if (!order) { > >>>> > >>>> Can you please replace the above with if (!order && nid != NUMA_NO_NODE)? > >>>> > >>>>> while (nr_allocated < nr_pages) { > >>>>> unsigned int nr, nr_pages_request; > >>>>> + page = NULL; > >>>>> > >>>>> /* > >>>>> * A maximum allowed request is hard-coded and is 100 > >>>>> @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > >>>>> */ > >>>>> nr_pages_request = min(100U, nr_pages - nr_allocated); > >>>>> > >>>> > >>>> Undo the following change in this if block. > >>> > >>> Yes, It seem like more simpler as you suggested, But it still have > >>> performance regression, I plan to change the following to consider > >>> both mempolcy and alloc_pages_bulk. > >> > >> Thanks for finding and debugging this. These APIs are a maze of twisty > >> little passages, all alike so I could be as confused as I was when I > >> wrote that patch, but doesn't a minimal fix look something like this? > > > > Yes, I sent a patch,it looks like as you show, besides it also > > contains some performance optimization. > > > > [PATCH] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to > > accelerate memory allocation > > Okay. It would be better to do it as two patches. First the minimal fix > so it can be backported easily and have the Fixes: tag pointed at my > commit. Then the performance optimization. > It is not only your commit. It also fixes my one :) <snip> commit 5c1f4e690eecc795b2e4d4408e87302040fceca4 Author: Uladzislau Rezki (Sony) <urezki@gmail.com> Date: Mon Jun 28 19:40:14 2021 -0700 mm/vmalloc: switch to bulk allocator in __vmalloc_area_node() <snip> I agree there should be two separate patches which fix NUMA balancing issue, tagged with "Fixes" flag. One is located here: https://lkml.org/lkml/2021/10/15/1172 second one should be that fixes a second place where "big" pages are allocated, basically your patch. -- Vlad Rezki
在 2021/10/15 19:51, Eric Dumazet 写道: > On Fri, Oct 15, 2021 at 12:11 AM Nicholas Piggin <npiggin@gmail.com> wrote: > >> Okay. It would be better to do it as two patches. First the minimal fix >> so it can be backported easily and have the Fixes: tag pointed at my >> commit. Then the performance optimization. >> > > +2, we need a small fix for stable branches. OK, I will send patch v2. Thanks, Wandun > > Thanks > >> Thanks, >> Nick > . >
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index f884706c5280..48e717626e94 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2823,6 +2823,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, unsigned int order, unsigned int nr_pages, struct page **pages) { unsigned int nr_allocated = 0; + struct page *page; + int i; /* * For order-0 pages we make use of bulk allocator, if @@ -2833,6 +2835,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, if (!order) { while (nr_allocated < nr_pages) { unsigned int nr, nr_pages_request; + page = NULL; /* * A maximum allowed request is hard-coded and is 100 @@ -2842,9 +2845,23 @@ vm_area_alloc_pages(gfp_t gfp, int nid, */ nr_pages_request = min(100U, nr_pages - nr_allocated); - nr = alloc_pages_bulk_array_node(gfp, nid, - nr_pages_request, pages + nr_allocated); - + if (nid == NUMA_NO_NODE) { + for (i = 0; i < nr_pages_request; i++) { + page = alloc_page(gfp); + if (page) + pages[nr_allocated + i] = page; + else { + nr = i; + break; + } + } + if (i >= nr_pages_request) + nr = nr_pages_request; + } else { + nr = alloc_pages_bulk_array_node(gfp, nid, + nr_pages_request, + pages + nr_allocated); + } nr_allocated += nr; cond_resched(); @@ -2863,11 +2880,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid, gfp |= __GFP_COMP; /* High-order pages or fallback path if "bulk" fails. */ - while (nr_allocated < nr_pages) { - struct page *page; - int i; - page = alloc_pages_node(nid, gfp, order); + page = NULL; + while (nr_allocated < nr_pages) { + if (nid == NUMA_NO_NODE) + page = alloc_pages(gfp, order); + else + page = alloc_pages_node(nid, gfp, order); if (unlikely(!page)) break;
Eric Dumazet reported a strange numa spreading info in [1], and found commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced this issue [2]. Dig into the difference before and after this patch, page allocation has some difference: before: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */ alloc_pages_current alloc_page_interleave /* can be proved by print policy mode */ after: alloc_large_system_hash __vmalloc __vmalloc_node(..., NUMA_NO_NODE, ...) __vmalloc_node_range __vmalloc_area_node alloc_pages_node /* choose nid by nuam_mem_id() */ __alloc_pages_node(nid, ....) So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"), it will allocate memory in current node instead of interleaving allocate memory. [1] https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/ [2] https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/ Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Chen Wandun <chenwandun@huawei.com> --- mm/vmalloc.c | 33 ++++++++++++++++++++++++++------- 1 file changed, 26 insertions(+), 7 deletions(-)