diff mbox series

[v2,1/2] mm/vmalloc: fix numa spreading for large hash tables

Message ID 20211018123710.1540996-2-chenwandun@huawei.com (mailing list archive)
State New
Headers show
Series fix numa spreading for large hash tables | expand

Commit Message

Chen Wandun Oct. 18, 2021, 12:37 p.m. UTC
Eric Dumazet reported a strange numa spreading info in [1], and found
commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
this issue [2].

Dig into the difference before and after this patch, page allocation has
some difference:

before:
alloc_large_system_hash
    __vmalloc
        __vmalloc_node(..., NUMA_NO_NODE, ...)
            __vmalloc_node_range
                __vmalloc_area_node
                    alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
                        alloc_pages_current
                            alloc_page_interleave /* can be proved by print policy mode */

after:
alloc_large_system_hash
    __vmalloc
        __vmalloc_node(..., NUMA_NO_NODE, ...)
            __vmalloc_node_range
                __vmalloc_area_node
                    alloc_pages_node /* choose nid by nuam_mem_id() */
                        __alloc_pages_node(nid, ....)

So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
it will allocate memory in current node instead of interleaving allocate
memory.

[1]
https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/

[2]
https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/

Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
---
 mm/vmalloc.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

Comments

Matthew Wilcox Oct. 18, 2021, 12:39 p.m. UTC | #1
On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
> Eric Dumazet reported a strange numa spreading info in [1], and found
> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> this issue [2].

I think the root problem here is that we have two meanings for
NUMA_NO_NODE.  I tend to read it as "The memory can be allocated from
any node", but here it's used to mean "The memory should be spread over
every node".  Should we split those out as -1 and -2?
Chen Wandun Oct. 18, 2021, 1:01 p.m. UTC | #2
在 2021/10/18 20:39, Matthew Wilcox 写道:
> On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
>> Eric Dumazet reported a strange numa spreading info in [1], and found
>> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
>> this issue [2].
> 
> I think the root problem here is that we have two meanings for
> NUMA_NO_NODE.  I tend to read it as "The memory can be allocated from
> any node", but here it's used to mean "The memory should be spread over
> every node".  Should we split those out as -1 and -2?
Yes, the intent of NUMA_NO_NODE some time is confused.

Besides,I think NUMA_NO_NODE should consider mempolicy in
most cases in the kernel unless it point out explicitly memory
can be allocated without considering mempolicy.

> .
>
Uladzislau Rezki Oct. 18, 2021, 2:03 p.m. UTC | #3
> Eric Dumazet reported a strange numa spreading info in [1], and found
> commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> this issue [2].
>
> Dig into the difference before and after this patch, page allocation has
> some difference:
>
> before:
> alloc_large_system_hash
>     __vmalloc
>         __vmalloc_node(..., NUMA_NO_NODE, ...)
>             __vmalloc_node_range
>                 __vmalloc_area_node
>                     alloc_page /* because NUMA_NO_NODE, so choose alloc_page branch */
>                         alloc_pages_current
>                             alloc_page_interleave /* can be proved by print policy mode */
>
> after:
> alloc_large_system_hash
>     __vmalloc
>         __vmalloc_node(..., NUMA_NO_NODE, ...)
>             __vmalloc_node_range
>                 __vmalloc_area_node
>                     alloc_pages_node /* choose nid by nuam_mem_id() */
>                         __alloc_pages_node(nid, ....)
>
> So after commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings"),
> it will allocate memory in current node instead of interleaving allocate
> memory.
>
> [1]
> https://lore.kernel.org/linux-mm/CANn89iL6AAyWhfxdHO+jaT075iOa3XcYn9k6JJc7JR2XYn6k_Q@mail.gmail.com/
>
> [2]
> https://lore.kernel.org/linux-mm/CANn89iLofTR=AK-QOZY87RdUZENCZUT4O6a0hvhu3_EwRMerOg@mail.gmail.com/
>
> Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
> Reported-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Chen Wandun <chenwandun@huawei.com>
> ---
>  mm/vmalloc.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..87552a4018aa 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2816,6 +2816,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                 unsigned int order, unsigned int nr_pages, struct page **pages)
>  {
>         unsigned int nr_allocated = 0;
> +       struct page *page;
> +       int i;
>
>         /*
>          * For order-0 pages we make use of bulk allocator, if
> @@ -2823,7 +2825,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>          * to fails, fallback to a single page allocator that is
>          * more permissive.
>          */
> -       if (!order) {
> +       if (!order && nid != NUMA_NO_NODE) {
>                 while (nr_allocated < nr_pages) {
>                         unsigned int nr, nr_pages_request;
>
> @@ -2848,7 +2850,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                         if (nr != nr_pages_request)
>                                 break;
>                 }
> -       } else
> +       } else if (order)
>                 /*
>                  * Compound pages required for remap_vmalloc_page if
>                  * high-order pages.
> @@ -2856,11 +2858,13 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                 gfp |= __GFP_COMP;
>
>         /* High-order pages or fallback path if "bulk" fails. */
> -       while (nr_allocated < nr_pages) {
> -               struct page *page;
> -               int i;
>
> -               page = alloc_pages_node(nid, gfp, order);
> +       page = NULL;
>
Why do you need to set page to NULL here?
Shakeel Butt Oct. 19, 2021, 3:53 p.m. UTC | #4
On Mon, Oct 18, 2021 at 5:41 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
> > Eric Dumazet reported a strange numa spreading info in [1], and found
> > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> > this issue [2].
>
> I think the root problem here is that we have two meanings for
> NUMA_NO_NODE.  I tend to read it as "The memory can be allocated from
> any node", but here it's used to mean "The memory should be spread over
> every node".  Should we split those out as -1 and -2?

I agree with Willy's suggestion to make it more explicit but as a
followup work. This patch needs a backport, so keep this simple.
Shakeel Butt Oct. 19, 2021, 3:56 p.m. UTC | #5
On Mon, Oct 18, 2021 at 5:23 AM Chen Wandun <chenwandun@huawei.com> wrote:
>
[...]
>
>         /* High-order pages or fallback path if "bulk" fails. */
> -       while (nr_allocated < nr_pages) {
> -               struct page *page;
> -               int i;
>
> -               page = alloc_pages_node(nid, gfp, order);
> +       page = NULL;

No need for the above NULL assignment.

After removing this, you can add:

Reviewed-by: Shakeel Butt <shakeelb@google.com>

> +       while (nr_allocated < nr_pages) {
> +               if (nid == NUMA_NO_NODE)
> +                       page = alloc_pages(gfp, order);
> +               else
> +                       page = alloc_pages_node(nid, gfp, order);
>                 if (unlikely(!page))
>                         break;
>
> --
> 2.25.1
>
Eric Dumazet Oct. 19, 2021, 3:58 p.m. UTC | #6
On Tue, Oct 19, 2021 at 8:54 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Oct 18, 2021 at 5:41 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Oct 18, 2021 at 08:37:09PM +0800, Chen Wandun wrote:
> > > Eric Dumazet reported a strange numa spreading info in [1], and found
> > > commit 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings") introduced
> > > this issue [2].
> >
> > I think the root problem here is that we have two meanings for
> > NUMA_NO_NODE.  I tend to read it as "The memory can be allocated from
> > any node", but here it's used to mean "The memory should be spread over
> > every node".  Should we split those out as -1 and -2?
>
> I agree with Willy's suggestion to make it more explicit but as a
> followup work. This patch needs a backport, so keep this simple.

NUMA_NO_NODE in process context also meant :
Please follow current thread NUMA policies.

One could hope for instance, that whenever large BPF maps are allocated,
current thread could set non default NUMA policies.
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..87552a4018aa 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2816,6 +2816,8 @@  vm_area_alloc_pages(gfp_t gfp, int nid,
 		unsigned int order, unsigned int nr_pages, struct page **pages)
 {
 	unsigned int nr_allocated = 0;
+	struct page *page;
+	int i;
 
 	/*
 	 * For order-0 pages we make use of bulk allocator, if
@@ -2823,7 +2825,7 @@  vm_area_alloc_pages(gfp_t gfp, int nid,
 	 * to fails, fallback to a single page allocator that is
 	 * more permissive.
 	 */
-	if (!order) {
+	if (!order && nid != NUMA_NO_NODE) {
 		while (nr_allocated < nr_pages) {
 			unsigned int nr, nr_pages_request;
 
@@ -2848,7 +2850,7 @@  vm_area_alloc_pages(gfp_t gfp, int nid,
 			if (nr != nr_pages_request)
 				break;
 		}
-	} else
+	} else if (order)
 		/*
 		 * Compound pages required for remap_vmalloc_page if
 		 * high-order pages.
@@ -2856,11 +2858,13 @@  vm_area_alloc_pages(gfp_t gfp, int nid,
 		gfp |= __GFP_COMP;
 
 	/* High-order pages or fallback path if "bulk" fails. */
-	while (nr_allocated < nr_pages) {
-		struct page *page;
-		int i;
 
-		page = alloc_pages_node(nid, gfp, order);
+	page = NULL;
+	while (nr_allocated < nr_pages) {
+		if (nid == NUMA_NO_NODE)
+			page = alloc_pages(gfp, order);
+		else
+			page = alloc_pages_node(nid, gfp, order);
 		if (unlikely(!page))
 			break;