diff mbox series

[RFC,v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0

Message ID 20240725035318.471-1-hailong.liu@oppo.com (mailing list archive)
State New
Headers show
Series [RFC,v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0 | expand

Commit Message

Hailong Liu July 25, 2024, 3:53 a.m. UTC
From: "Hailong.Liu" <hailong.liu@oppo.com>

The scenario where the issue occurs is as follows:
CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
    __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
        vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
                                        and phys_addr is aligned with PMD_SIZE
            vmap_pages_range
                vmap_pages_range_noflush
                    __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here

In fact, as long as page_shift is not equal to PAGE_SHIFT, there
might be issues with the __vmap_pages_range_noflush().

The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
are several reasons for this:
- This increases memory footprint because ALIGNMENT.
- This increases the likelihood of kvmalloc allocation failures.
- Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
Besides if drivers want to vmap huge, user vmalloc_huge instead.

Fix it by disabling fallback and remove VM_ALLOW_HUGE_VMAP in
kvmalloc_node().
Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")

CC: Barry Song <21cnbao@gmail.com>
CC: Baoquan He <bhe@redhat.com>
CC: Matthew Wilcox <willy@infradead.org>
Reported-by: Tangquan.Zheng <zhengtangquan@oppo.com>
Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>
---
 mm/util.c    | 2 +-
 mm/vmalloc.c | 9 ---------
 2 files changed, 1 insertion(+), 10 deletions(-)

--
After 1) I check the code and I can't find a resonable band-aid to fix
this. so the v2 patch works but ugly. Glad to hear a better solution :)

[1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@oppo.com/
2.34.1

Comments

Barry Song July 25, 2024, 6:21 a.m. UTC | #1
On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@oppo.com> wrote:
>
> From: "Hailong.Liu" <hailong.liu@oppo.com>
>
> The scenario where the issue occurs is as follows:
> CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
>     __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
>         vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
>                                         and phys_addr is aligned with PMD_SIZE
>             vmap_pages_range
>                 vmap_pages_range_noflush
>                     __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
>
> In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> might be issues with the __vmap_pages_range_noflush().
>
> The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> are several reasons for this:
> - This increases memory footprint because ALIGNMENT.
> - This increases the likelihood of kvmalloc allocation failures.
> - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> Besides if drivers want to vmap huge, user vmalloc_huge instead.
>
> Fix it by disabling fallback and remove VM_ALLOW_HUGE_VMAP in
> kvmalloc_node().
> Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
>
> CC: Barry Song <21cnbao@gmail.com>
> CC: Baoquan He <bhe@redhat.com>
> CC: Matthew Wilcox <willy@infradead.org>
> Reported-by: Tangquan.Zheng <zhengtangquan@oppo.com>
> Signed-off-by: Hailong.Liu <hailong.liu@oppo.com>

The implementation of HUGE_VMAP appears to be quite disorganized.
A major change is needed.

1. when allocating 2.1MB kvmalloc, we shouldn't allocate 4MB memory, which
is now done by HUGE_VMAP. This is even worse than PMD-mapped THP for
userspace. We don't even do this for THP. vmap could be done by 1PMD map
+ 0.1MB PTE mapping instead.

2. We need to allow fallback to order-0 pages if we're unable to allocate 2MB.
In this case, we should perform PMD/PTE mapping based on how the pages
are acquired, rather than assuming they always form contiguous 2MB blocks.

3. Memory is entirely corrupted after Michael's "mm, vmalloc: fix high order
__GFP_NOFAIL allocations". but without it, forcing 2MB  allocation was
making OOM.

> ---
>  mm/util.c    | 2 +-
>  mm/vmalloc.c | 9 ---------
>  2 files changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/mm/util.c b/mm/util.c
> index 669397235787..b23133b738cf 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -657,7 +657,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>          * protection games.
>          */
>         return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -                       flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
> +                       flags, PAGE_KERNEL, 0,
>                         node, __builtin_return_address(0));

I'd vote +1 for this. we don't want to waste memory, for example, wasting
1.9MB memory while allocating 2.1MB kvmalloc. but this should be a separate
patch.

>  }
>  EXPORT_SYMBOL(kvmalloc_node);
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 03c78fae06f3..1914768f473e 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3577,15 +3577,6 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>                         page = alloc_pages(alloc_gfp, order);
>                 else
>                         page = alloc_pages_node(nid, alloc_gfp, order);
> -               if (unlikely(!page)) {
> -                       if (!nofail)
> -                               break;
> -
> -                       /* fall back to the zero order allocations */
> -                       alloc_gfp |= __GFP_NOFAIL;
> -                       order = 0;
> -                       continue;
> -               }
>
>                 /*
>                  * Higher order allocations must be able to be treated as
> --
> After 1) I check the code and I can't find a resonable band-aid to fix
> this. so the v2 patch works but ugly. Glad to hear a better solution :)

This is still incorrect because it undoes Michal's work. We also need to break
the loop if (!nofail), which you're currently omitting.

To avoid reverting Michal's work, the simplest "fix" would be,

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index caf032f0bd69..0011ca30df1c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
size, unsigned long align,
                return NULL;
        }

-       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
+       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
!(gfp_mask & __GFP_NOFAIL)) {
                unsigned long size_per_node;

                /*
>
> [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@oppo.com/
> 2.34.1

Thanks
Barry
Hailong Liu July 25, 2024, 9:17 a.m. UTC | #2
On Thu, 25. Jul 18:21, Barry Song wrote:
> On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@oppo.com> wrote:
[snip]
>
> This is still incorrect because it undoes Michal's work. We also need to break
> the loop if (!nofail), which you're currently omitting.

IIUC, the origin issue is to fix kvcalloc with __GFP_NOFAIL return NULL.
https://lore.kernel.org/all/ZAXynvdNqcI0f6Us@dhcp22.suse.cz/T/#u
if we disable huge flag in kmalloc_node, the issue will be fixed.
>
> To avoid reverting Michal's work, the simplest "fix" would be,
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index caf032f0bd69..0011ca30df1c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
> size, unsigned long align,
>                 return NULL;
>         }
>
> -       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
> +       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
> !(gfp_mask & __GFP_NOFAIL)) {
>                 unsigned long size_per_node;
>
>                 /*
> >
> > [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@oppo.com/
> > 2.34.1
>
> Thanks
> Barry

--
help you, help me,
Hailong.
Barry Song July 25, 2024, 9:34 a.m. UTC | #3
On Thu, Jul 25, 2024 at 9:17 PM Hailong Liu <hailong.liu@oppo.com> wrote:
>
> On Thu, 25. Jul 18:21, Barry Song wrote:
> > On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@oppo.com> wrote:
> [snip]
> >
> > This is still incorrect because it undoes Michal's work. We also need to break
> > the loop if (!nofail), which you're currently omitting.
>
> IIUC, the origin issue is to fix kvcalloc with __GFP_NOFAIL return NULL.
> https://lore.kernel.org/all/ZAXynvdNqcI0f6Us@dhcp22.suse.cz/T/#u
> if we disable huge flag in kmalloc_node, the issue will be fixed.

No, this just bypasses kvmalloc and doesn't solve the underlying issue. Problems
can still be triggered by vmalloc_huge() even after the bypass. Once we
reorganize vmap_huge to support the combination of PMD and PTE
mapping, we should re-enable HUGE_VMAP for kvmalloc.

I would consider dropping VM_ALLOW_HUGE_VMAP() for kvmalloc as
an short-term "optimization" to save memory rather than a long-term fix. This
'optimization' is only valid until we reorganize HUGE_VMAP in a way
similar to THP. I mean, for a 2.1MB kvmalloc, we can map 2MB as PMD
and 0.1 as PTE.

> >
> > To avoid reverting Michal's work, the simplest "fix" would be,
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index caf032f0bd69..0011ca30df1c 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
> > size, unsigned long align,
> >                 return NULL;
> >         }
> >
> > -       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
> > +       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
> > !(gfp_mask & __GFP_NOFAIL)) {
> >                 unsigned long size_per_node;
> >
> >                 /*
> > >
> > > [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@oppo.com/
> > > 2.34.1
> >
> > Thanks
> > Barry
>
> --
> help you, help me,
> Hailong.
Hailong Liu July 25, 2024, 9:58 a.m. UTC | #4
On Thu, 25. Jul 21:34, Barry Song wrote:
> On Thu, Jul 25, 2024 at 9:17 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> >
> > On Thu, 25. Jul 18:21, Barry Song wrote:
> > > On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@oppo.com> wrote:
> > [snip]
> > >
> > > This is still incorrect because it undoes Michal's work. We also need to break
> > > the loop if (!nofail), which you're currently omitting.
> >
> > IIUC, the origin issue is to fix kvcalloc with __GFP_NOFAIL return NULL.
> > https://lore.kernel.org/all/ZAXynvdNqcI0f6Us@dhcp22.suse.cz/T/#u
> > if we disable huge flag in kmalloc_node, the issue will be fixed.
>
> No, this just bypasses kvmalloc and doesn't solve the underlying issue. Problems
> can still be triggered by vmalloc_huge() even after the bypass. Once we
> reorganize vmap_huge to support the combination of PMD and PTE
> mapping, we should re-enable HUGE_VMAP for kvmalloc.
Totally agree, This will take some time to support. As in [1] I prepare to fix
with a offset in page_private to indicate the location of fallback.

>
> I would consider dropping VM_ALLOW_HUGE_VMAP() for kvmalloc as
> an short-term "optimization" to save memory rather than a long-term fix. This
> 'optimization' is only valid until we reorganize HUGE_VMAP in a way
> similar to THP. I mean, for a 2.1MB kvmalloc, we can map 2MB as PMD
> and 0.1 as PTE.
However this just fixed the kvmalloc_node, but for others who call
vmalloc_huge(), the issue exits. so I remove the Michal's code. sorry for this.

>
> > >
> > > To avoid reverting Michal's work, the simplest "fix" would be,
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index caf032f0bd69..0011ca30df1c 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
> > > size, unsigned long align,
> > >                 return NULL;
> > >         }
> > >
> > > -       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
> > > +       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
> > > !(gfp_mask & __GFP_NOFAIL)) {
> > >                 unsigned long size_per_node;
> > >
> > >                 /*
> > > >
> > > > [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@oppo.com/
> > > > 2.34.1
> > >
> > > Thanks
> > > Barry
> >
> > --
> > help you, help me,
> > Hailong.

--
help you, help me,
Hailong.
Barry Song July 25, 2024, 10:22 a.m. UTC | #5
On Thu, Jul 25, 2024 at 5:58 PM Hailong Liu <hailong.liu@oppo.com> wrote:
>
> On Thu, 25. Jul 21:34, Barry Song wrote:
> > On Thu, Jul 25, 2024 at 9:17 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> > >
> > > On Thu, 25. Jul 18:21, Barry Song wrote:
> > > > On Thu, Jul 25, 2024 at 3:53 PM <hailong.liu@oppo.com> wrote:
> > > [snip]
> > > >
> > > > This is still incorrect because it undoes Michal's work. We also need to break
> > > > the loop if (!nofail), which you're currently omitting.
> > >
> > > IIUC, the origin issue is to fix kvcalloc with __GFP_NOFAIL return NULL.
> > > https://lore.kernel.org/all/ZAXynvdNqcI0f6Us@dhcp22.suse.cz/T/#u
> > > if we disable huge flag in kmalloc_node, the issue will be fixed.
> >
> > No, this just bypasses kvmalloc and doesn't solve the underlying issue. Problems
> > can still be triggered by vmalloc_huge() even after the bypass. Once we
> > reorganize vmap_huge to support the combination of PMD and PTE
> > mapping, we should re-enable HUGE_VMAP for kvmalloc.
> Totally agree, This will take some time to support. As in [1] I prepare to fix
> with a offset in page_private to indicate the location of fallback.
>
> >
> > I would consider dropping VM_ALLOW_HUGE_VMAP() for kvmalloc as
> > an short-term "optimization" to save memory rather than a long-term fix. This
> > 'optimization' is only valid until we reorganize HUGE_VMAP in a way
> > similar to THP. I mean, for a 2.1MB kvmalloc, we can map 2MB as PMD
> > and 0.1 as PTE.
> However this just fixed the kvmalloc_node, but for others who call
> vmalloc_huge(), the issue exits. so I remove the Michal's code. sorry for this.

My proposal was to fallback to order-0 for __GFP_NOFAIL even before
vm_area_alloc_pages() as a short-term quick "fix".

We need to meet three conditions to do HUGE_VMAP
1. vmap_allow_huge
2. vm_flags & VM_ALLOW_HUGE_VMAP
3. !__GFP_NOFAIL gfp_flags

This is because if we fallback within vm_area_alloc_pages(), the
caller still expects
vm_area_alloc_pages() to return contiguous 2MB memory. By removing this
assumption from its callers, its caller will realize
vm_area_alloc_pages() is returning
small pages. That means, vm_area gets 0 as page_order from the first
beginning if we
have __GFP_NOFAIL in gfp_flags.

Other fixes appear to require significant changes to the source code
and can't be
done quickly.


>
> >
> > > >
> > > > To avoid reverting Michal's work, the simplest "fix" would be,
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index caf032f0bd69..0011ca30df1c 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -3775,7 +3775,7 @@ void *__vmalloc_node_range_noprof(unsigned long
> > > > size, unsigned long align,
> > > >                 return NULL;
> > > >         }
> > > >
> > > > -       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
> > > > +       if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP) &
> > > > !(gfp_mask & __GFP_NOFAIL)) {
> > > >                 unsigned long size_per_node;
> > > >
> > > >                 /*
> > > > >
> > > > > [1] https://lore.kernel.org/lkml/20240724182827.nlgdckimtg2gwns5@oppo.com/
> > > > > 2.34.1
> > > >
> > > > Thanks
> > > > Barry
> > >
> > > --
> > > help you, help me,
> > > Hailong.
>
> --
> help you, help me,
> Hailong.
Baoquan He July 25, 2024, 11:39 a.m. UTC | #6
On 07/25/24 at 11:53am, hailong.liu@oppo.com wrote:
> From: "Hailong.Liu" <hailong.liu@oppo.com>
> 
> The scenario where the issue occurs is as follows:
> CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
>     __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
>         vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
>                                         and phys_addr is aligned with PMD_SIZE
>             vmap_pages_range
>                 vmap_pages_range_noflush
>                     __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
> 
> In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> might be issues with the __vmap_pages_range_noflush().
> 
> The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> are several reasons for this:
> - This increases memory footprint because ALIGNMENT.
> - This increases the likelihood of kvmalloc allocation failures.
> - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> Besides if drivers want to vmap huge, user vmalloc_huge instead.

Seem there are two issues you are folding into one patch:

one is the wrong informatin passed into __vmap_pages_range_noflush();
the other is you want to take off VM_ALLOW_HUGE_VMAP on kvmalloc().

About the 1st one, do you think below draft is OK to you?

Pass out the fall back order and adjust the order and shift for later
usage, mainly for vmap_pages_range().

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 260897b21b11..5ee9ae518f3d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3508,9 +3508,9 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
 
 static inline unsigned int
 vm_area_alloc_pages(gfp_t gfp, int nid,
-		unsigned int order, unsigned int nr_pages, struct page **pages)
+		unsigned int *page_order, unsigned int nr_pages, struct page **pages)
 {
-	unsigned int nr_allocated = 0;
+	unsigned int nr_allocated = 0, order = *page_order;
 	gfp_t alloc_gfp = gfp;
 	bool nofail = gfp & __GFP_NOFAIL;
 	struct page *page;
@@ -3611,6 +3611,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		cond_resched();
 		nr_allocated += 1U << order;
 	}
+	*page_order = order;
 
 	return nr_allocated;
 }
@@ -3654,7 +3655,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 	page_order = vm_area_page_order(area);
 
 	area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN,
-		node, page_order, nr_small_pages, area->pages);
+		node, &page_order, nr_small_pages, area->pages);
 
 	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
 	if (gfp_mask & __GFP_ACCOUNT) {
@@ -3686,6 +3687,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 		goto fail;
 	}
 
+
+	set_vm_area_page_order(area, page_order);
+	page_shift = page_order + PAGE_SHIFT;
+
 	/*
 	 * page tables allocations ignore external gfp mask, enforce it
 	 * by the scope API
Hailong Liu July 25, 2024, 4:40 p.m. UTC | #7
On Thu, 25. Jul 19:39, Baoquan He wrote:
> On 07/25/24 at 11:53am, hailong.liu@oppo.com wrote:
> > From: "Hailong.Liu" <hailong.liu@oppo.com>
> >
> > The scenario where the issue occurs is as follows:
> > CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> > kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
> >     __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
> >         vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
> >                                         and phys_addr is aligned with PMD_SIZE
> >             vmap_pages_range
> >                 vmap_pages_range_noflush
> >                     __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
> >
> > In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> > might be issues with the __vmap_pages_range_noflush().
> >
> > The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> > are several reasons for this:
> > - This increases memory footprint because ALIGNMENT.
> > - This increases the likelihood of kvmalloc allocation failures.
> > - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> > Besides if drivers want to vmap huge, user vmalloc_huge instead.
>
> Seem there are two issues you are folding into one patch:
Got it. I will separate in the next version.

>
> one is the wrong informatin passed into __vmap_pages_range_noflush();
> the other is you want to take off VM_ALLOW_HUGE_VMAP on kvmalloc().
>
> About the 1st one, do you think below draft is OK to you?
>
> Pass out the fall back order and adjust the order and shift for later
> usage, mainly for vmap_pages_range().
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 260897b21b11..5ee9ae518f3d 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3508,9 +3508,9 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
>
>  static inline unsigned int
>  vm_area_alloc_pages(gfp_t gfp, int nid,
> -		unsigned int order, unsigned int nr_pages, struct page **pages)
> +		unsigned int *page_order, unsigned int nr_pages, struct page **pages)
>  {
> -	unsigned int nr_allocated = 0;
> +	unsigned int nr_allocated = 0, order = *page_order;
>  	gfp_t alloc_gfp = gfp;
>  	bool nofail = gfp & __GFP_NOFAIL;
>  	struct page *page;
> @@ -3611,6 +3611,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
>  		cond_resched();
>  		nr_allocated += 1U << order;
>  	}
> +	*page_order = order;
>
>  	return nr_allocated;
>  }
> @@ -3654,7 +3655,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  	page_order = vm_area_page_order(area);
>
>  	area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN,
> -		node, page_order, nr_small_pages, area->pages);
> +		node, &page_order, nr_small_pages, area->pages);
>
>  	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
>  	if (gfp_mask & __GFP_ACCOUNT) {
> @@ -3686,6 +3687,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  		goto fail;
>  	}
>
> +
> +	set_vm_area_page_order(area, page_order);
> +	page_shift = page_order + PAGE_SHIFT;
> +
>  	/*
>  	 * page tables allocations ignore external gfp mask, enforce it
>  	 * by the scope API
>
The logic of this patch is somewhat similar to my first one. If high order
allocation fails, it will go normal mapping.

However I also save the fallback position. The ones before this position are
used for huge mapping, the ones >= position for normal mapping as Barry said.
"support the combination of PMD and PTE mapping". this  will take some
times as it needs to address the corner cases and do some tests.

IMO, the draft can fix the current issue, it also does not have significant side
effects. Barry, what do you think about this patch? If you think it's okay,
I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
other to address the current mapping issue.

--
help you, help me,
Hailong.
Barry Song July 26, 2024, 1:31 a.m. UTC | #8
On Fri, Jul 26, 2024 at 4:40 AM Hailong Liu <hailong.liu@oppo.com> wrote:
>
> On Thu, 25. Jul 19:39, Baoquan He wrote:
> > On 07/25/24 at 11:53am, hailong.liu@oppo.com wrote:
> > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > >
> > > The scenario where the issue occurs is as follows:
> > > CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> > > kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
> > >     __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
> > >         vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
> > >                                         and phys_addr is aligned with PMD_SIZE
> > >             vmap_pages_range
> > >                 vmap_pages_range_noflush
> > >                     __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
> > >
> > > In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> > > might be issues with the __vmap_pages_range_noflush().
> > >
> > > The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> > > are several reasons for this:
> > > - This increases memory footprint because ALIGNMENT.
> > > - This increases the likelihood of kvmalloc allocation failures.
> > > - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> > > Besides if drivers want to vmap huge, user vmalloc_huge instead.
> >
> > Seem there are two issues you are folding into one patch:
> Got it. I will separate in the next version.
>
> >
> > one is the wrong informatin passed into __vmap_pages_range_noflush();
> > the other is you want to take off VM_ALLOW_HUGE_VMAP on kvmalloc().
> >
> > About the 1st one, do you think below draft is OK to you?
> >
> > Pass out the fall back order and adjust the order and shift for later
> > usage, mainly for vmap_pages_range().
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 260897b21b11..5ee9ae518f3d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3508,9 +3508,9 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
> >
> >  static inline unsigned int
> >  vm_area_alloc_pages(gfp_t gfp, int nid,
> > -             unsigned int order, unsigned int nr_pages, struct page **pages)
> > +             unsigned int *page_order, unsigned int nr_pages, struct page **pages)
> >  {
> > -     unsigned int nr_allocated = 0;
> > +     unsigned int nr_allocated = 0, order = *page_order;
> >       gfp_t alloc_gfp = gfp;
> >       bool nofail = gfp & __GFP_NOFAIL;
> >       struct page *page;
> > @@ -3611,6 +3611,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >               cond_resched();
> >               nr_allocated += 1U << order;
> >       }
> > +     *page_order = order;
> >
> >       return nr_allocated;
> >  }
> > @@ -3654,7 +3655,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >       page_order = vm_area_page_order(area);
> >
> >       area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN,
> > -             node, page_order, nr_small_pages, area->pages);
> > +             node, &page_order, nr_small_pages, area->pages);
> >
> >       atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
> >       if (gfp_mask & __GFP_ACCOUNT) {
> > @@ -3686,6 +3687,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >               goto fail;
> >       }
> >
> > +
> > +     set_vm_area_page_order(area, page_order);
> > +     page_shift = page_order + PAGE_SHIFT;
> > +
> >       /*
> >        * page tables allocations ignore external gfp mask, enforce it
> >        * by the scope API
> >
> The logic of this patch is somewhat similar to my first one. If high order
> allocation fails, it will go normal mapping.
>
> However I also save the fallback position. The ones before this position are
> used for huge mapping, the ones >= position for normal mapping as Barry said.
> "support the combination of PMD and PTE mapping". this  will take some
> times as it needs to address the corner cases and do some tests.
>
> IMO, the draft can fix the current issue, it also does not have significant side
> effects. Barry, what do you think about this patch? If you think it's okay,
> I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> other to address the current mapping issue.

Yes, it's acceptable, even though it's not perfect. However,
addressing the mapping
issues is an urgent requirement. Memory corruption is currently
occurring, and we
need to ensure the fix reaches the stable kernel and mainline as soon
as possible.

Removing VM_ALLOW_HUGE_VMAP in kvmalloc is not as urgent. with Baoquan's
patch, we can even extend the fallback to non-nofail case. then maybe we can
remain VM_ALLOW_HUGE_VMAP of kvmalloc. This at least fixes one problem of
kvmalloc: we are likely to fail because it is really difficult to get
2MB contiguous
memory from buddy while memory is fragmented.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index caf032f0bd69..6f47b01cbe2e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3585,11 +3585,11 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
                else
                        page = alloc_pages_node_noprof(nid, alloc_gfp, order);
                if (unlikely(!page)) {
-                       if (!nofail)
+                       if (!nofail && order == 0)
                                break;
-
+                       if (nofail)
+                               alloc_gfp |= __GFP_NOFAIL;
                        /* fall back to the zero order allocations */
-                       alloc_gfp |= __GFP_NOFAIL;
                        order = 0;
                        continue;
                }

Other two optimizations can be deferred tasks
1. to save memory—such as avoiding allocating 4MB when users request 2.1MB
with kvmalloc.

2. to do mixed and adaptive mapping, for example, for the 2.1kvmalloc,
 * if the first 2MB is contiguous, we map the 0~2MB as PMD and 2MB~2.1MB as PTE;
 * if the first 2MB is not contiguous, we map the whole 0~2.1MB as PTE

>
> --
> help you, help me,
> Hailong.

Thanks
Barry
Baoquan He July 26, 2024, 2:31 a.m. UTC | #9
On 07/26/24 at 12:40am, Hailong Liu wrote:
> On Thu, 25. Jul 19:39, Baoquan He wrote:
> > On 07/25/24 at 11:53am, hailong.liu@oppo.com wrote:
> > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > >
> > > The scenario where the issue occurs is as follows:
> > > CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> > > kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
> > >     __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
> > >         vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
> > >                                         and phys_addr is aligned with PMD_SIZE
> > >             vmap_pages_range
> > >                 vmap_pages_range_noflush
> > >                     __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
> > >
> > > In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> > > might be issues with the __vmap_pages_range_noflush().
> > >
> > > The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> > > are several reasons for this:
> > > - This increases memory footprint because ALIGNMENT.
> > > - This increases the likelihood of kvmalloc allocation failures.
> > > - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> > > Besides if drivers want to vmap huge, user vmalloc_huge instead.
> >
> > Seem there are two issues you are folding into one patch:
> Got it. I will separate in the next version.
> 
> >
> > one is the wrong informatin passed into __vmap_pages_range_noflush();
> > the other is you want to take off VM_ALLOW_HUGE_VMAP on kvmalloc().
> >
> > About the 1st one, do you think below draft is OK to you?
> >
> > Pass out the fall back order and adjust the order and shift for later
> > usage, mainly for vmap_pages_range().
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 260897b21b11..5ee9ae518f3d 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3508,9 +3508,9 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
> >
> >  static inline unsigned int
> >  vm_area_alloc_pages(gfp_t gfp, int nid,
> > -		unsigned int order, unsigned int nr_pages, struct page **pages)
> > +		unsigned int *page_order, unsigned int nr_pages, struct page **pages)
> >  {
> > -	unsigned int nr_allocated = 0;
> > +	unsigned int nr_allocated = 0, order = *page_order;
> >  	gfp_t alloc_gfp = gfp;
> >  	bool nofail = gfp & __GFP_NOFAIL;
> >  	struct page *page;
> > @@ -3611,6 +3611,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> >  		cond_resched();
> >  		nr_allocated += 1U << order;
> >  	}
> > +	*page_order = order;
> >
> >  	return nr_allocated;
> >  }
> > @@ -3654,7 +3655,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  	page_order = vm_area_page_order(area);
> >
> >  	area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN,
> > -		node, page_order, nr_small_pages, area->pages);
> > +		node, &page_order, nr_small_pages, area->pages);
> >
> >  	atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
> >  	if (gfp_mask & __GFP_ACCOUNT) {
> > @@ -3686,6 +3687,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> >  		goto fail;
> >  	}
> >
> > +
> > +	set_vm_area_page_order(area, page_order);
> > +	page_shift = page_order + PAGE_SHIFT;
> > +
> >  	/*
> >  	 * page tables allocations ignore external gfp mask, enforce it
> >  	 * by the scope API
> >
> The logic of this patch is somewhat similar to my first one. If high order
> allocation fails, it will go normal mapping.
> 
> However I also save the fallback position. The ones before this position are
> used for huge mapping, the ones >= position for normal mapping as Barry said.
> "support the combination of PMD and PTE mapping". this  will take some
> times as it needs to address the corner cases and do some tests.

Hmm, we may not need to worry about the imperfect mapping. Currently
there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
and vmalloc_huge(). 

For vmalloc_huge(), it's called in below three interfaces which are all
invoked during boot. Basically they can succeed to get required contiguous
physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
we have told that in the code comment above __kvmalloc_node_noprof(),
it's a best effort behaviour.

 mm/mm_init.c <<alloc_large_system_hash>>
 table = vmalloc_huge(size, gfp_flags);
 net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
 new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
 net/ipv4/udp.c <<udp_pernet_table_alloc>>
 udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)

Maybe we should add code comment or document to notice people that the
contiguous physical pages are not guaranteed for vmalloc_huge() if you
use it after boot.

> 
> IMO, the draft can fix the current issue, it also does not have significant side
> effects. Barry, what do you think about this patch? If you think it's okay,
> I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> other to address the current mapping issue.
> 
> --
> help you, help me,
> Hailong.
>
Barry Song July 26, 2024, 3:53 a.m. UTC | #10
On Fri, Jul 26, 2024 at 2:31 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 07/26/24 at 12:40am, Hailong Liu wrote:
> > On Thu, 25. Jul 19:39, Baoquan He wrote:
> > > On 07/25/24 at 11:53am, hailong.liu@oppo.com wrote:
> > > > From: "Hailong.Liu" <hailong.liu@oppo.com>
> > > >
> > > > The scenario where the issue occurs is as follows:
> > > > CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE
> > > > kvmalloc(2M, __GFP_NOFAIL|GFP_XXX)
> > > >     __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP)
> > > >         vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0
> > > >                                         and phys_addr is aligned with PMD_SIZE
> > > >             vmap_pages_range
> > > >                 vmap_pages_range_noflush
> > > >                     __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here
> > > >
> > > > In fact, as long as page_shift is not equal to PAGE_SHIFT, there
> > > > might be issues with the __vmap_pages_range_noflush().
> > > >
> > > > The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There
> > > > are several reasons for this:
> > > > - This increases memory footprint because ALIGNMENT.
> > > > - This increases the likelihood of kvmalloc allocation failures.
> > > > - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL.
> > > > Besides if drivers want to vmap huge, user vmalloc_huge instead.
> > >
> > > Seem there are two issues you are folding into one patch:
> > Got it. I will separate in the next version.
> >
> > >
> > > one is the wrong informatin passed into __vmap_pages_range_noflush();
> > > the other is you want to take off VM_ALLOW_HUGE_VMAP on kvmalloc().
> > >
> > > About the 1st one, do you think below draft is OK to you?
> > >
> > > Pass out the fall back order and adjust the order and shift for later
> > > usage, mainly for vmap_pages_range().
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 260897b21b11..5ee9ae518f3d 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3508,9 +3508,9 @@ EXPORT_SYMBOL_GPL(vmap_pfn);
> > >
> > >  static inline unsigned int
> > >  vm_area_alloc_pages(gfp_t gfp, int nid,
> > > -           unsigned int order, unsigned int nr_pages, struct page **pages)
> > > +           unsigned int *page_order, unsigned int nr_pages, struct page **pages)
> > >  {
> > > -   unsigned int nr_allocated = 0;
> > > +   unsigned int nr_allocated = 0, order = *page_order;
> > >     gfp_t alloc_gfp = gfp;
> > >     bool nofail = gfp & __GFP_NOFAIL;
> > >     struct page *page;
> > > @@ -3611,6 +3611,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> > >             cond_resched();
> > >             nr_allocated += 1U << order;
> > >     }
> > > +   *page_order = order;
> > >
> > >     return nr_allocated;
> > >  }
> > > @@ -3654,7 +3655,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >     page_order = vm_area_page_order(area);
> > >
> > >     area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN,
> > > -           node, page_order, nr_small_pages, area->pages);
> > > +           node, &page_order, nr_small_pages, area->pages);
> > >
> > >     atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
> > >     if (gfp_mask & __GFP_ACCOUNT) {
> > > @@ -3686,6 +3687,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > >             goto fail;
> > >     }
> > >
> > > +
> > > +   set_vm_area_page_order(area, page_order);
> > > +   page_shift = page_order + PAGE_SHIFT;
> > > +
> > >     /*
> > >      * page tables allocations ignore external gfp mask, enforce it
> > >      * by the scope API
> > >
> > The logic of this patch is somewhat similar to my first one. If high order
> > allocation fails, it will go normal mapping.
> >
> > However I also save the fallback position. The ones before this position are
> > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > "support the combination of PMD and PTE mapping". this  will take some
> > times as it needs to address the corner cases and do some tests.
>
> Hmm, we may not need to worry about the imperfect mapping. Currently
> there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> and vmalloc_huge().
>
> For vmalloc_huge(), it's called in below three interfaces which are all
> invoked during boot. Basically they can succeed to get required contiguous
> physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> we have told that in the code comment above __kvmalloc_node_noprof(),
> it's a best effort behaviour.
>
>  mm/mm_init.c <<alloc_large_system_hash>>
>  table = vmalloc_huge(size, gfp_flags);
>  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
>  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
>  net/ipv4/udp.c <<udp_pernet_table_alloc>>
>  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
>
> Maybe we should add code comment or document to notice people that the
> contiguous physical pages are not guaranteed for vmalloc_huge() if you
> use it after boot.

Currently, the issue goes beyond just 'contiguous physical pages are
not guaranteed.'
The problem includes the likelihood of failure when trying to allocate
2MB of contiguous
memory. That's why I suggest we allow fallback to order-0 for
non-nofail allocations with
your proposed changes.

The only difference is that for non-nofail allocations, if we fall
back to order-0 and still
fail, the process will break. In the case of nofail, we always succeed
on the final
allocation.

>
> >
> > IMO, the draft can fix the current issue, it also does not have significant side
> > effects. Barry, what do you think about this patch? If you think it's okay,
> > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > other to address the current mapping issue.
> >
> > --
> > help you, help me,
> > Hailong.
> >

Thanks
Barry
Hailong Liu July 26, 2024, 4 a.m. UTC | #11
On Fri, 26. Jul 10:31, Baoquan He wrote:
[...]
> > The logic of this patch is somewhat similar to my first one. If high order
> > allocation fails, it will go normal mapping.
> >
> > However I also save the fallback position. The ones before this position are
> > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > "support the combination of PMD and PTE mapping". this  will take some
> > times as it needs to address the corner cases and do some tests.
>
> Hmm, we may not need to worry about the imperfect mapping. Currently
> there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> and vmalloc_huge().
>
> For vmalloc_huge(), it's called in below three interfaces which are all
> invoked during boot. Basically they can succeed to get required contiguous
> physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> we have told that in the code comment above __kvmalloc_node_noprof(),
> it's a best effort behaviour.
>
Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
because the align requirement of huge. the real size is 4M.
if allocation first order-9 successfully and the next failed. becuase the
fallback, the layout out pages would be like order9 - 512 * order0
order9 support huge mapping, but order0 not.
with the patch above, would call vmap_small_pages_range_noflush() and do normal
mapping, the huge mapping would not exist.

>  mm/mm_init.c <<alloc_large_system_hash>>
>  table = vmalloc_huge(size, gfp_flags);
>  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
>  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
>  net/ipv4/udp.c <<udp_pernet_table_alloc>>
>  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
>
> Maybe we should add code comment or document to notice people that the
> contiguous physical pages are not guaranteed for vmalloc_huge() if you
> use it after boot.
>
> >
> > IMO, the draft can fix the current issue, it also does not have significant side
> > effects. Barry, what do you think about this patch? If you think it's okay,
> > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > other to address the current mapping issue.
> >
> > --
> > help you, help me,
> > Hailong.
> >
>
>

--
help you, help me,
Hailong.
Hailong Liu July 26, 2024, 5:03 a.m. UTC | #12
On Fri, 26. Jul 12:00, Hailong Liu wrote:
> On Fri, 26. Jul 10:31, Baoquan He wrote:
> [...]
> > > The logic of this patch is somewhat similar to my first one. If high order
> > > allocation fails, it will go normal mapping.
> > >
> > > However I also save the fallback position. The ones before this position are
> > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > "support the combination of PMD and PTE mapping". this  will take some
> > > times as it needs to address the corner cases and do some tests.
> >
> > Hmm, we may not need to worry about the imperfect mapping. Currently
> > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > and vmalloc_huge().
> >
> > For vmalloc_huge(), it's called in below three interfaces which are all
> > invoked during boot. Basically they can succeed to get required contiguous
> > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > we have told that in the code comment above __kvmalloc_node_noprof(),
> > it's a best effort behaviour.
> >
> Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> because the align requirement of huge. the real size is 4M.
> if allocation first order-9 successfully and the next failed. becuase the
> fallback, the layout out pages would be like order9 - 512 * order0
> order9 support huge mapping, but order0 not.
> with the patch above, would call vmap_small_pages_range_noflush() and do normal
> mapping, the huge mapping would not exist.
>
> >  mm/mm_init.c <<alloc_large_system_hash>>
> >  table = vmalloc_huge(size, gfp_flags);
> >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> >
> > Maybe we should add code comment or document to notice people that the
> > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > use it after boot.
> >
> > >
> > > IMO, the draft can fix the current issue, it also does not have significant side
> > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > other to address the current mapping issue.
> > >
> > > --
> > > help you, help me,
> > > Hailong.
> > >
> >
> >
I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
fallback to order 0, actuaally without this commit
e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
if __vmalloc_area_node allocation failed, it will goto fail and try order-0.

fail:
	if (shift > PAGE_SHIFT) {
		shift = PAGE_SHIFT;
		align = real_align;
		size = real_size;
		goto again;
	}

So do we really need fallback to order-0 if nofail?
>
> --
> help you, help me,
> Hailong.
>

--
help you, help me,
Hailong.
Barry Song July 26, 2024, 5:29 a.m. UTC | #13
On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
>
> On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > [...]
> > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > allocation fails, it will go normal mapping.
> > > >
> > > > However I also save the fallback position. The ones before this position are
> > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > "support the combination of PMD and PTE mapping". this  will take some
> > > > times as it needs to address the corner cases and do some tests.
> > >
> > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > and vmalloc_huge().
> > >
> > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > invoked during boot. Basically they can succeed to get required contiguous
> > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > it's a best effort behaviour.
> > >
> > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > because the align requirement of huge. the real size is 4M.
> > if allocation first order-9 successfully and the next failed. becuase the
> > fallback, the layout out pages would be like order9 - 512 * order0
> > order9 support huge mapping, but order0 not.
> > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > mapping, the huge mapping would not exist.
> >
> > >  mm/mm_init.c <<alloc_large_system_hash>>
> > >  table = vmalloc_huge(size, gfp_flags);
> > >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > >
> > > Maybe we should add code comment or document to notice people that the
> > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > use it after boot.
> > >
> > > >
> > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > other to address the current mapping issue.
> > > >
> > > > --
> > > > help you, help me,
> > > > Hailong.
> > > >
> > >
> > >
> I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> fallback to order 0, actuaally without this commit
> e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
>
> fail:
>         if (shift > PAGE_SHIFT) {
>                 shift = PAGE_SHIFT;
>                 align = real_align;
>                 size = real_size;
>                 goto again;
>         }
>
> So do we really need fallback to order-0 if nofail?

Good catch, this is what I missed. I feel we can revert Michal's fix.
And just remove __GFP_NOFAIL bit when we are still allocating
by high-order. When "goto again" happens, we will allocate by
order-0, in this case, we keep the __GFP_NOFAIL.

> >
> > --
> > help you, help me,
> > Hailong.

Thanks
Barry
Baoquan He July 26, 2024, 8:37 a.m. UTC | #14
On 07/26/24 at 05:29pm, Barry Song wrote:
> On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> >
> > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > [...]
> > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > allocation fails, it will go normal mapping.
> > > > >
> > > > > However I also save the fallback position. The ones before this position are
> > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > "support the combination of PMD and PTE mapping". this  will take some
> > > > > times as it needs to address the corner cases and do some tests.
> > > >
> > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > and vmalloc_huge().
> > > >
> > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > it's a best effort behaviour.
> > > >
> > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > because the align requirement of huge. the real size is 4M.
> > > if allocation first order-9 successfully and the next failed. becuase the
> > > fallback, the layout out pages would be like order9 - 512 * order0
> > > order9 support huge mapping, but order0 not.
> > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > mapping, the huge mapping would not exist.
> > >
> > > >  mm/mm_init.c <<alloc_large_system_hash>>
> > > >  table = vmalloc_huge(size, gfp_flags);
> > > >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > >
> > > > Maybe we should add code comment or document to notice people that the
> > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > use it after boot.
> > > >
> > > > >
> > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > other to address the current mapping issue.
> > > > >
> > > > > --
> > > > > help you, help me,
> > > > > Hailong.
> > > > >
> > > >
> > > >
> > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > fallback to order 0, actuaally without this commit
> > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> >
> > fail:
> >         if (shift > PAGE_SHIFT) {
> >                 shift = PAGE_SHIFT;
> >                 align = real_align;
> >                 size = real_size;
> >                 goto again;
> >         }
> >
> > So do we really need fallback to order-0 if nofail?
> 
> Good catch, this is what I missed. I feel we can revert Michal's fix.
> And just remove __GFP_NOFAIL bit when we are still allocating
> by high-order. When "goto again" happens, we will allocate by
> order-0, in this case, we keep the __GFP_NOFAIL.

With Michal's patch, the fallback will be able to satisfy the allocation
for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The 
'if (shift > PAGE_SHIFT)' conditional checking and handling may be
problemtic since it could jump to fail becuase vmap_pages_range()
invocation failed, or partially allocate huge parges and break down, 
then it will ignore the already allocated pages, and do all the thing again.

The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
sense is it fallback to the real_size and real_align. BUT we need handle
the fail separately, e.g 
1)__get_vm_area_node() failed;
2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
3)vmap_pages_range() failed;

Honestly, I didn't see where the nofail is mishandled, could you point
it out specifically? I could miss it.

Thanks
Baoquan
Hailong Liu July 26, 2024, 8:48 a.m. UTC | #15
On Fri, 26. Jul 16:37, Baoquan He wrote:
> On 07/26/24 at 05:29pm, Barry Song wrote:
> > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> > >
> > > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > > [...]
> > > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > > allocation fails, it will go normal mapping.
> > > > > >
> > > > > > However I also save the fallback position. The ones before this position are
> > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > > "support the combination of PMD and PTE mapping". this  will take some
> > > > > > times as it needs to address the corner cases and do some tests.
> > > > >
> > > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > > and vmalloc_huge().
> > > > >
> > > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > > it's a best effort behaviour.
> > > > >
> > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > > because the align requirement of huge. the real size is 4M.
> > > > if allocation first order-9 successfully and the next failed. becuase the
> > > > fallback, the layout out pages would be like order9 - 512 * order0
> > > > order9 support huge mapping, but order0 not.
> > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > > mapping, the huge mapping would not exist.
> > > >
> > > > >  mm/mm_init.c <<alloc_large_system_hash>>
> > > > >  table = vmalloc_huge(size, gfp_flags);
> > > > >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > > >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > > >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > > >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > > >
> > > > > Maybe we should add code comment or document to notice people that the
> > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > > use it after boot.
> > > > >
> > > > > >
> > > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > > other to address the current mapping issue.
> > > > > >
> > > > > > --
> > > > > > help you, help me,
> > > > > > Hailong.
> > > > > >
> > > > >
> > > > >
> > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > > fallback to order 0, actuaally without this commit
> > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> > >
> > > fail:
> > >         if (shift > PAGE_SHIFT) {
> > >                 shift = PAGE_SHIFT;
> > >                 align = real_align;
> > >                 size = real_size;
> > >                 goto again;
> > >         }
> > >
> > > So do we really need fallback to order-0 if nofail?
> >
> > Good catch, this is what I missed. I feel we can revert Michal's fix.
> > And just remove __GFP_NOFAIL bit when we are still allocating
> > by high-order. When "goto again" happens, we will allocate by
> > order-0, in this case, we keep the __GFP_NOFAIL.
>
> With Michal's patch, the fallback will be able to satisfy the allocation
> for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The

Hi Baoquan:

int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
		pgprot_t prot, struct page **pages, unsigned int page_shift)
{
	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;

	WARN_ON(page_shift < PAGE_SHIFT);

	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
			page_shift == PAGE_SHIFT)
		return vmap_small_pages_range_noflush(addr, end, prot, pages);

	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {  ---> huge mapping
		int err;

		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
					page_to_phys(pages[i]), prot, ---------> incorrect mapping would occur here if nofail and fallback to order0
					page_shift);
		if (err)
			return err;

		addr += 1UL << page_shift;
	}

	return 0;
}
> 'if (shift > PAGE_SHIFT)' conditional checking and handling may be
> problemtic since it could jump to fail becuase vmap_pages_range()
> invocation failed, or partially allocate huge parges and break down,
> then it will ignore the already allocated pages, and do all the thing again.
>
> The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
> sense is it fallback to the real_size and real_align. BUT we need handle
> the fail separately, e.g
> 1)__get_vm_area_node() failed;
> 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
> 3)vmap_pages_range() failed;
>
> Honestly, I didn't see where the nofail is mishandled, could you point
> it out specifically? I could miss it.
>
> Thanks
> Baoquan
>

--
help you, help me,
Hailong.
Hailong Liu July 26, 2024, 9 a.m. UTC | #16
On Fri, 26. Jul 16:48, Hailong Liu wrote:
> On Fri, 26. Jul 16:37, Baoquan He wrote:
> > On 07/26/24 at 05:29pm, Barry Song wrote:
> > > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> > > >
> > > > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > > > [...]
> > > > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > > > allocation fails, it will go normal mapping.
> > > > > > >
> > > > > > > However I also save the fallback position. The ones before this position are
> > > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > > > "support the combination of PMD and PTE mapping". this  will take some
> > > > > > > times as it needs to address the corner cases and do some tests.
> > > > > >
> > > > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > > > and vmalloc_huge().
> > > > > >
> > > > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > > > it's a best effort behaviour.
> > > > > >
> > > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > > > because the align requirement of huge. the real size is 4M.
> > > > > if allocation first order-9 successfully and the next failed. becuase the
> > > > > fallback, the layout out pages would be like order9 - 512 * order0
> > > > > order9 support huge mapping, but order0 not.
> > > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > > > mapping, the huge mapping would not exist.
> > > > >
> > > > > >  mm/mm_init.c <<alloc_large_system_hash>>
> > > > > >  table = vmalloc_huge(size, gfp_flags);
> > > > > >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > > > >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > > > >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > > > >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > > > >
> > > > > > Maybe we should add code comment or document to notice people that the
> > > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > > > use it after boot.
> > > > > >
> > > > > > >
> > > > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > > > other to address the current mapping issue.
> > > > > > >
> > > > > > > --
> > > > > > > help you, help me,
> > > > > > > Hailong.
> > > > > > >
> > > > > >
> > > > > >
> > > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > > > fallback to order 0, actuaally without this commit
> > > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> > > >
> > > > fail:
> > > >         if (shift > PAGE_SHIFT) {
> > > >                 shift = PAGE_SHIFT;
> > > >                 align = real_align;
> > > >                 size = real_size;
> > > >                 goto again;
> > > >         }
> > > >
> > > > So do we really need fallback to order-0 if nofail?
> > >
> > > Good catch, this is what I missed. I feel we can revert Michal's fix.
> > > And just remove __GFP_NOFAIL bit when we are still allocating
> > > by high-order. When "goto again" happens, we will allocate by
> > > order-0, in this case, we keep the __GFP_NOFAIL.
> >
> > With Michal's patch, the fallback will be able to satisfy the allocation
> > for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The
>
> Hi Baoquan:
>
> int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> 		pgprot_t prot, struct page **pages, unsigned int page_shift)
> {
> 	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
>
> 	WARN_ON(page_shift < PAGE_SHIFT);
>
> 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> 			page_shift == PAGE_SHIFT)
> 		return vmap_small_pages_range_noflush(addr, end, prot, pages);
>
> 	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {  ---> huge mapping
> 		int err;
>
> 		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> 					page_to_phys(pages[i]), prot, ---------> incorrect mapping would occur here if nofail and fallback to order0
> 					page_shift);
> 		if (err)
> 			return err;
>
> 		addr += 1UL << page_shift;
> 	}
Sorry, I missunderstand.
>
> 	return 0;
> }
> > 'if (shift > PAGE_SHIFT)' conditional checking and handling may be
> > problemtic since it could jump to fail becuase vmap_pages_range()
> > invocation failed, or partially allocate huge parges and break down,
> > then it will ignore the already allocated pages, and do all the thing again.
> >
> > The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
> > sense is it fallback to the real_size and real_align. BUT we need handle
> > the fail separately, e.g
> > 1)__get_vm_area_node() failed;
> > 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
> > 3)vmap_pages_range() failed;

Right, this might run twice. But it also have a negative effect here,
__get_vm_area_node & vm_area_alloc_pages may waste vmap space and pages.
Actually if we fallback to order0, we don't need to aligh to PMD_SIZE.

> >
> > Honestly, I didn't see where the nofail is mishandled, could you point
> > it out specifically? I could miss it.

> >
> > Thanks
> > Baoquan
> >
>
> --
> help you, help me,
> Hailong.
>

--
help you, help me,
Hailong.
Barry Song July 26, 2024, 9:15 a.m. UTC | #17
On Fri, Jul 26, 2024 at 8:38 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 07/26/24 at 05:29pm, Barry Song wrote:
> > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> > >
> > > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > > [...]
> > > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > > allocation fails, it will go normal mapping.
> > > > > >
> > > > > > However I also save the fallback position. The ones before this position are
> > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > > "support the combination of PMD and PTE mapping". this  will take some
> > > > > > times as it needs to address the corner cases and do some tests.
> > > > >
> > > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > > and vmalloc_huge().
> > > > >
> > > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > > it's a best effort behaviour.
> > > > >
> > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > > because the align requirement of huge. the real size is 4M.
> > > > if allocation first order-9 successfully and the next failed. becuase the
> > > > fallback, the layout out pages would be like order9 - 512 * order0
> > > > order9 support huge mapping, but order0 not.
> > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > > mapping, the huge mapping would not exist.
> > > >
> > > > >  mm/mm_init.c <<alloc_large_system_hash>>
> > > > >  table = vmalloc_huge(size, gfp_flags);
> > > > >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > > >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > > >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > > >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > > >
> > > > > Maybe we should add code comment or document to notice people that the
> > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > > use it after boot.
> > > > >
> > > > > >
> > > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > > other to address the current mapping issue.
> > > > > >
> > > > > > --
> > > > > > help you, help me,
> > > > > > Hailong.
> > > > > >
> > > > >
> > > > >
> > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > > fallback to order 0, actuaally without this commit
> > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> > >
> > > fail:
> > >         if (shift > PAGE_SHIFT) {
> > >                 shift = PAGE_SHIFT;
> > >                 align = real_align;
> > >                 size = real_size;
> > >                 goto again;
> > >         }
> > >
> > > So do we really need fallback to order-0 if nofail?
> >
> > Good catch, this is what I missed. I feel we can revert Michal's fix.
> > And just remove __GFP_NOFAIL bit when we are still allocating
> > by high-order. When "goto again" happens, we will allocate by
> > order-0, in this case, we keep the __GFP_NOFAIL.
>
> With Michal's patch, the fallback will be able to satisfy the allocation
> for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The

My point is that vm_area_alloc_pages() is an internal function and
just an implementation
detail. As long as its caller, __vmalloc_area_node(), can support
NOFAIL, it's fine.
Therefore, we can skip NOFAIL support for high-order allocations in
vm_area_alloc_pages()
and limit GFP_NOFAIL support to order-0.

Good news is that __vmalloc_node_range_noprof() has already a way to fallback
to order-0

fail:
        if (shift > PAGE_SHIFT) {
                shift = PAGE_SHIFT;
                align = real_align;
                size = real_size;
                goto again;
        }

So, we can definitely utilize this fallback instead of implementing it
within vm_area_alloc_pages(),
which would alter the page_order of vm_area and create inconsistency,
crashing the system
due to memory corruption.

With higher-level fallback in __vmalloc_node_range_noprof(), we won't
need an unusual fix-up
for vm_area as you're proposing. The page_order of vm_area will
consistently stay the same.

If there is any way to improvement, we may also add:

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index caf032f0bd69..03d8148d7a02 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3806,7 +3806,7 @@ void *__vmalloc_node_range_noprof(unsigned long
size, unsigned long align,
                warn_alloc(gfp_mask, NULL,
                        "vmalloc error: size %lu, vm_struct allocation
failed%s",
                        real_size, (nofail) ? ". Retrying." : "");
-               if (nofail) {
+               if (nofail && shift == PAGE_SHIFT) {
                        schedule_timeout_uninterruptible(1);
                        goto again;
                }

There's no need to keep retrying for __get_vm_area_node() until success,
as we will succeed when we fall back to order-0.

> 'if (shift > PAGE_SHIFT)' conditional checking and handling may be
> problemtic since it could jump to fail becuase vmap_pages_range()
> invocation failed, or partially allocate huge parges and break down,
> then it will ignore the already allocated pages, and do all the thing again.
>
> The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
> sense is it fallback to the real_size and real_align. BUT we need handle
> the fail separately, e.g
> 1)__get_vm_area_node() failed;
> 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
> 3)vmap_pages_range() failed;
>
> Honestly, I didn't see where the nofail is mishandled, could you point
> it out specifically? I could miss it.
>
> Thanks
> Baoquan
>

Thanks
Barry
Baoquan He July 26, 2024, 9:29 a.m. UTC | #18
On 07/26/24 at 04:48pm, Hailong Liu wrote:
> On Fri, 26. Jul 16:37, Baoquan He wrote:
> > On 07/26/24 at 05:29pm, Barry Song wrote:
> > > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu <hailong.liu@oppo.com> wrote:
> > > >
> > > > On Fri, 26. Jul 12:00, Hailong Liu wrote:
> > > > > On Fri, 26. Jul 10:31, Baoquan He wrote:
> > > > > [...]
> > > > > > > The logic of this patch is somewhat similar to my first one. If high order
> > > > > > > allocation fails, it will go normal mapping.
> > > > > > >
> > > > > > > However I also save the fallback position. The ones before this position are
> > > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said.
> > > > > > > "support the combination of PMD and PTE mapping". this  will take some
> > > > > > > times as it needs to address the corner cases and do some tests.
> > > > > >
> > > > > > Hmm, we may not need to worry about the imperfect mapping. Currently
> > > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof()
> > > > > > and vmalloc_huge().
> > > > > >
> > > > > > For vmalloc_huge(), it's called in below three interfaces which are all
> > > > > > invoked during boot. Basically they can succeed to get required contiguous
> > > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc
> > > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(),
> > > > > > we have told that in the code comment above __kvmalloc_node_noprof(),
> > > > > > it's a best effort behaviour.
> > > > > >
> > > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example.
> > > > > because the align requirement of huge. the real size is 4M.
> > > > > if allocation first order-9 successfully and the next failed. becuase the
> > > > > fallback, the layout out pages would be like order9 - 512 * order0
> > > > > order9 support huge mapping, but order0 not.
> > > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal
> > > > > mapping, the huge mapping would not exist.
> > > > >
> > > > > >  mm/mm_init.c <<alloc_large_system_hash>>
> > > > > >  table = vmalloc_huge(size, gfp_flags);
> > > > > >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> > > > > >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> > > > > >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> > > > > >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> > > > > >
> > > > > > Maybe we should add code comment or document to notice people that the
> > > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > > > > > use it after boot.
> > > > > >
> > > > > > >
> > > > > > > IMO, the draft can fix the current issue, it also does not have significant side
> > > > > > > effects. Barry, what do you think about this patch? If you think it's okay,
> > > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the
> > > > > > > other to address the current mapping issue.
> > > > > > >
> > > > > > > --
> > > > > > > help you, help me,
> > > > > > > Hailong.
> > > > > > >
> > > > > >
> > > > > >
> > > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and
> > > > fallback to order 0, actuaally without this commit
> > > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
> > > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0.
> > > >
> > > > fail:
> > > >         if (shift > PAGE_SHIFT) {
> > > >                 shift = PAGE_SHIFT;
> > > >                 align = real_align;
> > > >                 size = real_size;
> > > >                 goto again;
> > > >         }
> > > >
> > > > So do we really need fallback to order-0 if nofail?
> > >
> > > Good catch, this is what I missed. I feel we can revert Michal's fix.
> > > And just remove __GFP_NOFAIL bit when we are still allocating
> > > by high-order. When "goto again" happens, we will allocate by
> > > order-0, in this case, we keep the __GFP_NOFAIL.
> >
> > With Michal's patch, the fallback will be able to satisfy the allocation
> > for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The
> 
> Hi Baoquan:
> 
> int __vmap_pages_range_noflush(unsigned long addr, unsigned long end,
> 		pgprot_t prot, struct page **pages, unsigned int page_shift)
> {
> 	unsigned int i, nr = (end - addr) >> PAGE_SHIFT;
> 
> 	WARN_ON(page_shift < PAGE_SHIFT);
> 
> 	if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) ||
> 			page_shift == PAGE_SHIFT)
> 		return vmap_small_pages_range_noflush(addr, end, prot, pages);
> 
> 	for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) {  ---> huge mapping
> 		int err;
> 
> 		err = vmap_range_noflush(addr, addr + (1UL << page_shift),
> 					page_to_phys(pages[i]), prot, ---------> incorrect mapping would occur here if nofail and fallback to order0

Thanks. I have got this issue from your patch. I mean if we have adjusted
the page_shift and page_order after fallback with the draft patch I
proposed, Barry still mentioned the nofail issue, that confuses me.

> 					page_shift);
> 		if (err)
> 			return err;
> 
> 		addr += 1UL << page_shift;
> 	}
> 
> 	return 0;
> }
> > 'if (shift > PAGE_SHIFT)' conditional checking and handling may be
> > problemtic since it could jump to fail becuase vmap_pages_range()
> > invocation failed, or partially allocate huge parges and break down,
> > then it will ignore the already allocated pages, and do all the thing again.
> >
> > The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes
> > sense is it fallback to the real_size and real_align. BUT we need handle
> > the fail separately, e.g
> > 1)__get_vm_area_node() failed;
> > 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail;
> > 3)vmap_pages_range() failed;
> >
> > Honestly, I didn't see where the nofail is mishandled, could you point
> > it out specifically? I could miss it.
> >
> > Thanks
> > Baoquan
> >
> 
> --
> help you, help me,
> Hailong.
>
Baoquan He July 29, 2024, 1:48 a.m. UTC | #19
Hi Barry,

On 07/26/24 at 03:53pm, Barry Song wrote:
> On Fri, Jul 26, 2024 at 2:31 PM Baoquan He <bhe@redhat.com> wrote:
......
> >  mm/mm_init.c <<alloc_large_system_hash>>
> >  table = vmalloc_huge(size, gfp_flags);
> >  net/ipv4/inet_hashtables.c <<inet_pernet_hashinfo_alloc>>
> >  new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket),
> >  net/ipv4/udp.c <<udp_pernet_table_alloc>>
> >  udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot)
> >
> > Maybe we should add code comment or document to notice people that the
> > contiguous physical pages are not guaranteed for vmalloc_huge() if you
> > use it after boot.
> 
> Currently, the issue goes beyond just 'contiguous physical pages are
> not guaranteed.'
> The problem includes the likelihood of failure when trying to allocate
> 2MB of contiguous
> memory. That's why I suggest we allow fallback to order-0 for
> non-nofail allocations with
> your proposed changes.

I missed this part of your comment, I agree with you. I think it's doable
with below draft patch combined with my earlier draft change.


diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 260897b21b11..9ae85342d337 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3581,11 +3581,11 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 		else
 			page = alloc_pages_node_noprof(nid, alloc_gfp, order);
 		if (unlikely(!page)) {
-			if (!nofail)
+			if (!nofail && !order)
 				break;
 
 			/* fall back to the zero order allocations */
-			alloc_gfp |= __GFP_NOFAIL;
+			alloc_gfp = gfp;
 			order = 0;
 			continue;
 		}


Hi Hailong,

Please feel free to collect them to post formal patch, maybe two
patches, one is to allow non-nofail to fallback to order-0 in
vm_area_alloc_pages(), the other is passing out the fallbacked
page_order to vmap_pages_range() if it's OK.

Thanks
Baoquan
Hailong Liu July 30, 2024, 3:24 a.m. UTC | #20
On Mon, 29. Jul 09:48, Baoquan He wrote:
[...]
> Hi Hailong,
>
> Please feel free to collect them to post formal patch, maybe two
> patches, one is to allow non-nofail to fallback to order-0 in
> vm_area_alloc_pages(), the other is passing out the fallbacked
> page_order to vmap_pages_range() if it's OK.
Sorry for late response. I personally prefer to revert part of
the problematic patch. There are several reasons:
- Save memory usage if high order allocation failed.
- If nofail and fallback to order0 in vmalloc huge allocation,
actually the allocation is alighed with PMD_SIZE or not PAGE_SHIFT.

You might be concerned about performance issues. But IMO,
- If we fallback to order0, we can make use of bulk allocator.
- Maybe we can remove VM_ALLOW_HUGE_VMAP in kvmalloc.

I am not sure if I have misunderstood anything. If I have, please
let me know.
>
> Thanks
> Baoquan
>
>

--
help you, help me,
Hailong.
Baoquan He July 30, 2024, 4:29 a.m. UTC | #21
On 07/30/24 at 11:24am, Hailong Liu wrote:
> On Mon, 29. Jul 09:48, Baoquan He wrote:
> [...]
> > Hi Hailong,
> >
> > Please feel free to collect them to post formal patch, maybe two
> > patches, one is to allow non-nofail to fallback to order-0 in
> > vm_area_alloc_pages(), the other is passing out the fallbacked
> > page_order to vmap_pages_range() if it's OK.
> Sorry for late response. I personally prefer to revert part of
> the problematic patch. There are several reasons:
> - Save memory usage if high order allocation failed.
> - If nofail and fallback to order0 in vmalloc huge allocation,
> actually the allocation is alighed with PMD_SIZE or not PAGE_SHIFT.

It's OK, maybe you can post patch to show what it looks like,
we can review and discuss there.

> 
> You might be concerned about performance issues. But IMO,
> - If we fallback to order0, we can make use of bulk allocator.
> - Maybe we can remove VM_ALLOW_HUGE_VMAP in kvmalloc.
> 
> I am not sure if I have misunderstood anything. If I have, please
> let me know.
> >
> > Thanks
> > Baoquan
> >
> >
> 
> --
> help you, help me,
> Hailong.
>
diff mbox series

Patch

diff --git a/mm/util.c b/mm/util.c
index 669397235787..b23133b738cf 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -657,7 +657,7 @@  void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	 * protection games.
 	 */
 	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-			flags, PAGE_KERNEL, VM_ALLOW_HUGE_VMAP,
+			flags, PAGE_KERNEL, 0,
 			node, __builtin_return_address(0));
 }
 EXPORT_SYMBOL(kvmalloc_node);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 03c78fae06f3..1914768f473e 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3577,15 +3577,6 @@  vm_area_alloc_pages(gfp_t gfp, int nid,
 			page = alloc_pages(alloc_gfp, order);
 		else
 			page = alloc_pages_node(nid, alloc_gfp, order);
-		if (unlikely(!page)) {
-			if (!nofail)
-				break;
-
-			/* fall back to the zero order allocations */
-			alloc_gfp |= __GFP_NOFAIL;
-			order = 0;
-			continue;
-		}

 		/*
 		 * Higher order allocations must be able to be treated as