diff mbox series

mm/mmap: fix the adjusted length error

Message ID 1558073209-79549-1-git-send-email-chenjianhong2@huawei.com (mailing list archive)
State New, archived
Headers show
Series mm/mmap: fix the adjusted length error | expand

Commit Message

chenjianhong (A) May 17, 2019, 6:06 a.m. UTC
In linux version 4.4, a 32-bit process may fail to allocate 64M hugepage
memory by function shmat even though there is a 64M memory gap in
the process.

It is the adjusted length that causes the problem, introduced from
commit db4fbfb9523c935 ("mm: vm_unmapped_area() lookup function").
Accounting for the worst case alignment overhead, function unmapped_area
and unmapped_area_topdown adjust the search length before searching
for available vma gap. This is an estimated length, sum of the desired
length and the longest alignment offset, which can cause misjudgement
if the system has very few virtual memory left. For example, if the
longest memory gap available is 64M, we can’t get it from the system
by allocating 64M hugepage memory via shmat function. The reason is
that it requires a longger length, the sum of the desired length(64M)
and the longest alignment offset.

To fix this error ,we can calculate the alignment offset of
gap_start or gap_end to get a desired gap_start or gap_end value,
before searching for the available gap. In this way, we don't
need to adjust the search length.

Problem reproduces procedure:
1. allocate a lot of virtual memory segments via shmat and malloc
2. release one of the biggest memory segment via shmdt
3. attach the biggest memory segment via shmat

e.g.
process maps:
00008000-00009000 r-xp 00000000 00:12 3385    /tmp/memory_mmap
00011000-00012000 rw-p 00001000 00:12 3385    /tmp/memory_mmap
27536000-f756a000 rw-p 00000000 00:00 0
f756a000-f7691000 r-xp 00000000 01:00 560     /lib/libc-2.11.1.so
f7691000-f7699000 ---p 00127000 01:00 560     /lib/libc-2.11.1.so
f7699000-f769b000 r--p 00127000 01:00 560     /lib/libc-2.11.1.so
f769b000-f769c000 rw-p 00129000 01:00 560     /lib/libc-2.11.1.so
f769c000-f769f000 rw-p 00000000 00:00 0
f769f000-f76c0000 r-xp 00000000 01:00 583     /lib/libgcc_s.so.1
f76c0000-f76c7000 ---p 00021000 01:00 583     /lib/libgcc_s.so.1
f76c7000-f76c8000 rw-p 00020000 01:00 583     /lib/libgcc_s.so.1
f76c8000-f76e5000 r-xp 00000000 01:00 543     /lib/ld-2.11.1.so
f76e9000-f76ea000 rw-p 00000000 00:00 0
f76ea000-f76ec000 rw-p 00000000 00:00 0
f76ec000-f76ed000 r--p 0001c000 01:00 543     /lib/ld-2.11.1.so
f76ed000-f76ee000 rw-p 0001d000 01:00 543     /lib/ld-2.11.1.so
f7800000-f7a00000 rw-s 00000000 00:0e 0       /SYSV000000ea (deleted)
fba00000-fca00000 rw-s 00000000 00:0e 65538   /SYSV000000ec (deleted)
fca00000-fce00000 rw-s 00000000 00:0e 98307   /SYSV000000ed (deleted)
fce00000-fd800000 rw-s 00000000 00:0e 131076  /SYSV000000ee (deleted)
ff913000-ff934000 rw-p 00000000 00:00 0       [stack]
ffff0000-ffff1000 r-xp 00000000 00:00 0       [vectors]

from 0xf7a00000 to fba00000, it has 64M memory gap, but we can't get
it from kernel.

Signed-off-by: jianhong chen <chenjianhong2@huawei.com>
Cc: stable@vger.kernel.org
---
 mm/mmap.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

Comments

Michel Lespinasse May 18, 2019, 12:12 a.m. UTC | #1
I worry that the proposed change turns the search from an O(log N)
worst case into a O(N) one.

To see why the current search is O(log N), it is easiest to start by
imagining a simplified search algorithm that wouldn't include the low
and high address limits. In that algorithm, backtracking through the
vma tree is never necessary - the tree walk can always know, prior to
going left or right, if a suitable gap will be found in the
corresponding subtree.

The code we have today does have to respect the low and high address
limits, so it does need to implement backtracking - but this
backtracking only occurs to back out of subtrees that include the low
address limit (the search went 'left' into a subtree that has a large
enough gap, but the gap turns out to be below the limit so it can't be
used and the search needs to go 'right' instead). Because of this, the
amount of backtracking that can occur is very limited, and the search
is still O(log N) in the worst case.

With your proposed change, backtracking could occur not only around
the low address limit, but also at any node within the search tree,
when it turns out that a gap that seemed large enough actually isn't
due to alignment constraints. So, the code should still work, but it
could backtrack more in the worst case, turning the worst case search
into an O(N) thing.

I am not sure what to do about this. First I would want to understand
more about your test case; is this something that you stumbled upon
without expecting it or was it an artificially constructed case to
show the limitations of the current search algorithm ? Also, if your
process does something unusual and expects to be able to map (close
to) the entirety of its address space, would it be reasonable for it
to manually manage the address space and pass explicit addresses to
mmap / shmat ?

On Thu, May 16, 2019 at 11:02 PM jianhong chen <chenjianhong2@huawei.com> wrote:
> In linux version 4.4, a 32-bit process may fail to allocate 64M hugepage
> memory by function shmat even though there is a 64M memory gap in
> the process.
>
> It is the adjusted length that causes the problem, introduced from
> commit db4fbfb9523c935 ("mm: vm_unmapped_area() lookup function").
> Accounting for the worst case alignment overhead, function unmapped_area
> and unmapped_area_topdown adjust the search length before searching
> for available vma gap. This is an estimated length, sum of the desired
> length and the longest alignment offset, which can cause misjudgement
> if the system has very few virtual memory left. For example, if the
> longest memory gap available is 64M, we can’t get it from the system
> by allocating 64M hugepage memory via shmat function. The reason is
> that it requires a longger length, the sum of the desired length(64M)
> and the longest alignment offset.
>
> To fix this error ,we can calculate the alignment offset of
> gap_start or gap_end to get a desired gap_start or gap_end value,
> before searching for the available gap. In this way, we don't
> need to adjust the search length.
>
> Problem reproduces procedure:
> 1. allocate a lot of virtual memory segments via shmat and malloc
> 2. release one of the biggest memory segment via shmdt
> 3. attach the biggest memory segment via shmat
>
> e.g.
> process maps:
> 00008000-00009000 r-xp 00000000 00:12 3385    /tmp/memory_mmap
> 00011000-00012000 rw-p 00001000 00:12 3385    /tmp/memory_mmap
> 27536000-f756a000 rw-p 00000000 00:00 0
> f756a000-f7691000 r-xp 00000000 01:00 560     /lib/libc-2.11.1.so
> f7691000-f7699000 ---p 00127000 01:00 560     /lib/libc-2.11.1.so
> f7699000-f769b000 r--p 00127000 01:00 560     /lib/libc-2.11.1.so
> f769b000-f769c000 rw-p 00129000 01:00 560     /lib/libc-2.11.1.so
> f769c000-f769f000 rw-p 00000000 00:00 0
> f769f000-f76c0000 r-xp 00000000 01:00 583     /lib/libgcc_s.so.1
> f76c0000-f76c7000 ---p 00021000 01:00 583     /lib/libgcc_s.so.1
> f76c7000-f76c8000 rw-p 00020000 01:00 583     /lib/libgcc_s.so.1
> f76c8000-f76e5000 r-xp 00000000 01:00 543     /lib/ld-2.11.1.so
> f76e9000-f76ea000 rw-p 00000000 00:00 0
> f76ea000-f76ec000 rw-p 00000000 00:00 0
> f76ec000-f76ed000 r--p 0001c000 01:00 543     /lib/ld-2.11.1.so
> f76ed000-f76ee000 rw-p 0001d000 01:00 543     /lib/ld-2.11.1.so
> f7800000-f7a00000 rw-s 00000000 00:0e 0       /SYSV000000ea (deleted)
> fba00000-fca00000 rw-s 00000000 00:0e 65538   /SYSV000000ec (deleted)
> fca00000-fce00000 rw-s 00000000 00:0e 98307   /SYSV000000ed (deleted)
> fce00000-fd800000 rw-s 00000000 00:0e 131076  /SYSV000000ee (deleted)
> ff913000-ff934000 rw-p 00000000 00:00 0       [stack]
> ffff0000-ffff1000 r-xp 00000000 00:00 0       [vectors]
>
> from 0xf7a00000 to fba00000, it has 64M memory gap, but we can't get
> it from kernel.
chenjianhong (A) May 18, 2019, 7:05 a.m. UTC | #2
I explain my test code and the problem in detail. This problem is found in 
32-bit user process, because its virtual is limited, 3G or 4G. 

First, I explain the bug I found. Function unmapped_area and 
unmapped_area_topdowns adjust search length to account for worst 
case alignment overhead, the code is ' length = info->length + info->align_mask; '.
The variable info->length is the length we allocate and the variable 
info->align_mask accounts for the alignment, because the gap_start  or gap_end 
value also should be an alignment address, but we can't know the alignment offset.
So in the current algorithm, it uses the max alignment offset, this value maybe zero
or other(0x1ff000 for shmat function). 
Is it reasonable way? The required value is longer than what I allocate.
What's more,  why for the first time I can allocate the memory successfully
Via shmat, but after releasing the memory via shmdt and I want to attach
again, it fails. This is not acceptable for many people.

Second, I explain my test code. The code I have sent an email. The following is
the step. I don't think it's something unusual or unreasonable, because the virtual
memory space is enough, but the process can allocate from it. And we can't pass
explicit addresses to function mmap or shmat, the address is getting from the left
vma gap.
 1, we allocat large virtual memory;
 2, we allocate hugepage memory via shmat, and release one
 of the hugepage memory block;
 3, we allocate hugepage memory by shmat again, this will fail.

Third, I want to introduce my change in the current algorithm. I don't change the
current algorithm. Also, I think there maybe a better way to fix this error. Nowadays,
I can just adjust the gap_start value. Take unmapped_area function an example:
The current algorithm:
1, Adjust search length to account for worst case alignment overhead
|----------------------------------------- -----------|----------------------------------------|
gap_start                                                                                                                             gap_end
|____________info->length_____________|___ info->align_mask _________|    
|______________________________length_______________________  ____|                 
2, search for the suitable gap, the required length is the sum of info->length and
info->align_mask
3, Adjust gap address to the desired alignment
My changed:
1, Adjust the gap_start address to the desired alignment
|---------------------------------|--------------------- -----------|--------------------------|
gap_start                            gap_start'                                                                        gap_end
|____ gap_start_offset____|
                                                    |__________________ info->length ___________|
2, search for the suitable gap, from gap_start' to gap_end. the required length is
info->length.

Test code:
#include <stdio.h>
#include <stdlib.h>
#include <sys/shm.h>
#include <errno.h>

int size = 0x4000000;
int shmid[5];
void *shm[5];
int key[5] = {234, 235, 236, 237, 238};
unsigned long seg_size[5] = {0x200000, 0x4000000, 0x1000000,
                        0x400000, 0xa00000};

int init_memory(void)
{
        int i,j;
        for (i = 0; i < 5; i++) {
                shmid[i] = shmget((key_t)key[i], seg_size[i], 0666
                                | IPC_CREAT | SHM_HUGETLB);
                if (shmid[i] == -1) {
                        fprintf(stderr, "shmget[%d] error(%d)\n",
                                i, errno);
                        goto failed;
                }
        }
        return 0;
failed:
        for (j = 0; j < i; j++) {
                shmctl(shmid[j], IPC_RMID, 0);
        }
        return -1;
}

int del_segmem(void)
{
        int i = 0;
        for (i = 0; i < 5; i++) {
                shmdt(shm[i]);
                shmctl(shmid[i], IPC_RMID, 0);
        }
        return 0;
}

int fun_C(void)
{
        int i = 0;
        printf("-----------------------fun_C shmat-----------------------\n");
        for (i = 0; i < 5; i+=1) {
                shm[i] = shmat(shmid[i], 0, 0);
                if (shm[i] == (void *)-1) {
                        fprintf(stderr, "shmat[%d] failed %d\n", i, errno);
                        return -1;
                }
        }
        sleep(2);
        system("pid=`ps -e | grep memory | awk '{print $1}'`;cat /proc/$pid/maps");
        shmdt(shm[1]);
        printf("-----------------------after fun_C shmdt-----------------------\n");
        system("pid=`ps -e | grep memory | awk '{print $1}'`;cat /proc/$pid/maps");
        printf("-----------------------fun_C ok-----------------------\n");
        return 0;
}

int fun_A(void)
{
        int i = 1;
        shm[i] = shmat(shmid[1], 0, 0);
        printf("-----------------------fun_A shmat-----------------------\n");
        if (shm[i] == (void *)-1) {
                fprintf(stderr, "funa shmat[%d] size(0x%08x)failed %d\n",
                        i, seg_size[i], errno);
                return -1;
        }
        system("pid=`ps -e | grep memory | awk '{print $1}'`;cat /proc/$pid/maps");
        sleep(2);
        shmdt(shm[1]);
        printf("-----------------------fun_A shmdt-----------------------\n");
        system("pid=`ps -e | grep memory | awk '{print $1}'`;cat /proc/$pid/maps");
        printf("-----------------------fun_A ok-----------------------\n");
        return 0;
}

/*
 * first, we allocat large virtual memory;
 * second, we allocate hugepage memory by shmat, and release one
 * of the hugepage memory block;
 * third, we allocate hugepage memory by shmat again, this will fail.
 */

int main(int argc,char * argv[])
{
        int i;
        int ret = 0;
        for (i == 0; i < 52; i++) {
                malloc(size);//first
        }
        if (init_memory() != 0) {
                ret = -1;
                goto failed_memory;
        }
        fun_C();//second
        sleep(5);
        ret = fun_A();//third
        if (ret != 0) {
                goto failed_memory;
        }
        sleep(3);
failed_memory:
        del_segmem();
        return ret;
}


-----Original Message-----
From: Michel Lespinasse [mailto:walken@google.com] 
Sent: Saturday, May 18, 2019 8:13 AM
To: chenjianhong (A) <chenjianhong2@huawei.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>; Andrew Morton <akpm@linux-foundation.org>; mhocko@suse.com; Vlastimil Babka <vbabka@suse.cz>; Kirill A. Shutemov <kirill.shutemov@linux.intel.com>; Yang Shi <yang.shi@linux.alibaba.com>; jannh@google.com; steve.capper@arm.com; tiny.windzz@gmail.com; LKML <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; stable@vger.kernel.org
Subject: Re: [PATCH] mm/mmap: fix the adjusted length error

I worry that the proposed change turns the search from an O(log N)
worst case into a O(N) one.

To see why the current search is O(log N), it is easiest to start by
imagining a simplified search algorithm that wouldn't include the low
and high address limits. In that algorithm, backtracking through the
vma tree is never necessary - the tree walk can always know, prior to
going left or right, if a suitable gap will be found in the
corresponding subtree.

The code we have today does have to respect the low and high address
limits, so it does need to implement backtracking - but this
backtracking only occurs to back out of subtrees that include the low
address limit (the search went 'left' into a subtree that has a large
enough gap, but the gap turns out to be below the limit so it can't be
used and the search needs to go 'right' instead). Because of this, the
amount of backtracking that can occur is very limited, and the search
is still O(log N) in the worst case.

With your proposed change, backtracking could occur not only around
the low address limit, but also at any node within the search tree,
when it turns out that a gap that seemed large enough actually isn't
due to alignment constraints. So, the code should still work, but it
could backtrack more in the worst case, turning the worst case search
into an O(N) thing.

I am not sure what to do about this. First I would want to understand
more about your test case; is this something that you stumbled upon
without expecting it or was it an artificially constructed case to
show the limitations of the current search algorithm ? Also, if your
process does something unusual and expects to be able to map (close
to) the entirety of its address space, would it be reasonable for it
to manually manage the address space and pass explicit addresses to
mmap / shmat ?

On Thu, May 16, 2019 at 11:02 PM jianhong chen <chenjianhong2@huawei.com> wrote:
> In linux version 4.4, a 32-bit process may fail to allocate 64M hugepage
> memory by function shmat even though there is a 64M memory gap in
> the process.
>
> It is the adjusted length that causes the problem, introduced from
> commit db4fbfb9523c935 ("mm: vm_unmapped_area() lookup function").
> Accounting for the worst case alignment overhead, function unmapped_area
> and unmapped_area_topdown adjust the search length before searching
> for available vma gap. This is an estimated length, sum of the desired
> length and the longest alignment offset, which can cause misjudgement
> if the system has very few virtual memory left. For example, if the
> longest memory gap available is 64M, we can’t get it from the system
> by allocating 64M hugepage memory via shmat function. The reason is
> that it requires a longger length, the sum of the desired length(64M)
> and the longest alignment offset.
>
> To fix this error ,we can calculate the alignment offset of
> gap_start or gap_end to get a desired gap_start or gap_end value,
> before searching for the available gap. In this way, we don't
> need to adjust the search length.
>
> Problem reproduces procedure:
> 1. allocate a lot of virtual memory segments via shmat and malloc
> 2. release one of the biggest memory segment via shmdt
> 3. attach the biggest memory segment via shmat
>
> e.g.
> process maps:
> 00008000-00009000 r-xp 00000000 00:12 3385    /tmp/memory_mmap
> 00011000-00012000 rw-p 00001000 00:12 3385    /tmp/memory_mmap
> 27536000-f756a000 rw-p 00000000 00:00 0
> f756a000-f7691000 r-xp 00000000 01:00 560     /lib/libc-2.11.1.so
> f7691000-f7699000 ---p 00127000 01:00 560     /lib/libc-2.11.1.so
> f7699000-f769b000 r--p 00127000 01:00 560     /lib/libc-2.11.1.so
> f769b000-f769c000 rw-p 00129000 01:00 560     /lib/libc-2.11.1.so
> f769c000-f769f000 rw-p 00000000 00:00 0
> f769f000-f76c0000 r-xp 00000000 01:00 583     /lib/libgcc_s.so.1
> f76c0000-f76c7000 ---p 00021000 01:00 583     /lib/libgcc_s.so.1
> f76c7000-f76c8000 rw-p 00020000 01:00 583     /lib/libgcc_s.so.1
> f76c8000-f76e5000 r-xp 00000000 01:00 543     /lib/ld-2.11.1.so
> f76e9000-f76ea000 rw-p 00000000 00:00 0
> f76ea000-f76ec000 rw-p 00000000 00:00 0
> f76ec000-f76ed000 r--p 0001c000 01:00 543     /lib/ld-2.11.1.so
> f76ed000-f76ee000 rw-p 0001d000 01:00 543     /lib/ld-2.11.1.so
> f7800000-f7a00000 rw-s 00000000 00:0e 0       /SYSV000000ea (deleted)
> fba00000-fca00000 rw-s 00000000 00:0e 65538   /SYSV000000ec (deleted)
> fca00000-fce00000 rw-s 00000000 00:0e 98307   /SYSV000000ed (deleted)
> fce00000-fd800000 rw-s 00000000 00:0e 131076  /SYSV000000ee (deleted)
> ff913000-ff934000 rw-p 00000000 00:00 0       [stack]
> ffff0000-ffff1000 r-xp 00000000 00:00 0       [vectors]
>
> from 0xf7a00000 to fba00000, it has 64M memory gap, but we can't get
> it from kernel.
Andrew Morton July 12, 2019, 1:20 a.m. UTC | #3
On Sat, 18 May 2019 07:05:07 +0000 "chenjianhong (A)" <chenjianhong2@huawei.com> wrote:

> I explain my test code and the problem in detail. This problem is found in 
> 32-bit user process, because its virtual is limited, 3G or 4G. 
> 
> First, I explain the bug I found. Function unmapped_area and 
> unmapped_area_topdowns adjust search length to account for worst 
> case alignment overhead, the code is ' length = info->length + info->align_mask; '.
> The variable info->length is the length we allocate and the variable 
> info->align_mask accounts for the alignment, because the gap_start  or gap_end 
> value also should be an alignment address, but we can't know the alignment offset.
> So in the current algorithm, it uses the max alignment offset, this value maybe zero
> or other(0x1ff000 for shmat function). 
> Is it reasonable way? The required value is longer than what I allocate.
> What's more,  why for the first time I can allocate the memory successfully
> Via shmat, but after releasing the memory via shmdt and I want to attach
> again, it fails. This is not acceptable for many people.
> 
> Second, I explain my test code. The code I have sent an email. The following is
> the step. I don't think it's something unusual or unreasonable, because the virtual
> memory space is enough, but the process can allocate from it. And we can't pass
> explicit addresses to function mmap or shmat, the address is getting from the left
> vma gap.
>  1, we allocat large virtual memory;
>  2, we allocate hugepage memory via shmat, and release one
>  of the hugepage memory block;
>  3, we allocate hugepage memory by shmat again, this will fail.

How significant is this problem in real-world use cases?  How much
trouble is it causing?

> Third, I want to introduce my change in the current algorithm. I don't change the
> current algorithm. Also, I think there maybe a better way to fix this error. Nowadays,
> I can just adjust the gap_start value.

Have you looked further into this?  Michel is concerned about the
performance cost of the current solution.
chenjianhong (A) July 12, 2019, 10:53 a.m. UTC | #4
Thank you for your reply! 
> How significant is this problem in real-world use cases?  How much trouble is it causing?
   In my opinion, this problem is very rare in real-world use cases. In arm64
   or x86 environment, the virtual memory is enough. In arm32 environment,
   each process has only 3G or 4G or less, but we seldom use out all of the virtual memory,
   it only happens in some special environment. They almost use out all the virtual memory, and
   in some moment, they will change their working mode so they will release and allocate
   memory again. This current length limitation will cause this problem. I explain it's the memory
   length limitation. But they can't accept the reason, it is unreasonable that we fail to allocate
   memory even though the memory gap is enough.

> Have you looked further into this?  Michel is concerned about the performance cost of the current solution.
   The current algorithm(change before) is wonderful, and it has been used for a long time, I don't
   think it is worthy to change the whole algorithm in order to fix this problem. Therefore, I just
   adjust the gap_start and gap_end value in place of the length. My change really affects the
   performance because I calculate the gap_start and gap_end value again and again. Does it affect
   too much performance?  I have no complex environment, so I can't test it, but I don't think it will cause
   too much performance loss. First, I don't change the whole algorithm. Second, unmapped_area and
   unmapped_area_topdown function aren't used frequently. Maybe there are some big performance problems
   I'm not concerned about. But I think if that's not a problem, there should be a limitation description.

-----Original Message-----
From: Andrew Morton [mailto:akpm@linux-foundation.org] 
Sent: Friday, July 12, 2019 9:20 AM
To: chenjianhong (A) <chenjianhong2@huawei.com>
Cc: Michel Lespinasse <walken@google.com>; Greg Kroah-Hartman <gregkh@linuxfoundation.org>; mhocko@suse.com; Vlastimil Babka <vbabka@suse.cz>; Kirill A. Shutemov <kirill.shutemov@linux.intel.com>; Yang Shi <yang.shi@linux.alibaba.com>; jannh@google.com; steve.capper@arm.com; tiny.windzz@gmail.com; LKML <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; stable@vger.kernel.org; willy@infradead.org
Subject: Re: [PATCH] mm/mmap: fix the adjusted length error

On Sat, 18 May 2019 07:05:07 +0000 "chenjianhong (A)" <chenjianhong2@huawei.com> wrote:

> I explain my test code and the problem in detail. This problem is 
> found in 32-bit user process, because its virtual is limited, 3G or 4G.
> 
> First, I explain the bug I found. Function unmapped_area and 
> unmapped_area_topdowns adjust search length to account for worst case 
> alignment overhead, the code is ' length = info->length + info->align_mask; '.
> The variable info->length is the length we allocate and the variable
> info->align_mask accounts for the alignment, because the gap_start  or 
> info->gap_end
> value also should be an alignment address, but we can't know the alignment offset.
> So in the current algorithm, it uses the max alignment offset, this 
> value maybe zero or other(0x1ff000 for shmat function).
> Is it reasonable way? The required value is longer than what I allocate.
> What's more,  why for the first time I can allocate the memory 
> successfully Via shmat, but after releasing the memory via shmdt and I 
> want to attach again, it fails. This is not acceptable for many people.
> 
> Second, I explain my test code. The code I have sent an email. The 
> following is the step. I don't think it's something unusual or 
> unreasonable, because the virtual memory space is enough, but the 
> process can allocate from it. And we can't pass explicit addresses to 
> function mmap or shmat, the address is getting from the left vma gap.
>  1, we allocat large virtual memory;
>  2, we allocate hugepage memory via shmat, and release one  of the 
> hugepage memory block;  3, we allocate hugepage memory by shmat again, 
> this will fail.

How significant is this problem in real-world use cases?  How much trouble is it causing?

> Third, I want to introduce my change in the current algorithm. I don't 
> change the current algorithm. Also, I think there maybe a better way 
> to fix this error. Nowadays, I can just adjust the gap_start value.

Have you looked further into this?  Michel is concerned about the performance cost of the current solution.
Michel Lespinasse July 12, 2019, 5 p.m. UTC | #5
On Fri, Jul 12, 2019 at 3:53 AM chenjianhong (A)
<chenjianhong2@huawei.com> wrote:
> Thank you for your reply!
> > How significant is this problem in real-world use cases?  How much trouble is it causing?
>    In my opinion, this problem is very rare in real-world use cases. In arm64
>    or x86 environment, the virtual memory is enough. In arm32 environment,
>    each process has only 3G or 4G or less, but we seldom use out all of the virtual memory,
>    it only happens in some special environment. They almost use out all the virtual memory, and
>    in some moment, they will change their working mode so they will release and allocate
>    memory again. This current length limitation will cause this problem. I explain it's the memory
>    length limitation. But they can't accept the reason, it is unreasonable that we fail to allocate
>    memory even though the memory gap is enough.

Right. So to summarize, you have a customer accidentally hitting this
and asking you about it ? and I assume their workload is not public ?

> > Have you looked further into this?  Michel is concerned about the performance cost of the current solution.
>    The current algorithm(change before) is wonderful, and it has been used for a long time, I don't
>    think it is worthy to change the whole algorithm in order to fix this problem. Therefore, I just
>    adjust the gap_start and gap_end value in place of the length. My change really affects the
>    performance because I calculate the gap_start and gap_end value again and again. Does it affect
>    too much performance?  I have no complex environment, so I can't test it, but I don't think it will cause
>    too much performance loss. First, I don't change the whole algorithm. Second, unmapped_area and
>    unmapped_area_topdown function aren't used frequently. Maybe there are some big performance problems
>    I'm not concerned about. But I think if that's not a problem, there should be a limitation description.

The case I am worried about is if there are a lot of gaps that are
large enough for an unaligned allocation, but too small for an aligned
one.

You could create the bad case as follows:
- Allocate a huge memory block (no need to populate it, so it can
really be as large as virtual memory will allow)
- Free a bunch of 2M holes in that block, but none of them are aligned
- Try to force allocation of a 2M aligned block

With the current code, the allocation will quickly skip over the
unaligned 2M holes. It will either find a 4M gap and allocate a 2M
aligned block from it, or it will fail, but it will be quick in either
case. With the suggested change, the allocation would try each of the
unaligned 2M holes, which could take a long time, before eventually
either finding a large enough aligned gap, or failing.

I can see two ways around this:
- the code could search for a 4M gap at first, like it currently does,
and that fails it could look at all 2M gaps and see if one of them is
aligned. So, there would still be the slow case, but only if the
initial (fast) check failed. Maybe there should be a sysfs setting to
enable the second pass, which would be disabled by default at least on
64-bit systems.
- If the issue only happens when allocating huge pages, and we know
the possible huge page sizes for a process from the start, we could
maintain more information about the gaps so that we could quickly
search for a suitable aligned gaps. that is, for each subtree we would
store both the highest 4K aligned size that can be allocated, and the
highest 2M aligned size as well. That makes a more complete solution
but probably overkill as we are not hitting this frequently enough to
justify the complications.

>
> -----Original Message-----
> From: Andrew Morton [mailto:akpm@linux-foundation.org]
> Sent: Friday, July 12, 2019 9:20 AM
> To: chenjianhong (A) <chenjianhong2@huawei.com>
> Cc: Michel Lespinasse <walken@google.com>; Greg Kroah-Hartman <gregkh@linuxfoundation.org>; mhocko@suse.com; Vlastimil Babka <vbabka@suse.cz>; Kirill A. Shutemov <kirill.shutemov@linux.intel.com>; Yang Shi <yang.shi@linux.alibaba.com>; jannh@google.com; steve.capper@arm.com; tiny.windzz@gmail.com; LKML <linux-kernel@vger.kernel.org>; linux-mm <linux-mm@kvack.org>; stable@vger.kernel.org; willy@infradead.org
> Subject: Re: [PATCH] mm/mmap: fix the adjusted length error
>
> On Sat, 18 May 2019 07:05:07 +0000 "chenjianhong (A)" <chenjianhong2@huawei.com> wrote:
>
> > I explain my test code and the problem in detail. This problem is
> > found in 32-bit user process, because its virtual is limited, 3G or 4G.
> >
> > First, I explain the bug I found. Function unmapped_area and
> > unmapped_area_topdowns adjust search length to account for worst case
> > alignment overhead, the code is ' length = info->length + info->align_mask; '.
> > The variable info->length is the length we allocate and the variable
> > info->align_mask accounts for the alignment, because the gap_start  or
> > info->gap_end
> > value also should be an alignment address, but we can't know the alignment offset.
> > So in the current algorithm, it uses the max alignment offset, this
> > value maybe zero or other(0x1ff000 for shmat function).
> > Is it reasonable way? The required value is longer than what I allocate.
> > What's more,  why for the first time I can allocate the memory
> > successfully Via shmat, but after releasing the memory via shmdt and I
> > want to attach again, it fails. This is not acceptable for many people.
> >
> > Second, I explain my test code. The code I have sent an email. The
> > following is the step. I don't think it's something unusual or
> > unreasonable, because the virtual memory space is enough, but the
> > process can allocate from it. And we can't pass explicit addresses to
> > function mmap or shmat, the address is getting from the left vma gap.
> >  1, we allocat large virtual memory;
> >  2, we allocate hugepage memory via shmat, and release one  of the
> > hugepage memory block;  3, we allocate hugepage memory by shmat again,
> > this will fail.
>
> How significant is this problem in real-world use cases?  How much trouble is it causing?
>
> > Third, I want to introduce my change in the current algorithm. I don't
> > change the current algorithm. Also, I think there maybe a better way
> > to fix this error. Nowadays, I can just adjust the gap_start value.
>
> Have you looked further into this?  Michel is concerned about the performance cost of the current solution.
>
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index bd7b9f2..c5a5782 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1865,6 +1865,22 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	return error;
 }
 
+static inline unsigned long gap_start_offset(struct vm_unmapped_area_info *info,
+					unsigned long addr)
+{
+	/* get gap_start offset to adjust gap address to the
+	 * desired alignment
+	 */
+	return (info->align_offset - addr) & info->align_mask;
+}
+
+static inline unsigned long gap_end_offset(struct vm_unmapped_area_info *info,
+					unsigned long addr)
+{
+	/* get gap_end offset to adjust gap address to the desired alignment */
+	return (addr - info->align_offset) & info->align_mask;
+}
+
 unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 {
 	/*
@@ -1879,10 +1895,7 @@  unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 	struct vm_area_struct *vma;
 	unsigned long length, low_limit, high_limit, gap_start, gap_end;
 
-	/* Adjust search length to account for worst case alignment overhead */
-	length = info->length + info->align_mask;
-	if (length < info->length)
-		return -ENOMEM;
+	length = info->length;
 
 	/* Adjust search limits by the desired length */
 	if (info->high_limit < length)
@@ -1914,6 +1927,7 @@  unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 		}
 
 		gap_start = vma->vm_prev ? vm_end_gap(vma->vm_prev) : 0;
+		gap_start += gap_start_offset(info, gap_start);
 check_current:
 		/* Check if current node has a suitable gap */
 		if (gap_start > high_limit)
@@ -1942,6 +1956,7 @@  unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 				       struct vm_area_struct, vm_rb);
 			if (prev == vma->vm_rb.rb_left) {
 				gap_start = vm_end_gap(vma->vm_prev);
+				gap_start += gap_start_offset(info, gap_start);
 				gap_end = vm_start_gap(vma);
 				goto check_current;
 			}
@@ -1951,17 +1966,17 @@  unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 check_highest:
 	/* Check highest gap, which does not precede any rbtree node */
 	gap_start = mm->highest_vm_end;
+	gap_start += gap_start_offset(info, gap_start);
 	gap_end = ULONG_MAX;  /* Only for VM_BUG_ON below */
 	if (gap_start > high_limit)
 		return -ENOMEM;
 
 found:
 	/* We found a suitable gap. Clip it with the original low_limit. */
-	if (gap_start < info->low_limit)
+	if (gap_start < info->low_limit) {
 		gap_start = info->low_limit;
-
-	/* Adjust gap address to the desired alignment */
-	gap_start += (info->align_offset - gap_start) & info->align_mask;
+		gap_start += gap_start_offset(info, gap_start);
+	}
 
 	VM_BUG_ON(gap_start + info->length > info->high_limit);
 	VM_BUG_ON(gap_start + info->length > gap_end);
@@ -1974,16 +1989,14 @@  unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
 	struct vm_area_struct *vma;
 	unsigned long length, low_limit, high_limit, gap_start, gap_end;
 
-	/* Adjust search length to account for worst case alignment overhead */
-	length = info->length + info->align_mask;
-	if (length < info->length)
-		return -ENOMEM;
+	length = info->length;
 
 	/*
 	 * Adjust search limits by the desired length.
 	 * See implementation comment at top of unmapped_area().
 	 */
 	gap_end = info->high_limit;
+	gap_end -= gap_end_offset(info, gap_end);
 	if (gap_end < length)
 		return -ENOMEM;
 	high_limit = gap_end - length;
@@ -2020,6 +2033,7 @@  unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
 check_current:
 		/* Check if current node has a suitable gap */
 		gap_end = vm_start_gap(vma);
+		gap_end -= gap_end_offset(info, gap_end);
 		if (gap_end < low_limit)
 			return -ENOMEM;
 		if (gap_start <= high_limit &&
@@ -2054,13 +2068,14 @@  unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
 
 found:
 	/* We found a suitable gap. Clip it with the original high_limit. */
-	if (gap_end > info->high_limit)
+	if (gap_end > info->high_limit) {
 		gap_end = info->high_limit;
+		gap_end -= gap_end_offset(info, gap_end);
+	}
 
 found_highest:
 	/* Compute highest gap address at the desired alignment */
 	gap_end -= info->length;
-	gap_end -= (gap_end - info->align_offset) & info->align_mask;
 
 	VM_BUG_ON(gap_end < info->low_limit);
 	VM_BUG_ON(gap_end < gap_start);