diff mbox series

[RFC] mm: support multi_freearea to the reduction of external fragmentation

Message ID 20210414023803.937-1-lipeifeng@oppo.com (mailing list archive)
State New, archived
Headers show
Series [RFC] mm: support multi_freearea to the reduction of external fragmentation | expand

Commit Message

李培锋 April 14, 2021, 2:38 a.m. UTC
From: lipeifeng <lipeifeng@oppo.com>

This patch would "sort" the free-pages in buddy by pages-PFN to concentrate
low-order-pages allocation in the front area of memory and high-order-pages
allcation on the contrary so that few memory-pollution in the back area of
memory and the probablity of high-order-pages allocation would be increased
significantly.
-----------------------------------------------------------------------

  1) Divide memory into several segments by pages-PFN
     "Multi_freearea" would divide memory into FREE_AREA_COUNTS segments
     by pages-PFN,each memory-segment corresponds to a free_area.

     Example: machine(4G of physical memery) and FREE_AREA_COUNTS(4):
        page-PFN:0x0     0x40000(1G)   0x80000(2G)  0xc0000(3G) 0xFFFFF(4G)
                 |------------|--------------|--------------|-------------|
        free_area:  [0][]           [1][]           [2][]        [3][]

     NOTE: Selecting the corresponding freearea when pages are freed back
	       to buddy:
        - pages-PFN[0, free_area_segment[0].max_pfn] -> free_area[0][]
        - pages-PFN[free_area_segment[flc - 1].max_pfn,
	            free_area_segment[flc].max_pfn] -> free_area[flc][]
                   (flc > 0)

     By this way, all pages in the same segment/free_area is within a
     certain range of pages-PFN.

  2) Select the corresponding freearea to alloc-pages
     "Multi_freearea" would select the corresponding free_area by the
     allocation-order when alloc-pages.
        - order <  HIGH_ORDER_TO_FLC:
	        free_area[0] -> ... -> free_area[FREE_AREA_COUNTS - 1]
        - order >= HIGH_ORDER_TO_FLC:
	        free_area[FREE_AREA_COUNTS - 1] -> ... -> free_area[0]

     Example:
        The machine(4G of physical memery) and FREE_AREA_COUNTS(4)
        and HIGH_ORDER_TO_FLC(3).
	If user allocs page(order = 0),it would take page from
	free_area[0][] first, if that fails,try free_area[1][] and so on.
	If user allocs page(order = 4),it would take page from
	free_area[3][] first, if that fails,try free_area[2][] and so on.

     By this way,low-order pages will be concentrated in the front area
     of memory.Because of few memory-pollution in the back area of memory,
     the sussessful probablity of high-order allocation would be improved.

  3) Adjust the location of free-pages in the free_list
     "Multi_freearea" would place free-pages in the head of free_list if
     pages-PFN is smaller than free_area_segment[flc]->median_pfn and in
     the tail of free_list on the contrary.

     Example:
        page-PFN:        free_area_segment[flc]->median_pfn
                                        |
        free_list: page->page->page->...|...page->page->page
        pages-PFN:|   < median_pfn      |  >= median_pfn    |

     Because it would take pages from the head of the freelist first in
     buddy system,the free-pages in the tail are more likely to keep in the
     buddy system.The closer the PFN of pages kept in buddy system, the
     greater the probablity of merging that into high-order pages.
--------------------------------------------------------------------------

We tested the patch on linux-4.4/linux-4.9/linux-4.14/linux-4.19 and the
results shows that the patch is effective in reducing COMPACTSTALL.

Example:the machine with 4 gigabytes of physical memery and in linux-4.19.
	- monkey for 12 hours.
	- record COMPACTSTALL after test.
	Test-result: reduced COMPACTSTALL by 95.6% with the patch.
	---------------------------------
                     |   COMPACTSTALL
        ---------------------------------
	   ori       |     2189
	---------------------------------
	optimization |      95
	---------------------------------

Note: it is a patch in linux-4.19 to consult experts:
Is it possible to merge the optimization approach to the baseline?
I would develop a formal patch based on the lastest linux if there
is a chance.

Thank you very much indeed for your help to check the patch.

Signed-off-by: lipeifeng <lipeifeng@oppo.com>
---
 arch/s390/mm/page-states.c |   7 +-
 include/linux/mmzone.h     |  48 +++++++++++-
 mm/Makefile                |   2 +-
 mm/compaction.c            |   6 +-
 mm/memory_hotplug.c        |   4 +
 mm/multi_freearea.c        | 177 +++++++++++++++++++++++++++++++++++++++++++++
 mm/multi_freearea.h        |  18 +++++
 mm/page_alloc.c            | 119 +++++++++++++++++++++---------
 mm/vmstat.c                |  21 +++++-
 9 files changed, 357 insertions(+), 45 deletions(-)
 create mode 100644 mm/multi_freearea.c
 create mode 100644 mm/multi_freearea.h

Comments

Vlastimil Babka April 16, 2021, 11:06 a.m. UTC | #1
On 4/14/21 4:38 AM, lipeifeng@oppo.com wrote:
> From: lipeifeng <lipeifeng@oppo.com>
> 
> This patch would "sort" the free-pages in buddy by pages-PFN to concentrate
> low-order-pages allocation in the front area of memory and high-order-pages
> allcation on the contrary so that few memory-pollution in the back area of
> memory and the probablity of high-order-pages allocation would be increased
> significantly.
> -----------------------------------------------------------------------
> 
>   1) Divide memory into several segments by pages-PFN
>      "Multi_freearea" would divide memory into FREE_AREA_COUNTS segments
>      by pages-PFN,each memory-segment corresponds to a free_area.
> 
>      Example: machine(4G of physical memery) and FREE_AREA_COUNTS(4):
>         page-PFN:0x0     0x40000(1G)   0x80000(2G)  0xc0000(3G) 0xFFFFF(4G)
>                  |------------|--------------|--------------|-------------|
>         free_area:  [0][]           [1][]           [2][]        [3][]
> 
>      NOTE: Selecting the corresponding freearea when pages are freed back
> 	       to buddy:
>         - pages-PFN[0, free_area_segment[0].max_pfn] -> free_area[0][]
>         - pages-PFN[free_area_segment[flc - 1].max_pfn,
> 	            free_area_segment[flc].max_pfn] -> free_area[flc][]
>                    (flc > 0)
> 
>      By this way, all pages in the same segment/free_area is within a
>      certain range of pages-PFN.
> 
>   2) Select the corresponding freearea to alloc-pages
>      "Multi_freearea" would select the corresponding free_area by the
>      allocation-order when alloc-pages.
>         - order <  HIGH_ORDER_TO_FLC:
> 	        free_area[0] -> ... -> free_area[FREE_AREA_COUNTS - 1]
>         - order >= HIGH_ORDER_TO_FLC:
> 	        free_area[FREE_AREA_COUNTS - 1] -> ... -> free_area[0]
> 
>      Example:
>         The machine(4G of physical memery) and FREE_AREA_COUNTS(4)
>         and HIGH_ORDER_TO_FLC(3).
> 	If user allocs page(order = 0),it would take page from
> 	free_area[0][] first, if that fails,try free_area[1][] and so on.
> 	If user allocs page(order = 4),it would take page from
> 	free_area[3][] first, if that fails,try free_area[2][] and so on.
> 
>      By this way,low-order pages will be concentrated in the front area
>      of memory.Because of few memory-pollution in the back area of memory,
>      the sussessful probablity of high-order allocation would be improved.
> 
>   3) Adjust the location of free-pages in the free_list
>      "Multi_freearea" would place free-pages in the head of free_list if
>      pages-PFN is smaller than free_area_segment[flc]->median_pfn and in
>      the tail of free_list on the contrary.
> 
>      Example:
>         page-PFN:        free_area_segment[flc]->median_pfn
>                                         |
>         free_list: page->page->page->...|...page->page->page
>         pages-PFN:|   < median_pfn      |  >= median_pfn    |
> 
>      Because it would take pages from the head of the freelist first in
>      buddy system,the free-pages in the tail are more likely to keep in the
>      buddy system.The closer the PFN of pages kept in buddy system, the
>      greater the probablity of merging that into high-order pages.

I think this part 3) would be worth to be tried separately first, as it's not a
big change compared to the other ones.
李培锋 April 19, 2021, 2:37 a.m. UTC | #2
Hi Vlastimil Babka:
Thank you very much indeed for your advice.


Hi Vlastimil Babka, schwidefsky,heiko.carstens:

It is a temporary patch to consult experts:
Is it possible to merge the optimization idea and the implementation
method to the baseline?

If there is a chance,i would develop a formal patch based on the
lastest linux and it will split into serveral pacthes.

I hope you can give us your opinions, thank you very much.




lipeifeng@oppo.com
 
From: Vlastimil Babka
Date: 2021-04-16 19:06
To: lipeifeng; peifengl55; schwidefsky; heiko.carstens
CC: linux-s390; linux-kernel; linux-mm
Subject: Re: [RFC] mm: support multi_freearea to the reduction of external fragmentation
On 4/14/21 4:38 AM, lipeifeng@oppo.com wrote:
> From: lipeifeng <lipeifeng@oppo.com>
> 
> This patch would "sort" the free-pages in buddy by pages-PFN to concentrate
> low-order-pages allocation in the front area of memory and high-order-pages
> allcation on the contrary so that few memory-pollution in the back area of
> memory and the probablity of high-order-pages allocation would be increased
> significantly.
> -----------------------------------------------------------------------
> 
>   1) Divide memory into several segments by pages-PFN
>      "Multi_freearea" would divide memory into FREE_AREA_COUNTS segments
>      by pages-PFN,each memory-segment corresponds to a free_area.
> 
>      Example: machine(4G of physical memery) and FREE_AREA_COUNTS(4):
>         page-PFN:0x0     0x40000(1G)   0x80000(2G)  0xc0000(3G) 0xFFFFF(4G)
>                  |------------|--------------|--------------|-------------|
>         free_area:  [0][]           [1][]           [2][]        [3][]
> 
>      NOTE: Selecting the corresponding freearea when pages are freed back
>        to buddy:
>         - pages-PFN[0, free_area_segment[0].max_pfn] -> free_area[0][]
>         - pages-PFN[free_area_segment[flc - 1].max_pfn,
>             free_area_segment[flc].max_pfn] -> free_area[flc][]
>                    (flc > 0)
> 
>      By this way, all pages in the same segment/free_area is within a
>      certain range of pages-PFN.
> 
>   2) Select the corresponding freearea to alloc-pages
>      "Multi_freearea" would select the corresponding free_area by the
>      allocation-order when alloc-pages.
>         - order <  HIGH_ORDER_TO_FLC:
>         free_area[0] -> ... -> free_area[FREE_AREA_COUNTS - 1]
>         - order >= HIGH_ORDER_TO_FLC:
>         free_area[FREE_AREA_COUNTS - 1] -> ... -> free_area[0]
> 
>      Example:
>         The machine(4G of physical memery) and FREE_AREA_COUNTS(4)
>         and HIGH_ORDER_TO_FLC(3).
> If user allocs page(order = 0),it would take page from
> free_area[0][] first, if that fails,try free_area[1][] and so on.
> If user allocs page(order = 4),it would take page from
> free_area[3][] first, if that fails,try free_area[2][] and so on.
> 
>      By this way,low-order pages will be concentrated in the front area
>      of memory.Because of few memory-pollution in the back area of memory,
>      the sussessful probablity of high-order allocation would be improved.
> 
>   3) Adjust the location of free-pages in the free_list
>      "Multi_freearea" would place free-pages in the head of free_list if
>      pages-PFN is smaller than free_area_segment[flc]->median_pfn and in
>      the tail of free_list on the contrary.
> 
>      Example:
>         page-PFN:        free_area_segment[flc]->median_pfn
>                                         |
>         free_list: page->page->page->...|...page->page->page
>         pages-PFN:|   < median_pfn      |  >= median_pfn    |
> 
>      Because it would take pages from the head of the freelist first in
>      buddy system,the free-pages in the tail are more likely to keep in the
>      buddy system.The closer the PFN of pages kept in buddy system, the
>      greater the probablity of merging that into high-order pages.
 
I think this part 3) would be worth to be tried separately first, as it's not a
big change compared to the other ones.
David Hildenbrand April 22, 2021, 9:29 a.m. UTC | #3
On 16.04.21 13:06, Vlastimil Babka wrote:
> On 4/14/21 4:38 AM, lipeifeng@oppo.com wrote:
>> From: lipeifeng <lipeifeng@oppo.com>
>>
>> This patch would "sort" the free-pages in buddy by pages-PFN to concentrate
>> low-order-pages allocation in the front area of memory and high-order-pages
>> allcation on the contrary so that few memory-pollution in the back area of
>> memory and the probablity of high-order-pages allocation would be increased
>> significantly.
>> -----------------------------------------------------------------------
>>
>>    1) Divide memory into several segments by pages-PFN
>>       "Multi_freearea" would divide memory into FREE_AREA_COUNTS segments
>>       by pages-PFN,each memory-segment corresponds to a free_area.
>>
>>       Example: machine(4G of physical memery) and FREE_AREA_COUNTS(4):
>>          page-PFN:0x0     0x40000(1G)   0x80000(2G)  0xc0000(3G) 0xFFFFF(4G)
>>                   |------------|--------------|--------------|-------------|
>>          free_area:  [0][]           [1][]           [2][]        [3][]
>>
>>       NOTE: Selecting the corresponding freearea when pages are freed back
>> 	       to buddy:
>>          - pages-PFN[0, free_area_segment[0].max_pfn] -> free_area[0][]
>>          - pages-PFN[free_area_segment[flc - 1].max_pfn,
>> 	            free_area_segment[flc].max_pfn] -> free_area[flc][]
>>                     (flc > 0)
>>
>>       By this way, all pages in the same segment/free_area is within a
>>       certain range of pages-PFN.
>>
>>    2) Select the corresponding freearea to alloc-pages
>>       "Multi_freearea" would select the corresponding free_area by the
>>       allocation-order when alloc-pages.
>>          - order <  HIGH_ORDER_TO_FLC:
>> 	        free_area[0] -> ... -> free_area[FREE_AREA_COUNTS - 1]
>>          - order >= HIGH_ORDER_TO_FLC:
>> 	        free_area[FREE_AREA_COUNTS - 1] -> ... -> free_area[0]
>>
>>       Example:
>>          The machine(4G of physical memery) and FREE_AREA_COUNTS(4)
>>          and HIGH_ORDER_TO_FLC(3).
>> 	If user allocs page(order = 0),it would take page from
>> 	free_area[0][] first, if that fails,try free_area[1][] and so on.
>> 	If user allocs page(order = 4),it would take page from
>> 	free_area[3][] first, if that fails,try free_area[2][] and so on.
>>
>>       By this way,low-order pages will be concentrated in the front area
>>       of memory.Because of few memory-pollution in the back area of memory,
>>       the sussessful probablity of high-order allocation would be improved.
>>
>>    3) Adjust the location of free-pages in the free_list
>>       "Multi_freearea" would place free-pages in the head of free_list if
>>       pages-PFN is smaller than free_area_segment[flc]->median_pfn and in
>>       the tail of free_list on the contrary.
>>
>>       Example:
>>          page-PFN:        free_area_segment[flc]->median_pfn
>>                                          |
>>          free_list: page->page->page->...|...page->page->page
>>          pages-PFN:|   < median_pfn      |  >= median_pfn    |
>>
>>       Because it would take pages from the head of the freelist first in
>>       buddy system,the free-pages in the tail are more likely to keep in the
>>       buddy system.The closer the PFN of pages kept in buddy system, the
>>       greater the probablity of merging that into high-order pages.
> 
> I think this part 3) would be worth to be tried separately first, as it's not a
> big change compared to the other ones.
> 

Let's consider part 3 only and ignore the 1) multi freearea (which might 
be problematic with sparcity) and 2) the modified allocation scheme 
(which doesn't yet quite sense to me yet, e.g., because we group by 
mobility and have compaction in place; I assume this really only helps 
in some special cases -- like the test case you are giving; I might be 
wrong)

Right now, we decide whether to but to head or tail based on how likely 
it is that we might merge to a higher-order page (buddy_merge_likely()) 
in the future. So we only consider the current "neighborhood" of the 
page we're freeing. As we restrict our neighborhood to MAX_ORDER - 1 
pages (what we can actually merge). Of course, we can easily be wrong 
here. Grouping by movability and compaction only helps to some degree I 
guess.

AFAIK, what you propose is basing the decisions where to place a page 
(in addition?) on a median_pfn. Without 1) and 2) I cannot completely 
understand if 3) itself would help at all (and how to set the 
median_pfn). But it would certainly be interesting if we can tweak the 
current logic to better identify merge targets simply by tweaking 
buddy_merge_likely() or the assumptions it is making.
David Hildenbrand April 22, 2021, 9:37 a.m. UTC | #4
On 19.04.21 04:37, lipeifeng@oppo.com wrote:
> Hi Vlastimil Babka:
> Thank you very much indeed for your advice.
> 
> 
> Hi Vlastimil Babka, schwidefsky,heiko.carstens:
> 
> It is a temporary patch to consult experts:
> Is it possible to merge the optimization idea and the implementation
> method to the baseline?

Well, we cannot really say that :)

History taught that merging large and invasive buddy changes is a 
tedious task, can take a long time, and can fail even after a lot of 
discussions and patch series.

Further, usually there has to be a very compelling reason to merge 
large, invasive buddy changes (read: not only optimize very specific 
scenarios); otherwise there will just push back because the invasive 
changes might introduce additional problems or degrade other special 
cases or even the general case.

Last but not least, there have to be more benchmarks and test cases that 
proof that other workload won't be degraded to a degree that people 
care; as one example, this includes runtime overhead when 
allocating/freeing pages.

What usually works best is improving the code in small steps, doing 
minor adjustments but moving into the desired direction.
李培锋 April 26, 2021, 3:19 a.m. UTC | #5
>> Let's consider part 3 only and ignore the 1) multi freearea (which might
>> be problematic with sparcity) and 2) the modified allocation scheme
>> (which doesn't yet quite sense to me yet, e.g., because we group by
>> mobility and have compaction in place; I assume this really only helps
>> in some special cases -- like the test case you are giving; I might be
>> wrong)
 
>> Right now, we decide whether to but to head or tail based on how likely
>> it is that we might merge to a higher-order page (buddy_merge_likely())
>> in the future. So we only consider the current "neighborhood" of the
>> page we're freeing. As we restrict our neighborhood to MAX_ORDER - 1
>> pages (what we can actually merge). Of course, we can easily be wrong
>> here. Grouping by movability and compaction only helps to some degree I
>> guess.
 
>> AFAIK, what you propose is basing the decisions where to place a page
>> (in addition?) on a median_pfn. Without 1) and 2) I cannot completely
>> understand if 3) itself would help at all (and how to set the
>> median_pfn). But it would certainly be interesting if we can tweak the
>> current logic to better identify merge targets simply by tweaking
>> buddy_merge_likely() or the assumptions it is making.



Hi David Hildenbrand, Vlastimil Babka:
    Thank you very much indeed for advices.

>> 2) the modified allocation scheme
>> (which doesn't yet quite sense to me yet, e.g., because we group by
>> mobility and have compaction in place; I assume this really only helps
>> in some special cases -- like the test case you are giving;
 ---------------------------------------------------------------------------------
1) Divide memory into several segments by pages-PFN 
2) Select the corresponding freearea to alloc-pages         
    These two parts art for the same purpose: 
    low-order-pages allocation will be concentrated in the front area of physical  memory
    so that few memory-pollution in the back area of memory, the sussessful probablity
    of high-order allocation would be improved.

    I think that it would help in almost all cases of high-oder-pages allocation, instead
    of special case, because it can let more high-order free-pages in buddy, example:
when user alloc 64K bytes, if the unit is page(4K bytes) and it needs to 16 times.  
             if the unit is 64Kbytes, it only takes once.
if there are more free-high-order-pages in buddy that few compact-stall in
             alloction-process, the allocstall-time would be shortened.

    We tested the speed of the high-orders-pages(order=4 and order = 8) allocation
    after monkey and found that it increased by more than 18%.

3) Adjust the location of free-pages in the free_list
>>Without 1) and 2) I cannot completely
>>understand if 3) itself would help at all (and how to set the median_pfn)
-----------------------------------------------------------------------------------------------------
    Median_pfn is set by the range of pages-PFN of free_area. if part 3) would be tried separately
    without 1) and 2), the simple setting is the median of the entire memory. But i think it will play the
    better role in optimization based on the 1) and 2).



>> Last but not least, there have to be more benchmarks and test cases that
>> proof that other workload won't be degraded to a degree that people
>> care; as one example, this includes runtime overhead when
>> allocating/freeing pages.    
---------------------------------------------
1. For modification of buddy: the modified allocation scheme 1)+2)
    Is thers any standard detailed test-list  of the modified allocation in the community? like benchmarks
    or any other tests? if  i pass the test required by communiry that can proof the patch will not degraded
    to a degree that people care and can merge it in the baseline?

   
    Thanks.
    
    



lipeifeng@oppo.com
 
From: David Hildenbrand
Date: 2021-04-22 17:29
To: Vlastimil Babka; lipeifeng; peifengl55; schwidefsky; heiko.carstens
CC: linux-s390; linux-kernel; linux-mm
Subject: Re: [RFC] mm: support multi_freearea to the reduction of external fragmentation
On 16.04.21 13:06, Vlastimil Babka wrote:
> On 4/14/21 4:38 AM, lipeifeng@oppo.com wrote:
>> From: lipeifeng <lipeifeng@oppo.com>
>>
>> This patch would "sort" the free-pages in buddy by pages-PFN to concentrate
>> low-order-pages allocation in the front area of memory and high-order-pages
>> allcation on the contrary so that few memory-pollution in the back area of
>> memory and the probablity of high-order-pages allocation would be increased
>> significantly.
>> -----------------------------------------------------------------------
>>
>>    1) Divide memory into several segments by pages-PFN
>>       "Multi_freearea" would divide memory into FREE_AREA_COUNTS segments
>>       by pages-PFN,each memory-segment corresponds to a free_area.
>>
>>       Example: machine(4G of physical memery) and FREE_AREA_COUNTS(4):
>>          page-PFN:0x0     0x40000(1G)   0x80000(2G)  0xc0000(3G) 0xFFFFF(4G)
>>                   |------------|--------------|--------------|-------------|
>>          free_area:  [0][]           [1][]           [2][]        [3][]
>>
>>       NOTE: Selecting the corresponding freearea when pages are freed back
>>        to buddy:
>>          - pages-PFN[0, free_area_segment[0].max_pfn] -> free_area[0][]
>>          - pages-PFN[free_area_segment[flc - 1].max_pfn,
>>             free_area_segment[flc].max_pfn] -> free_area[flc][]
>>                     (flc > 0)
>>
>>       By this way, all pages in the same segment/free_area is within a
>>       certain range of pages-PFN.
>>
>>    2) Select the corresponding freearea to alloc-pages
>>       "Multi_freearea" would select the corresponding free_area by the
>>       allocation-order when alloc-pages.
>>          - order <  HIGH_ORDER_TO_FLC:
>>         free_area[0] -> ... -> free_area[FREE_AREA_COUNTS - 1]
>>          - order >= HIGH_ORDER_TO_FLC:
>>         free_area[FREE_AREA_COUNTS - 1] -> ... -> free_area[0]
>>
>>       Example:
>>          The machine(4G of physical memery) and FREE_AREA_COUNTS(4)
>>          and HIGH_ORDER_TO_FLC(3).
>> If user allocs page(order = 0),it would take page from
>> free_area[0][] first, if that fails,try free_area[1][] and so on.
>> If user allocs page(order = 4),it would take page from
>> free_area[3][] first, if that fails,try free_area[2][] and so on.
>>
>>       By this way,low-order pages will be concentrated in the front area
>>       of memory.Because of few memory-pollution in the back area of memory,
>>       the sussessful probablity of high-order allocation would be improved.
>>
>>    3) Adjust the location of free-pages in the free_list
>>       "Multi_freearea" would place free-pages in the head of free_list if
>>       pages-PFN is smaller than free_area_segment[flc]->median_pfn and in
>>       the tail of free_list on the contrary.
>>
>>       Example:
>>          page-PFN:        free_area_segment[flc]->median_pfn
>>                                          |
>>          free_list: page->page->page->...|...page->page->page
>>          pages-PFN:|   < median_pfn      |  >= median_pfn    |
>>
>>       Because it would take pages from the head of the freelist first in
>>       buddy system,the free-pages in the tail are more likely to keep in the
>>       buddy system.The closer the PFN of pages kept in buddy system, the
>>       greater the probablity of merging that into high-order pages.
> 
> I think this part 3) would be worth to be tried separately first, as it's not a
> big change compared to the other ones.
> 
 
Let's consider part 3 only and ignore the 1) multi freearea (which might 
be problematic with sparcity) and 2) the modified allocation scheme 
(which doesn't yet quite sense to me yet, e.g., because we group by 
mobility and have compaction in place; I assume this really only helps 
in some special cases -- like the test case you are giving; I might be 
wrong)
 
Right now, we decide whether to but to head or tail based on how likely 
it is that we might merge to a higher-order page (buddy_merge_likely()) 
in the future. So we only consider the current "neighborhood" of the 
page we're freeing. As we restrict our neighborhood to MAX_ORDER - 1 
pages (what we can actually merge). Of course, we can easily be wrong 
here. Grouping by movability and compaction only helps to some degree I 
guess.
 
AFAIK, what you propose is basing the decisions where to place a page 
(in addition?) on a median_pfn. Without 1) and 2) I cannot completely 
understand if 3) itself would help at all (and how to set the 
median_pfn). But it would certainly be interesting if we can tweak the 
current logic to better identify merge targets simply by tweaking 
buddy_merge_likely() or the assumptions it is making.
David Hildenbrand April 26, 2021, 8:37 a.m. UTC | #6
On 26.04.21 05:19, lipeifeng@oppo.com wrote:
> 
>  >> Let's consider part 3 only and ignore the 1) multi freearea (which might
>  >> be problematic with sparcity) and 2) the modified allocation scheme
>  >> (which doesn't yet quite sense to me yet, e.g., because we group by
>  >> mobility and have compaction in place; I assume this really only helps
>  >> in some special cases -- like the test case you are giving; I might be
>  >> wrong)
>  >> Right now, we decide whether to but to head or tail based on how likely
>  >> it is that we might merge to a higher-order page (buddy_merge_likely())
>  >> in the future. So we only consider the current "neighborhood" of the
>  >> page we're freeing. As we restrict our neighborhood to MAX_ORDER - 1
>  >> pages (what we can actually merge). Of course, we can easily be wrong
>  >> here. Grouping by movability and compaction only helps to some degree I
>  >> guess.
>  >> AFAIK, what you propose is basing the decisions where to place a page
>  >> (in addition?) on a median_pfn. Without 1) and 2) I cannot completely
>  >> understand if 3) itself would help at all (and how to set the
>  >> median_pfn). But it would certainly be interesting if we can tweak the
>  >> current logic to better identify merge targets simply by tweaking
>  >> buddy_merge_likely() or the assumptions it is making.
> 
> 
> 
> Hi David Hildenbrand,Vlastimil Babka:
>      Thank you very much indeed for advices.
> 
>>> 2) the modified allocation scheme
>  >> (which doesn't yet quite sense to me yet, e.g., because we group by
>  >> mobility and have compaction in place; I assume this really only helps
>  >> in some special cases -- like the test case you are giving;
>   ---------------------------------------------------------------------------------
> 1) Divide memory into several segments by pages-PFN
> 2) Select the corresponding freearea to alloc-pages
>      These two parts art for the same purpose:
> low-order-pages allocation will be concentrated in the front area of 
> physical memory
> so that few memory-pollution in the back area of memory, the sussessful 
> probablity
> of high-order allocation would be improved.
> 
>      I think that it would help in almost all cases of high-oder-pages 
> allocation, instead
>      of special case, because it can let more high-order free-pages in 
> buddy, example:

See, and I am not convinced that this is the case, because you really 
only report one example (Monkey) and I have to assume it is a special 
case then.

> 
>   * when user alloc 64K bytes, if the unit is page(4K bytes) and it
>     needs to 16 times. 
> 
> if the unit is 64Kbytes, it only takes once.
> 
>   * if there are more free-high-order-pages in buddy that few
>     compact-stall in
> 
> alloction-process, the allocstall-time would be shortened.
> 
>      We tested the speed of the high-orders-pages(order=4 and order = 8) 
> allocation
> after monkey and found that it increased by more than 18%.
> 

And you don't mention what the baseline configuration was. For example, 
how was compaction configured?

Just to clarify, what is monkey?

Monkey HTTP server? MonkeyTest disk benchmark? UI/Application Exerciser 
Monkey?

> 3) Adjust the location of free-pages in the free_list
>>>Without 1) and 2) I cannot completely
>  >>understand if 3) itself would help at all (and how to set the median_pfn)
> -----------------------------------------------------------------------------------------------------
>      Median_pfn is set by the range of pages-PFN of free_area. if part 
> 3) would be tried separately
>      without 1) and 2), the simple setting is the median of the entire 
> memory. But i think it will play the
> better role in optimization based on the 1) and 2).
> 
> 
> 
>  >> Last but not least, there have to be more benchmarks and test cases that
>  >> proof that other workload won't be degraded to a degree that people
>  >> care; as one example, this includes runtime overhead when
>>> allocating/freeing pages.
> ---------------------------------------------
> 1. For modification of buddy: the modified allocation scheme 1)+2)
>      Is thers any standard detailed test-list  of the modified 
> allocation in the community? like benchmarks
> or any other tests? if  i pass the test required by communiry that can 
> proof the patch will not degraded
> to a degree that people care and can merge it in the baseline?

IIRC, there are plenty. One example is will-it-scale.

Have a look at https://github.com/intel/lkp-tests.git
李培锋 April 26, 2021, 10:19 a.m. UTC | #7
Hi David Hildenbrand:

>> And you don't mention what the baseline configuration was. For example,
>> how was compaction configured?
 
>> Just to clarify, what is monkey?
 
>> Monkey HTTP server? MonkeyTest disk benchmark? UI/Application Exerciser
>> Monkey?
-------------------------------------------------------------------------------------
I am sorry that i didn't  give a clear explanation about Monkey.
It meant  "UI/Application Exerciser Monkey" from google.

Excuse me, let me introduce our test:

1. record COMPACT_STALL
We tested the patch on linux-4.4/linux-4.9/linux-4.14/linux-4.19 and the
results shows that the patch is effective in reducing COMPACTSTALL.
    - monkey for 12 hours.
    - record COMPACTSTALL after test.

Test-result: reduced COMPACTSTALL by 95.6% with the patch.
(the machine with 4 gigabytes of physical memery and in linux-4.19.)
---------------------------------
                     |   COMPACTSTALL
---------------------------------
   ori              |     2189
---------------------------------
optimization |      95
---------------------------------

I fully agree with the value of compaction, but compaction also bring cpu
consumption and will increase the time of alloc_stall. So if we can let more
free high-orders-pages in buddy instead of signal pages, it will decrease
COMPACT_STALL and speed up memory allocation.

2. record the speed of the high-orders-pages allocation(order=4 and order = 8)
Before and after optimization, we tested the speed of the high-orders-pages allocation
after 120-hours-Monkey in 10 Android mobile phones. and the result show that
the speed has been increased by more than 18%.

Also, we do some test designed by us: 
(the machine with 4 gigabytes of physical memery and in linux-4.19.)
model the usage of users, and constantly start and
operate the diffrent application for 120h, and we record COMPACT_STALL is decreased by
90+% and speed of the high-orders-pages is increaed by 15+%.

and I have some question, i hope you can guide me if when you are free.
1) What is the compaction configured?
    Dost it meant the members in zone? like as follows:
        unsigned int compact_considered;
        unsigned int compact_defer_shift;
        int compact_order_failed;
        bool compact_blockskip_failed;
    Or the some Macro variable? like as follows:
        PAGE_ALLOC_COSTLY_ORDER = 3
        MIN_COMPACT_PRIORITY = 1
        MAX_COMPACT_RETRIES = 16

>> 1) multi freearea (which might
>> be problematic with sparcity)
2) Can you pls tell me what is soarcity and what is the impact of this?
    and whether there are some documents about it?




IIRC, there are plenty. One example is will-it-scale.
 
Have a look at https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Flkp-tests.git&data=04%7C01%7Clipeifeng%40oppo.com%7C05651d116ca04321f76a08d9088e8841%7Cf1905eb1c35341c5951662b4a54b5ee6%7C0%7C0%7C637550230524688657%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BnYgprqlCIolSaPyRhCRvyugE9Kif27AtNgeRYKMZkw%3D&reserved=0 

Thanks you indeed, we will do the test and see if there's any revenue.



lipeifeng@oppo.com
 
From: David Hildenbrand
Date: 2021-04-26 16:37
To: lipeifeng@oppo.com; Vlastimil Babka; peifengl55; schwidefsky; heiko.carstens; zhangshiming; zhouhuacai; guoweichao; guojian
CC: linux-s390; linux-kernel; linux-mm
Subject: Re: [RFC] mm: support multi_freearea to the reduction of external fragmentation
On 26.04.21 05:19, lipeifeng@oppo.com wrote:
> 
>  >> Let's consider part 3 only and ignore the 1) multi freearea (which might
>  >> be problematic with sparcity) and 2) the modified allocation scheme
>  >> (which doesn't yet quite sense to me yet, e.g., because we group by
>  >> mobility and have compaction in place; I assume this really only helps
>  >> in some special cases -- like the test case you are giving; I might be
>  >> wrong)
>  >> Right now, we decide whether to but to head or tail based on how likely
>  >> it is that we might merge to a higher-order page (buddy_merge_likely())
>  >> in the future. So we only consider the current "neighborhood" of the
>  >> page we're freeing. As we restrict our neighborhood to MAX_ORDER - 1
>  >> pages (what we can actually merge). Of course, we can easily be wrong
>  >> here. Grouping by movability and compaction only helps to some degree I
>  >> guess.
>  >> AFAIK, what you propose is basing the decisions where to place a page
>  >> (in addition?) on a median_pfn. Without 1) and 2) I cannot completely
>  >> understand if 3) itself would help at all (and how to set the
>  >> median_pfn). But it would certainly be interesting if we can tweak the
>  >> current logic to better identify merge targets simply by tweaking
>  >> buddy_merge_likely() or the assumptions it is making.
> 
> 
> 
> Hi David Hildenbrand,Vlastimil Babka:
>      Thank you very much indeed for advices.
> 
>>> 2) the modified allocation scheme
>  >> (which doesn't yet quite sense to me yet, e.g., because we group by
>  >> mobility and have compaction in place; I assume this really only helps
>  >> in some special cases -- like the test case you are giving;
>   ---------------------------------------------------------------------------------
> 1) Divide memory into several segments by pages-PFN
> 2) Select the corresponding freearea to alloc-pages
>      These two parts art for the same purpose:
> low-order-pages allocation will be concentrated in the front area of 
> physical memory
> so that few memory-pollution in the back area of memory, the sussessful 
> probablity
> of high-order allocation would be improved.
> 
>      I think that it would help in almost all cases of high-oder-pages 
> allocation, instead
>      of special case, because it can let more high-order free-pages in 
> buddy, example:
 
See, and I am not convinced that this is the case, because you really 
only report one example (Monkey) and I have to assume it is a special 
case then.
 
> 
>   * when user alloc 64K bytes, if the unit is page(4K bytes) and it
>     needs to 16 times. 
> 
> if the unit is 64Kbytes, it only takes once.
> 
>   * if there are more free-high-order-pages in buddy that few
>     compact-stall in
> 
> alloction-process, the allocstall-time would be shortened.
> 
>      We tested the speed of the high-orders-pages(order=4 and order = 8) 
> allocation
> after monkey and found that it increased by more than 18%.
> 
 
And you don't mention what the baseline configuration was. For example, 
how was compaction configured?
 
Just to clarify, what is monkey?
 
Monkey HTTP server? MonkeyTest disk benchmark? UI/Application Exerciser 
Monkey?
 
> 3) Adjust the location of free-pages in the free_list
>>>Without 1) and 2) I cannot completely
>  >>understand if 3) itself would help at all (and how to set the median_pfn)
> -----------------------------------------------------------------------------------------------------
>      Median_pfn is set by the range of pages-PFN of free_area. if part 
> 3) would be tried separately
>      without 1) and 2), the simple setting is the median of the entire 
> memory. But i think it will play the
> better role in optimization based on the 1) and 2).
> 
> 
> 
>  >> Last but not least, there have to be more benchmarks and test cases that
>  >> proof that other workload won't be degraded to a degree that people
>  >> care; as one example, this includes runtime overhead when
>>> allocating/freeing pages.
> ---------------------------------------------
> 1. For modification of buddy: the modified allocation scheme 1)+2)
>      Is thers any standard detailed test-list  of the modified 
> allocation in the community? like benchmarks
> or any other tests? if  i pass the test required by communiry that can 
> proof the patch will not degraded
> to a degree that people care and can merge it in the baseline?
 
IIRC, there are plenty. One example is will-it-scale.
 
Have a look at https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fintel%2Flkp-tests.git&amp;data=04%7C01%7Clipeifeng%40oppo.com%7C05651d116ca04321f76a08d9088e8841%7Cf1905eb1c35341c5951662b4a54b5ee6%7C0%7C0%7C637550230524688657%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=BnYgprqlCIolSaPyRhCRvyugE9Kif27AtNgeRYKMZkw%3D&amp;reserved=0
David Hildenbrand April 27, 2021, 12:46 p.m. UTC | #8
On 26.04.21 12:19, lipeifeng@oppo.com wrote:
> Hi David Hildenbrand <mailto:david@redhat.com>:
> 
>  >> And you don't mention what the baseline configuration was. For example,
>  >> how was compaction configured?
>  >> Just to clarify, what is monkey?
>  >> Monkey HTTP server? MonkeyTest disk benchmark? UI/Application Exerciser
>  >> Monkey?
> -------------------------------------------------------------------------------------
> I am sorry that i didn't  give a clear explanation about Monkey.
> It meant  "UI/Application Exerciser Monkey" from google.
> 
> Excuse me, let me introduce our test:
> 

Thanks for more details on the test.

> 1. record COMPACT_STALL
> We tested the patch on linux-4.4/linux-4.9/linux-4.14/linux-4.19 and the
> results shows that the patch is effective in reducing COMPACTSTALL.
>      - monkey for 12 hours.
>      - record COMPACTSTALL after test.
> 
> Test-result: reduced COMPACTSTALL by 95.6% with the patch.
> (the machine with 4 gigabytes of physical memery and in linux-4.19.)
> ---------------------------------
>                       |   COMPACTSTALL
> ---------------------------------
>     ori              |     2189
> ---------------------------------
> optimization |      95
> ---------------------------------
> 
> I fully agree with the value of compaction, but compaction also bring cpu
> consumption and will increase the time of alloc_stall. So if we can let more
> free high-orders-pages in buddy instead of signal pages, it will decrease
> COMPACT_STALL and speed up memory allocation.

Okay, but then I assume the target goal of your patch set is to minimize 
CPU consumption/allocation stall time when allocating larger order pages.

Currently you state "the probablity of high-order-pages allocation would 
be increased significantly", but I assume that's then not 100% correct. 
What you measure is the stall time to allocate higher order pages, not 
that you can allocate them.

> 
> 2. record the speed of the high-orders-pages allocation(order=4 and 
> order = 8)
> Before and after optimization, we tested the speed of the 
> high-orders-pages allocation
> after 120-hours-Monkey in 10 Android mobile phones. and the result show that
> the speed has been increased by more than 18%.
> 
> Also, we do some test designed by us:
> (the machine with 4 gigabytes of physical memery and in linux-4.19.)
> model the usage of users, and constantly start and
> operate the diffrent application for 120h, and we record COMPACT_STALL 
> is decreased by
> 90+% and speed of the high-orders-pages is increaed by 15+%.

Okay, again, this is then some optimization for allocation speed; which 
makes it less attractive IMHO (at least for more invasive changes), 
because I suspect this mostly helps in corner cases (Monkey benchmarks 
corner cases AFAIU).

> 
> and I have some question, i hope you can guide me if when you are free.
> 1) What is the compaction configured?
>      Dost it meant the members in zone? like as follows:
>      unsigned int compact_considered;
>      unsigned int compact_defer_shift;
>      int compact_order_failed;
>      bool compact_blockskip_failed;
>      Or the some Macro variable? like as follows:
>      PAGE_ALLOC_COSTLY_ORDER = 3
>      MIN_COMPACT_PRIORITY = 1
>      MAX_COMPACT_RETRIES = 16
> 

Rather if you have proactive compaction 
(/proc/sys/vm/compaction_proactiveness). But I assume because you're 
messing with older kernels, that you didn't compare against that yet. 
Would be worth a comparison.

>>> 1) multi freearea (which might
>  >> be problematic with sparcity)
> 2) Can you pls tell me what is soarcity and what is the impact of this?
>      and whether there are some documents about it?

Essentially CONFIG_SPARSEMEM, whereby we can have huge holes in physical 
memory layout and memory areas coming/going with memory hot(un)plug. 
Usually we manage all metadata per section. For example, pageblocks are 
allocated per section. We avoid arrays that depend on the 
initial/maximum physical memory size.
李培锋 April 28, 2021, 4:03 a.m. UTC | #9
Hi David Hildenbrand:

>> Okay, but then I assume the target goal of your patch set is to minimize
>> CPU consumption/allocation stall time when allocating larger order pages.
 
>> Currently you state "the probablity of high-order-pages allocation would
>> be increased significantly", but I assume that's then not 100% correct.
>> What you measure is the stall time to allocate higher order pages, not
>> that you can allocate them.

You are right that multi_freearea is to to speed the rate of large-order-pages
allocation instead of increasing the probality of it if CONFIG_COMPACTION is
open.

>> Okay, again, this is then some optimization for allocation speed; which
>> makes it less attractive IMHO (at least for more invasive changes),
>> because I suspect this mostly helps in corner cases (Monkey benchmarks
>> corner cases AFAIU).

I really agree with the preciseness of the community that if we want to merge the
patch in baseline, we would do more test to proof that the patch will not degraded
to a degree that people care. But, maybe there are the detailed test-list for us to
test, or we really don't know how to proof there are no any side-effects on all situation
of memory-opt.
It is a question that has puzzled me for a long time.

>> Essentially CONFIG_SPARSEMEM, whereby we can have huge holes in physical
>> memory layout and memory areas coming/going with memory hot(un)plug.
>> Usually we manage all metadata per section. For example, pageblocks are
>> allocated per section. We avoid arrays that depend on the
>> initial/maximum physical memory size.

CONFIG_SPRSEMEM has been opened in some of our product with Qcom-platform and
MTK platform. AFAIK, multi_freearea would not bring problem to it?because the patch
just manage the physical memory of zone to serveral section(free_area) and adjust the
the range of pages-PFN for buddy-alloc-pages by the alloction-order. With memory
hot(un)plug, we would initialize the members of "multi_freearea" in zone.

The patch has been merged in the baseline of our product that has been sold all over the
world with Linux-4.4/4.9/4.19 so that i don't think there will be too much risk. Of course,
i might be wrong.




lipeifeng@oppo.com
 
From: David Hildenbrand
Date: 2021-04-27 20:46
To: lipeifeng@oppo.com; Vlastimil Babka; peifengl55; schwidefsky; heiko.carstens; zhangshiming; zhouhuacai; guoweichao; guojian
CC: linux-s390; linux-kernel; linux-mm
Subject: Re: [RFC] mm: support multi_freearea to the reduction of external fragmentation
On 26.04.21 12:19, lipeifeng@oppo.com wrote:
> Hi David Hildenbrand <mailto:david@redhat.com>:
> 
>  >> And you don't mention what the baseline configuration was. For example,
>  >> how was compaction configured?
>  >> Just to clarify, what is monkey?
>  >> Monkey HTTP server? MonkeyTest disk benchmark? UI/Application Exerciser
>  >> Monkey?
> -------------------------------------------------------------------------------------
> I am sorry that i didn't  give a clear explanation about Monkey.
> It meant  "UI/Application Exerciser Monkey" from google.
> 
> Excuse me, let me introduce our test:
> 
 
Thanks for more details on the test.
 
> 1. record COMPACT_STALL
> We tested the patch on linux-4.4/linux-4.9/linux-4.14/linux-4.19 and the
> results shows that the patch is effective in reducing COMPACTSTALL.
>      - monkey for 12 hours.
>      - record COMPACTSTALL after test.
> 
> Test-result: reduced COMPACTSTALL by 95.6% with the patch.
> (the machine with 4 gigabytes of physical memery and in linux-4.19.)
> ---------------------------------
>                       |   COMPACTSTALL
> ---------------------------------
>     ori              |     2189
> ---------------------------------
> optimization |      95
> ---------------------------------
> 
> I fully agree with the value of compaction, but compaction also bring cpu
> consumption and will increase the time of alloc_stall. So if we can let more
> free high-orders-pages in buddy instead of signal pages, it will decrease
> COMPACT_STALL and speed up memory allocation.
 
Okay, but then I assume the target goal of your patch set is to minimize 
CPU consumption/allocation stall time when allocating larger order pages.
 
Currently you state "the probablity of high-order-pages allocation would 
be increased significantly", but I assume that's then not 100% correct. 
What you measure is the stall time to allocate higher order pages, not 
that you can allocate them.
 
> 
> 2. record the speed of the high-orders-pages allocation(order=4 and 
> order = 8)
> Before and after optimization, we tested the speed of the 
> high-orders-pages allocation
> after 120-hours-Monkey in 10 Android mobile phones. and the result show that
> the speed has been increased by more than 18%.
> 
> Also, we do some test designed by us:
> (the machine with 4 gigabytes of physical memery and in linux-4.19.)
> model the usage of users, and constantly start and
> operate the diffrent application for 120h, and we record COMPACT_STALL 
> is decreased by
> 90+% and speed of the high-orders-pages is increaed by 15+%.
 
Okay, again, this is then some optimization for allocation speed; which 
makes it less attractive IMHO (at least for more invasive changes), 
because I suspect this mostly helps in corner cases (Monkey benchmarks 
corner cases AFAIU).
 
> 
> and I have some question, i hope you can guide me if when you are free.
> 1) What is the compaction configured?
>      Dost it meant the members in zone? like as follows:
>      unsigned int compact_considered;
>      unsigned int compact_defer_shift;
>      int compact_order_failed;
>      bool compact_blockskip_failed;
>      Or the some Macro variable? like as follows:
>      PAGE_ALLOC_COSTLY_ORDER = 3
>      MIN_COMPACT_PRIORITY = 1
>      MAX_COMPACT_RETRIES = 16
> 
 
Rather if you have proactive compaction 
(/proc/sys/vm/compaction_proactiveness). But I assume because you're 
messing with older kernels, that you didn't compare against that yet. 
Would be worth a comparison.
 
>>> 1) multi freearea (which might
>  >> be problematic with sparcity)
> 2) Can you pls tell me what is soarcity and what is the impact of this?
>      and whether there are some documents about it?
 
Essentially CONFIG_SPARSEMEM, whereby we can have huge holes in physical 
memory layout and memory areas coming/going with memory hot(un)plug. 
Usually we manage all metadata per section. For example, pageblocks are 
allocated per section. We avoid arrays that depend on the 
initial/maximum physical memory size.
David Hildenbrand April 28, 2021, 9:04 a.m. UTC | #10
>  >> Essentially CONFIG_SPARSEMEM, whereby we can have huge holes in physical
>  >> memory layout and memory areas coming/going with memory hot(un)plug.
>  >> Usually we manage all metadata per section. For example, pageblocks are
>  >> allocated per section. We avoid arrays that depend on the
>  >> initial/maximum physical memory size.
> 
> CONFIG_SPRSEMEM has been opened in some of our product with 
> Qcom-platform and
> MTK platform. AFAIK, multi_freearea would not bring problem to 
> it?because the patch
> just manage the physical memory of zone to serveral section(free_area) 
> and adjust the
> the range of pages-PFN for buddy-alloc-pages by the alloction-order. 
> With memory
> hot(un)plug, we would initialize the members of "multi_freearea" in zone.

 From your description only I cannot tell how that would really work. 
Your description of 1) indicated that we are dealing with an array to 
manage memory segments, and arrays are a bad data structure when it 
comes to sparsity.

> 
> The patch has been merged in the baseline of our product that has been 
> sold all over the
> world with Linux-4.4/4.9/4.19 so that i don't think there will be too 
> much risk. Of course,
> i might be wrong.

Just always keep in mind that upstream Linux has a very broad community. 
What might be "good enough" for smartphones might not be well suited for 
servers, VMs, embedded devices, other archs ... just imagine the RAM 
size differences, sparse layout, dynamic memory changes, ...

Adding additional complexity to the buddy has to have a compelling 
benefit; keep in mind that any complexity we introduce has to be 
maintained in the long term.

Having that said, starting with small steps is IMHO the right approach.
李培锋 April 28, 2021, 10:53 a.m. UTC | #11
Hi David Hildenbrand:

>> From your description only I cannot tell how that would really work.
>> Your description of 1) indicated that we are dealing with an array to
>> manage memory segments, and arrays are a bad data structure when it
>> comes to sparsity.

In the baseline, it also manage memory by array(zone->free_area) in linux, as follows:
struct zone {
......
    struct free_area free_area[MAX_ORDER];
}

"Multi_free_area" would devide physical memory into serveral parts in zone by page-PFN,
and it also uses free_area to manage each parts of memory:
struct zone {
......
    struct free_area free_area[flc][MAX_ORDER];  // flc def to 4
}

All the logic of memory-opt is unchange experts for that: select the corresponding freearea
to alloc-pages by allocation-order which is to concentrate most of low-order-pages(almost
signal-page) allocation in the front area of physical memory and few memory-pollution in the
back area of the memory. Theoretically,  there will not be too much risk i think. 

Because i am not a expert so that maybe there are some case i have not considered and maybe
there are something wrong displeaed you in my email, pls don't take it ill.


>> Just always keep in mind that upstream Linux has a very broad community.
>> What might be "good enough" for smartphones might not be well suited for
>> servers, VMs, embedded devices, other archs ... just imagine the RAM
>> size differences, sparse layout, dynamic memory changes, ...
 
>> Adding additional complexity to the buddy has to have a compelling
>> benefit; keep in mind that any complexity we introduce has to be
>> maintained in the long term.
 
>> Having that said, starting with small steps is IMHO the right approach.

I will make the patch-series and try to make the patch smaller and readable and
thanks your advices indeed.





lipeifeng@oppo.com
 
From: David Hildenbrand
Date: 2021-04-28 17:04
To: lipeifeng@oppo.com; Vlastimil Babka; peifengl55; schwidefsky; heiko.carstens; zhangshiming; zhouhuacai; guoweichao; guojian
CC: linux-s390; linux-kernel; linux-mm
Subject: Re: [RFC] mm: support multi_freearea to the reduction of external fragmentation
>  >> Essentially CONFIG_SPARSEMEM, whereby we can have huge holes in physical
>  >> memory layout and memory areas coming/going with memory hot(un)plug.
>  >> Usually we manage all metadata per section. For example, pageblocks are
>  >> allocated per section. We avoid arrays that depend on the
>  >> initial/maximum physical memory size.
> 
> CONFIG_SPRSEMEM has been opened in some of our product with 
> Qcom-platform and
> MTK platform. AFAIK, multi_freearea would not bring problem to 
> it?because the patch
> just manage the physical memory of zone to serveral section(free_area) 
> and adjust the
> the range of pages-PFN for buddy-alloc-pages by the alloction-order. 
> With memory
> hot(un)plug, we would initialize the members of "multi_freearea" in zone.
 
From your description only I cannot tell how that would really work. 
Your description of 1) indicated that we are dealing with an array to 
manage memory segments, and arrays are a bad data structure when it 
comes to sparsity.
 
> 
> The patch has been merged in the baseline of our product that has been 
> sold all over the
> world with Linux-4.4/4.9/4.19 so that i don't think there will be too 
> much risk. Of course,
> i might be wrong.
 
Just always keep in mind that upstream Linux has a very broad community. 
What might be "good enough" for smartphones might not be well suited for 
servers, VMs, embedded devices, other archs ... just imagine the RAM 
size differences, sparse layout, dynamic memory changes, ...
 
Adding additional complexity to the buddy has to have a compelling 
benefit; keep in mind that any complexity we introduce has to be 
maintained in the long term.
 
Having that said, starting with small steps is IMHO the right approach.
diff mbox series

Patch

diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index dc3cede..0865772 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -260,6 +260,7 @@  void arch_set_page_states(int make_stable)
 	struct list_head *l;
 	struct page *page;
 	struct zone *zone;
+	int flc = 0;
 
 	if (!cmma_flag)
 		return;
@@ -267,8 +268,10 @@  void arch_set_page_states(int make_stable)
 		drain_local_pages(NULL);
 	for_each_populated_zone(zone) {
 		spin_lock_irqsave(&zone->lock, flags);
+traversal_multi_freearea:
 		for_each_migratetype_order(order, t) {
-			list_for_each(l, &zone->free_area[order].free_list[t]) {
+			list_for_each(l,
+				&zone->free_area[flc][order].free_list[t]) {
 				page = list_entry(l, struct page, lru);
 				if (make_stable)
 					set_page_stable_dat(page, order);
@@ -276,6 +279,8 @@  void arch_set_page_states(int make_stable)
 					set_page_unused(page, order);
 			}
 		}
+		if (flc++ < FREE_AREA_COUNTS)
+			goto traversal_multi_freearea;
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 }
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fa02014..4c6ed49 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -29,6 +29,18 @@ 
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
 /*
+ * FREE_AREA_COUNTS means counts of free_area in per zone.
+ * Multi_freearea would divide memory into FREE_AREA_COUNTS segments by
+ * PFN and each memory-segment corresponds to a free_area.Buddy system
+ * would give pages first in the front free_area if Low-order page
+ * allocation and high-order page-allocation on the contrary.Low-order
+ * pages would be concentrated in the front area of memory which can
+ * reduce the memory-pollution in the back area of memory and increase
+ * the successful probablity of high-order memory-allocation.
+ */
+
+#define FREE_AREA_COUNTS 4
+/*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
  * coalesce naturally under reasonable reclaim pressure and those which
@@ -98,6 +110,35 @@  struct free_area {
 	unsigned long		nr_free;
 };
 
+/*
+ * free_area_segment: record the max PFN and median PFN in freearea
+ * max_pfn: the max page-PFN in the free_area.
+ * median_pfn: the median page-PFN in the free_area.
+ *
+ * "Multi_freearea" would divide memory into several segments by comparing
+ * pages-PFN and free_area_segment.max_pfn,each memory-segment corresponds
+ * to a free_area.
+ *
+ * "Multi_freearea" would adjust the location of free-pages in free_list by
+ * comparing PFN and free_area_segment->median_pfn and selecting the
+ * corresponding freearea by comparing PFN and free_area_segment->max_pfn.
+ *
+ * Example: the machine with 4G of physical memery and FREE_AREA_COUNTS(4):
+ *	- free_area_segment[0].max_pfn = 0x40000(1G)
+ *	- free_area_segment[0].median_pfn = 0x20000(512M)
+ *	- free_area_segment[1].max_pfn = 0x80000(2G)
+ *	- free_area_segment[1].median_pfn = 0x60000(1.5G)
+ *	- free_area_segment[2].max_pfn = 0xc0000(3G)
+ *	- free_area_segment[2].median_pfn = 0xA0000(2.5G)
+ *	- free_area_segment[3].max_pfn = 0xFFFFF(4G)
+ *	- free_area_segment[3].median_pfn = 0xE0000(3.5G)
+ */
+
+struct free_area_segment {
+	unsigned long max_pfn;
+	unsigned long median_pfn;
+};
+
 struct pglist_data;
 
 /*
@@ -355,7 +396,6 @@  enum zone_type {
 };
 
 #ifndef __GENERATING_BOUNDS_H
-
 struct zone {
 	/* Read-mostly fields */
 
@@ -436,7 +476,6 @@  struct zone {
 	unsigned long		managed_pages;
 	unsigned long		spanned_pages;
 	unsigned long		present_pages;
-
 	const char		*name;
 
 #ifdef CONFIG_MEMORY_ISOLATION
@@ -458,8 +497,9 @@  struct zone {
 	/* Write-intensive fields used from the page allocator */
 	ZONE_PADDING(_pad1_)
 
-	/* free areas of different sizes */
-	struct free_area	free_area[MAX_ORDER];
+	/* free areas of different segment of memory and size */
+	struct free_area	free_area[FREE_AREA_COUNTS][MAX_ORDER];
+	struct free_area_segment	free_area_segment[FREE_AREA_COUNTS];
 
 	/* zone flags, see below */
 	unsigned long		flags;
diff --git a/mm/Makefile b/mm/Makefile
index 26ef77a..b70ed15 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@  obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o mmu_context.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o $(mmu-y)
+			   debug.o multi_freearea.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 5079ddb..57b6b74 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1314,6 +1314,7 @@  static enum compact_result __compact_finished(struct zone *zone,
 {
 	unsigned int order;
 	const int migratetype = cc->migratetype;
+	int flc = 0;
 
 	if (cc->contended || fatal_signal_pending(current))
 		return COMPACT_CONTENDED;
@@ -1353,8 +1354,9 @@  static enum compact_result __compact_finished(struct zone *zone,
 	}
 
 	/* Direct compactor: Is a suitable page free? */
+traversal_multi_freearea:
 	for (order = cc->order; order < MAX_ORDER; order++) {
-		struct free_area *area = &zone->free_area[order];
+		struct free_area *area = &zone->free_area[flc][order];
 		bool can_steal;
 
 		/* Job done if page is free of the right migratetype */
@@ -1396,6 +1398,8 @@  static enum compact_result __compact_finished(struct zone *zone,
 			return COMPACT_CONTINUE;
 		}
 	}
+	if (flc++ < FREE_AREA_COUNTS)
+		goto traversal_multi_freearea;
 
 	return COMPACT_NO_SUITABLE_PAGE;
 }
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e60e281..4cfeb66 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -41,6 +41,7 @@ 
 
 #include "internal.h"
 
+#include "multi_freearea.h"
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -391,6 +392,7 @@  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 			zone->spanned_pages = pfn - zone_start_pfn + 1;
 	}
 
+	adjust_freearea_segment(zone);
 	/*
 	 * The section is not biggest or smallest mem_section in the zone, it
 	 * only creates a hole in the zone. So in this case, we need not
@@ -417,6 +419,7 @@  static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
 	/* The zone has no valid section */
 	zone->zone_start_pfn = 0;
 	zone->spanned_pages = 0;
+	adjust_freearea_segment(zone);
 	zone_span_writeunlock(zone);
 }
 
@@ -693,6 +696,7 @@  static void __meminit resize_zone_range(struct zone *zone, unsigned long start_p
 		zone->zone_start_pfn = start_pfn;
 
 	zone->spanned_pages = max(start_pfn + nr_pages, old_end_pfn) - zone->zone_start_pfn;
+	adjust_freearea_segment(zone);
 }
 
 static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned long start_pfn,
diff --git a/mm/multi_freearea.c b/mm/multi_freearea.c
new file mode 100644
index 0000000..e442bfe
--- /dev/null
+++ b/mm/multi_freearea.c
@@ -0,0 +1,177 @@ 
+// SPDX-License-Identifier: GPL-1.0-only
+/*
+ * Copyright (C) 2018-2020 Oplus. All rights reserved.
+ * Written by Peifeng Li (peifengl55@gmail.com)
+ */
+
+#include "internal.h"
+#include "multi_freearea.h"
+
+/**
+ * list_sort_add - add free pages in buddy.
+ * @page: pages to put back to buddy
+ * @zone: zone that free pages belong to
+ * @order: the order of free pages
+ * @mt: the migrate type of free pages
+ *
+ * 1) Select the corresponding freearea to alloc-pages
+ *	"Multi_freearea" would select the corresponding free_area by the
+ *	allocation-order.
+ *		- order <  HIGH_ORDER_TO_FLC: free_area[0] -> ... ->
+ *				free_area[FREE_AREA_COUNTS - 1]
+ *		- order >= HIGH_ORDER_TO_FLC: free_area[FREE_AREA_COUNTS - 1]
+ *				-> ... -> free_area[0]
+ *
+ *	Example:
+ *		The machine with 4G of physical memery and FREE_AREA_COUNTS(4)
+ *		and HIGH_ORDER_TO_FLC(3).
+ *		If user allocs page(order < 3),it would take page from
+ *		free_area[0][] first and if that fails,try free_area[1][] and
+ *		so on.
+ *		if user allocs page(order >= 3),it would take page form
+ *		free_area[3][] first and if that fails,try free_area[2][] and
+ *		so on.
+ *
+ *	By this way,low-order pages will be concentrated in the front area of
+ *	memory.Because of few memory-pollution in the back area of memory,the
+ *	sussessful probablity of high-order allocation would be improved.
+ *
+ * 2) Adjust the location of free_pages in free_list
+ *	"Multi_freearea" would place pages in the head of free_list if page-PFN
+ *	is smaller than free_area_segment->median_pfn and in the tail of
+ *	free_list on the contrary.
+ *
+ *	Free_list "sort" by free_area_segment->median_pfn:
+ *		page-PFN:	free_area_segment->median_pfn
+ *						|
+ *		free_list: page - page - page ..|...- page - page - page
+ *			| pages-PFN < median_pfn| pages-PFN >= median_pfn |
+ *
+ *	Because it would take pages from the head first in buddy system, the
+ *	free-pages in the tail are more likely to keep in the buddy system.
+ *	The closer the PFN of pages kept in buddy system, the greater the
+ *	probablity of merging that into high-order pages.
+ */
+
+void list_sort_add(struct page *page, struct zone *zone,
+				unsigned int order, int mt)
+{
+	struct list_head *list = &(zone->free_area[0][order].free_list[mt]);
+	unsigned long pfn = 0, segment = 0;
+	int i = 0, max_flc = FREE_AREA_COUNTS;
+
+	pfn = page_to_pfn(page);
+
+	if (unlikely(pfn > zone->free_area_segment[max_flc - 1].max_pfn)) {
+		list = &(zone->free_area[max_flc - 1][order].free_list[mt]);
+		segment = zone->free_area_segment[max_flc - 1].median_pfn;
+		goto add_page;
+	}
+
+	for (i = 0; i < max_flc; i++) {
+		if (pfn <= zone->free_area_segment[i].max_pfn) {
+			list = &(zone->free_area[i][order].free_list[mt]);
+			segment = zone->free_area_segment[i].median_pfn;
+			break;
+		}
+	}
+
+add_page:
+	if (pfn >= segment)
+		list_add_tail(&page->lru, list);
+	else
+		list_add(&page->lru, list);
+}
+
+/**
+ * page_to_flc - find the flc of free_area the page belongs to.
+ * @page: page belone free_area[flc][]
+ *
+ * Return: the flc of freearea that page belongs to.
+ */
+
+int page_to_flc(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	int flc = 0, max_flc = FREE_AREA_COUNTS;
+
+	if (unlikely(pfn > zone->free_area_segment[max_flc - 1].max_pfn))
+		return FREE_AREA_COUNTS - 1;
+
+	for (flc = 0; flc < FREE_AREA_COUNTS; flc++) {
+		if (pfn <= zone->free_area_segment[flc].max_pfn)
+			return flc;
+	}
+
+	return flc;
+}
+
+/**
+ * adjust_freearea_segment - adjust the max page-PFN and the median page-PFN
+ * in free_area.
+ * @zone: the zone that free_area belongs to.
+ *
+ * "Multi_freearea" would divide memory into several segments by comparing
+ * pages-PFN and free_area_segment.max_pfn,each memory-segment corresponds
+ * to a free_area.
+ *
+ * Example: the machine with 4G of physical memery and FREE_AREA_COUNTS(4):
+ *	- free_area_segment[0].max_pfn = 0x40000(1G)
+ *	- free_area_segment[0].median_pfn = 0x20000(512M)
+ *	- free_area_segment[1].max_pfn = 0x80000(2G)
+ *	- free_area_segment[1].median_pfn = 0x60000(1.5G)
+ *	- free_area_segment[2].max_pfn = 0xc0000(3G)
+ *	- free_area_segment[2].median_pfn = 0xA0000(2.5G)
+ *	- free_area_segment[3].max_pfn = 0xFFFFF(4G)
+ *	- free_area_segment[3].median_pfn = 0xE0000(3.5G)
+ *
+ * "Multi_freearea" would adjust the location of free-pages in free_list by
+ * comparing PFN and free_area_segment->median_pfn, select the one freearea
+ * to place free pages by comparing PFN and free_area_segment->max_pfn.
+ */
+
+void adjust_freearea_segment(struct zone *zone)
+{
+	int i, max_flc = FREE_AREA_COUNTS;
+	unsigned long prev_base;
+
+	for (i = 0; i < max_flc; i++) {
+		zone->free_area_segment[i].max_pfn = zone->zone_start_pfn +
+			zone->spanned_pages * (i + 1) / max_flc;
+	}
+
+	for (i = 0; i < max_flc; i++) {
+		if (i == 0)
+			prev_base = zone->zone_start_pfn;
+		else
+			prev_base = zone->free_area_segment[i - 1].max_pfn;
+
+		zone->free_area_segment[i].median_pfn = prev_base +
+			((zone->free_area_segment[i].max_pfn - prev_base) >> 1);
+	}
+}
+
+/**
+ * adjust_flc - select which free_area to alloc-pages.
+ * @current_flc: the flc of current free_area
+ * @order: allocation order
+ *
+ * allocation-order >= HIGH_ORDER_TO_FLC:
+ *	free_area[FREE_AREA_COUNTS - 1] -> free_area[0]
+ * allocation-order <  HIGH_ORDER_TO_FLC:
+ *	free_area[0] -> free_area[FREE_AREA_COUNTS - 1]
+ *
+ * Return: the flc adjusted by alloc order.
+ */
+
+unsigned int adjust_flc(unsigned int current_flc, unsigned int order)
+{
+	/* when alloc_order >= HIGH_ORDER_TO_FLC,
+	 * we like to alloc in free_area: 4->3->2->1
+	 */
+	if (order >= HIGH_ORDER_TO_FLC)
+		return (FREE_AREA_COUNTS - 1 - current_flc);
+
+	return current_flc;
+}
diff --git a/mm/multi_freearea.h b/mm/multi_freearea.h
new file mode 100644
index 0000000..8043aad
--- /dev/null
+++ b/mm/multi_freearea.h
@@ -0,0 +1,18 @@ 
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2018-2020 Oplus. All rights reserved.
+ * Written by Peifeng Li (peifengl55@gmail.com)
+ */
+
+#ifndef __MULTI_FREEAREA_H__
+#define __MULTI_FREEAREA_H__
+
+#define HIGH_ORDER_TO_FLC 3
+
+extern void list_sort_add(struct page *page, struct zone *zone,
+				unsigned int order, int mt);
+extern int page_to_flc(struct page *page);
+extern void adjust_freearea_segment(struct zone *zone);
+extern unsigned int adjust_flc(unsigned int current_flc, unsigned int order);
+
+#endif //__MULTI_FREEAREA_H__
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4446a52..8e23400 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -72,6 +72,7 @@ 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
 #include "internal.h"
+#include "multi_freearea.h"
 
 /* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
 static DEFINE_MUTEX(pcp_batch_high_lock);
@@ -132,6 +133,7 @@  unsigned long totalcma_pages __read_mostly;
 int percpu_pagelist_fraction;
 gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
 
+
 /*
  * A cached value of the page's pageblock's migratetype, used when the page is
  * put on a pcplist. Used to avoid the pageblock migratetype lookup when
@@ -806,6 +808,7 @@  static inline void __free_one_page(struct page *page,
 	unsigned long uninitialized_var(buddy_pfn);
 	struct page *buddy;
 	unsigned int max_order;
+	unsigned int flc;
 
 	max_order = min_t(unsigned int, MAX_ORDER, pageblock_order + 1);
 
@@ -836,7 +839,8 @@  static inline void __free_one_page(struct page *page,
 			clear_page_guard(zone, buddy, order, migratetype);
 		} else {
 			list_del(&buddy->lru);
-			zone->free_area[order].nr_free--;
+			flc = page_to_flc(buddy);
+			zone->free_area[flc][order].nr_free--;
 			rmv_page_order(buddy);
 		}
 		combined_pfn = buddy_pfn & pfn;
@@ -888,15 +892,16 @@  static inline void __free_one_page(struct page *page,
 		higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 		if (pfn_valid_within(buddy_pfn) &&
 		    page_is_buddy(higher_page, higher_buddy, order + 1)) {
+			flc = page_to_flc(page);
 			list_add_tail(&page->lru,
-				&zone->free_area[order].free_list[migratetype]);
+			   &zone->free_area[flc][order].free_list[migratetype]);
 			goto out;
 		}
 	}
-
-	list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+	list_sort_add(page, zone, order, migratetype);
 out:
-	zone->free_area[order].nr_free++;
+	flc = page_to_flc(page);
+	zone->free_area[flc][order].nr_free++;
 }
 
 /*
@@ -1820,6 +1825,7 @@  static inline void expand(struct zone *zone, struct page *page,
 	int migratetype)
 {
 	unsigned long size = 1 << high;
+	unsigned int flc = 0;
 
 	while (high > low) {
 		area--;
@@ -1835,9 +1841,9 @@  static inline void expand(struct zone *zone, struct page *page,
 		 */
 		if (set_page_guard(zone, &page[size], high, migratetype))
 			continue;
-
-		list_add(&page[size].lru, &area->free_list[migratetype]);
-		area->nr_free++;
+		list_sort_add(&page[size], zone, high, migratetype);
+		flc = page_to_flc(&page[size]);
+		zone->free_area[flc][high].nr_free++;
 		set_page_order(&page[size], high);
 	}
 }
@@ -1974,22 +1980,27 @@  struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	unsigned int current_order;
 	struct free_area *area;
 	struct page *page;
+	unsigned int flc = 0, flc_tmp = 0, flc_last = 0;
 
 	/* Find a page of the appropriate size in the preferred list */
+traversal_multi_freearea:
+	flc_tmp = adjust_flc(flc, order);
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = &(zone->free_area[current_order]);
+		area = &(zone->free_area[flc_tmp][current_order]);
 		page = list_first_entry_or_null(&area->free_list[migratetype],
 							struct page, lru);
 		if (!page)
 			continue;
 		list_del(&page->lru);
 		rmv_page_order(page);
-		area->nr_free--;
+		flc_last = page_to_flc(page);
+		zone->free_area[flc_last][current_order].nr_free--;
 		expand(zone, page, order, current_order, area, migratetype);
 		set_pcppage_migratetype(page, migratetype);
 		return page;
 	}
-
+	if (flc++ < FREE_AREA_COUNTS)
+		goto traversal_multi_freearea;
 	return NULL;
 }
 
@@ -2074,8 +2085,8 @@  static int move_freepages(struct zone *zone,
 		}
 
 		order = page_order(page);
-		list_move(&page->lru,
-			  &zone->free_area[order].free_list[migratetype]);
+		__list_del_entry(&page->lru);
+		list_sort_add(page, zone, order, migratetype);
 		page += 1 << order;
 		pages_moved += 1 << order;
 	}
@@ -2161,7 +2172,6 @@  static void steal_suitable_fallback(struct zone *zone, struct page *page,
 					int start_type, bool whole_block)
 {
 	unsigned int current_order = page_order(page);
-	struct free_area *area;
 	int free_pages, movable_pages, alike_pages;
 	int old_block_type;
 
@@ -2223,8 +2233,8 @@  static void steal_suitable_fallback(struct zone *zone, struct page *page,
 	return;
 
 single_page:
-	area = &zone->free_area[current_order];
-	list_move(&page->lru, &area->free_list[start_type]);
+	__list_del_entry(&page->lru);
+	list_sort_add(page, zone, current_order, start_type);
 }
 
 /*
@@ -2320,6 +2330,7 @@  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 	struct page *page;
 	int order;
 	bool ret;
+	int flc = 0;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
 								ac->nodemask) {
@@ -2332,8 +2343,9 @@  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 			continue;
 
 		spin_lock_irqsave(&zone->lock, flags);
+traversal_multi_freearea:
 		for (order = 0; order < MAX_ORDER; order++) {
-			struct free_area *area = &(zone->free_area[order]);
+			struct free_area *area = &(zone->free_area[flc][order]);
 
 			page = list_first_entry_or_null(
 					&area->free_list[MIGRATE_HIGHATOMIC],
@@ -2378,6 +2390,8 @@  static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
 				return ret;
 			}
 		}
+		if (flc++ < FREE_AREA_COUNTS)
+			goto traversal_multi_freearea;
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
@@ -2402,15 +2416,18 @@  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 	struct page *page;
 	int fallback_mt;
 	bool can_steal;
+	unsigned int flc = 0, flc_tmp = 0;
 
 	/*
 	 * Find the largest available free page in the other list. This roughly
 	 * approximates finding the pageblock with the most free pages, which
 	 * would be too costly to do exactly.
 	 */
+find_traversal_multi_freearea:
+	flc_tmp = adjust_flc(flc, order);
 	for (current_order = MAX_ORDER - 1; current_order >= order;
 				--current_order) {
-		area = &(zone->free_area[current_order]);
+		area = &(zone->free_area[flc_tmp][current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
 				start_migratetype, false, &can_steal);
 		if (fallback_mt == -1)
@@ -2430,18 +2447,24 @@  __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
 
 		goto do_steal;
 	}
-
+	if (flc++ < FREE_AREA_COUNTS)
+		goto find_traversal_multi_freearea;
 	return false;
 
 find_smallest:
+	flc = 0;
+steal_traversal_multi_freearea:
+	flc_tmp = adjust_flc(flc, order);
 	for (current_order = order; current_order < MAX_ORDER;
 							current_order++) {
-		area = &(zone->free_area[current_order]);
+		area = &(zone->free_area[flc_tmp][current_order]);
 		fallback_mt = find_suitable_fallback(area, current_order,
 				start_migratetype, false, &can_steal);
 		if (fallback_mt != -1)
-			break;
+			do_steal;
 	}
+	if (flc++ < FREE_AREA_COUNTS)
+		goto steal_traversal_multi_freearea;
 
 	/*
 	 * This should not happen - we already found a suitable fallback
@@ -2714,6 +2737,7 @@  void mark_free_pages(struct zone *zone)
 	unsigned long flags;
 	unsigned int order, t;
 	struct page *page;
+	unsigned int flc = 0;
 
 	if (zone_is_empty(zone))
 		return;
@@ -2737,9 +2761,10 @@  void mark_free_pages(struct zone *zone)
 				swsusp_unset_page_free(page);
 		}
 
+traversal_multi_freearea:
 	for_each_migratetype_order(order, t) {
 		list_for_each_entry(page,
-				&zone->free_area[order].free_list[t], lru) {
+			&zone->free_area[flc][order].free_list[t], lru) {
 			unsigned long i;
 
 			pfn = page_to_pfn(page);
@@ -2752,6 +2777,8 @@  void mark_free_pages(struct zone *zone)
 			}
 		}
 	}
+	if (flc++ < FREE_AREA_COUNTS)
+		goto traversal_multi_freearea;
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 #endif /* CONFIG_PM */
@@ -2881,6 +2908,7 @@  int __isolate_free_page(struct page *page, unsigned int order)
 	unsigned long watermark;
 	struct zone *zone;
 	int mt;
+	unsigned int flc;
 
 	BUG_ON(!PageBuddy(page));
 
@@ -2903,7 +2931,9 @@  int __isolate_free_page(struct page *page, unsigned int order)
 
 	/* Remove page from free list */
 	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
+	flc = page_to_flc(page);
+	zone->free_area[flc][order].nr_free--;
+
 	rmv_page_order(page);
 
 	/*
@@ -3143,6 +3173,7 @@  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 	long min = mark;
 	int o;
 	const bool alloc_harder = (alloc_flags & (ALLOC_HARDER|ALLOC_OOM));
+	int flc = 0;
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -3190,8 +3221,9 @@  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 		return true;
 
 	/* For a high-order request, check at least one suitable page is free */
+traversal_multi_freearea:
 	for (o = order; o < MAX_ORDER; o++) {
-		struct free_area *area = &z->free_area[o];
+		struct free_area *area = &z->free_area[flc][o];
 		int mt;
 
 		if (!area->nr_free)
@@ -3212,6 +3244,8 @@  bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 			!list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
 			return true;
 	}
+	if (flc++ < FREE_AREA_COUNTS)
+		goto traversal_multi_freearea;
 	return false;
 }
 
@@ -4887,6 +4921,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 	int cpu;
 	struct zone *zone;
 	pg_data_t *pgdat;
+	int flc = 0;
 
 	for_each_populated_zone(zone) {
 		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
@@ -5032,7 +5067,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 	for_each_populated_zone(zone) {
 		unsigned int order;
-		unsigned long nr[MAX_ORDER], flags, total = 0;
+		unsigned long nr[FREE_AREA_COUNTS][MAX_ORDER], flags, total = 0;
 		unsigned char types[MAX_ORDER];
 
 		if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
@@ -5041,12 +5076,13 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		printk(KERN_CONT "%s: ", zone->name);
 
 		spin_lock_irqsave(&zone->lock, flags);
+nr_traversal_multi_freearea:
 		for (order = 0; order < MAX_ORDER; order++) {
-			struct free_area *area = &zone->free_area[order];
+			struct free_area *area = &zone->free_area[flc][order];
 			int type;
 
-			nr[order] = area->nr_free;
-			total += nr[order] << order;
+			nr[flc][order] = area->nr_free;
+			total += nr[flc][order] << order;
 
 			types[order] = 0;
 			for (type = 0; type < MIGRATE_TYPES; type++) {
@@ -5054,13 +5090,20 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 					types[order] |= 1 << type;
 			}
 		}
+		if (flc++ < FREE_AREA_COUNTS)
+			goto nr_traversal_multi_freearea;
 		spin_unlock_irqrestore(&zone->lock, flags);
+
+		flc = 0;
+printk_traversal_multi_freearea:
 		for (order = 0; order < MAX_ORDER; order++) {
 			printk(KERN_CONT "%lu*%lukB ",
-			       nr[order], K(1UL) << order);
-			if (nr[order])
+			       nr[flc][order], K(1UL) << order);
+			if (nr[flc][order])
 				show_migration_types(types[order]);
 		}
+		if (flc++ < FREE_AREA_COUNTS)
+			goto printk_traversal_multi_freearea;
 		printk(KERN_CONT "= %lukB\n", K(total));
 	}
 
@@ -5576,9 +5619,14 @@  void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 static void __meminit zone_init_free_lists(struct zone *zone)
 {
 	unsigned int order, t;
-	for_each_migratetype_order(order, t) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
-		zone->free_area[order].nr_free = 0;
+	int flc;
+
+	for (flc = 0; flc < FREE_AREA_COUNTS; flc++) {
+		for_each_migratetype_order(order, t) {
+			INIT_LIST_HEAD(
+				&zone->free_area[flc][order].free_list[t]);
+			zone->free_area[flc][order].nr_free = 0;
+		}
 	}
 }
 
@@ -6365,6 +6413,7 @@  static void __init free_area_init_core(struct pglist_data *pgdat)
 		set_pageblock_order();
 		setup_usemap(pgdat, zone, zone_start_pfn, size);
 		init_currently_empty_zone(zone, zone_start_pfn, size);
+		adjust_freearea_segment(zone);
 		memmap_init(size, nid, j, zone_start_pfn);
 	}
 }
@@ -8102,6 +8151,7 @@  __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 	unsigned int order, i;
 	unsigned long pfn;
 	unsigned long flags;
+	unsigned int flc;
 	/* find the first valid pfn */
 	for (pfn = start_pfn; pfn < end_pfn; pfn++)
 		if (pfn_valid(pfn))
@@ -8137,7 +8187,8 @@  __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
 #endif
 		list_del(&page->lru);
 		rmv_page_order(page);
-		zone->free_area[order].nr_free--;
+		flc = page_to_flc(page);
+		zone->free_area[flc][order].nr_free--;
 		for (i = 0; i < (1 << order); i++)
 			SetPageReserved((page+i));
 		pfn += (1 << order);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ce81b0a..68d280a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -30,6 +30,8 @@ 
 
 #include "internal.h"
 
+#include "multi_freearea.h"
+
 #define NUMA_STATS_THRESHOLD (U16_MAX - 2)
 
 #ifdef CONFIG_NUMA
@@ -1021,16 +1023,18 @@  static void fill_contig_page_info(struct zone *zone,
 				struct contig_page_info *info)
 {
 	unsigned int order;
+	int flc = 0;
 
 	info->free_pages = 0;
 	info->free_blocks_total = 0;
 	info->free_blocks_suitable = 0;
 
+traversal_multi_freearea:
 	for (order = 0; order < MAX_ORDER; order++) {
 		unsigned long blocks;
 
 		/* Count number of free blocks */
-		blocks = zone->free_area[order].nr_free;
+		blocks = zone->free_area[flc][order].nr_free;
 		info->free_blocks_total += blocks;
 
 		/* Count free base pages */
@@ -1041,6 +1045,8 @@  static void fill_contig_page_info(struct zone *zone,
 			info->free_blocks_suitable += blocks <<
 						(order - suitable_order);
 	}
+	if (flc++ < FREE_AREA_COUNTS)
+		goto traversal_multi_freearea;
 }
 
 /*
@@ -1347,10 +1353,13 @@  static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
 						struct zone *zone)
 {
 	int order;
+	int flc = 0;
 
 	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
-	for (order = 0; order < MAX_ORDER; ++order)
-		seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+	for (flc = 0; flc < FREE_AREA_COUNTS; flc++)
+		for (order = 0; order < MAX_ORDER; ++order)
+			seq_printf(m, "%6lu ",
+				zone->free_area[flc][order].nr_free);
 	seq_putc(m, '\n');
 }
 
@@ -1368,7 +1377,9 @@  static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
 	int order, mtype;
+	int flc = 0;
 
+traversal_multi_freearea:
 	for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
 		seq_printf(m, "Node %4d, zone %8s, type %12s ",
 					pgdat->node_id,
@@ -1379,7 +1390,7 @@  static void pagetypeinfo_showfree_print(struct seq_file *m,
 			struct free_area *area;
 			struct list_head *curr;
 
-			area = &(zone->free_area[order]);
+			area = &(zone->free_area[flc][order]);
 
 			list_for_each(curr, &area->free_list[mtype])
 				freecount++;
@@ -1387,6 +1398,8 @@  static void pagetypeinfo_showfree_print(struct seq_file *m,
 		}
 		seq_putc(m, '\n');
 	}
+	if (flc++ < FREE_AREA_COUNTS)
+		goto traversal_multi_freearea;
 }
 
 /* Print out the free pages at each order for each migatetype */