[RFC,00/31] Generating physically contiguous memory after page allocation

Message ID	20190215220856.29749-1-zi.yan@sent.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of zi.yan@sent.com designates 64.147.123.25 as permitted sender) client-ip=64.147.123.25; From: Zi Yan <zi.yan@sent.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Dave Hansen <dave.hansen@linux.intel.com>, Michal Hocko <mhocko@kernel.org>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>, Mel Gorman <mgorman@techsingularity.net>, John Hubbard <jhubbard@nvidia.com>, Mark Hairgrove <mhairgrove@nvidia.com>, Nitin Gupta <nigupta@nvidia.com>, David Nellans <dnellans@nvidia.com>, Zi Yan <ziy@nvidia.com> Subject: [RFC PATCH 00/31] Generating physically contiguous memory after page allocation Date: Fri, 15 Feb 2019 14:08:25 -0800 Message-Id: <20190215220856.29749-1-zi.yan@sent.com> Reply-To: ziy@nvidia.com MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Generating physically contiguous memory after page allocation \| expand [RFC,00/31] Generating physically contiguous memory after page allocation [RFC,01/31] mm: migrate: Add exchange_pages to exchange two lists of pages. [RFC,02/31] mm: migrate: Add THP exchange support. [RFC,03/31] mm: migrate: Add tmpfs exchange support. [RFC,04/31] mm: add mem_defrag functionality. [RFC,05/31] mem_defrag: split a THP if either src or dst is THP only. [RFC,06/31] mm: Make MAX_ORDER configurable in Kconfig for buddy allocator. [RFC,07/31] mm: deallocate pages with order > MAX_ORDER. [RFC,08/31] mm: add pagechain container for storing multiple pages. [RFC,09/31] mm: thp: 1GB anonymous page implementation. [RFC,10/31] mm: proc: add 1GB THP kpageflag. [RFC,11/31] mm: debug: print compound page order in dump_page(). [RFC,12/31] mm: stats: Separate PMD THP and PUD THP stats. [RFC,13/31] mm: thp: 1GB THP copy on write implementation. [RFC,14/31] mm: thp: handling 1GB THP reference bit. [RFC,15/31] mm: thp: add 1GB THP split_huge_pud_page() function. [RFC,16/31] mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time. [RFC,17/31] mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP. [RFC,18/31] mm: page_vma_walk: teach it about PMD-mapped PUD THP. [RFC,19/31] mm: thp: 1GB THP support in try_to_unmap(). [RFC,20/31] mm: thp: split 1GB THPs at page reclaim. [RFC,21/31] mm: thp: 1GB zero page shrinker. [RFC,22/31] mm: thp: 1GB THP follow_p*d_page() support. [RFC,23/31] mm: support 1GB THP pagemap support. [RFC,24/31] sysctl: add an option to only print the head page virtual address. [RFC,25/31] mm: thp: add a knob to enable/disable 1GB THPs. [RFC,26/31] mm: thp: promote PTE-mapped THP to PMD-mapped THP. [RFC,27/31] mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages. [RFC,28/31] mm: vmstats: add page promotion stats. [RFC,29/31] mm: madvise: add madvise options to split PMD and PUD THPs. [RFC,30/31] mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support. [RFC,31/31] sysctl: toggle to promote PUD-mapped 1GB THP or not.

Zi Yan Feb. 15, 2019, 10:08 p.m. UTC

From: Zi Yan <ziy@nvidia.com>

Hi all,

This patchset produces physically contiguous memory by moving in-use pages
without allocating any new pages. It targets two scenarios that complements
khugepaged use cases: 1) avoiding page reclaim and memory compaction when the
system is under memory pressure because this patchset does not allocate any new
pages, 2) generating pages larger than 2^MAX_ORDER without changing the buddy
allocator.

To demonstrate its use, I add very basic 1GB THP support and enable promoting
512 2MB THPs to a 1GB THP in my patchset. Promoting 512 4KB pages to a 2MB
THP is also implemented.

The patches are on top of v5.0-rc5. They are posted as part of my upcoming
LSF/MM proposal.

Motivation 
---- 

The goal of this patchset is to provide alternative way of generating physically
contiguous memory and making it available as arbitrary sized large pages. This
patchset generates physically contiguous memory/arbitrary size pages after pages
are allocated by moving virtually-contiguous pages to become physically
contiguous at any size, thus it does not require changes to memory allocators.
On the other hand, it works only for moveable pages, so it also faces the same
fragmentation issues as memory compaction, i.e., if non-moveable pages spread
across the entire memory, this patchset can only generate contiguity between
any two non-moveable pages. 

Large pages and physically contiguous memory are important to devices, such as
GPUs, FPGAs, NICs and RDMA controllers, because they can often achieve better
performance when operating on large pages. The same can be said of CPU
performance, of course, but there is an important difference: GPUs and
high-throughput devices often take a more severe performance hit, in the event
of a TLB miss and subsequent page table walks, as compared to a CPU. The effect
is sufficiently large that such devices *really* want a highly reliable way to
allocate large pages to minimize the number of potential TLB misses and the time
spent on the induced page table walks. 

Vendors (like Oracle, Mellanox, IBM, NVIDIA) are interested in generating
physically contiguous memory beyond THP sizes and looking for solutions [1],[2],[3].
This patchset provides an alternative approach, compared to allocating
physically contiguous memory at page allocation time, to generating physically
contiguous memory after pages are allocated. This approach can avoid page
reclaim and memory compaction, which happen during the process of page
allocation, but still produces comparable physically contiguous memory. 

In terms of THPs, it helps, but we are interested in even larger contiguous
ranges (or page size support) to further reduce the address translation overheads.
With this patchset, we can generate pages larger than PMD-level THPs without
requiring MAX_ORDER changes in the buddy allocators. 


Patch structure 
---- 

The patchset I developed to generate physically contiguous memory/arbitrary
sized pages merely moves pages around. There are three components in this
patchset:

1) a new page migration mechanism, called exchange pages, that exchanges the
content of two in-use pages instead of performing two back-to-back page
migration. It saves on overheads and avoids page reclaim and memory compaction
in the page allocation path, although it is not strictly required if enough
free memory is available in the system.

2) a new mechanism that utilizes both page migration and exchange pages to
produce physically contiguous memory/arbitrary sized pages without allocating
any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
physically contiguous memory out of each VMA, which is virtually contiguous.
A simple range tree is used to ensure no two VMAs are overlapping with each
other in the physical address space.

3) a use case of the new physically contiguous memory producing mechanism that
generates 1GB THPs by migrating and exchanging pages and promoting 512
contiguous 2MB THPs to a 1GB THP, although even larger physically contiguous
memory ranges can be generated. The 1GB THP implement is very basic, which can
handle 1GB THP faults when buddy allocator is modified to allocate 1GB pages,
support 1GB THP split to 2MB THP and in-place promotion from 2MB THP to 1GB THP,
and PMD/PTE-mapped 1GB THP. These are not fully tested.


[1] https://lwn.net/Articles/736170/ 
[2] https://lwn.net/Articles/753167/ 
[3] https://blogs.nvidia.com/blog/2018/06/08/worlds-fastest-exascale-ai-supercomputer-summit/ 

Zi Yan (31):
  mm: migrate: Add exchange_pages to exchange two lists of pages.
  mm: migrate: Add THP exchange support.
  mm: migrate: Add tmpfs exchange support.
  mm: add mem_defrag functionality.
  mem_defrag: split a THP if either src or dst is THP only.
  mm: Make MAX_ORDER configurable in Kconfig for buddy allocator.
  mm: deallocate pages with order > MAX_ORDER.
  mm: add pagechain container for storing multiple pages.
  mm: thp: 1GB anonymous page implementation.
  mm: proc: add 1GB THP kpageflag.
  mm: debug: print compound page order in dump_page().
  mm: stats: Separate PMD THP and PUD THP stats.
  mm: thp: 1GB THP copy on write implementation.
  mm: thp: handling 1GB THP reference bit.
  mm: thp: add 1GB THP split_huge_pud_page() function.
  mm: thp: check compound_mapcount of PMD-mapped PUD THPs at free time.
  mm: thp: split properly PMD-mapped PUD THP to PTE-mapped PUD THP.
  mm: page_vma_walk: teach it about PMD-mapped PUD THP.
  mm: thp: 1GB THP support in try_to_unmap().
  mm: thp: split 1GB THPs at page reclaim.
  mm: thp: 1GB zero page shrinker.
  mm: thp: 1GB THP follow_p*d_page() support.
  mm: support 1GB THP pagemap support.
  sysctl: add an option to only print the head page virtual address.
  mm: thp: add a knob to enable/disable 1GB THPs.
  mm: thp: promote PTE-mapped THP to PMD-mapped THP.
  mm: thp: promote PMD-mapped PUD pages to PUD-mapped PUD pages.
  mm: vmstats: add page promotion stats.
  mm: madvise: add madvise options to split PMD and PUD THPs.
  mm: mem_defrag: thp: PMD THP and PUD THP in-place promotion support.
  sysctl: toggle to promote PUD-mapped 1GB THP or not.

 arch/x86/Kconfig                       |   15 +
 arch/x86/entry/syscalls/syscall_64.tbl |    1 +
 arch/x86/include/asm/pgalloc.h         |   69 +
 arch/x86/include/asm/pgtable.h         |   20 +
 arch/x86/include/asm/sparsemem.h       |    4 +-
 arch/x86/mm/pgtable.c                  |   38 +
 drivers/base/node.c                    |    3 +
 fs/exec.c                              |    4 +
 fs/proc/meminfo.c                      |    2 +
 fs/proc/page.c                         |    2 +
 fs/proc/task_mmu.c                     |   47 +-
 include/asm-generic/pgtable.h          |  110 +
 include/linux/huge_mm.h                |   78 +-
 include/linux/khugepaged.h             |    1 +
 include/linux/ksm.h                    |    5 +
 include/linux/mem_defrag.h             |   60 +
 include/linux/memcontrol.h             |    5 +
 include/linux/mm.h                     |   34 +
 include/linux/mm_types.h               |    5 +
 include/linux/mmu_notifier.h           |   13 +
 include/linux/mmzone.h                 |    1 +
 include/linux/page-flags.h             |   79 +-
 include/linux/pagechain.h              |   73 +
 include/linux/rmap.h                   |   10 +-
 include/linux/sched/coredump.h         |    4 +
 include/linux/swap.h                   |    2 +
 include/linux/syscalls.h               |    3 +
 include/linux/vm_event_item.h          |   33 +
 include/uapi/asm-generic/mman-common.h |   15 +
 include/uapi/linux/kernel-page-flags.h |    2 +
 kernel/events/uprobes.c                |    4 +-
 kernel/fork.c                          |   14 +
 kernel/sysctl.c                        |  101 +-
 mm/Makefile                            |    2 +
 mm/compaction.c                        |   17 +-
 mm/debug.c                             |    8 +-
 mm/exchange.c                          |  878 +++++++
 mm/filemap.c                           |    8 +
 mm/gup.c                               |   60 +-
 mm/huge_memory.c                       | 3360 ++++++++++++++++++++----
 mm/hugetlb.c                           |    4 +-
 mm/internal.h                          |   46 +
 mm/khugepaged.c                        |    7 +-
 mm/ksm.c                               |   39 +-
 mm/madvise.c                           |  121 +
 mm/mem_defrag.c                        | 1941 ++++++++++++++
 mm/memcontrol.c                        |   13 +
 mm/memory.c                            |   55 +-
 mm/migrate.c                           |   14 +-
 mm/mmap.c                              |   29 +
 mm/page_alloc.c                        |  108 +-
 mm/page_vma_mapped.c                   |  129 +-
 mm/pgtable-generic.c                   |   78 +-
 mm/rmap.c                              |  283 +-
 mm/swap.c                              |   38 +
 mm/swap_slots.c                        |    2 +
 mm/swapfile.c                          |    4 +-
 mm/userfaultfd.c                       |    2 +-
 mm/util.c                              |    7 +
 mm/vmscan.c                            |   55 +-
 mm/vmstat.c                            |   32 +
 61 files changed, 7452 insertions(+), 745 deletions(-)
 create mode 100644 include/linux/mem_defrag.h
 create mode 100644 include/linux/pagechain.h
 create mode 100644 mm/exchange.c
 create mode 100644 mm/mem_defrag.c

--
2.20.1

Mike Kravetz Feb. 20, 2019, 1:42 a.m. UTC | #1

On 2/15/19 2:08 PM, Zi Yan wrote:

Thanks for working on this issue!

I have not yet had a chance to take a look at the code.  However, I do have
some general questions/comments on the approach.

> Patch structure 
> ---- 
> 
> The patchset I developed to generate physically contiguous memory/arbitrary
> sized pages merely moves pages around. There are three components in this
> patchset:
> 
> 1) a new page migration mechanism, called exchange pages, that exchanges the
> content of two in-use pages instead of performing two back-to-back page
> migration. It saves on overheads and avoids page reclaim and memory compaction
> in the page allocation path, although it is not strictly required if enough
> free memory is available in the system.
> 
> 2) a new mechanism that utilizes both page migration and exchange pages to
> produce physically contiguous memory/arbitrary sized pages without allocating
> any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
> physically contiguous memory out of each VMA, which is virtually contiguous.
> A simple range tree is used to ensure no two VMAs are overlapping with each
> other in the physical address space.

This appears to be a new approach to generating contiguous areas.  Previous
attempts had relied on finding a contiguous area that can then be used for
various purposes including user mappings.  Here, you take an existing mapping
and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag functionality
talks about creating a (VPN, PFN) anchor pair for each vma and then using
this pair as the base for creating a contiguous area.

I'm curious, how 'fixed' is the anchor?  As you know, there could be a
non-movable page in the PFN range.  As a result, you will not be able to
create a contiguous area starting at that PFN.  In such a case, do we try
another PFN?  I know this could result in much page shuffling.  I'm just
trying to figure out how we satisfy a user who really wants a contiguous
area.  Is there some method to keep trying?

My apologies if this is addressed in the code.  This was just one of the
first thoughts that came to mine when giving the series a quick look.

Zi Yan Feb. 20, 2019, 2:33 a.m. UTC | #2

On 19 Feb 2019, at 17:42, Mike Kravetz wrote:

> On 2/15/19 2:08 PM, Zi Yan wrote:
>
> Thanks for working on this issue!
>
> I have not yet had a chance to take a look at the code.  However, I do 
> have
> some general questions/comments on the approach.

Thanks for replying. The code is very intrusive and has a lot of hacks, 
so it is
OK for us to discuss the general idea first. :)

>> Patch structure
>> ----
>>
>> The patchset I developed to generate physically contiguous 
>> memory/arbitrary
>> sized pages merely moves pages around. There are three components in 
>> this
>> patchset:
>>
>> 1) a new page migration mechanism, called exchange pages, that 
>> exchanges the
>> content of two in-use pages instead of performing two back-to-back 
>> page
>> migration. It saves on overheads and avoids page reclaim and memory 
>> compaction
>> in the page allocation path, although it is not strictly required if 
>> enough
>> free memory is available in the system.
>>
>> 2) a new mechanism that utilizes both page migration and exchange 
>> pages to
>> produce physically contiguous memory/arbitrary sized pages without 
>> allocating
>> any new pages, unlike what khugepaged does. It works on per-VMA 
>> basis, creating
>> physically contiguous memory out of each VMA, which is virtually 
>> contiguous.
>> A simple range tree is used to ensure no two VMAs are overlapping 
>> with each
>> other in the physical address space.
>
> This appears to be a new approach to generating contiguous areas.  
> Previous
> attempts had relied on finding a contiguous area that can then be used 
> for
> various purposes including user mappings.  Here, you take an existing 
> mapping
> and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag 
> functionality
> talks about creating a (VPN, PFN) anchor pair for each vma and then 
> using
> this pair as the base for creating a contiguous area.
>
> I'm curious, how 'fixed' is the anchor?  As you know, there could be a
> non-movable page in the PFN range.  As a result, you will not be able 
> to
> create a contiguous area starting at that PFN.  In such a case, do we 
> try
> another PFN?  I know this could result in much page shuffling.  I'm 
> just
> trying to figure out how we satisfy a user who really wants a 
> contiguous
> area.  Is there some method to keep trying?

Good question. The anchor is determined on a per-VMA basis, which can be 
changed easily,
but in this patchiest, I used a very simple strategy — making all VMAs 
not overlapping
in the physical address space to get maximum overall contiguity and not 
changing anchors
even if non-moveable pages are encountered when generating physically 
contiguous pages.

Basically, first VMA1 in the virtual address space has its anchor as 
(VMA1_start_VPN, ZONE_start_PFN),
second VMA1 has its anchor as (VMA2_start_VPN, ZONE_start_PFN + 
VMA1_size), and so on.
This makes all VMA not overlapping in physical address space during 
contiguous memory
generation. When there is a non-moveable page, the anchor will not be 
changed, because
no matter whether we assign a new anchor or not, the contiguous pages 
stops at
the non-moveable page. If we are trying to get a new anchor, more effort 
is needed to
avoid overlapping new anchor with existing contiguous pages. Any 
overlapping will
nullify the existing contiguous pages.

To satisfy a user who wants a contiguous area with N pages, the minimal 
distance between
any two non-moveable pages should be bigger than N pages in the system 
memory. Otherwise,
nothing would work. If there is such an area (PFN1, PFN1+N) in the 
physical address space,
you can set the anchor to (VPN_USER, PFN1) and use exchange_pages() to 
generate a contiguous
area with N pages. Instead, alloc_contig_pages(PFN1, PFN1+N, …) could 
also work, but
only at page allocation time. It also requires the system has N free 
pages when
alloc_contig_pages() are migrating the pages in (PFN1, PFN1+N) away, or 
you need to swap
pages to make the space.

Let me know if this makes sense to you.

--
Best Regards,
Yan Zi

Mike Kravetz Feb. 20, 2019, 3:18 a.m. UTC | #3

On 2/19/19 6:33 PM, Zi Yan wrote:
> On 19 Feb 2019, at 17:42, Mike Kravetz wrote:
> 
>> On 2/15/19 2:08 PM, Zi Yan wrote:
>>
>> Thanks for working on this issue!
>>
>> I have not yet had a chance to take a look at the code.  However, I do have
>> some general questions/comments on the approach.
> 
> Thanks for replying. The code is very intrusive and has a lot of hacks, so it is
> OK for us to discuss the general idea first. :)
> 
> 
>>> Patch structure
>>> ----
>>>
>>> The patchset I developed to generate physically contiguous memory/arbitrary
>>> sized pages merely moves pages around. There are three components in this
>>> patchset:
>>>
>>> 1) a new page migration mechanism, called exchange pages, that exchanges the
>>> content of two in-use pages instead of performing two back-to-back page
>>> migration. It saves on overheads and avoids page reclaim and memory compaction
>>> in the page allocation path, although it is not strictly required if enough
>>> free memory is available in the system.
>>>
>>> 2) a new mechanism that utilizes both page migration and exchange pages to
>>> produce physically contiguous memory/arbitrary sized pages without allocating
>>> any new pages, unlike what khugepaged does. It works on per-VMA basis, creating
>>> physically contiguous memory out of each VMA, which is virtually contiguous.
>>> A simple range tree is used to ensure no two VMAs are overlapping with each
>>> other in the physical address space.
>>
>> This appears to be a new approach to generating contiguous areas.  Previous
>> attempts had relied on finding a contiguous area that can then be used for
>> various purposes including user mappings.  Here, you take an existing mapping
>> and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag functionality
>> talks about creating a (VPN, PFN) anchor pair for each vma and then using
>> this pair as the base for creating a contiguous area.
>>
>> I'm curious, how 'fixed' is the anchor?  As you know, there could be a
>> non-movable page in the PFN range.  As a result, you will not be able to
>> create a contiguous area starting at that PFN.  In such a case, do we try
>> another PFN?  I know this could result in much page shuffling.  I'm just
>> trying to figure out how we satisfy a user who really wants a contiguous
>> area.  Is there some method to keep trying?
> 
> Good question. The anchor is determined on a per-VMA basis, which can be changed
> easily,
> but in this patchiest, I used a very simple strategy — making all VMAs not
> overlapping
> in the physical address space to get maximum overall contiguity and not changing
> anchors
> even if non-moveable pages are encountered when generating physically contiguous
> pages.
> 
> Basically, first VMA1 in the virtual address space has its anchor as
> (VMA1_start_VPN, ZONE_start_PFN),
> second VMA1 has its anchor as (VMA2_start_VPN, ZONE_start_PFN + VMA1_size), and
> so on.
> This makes all VMA not overlapping in physical address space during contiguous
> memory
> generation. When there is a non-moveable page, the anchor will not be changed,
> because
> no matter whether we assign a new anchor or not, the contiguous pages stops at
> the non-moveable page. If we are trying to get a new anchor, more effort is
> needed to
> avoid overlapping new anchor with existing contiguous pages. Any overlapping will
> nullify the existing contiguous pages.
> 
> To satisfy a user who wants a contiguous area with N pages, the minimal distance
> between
> any two non-moveable pages should be bigger than N pages in the system memory.
> Otherwise,
> nothing would work. If there is such an area (PFN1, PFN1+N) in the physical
> address space,
> you can set the anchor to (VPN_USER, PFN1) and use exchange_pages() to generate
> a contiguous
> area with N pages. Instead, alloc_contig_pages(PFN1, PFN1+N, …) could also work,
> but
> only at page allocation time. It also requires the system has N free pages when
> alloc_contig_pages() are migrating the pages in (PFN1, PFN1+N) away, or you need
> to swap
> pages to make the space.
> 
> Let me know if this makes sense to you.
> 

Yes, that is how I expected the implementation would work.  Thank you.

Another high level question.  One of the benefits of this approach is
that exchanging pages does not require N free pages as you describe
above.  This assumes that the vma which we are trying to make contiguous
is already populated.  If it is not populated, then you also need to
have N free pages.  Correct?  If this is true, then is the expected use
case to first populate a vma, and then try to make contiguous?  I would
assume that if it is not populated and a request to make contiguous is
given, we should try to allocate/populate the vma with contiguous pages
at that time?

Zi Yan Feb. 20, 2019, 5:19 a.m. UTC | #4

On 19 Feb 2019, at 19:18, Mike Kravetz wrote:

> On 2/19/19 6:33 PM, Zi Yan wrote:
>> On 19 Feb 2019, at 17:42, Mike Kravetz wrote:
>>
>>> On 2/15/19 2:08 PM, Zi Yan wrote:
>>>
>>> Thanks for working on this issue!
>>>
>>> I have not yet had a chance to take a look at the code.  However, I 
>>> do have
>>> some general questions/comments on the approach.
>>
>> Thanks for replying. The code is very intrusive and has a lot of 
>> hacks, so it is
>> OK for us to discuss the general idea first. :)
>>
>>
>>>> Patch structure
>>>> ----
>>>>
>>>> The patchset I developed to generate physically contiguous 
>>>> memory/arbitrary
>>>> sized pages merely moves pages around. There are three components 
>>>> in this
>>>> patchset:
>>>>
>>>> 1) a new page migration mechanism, called exchange pages, that 
>>>> exchanges the
>>>> content of two in-use pages instead of performing two back-to-back 
>>>> page
>>>> migration. It saves on overheads and avoids page reclaim and memory 
>>>> compaction
>>>> in the page allocation path, although it is not strictly required 
>>>> if enough
>>>> free memory is available in the system.
>>>>
>>>> 2) a new mechanism that utilizes both page migration and exchange 
>>>> pages to
>>>> produce physically contiguous memory/arbitrary sized pages without 
>>>> allocating
>>>> any new pages, unlike what khugepaged does. It works on per-VMA 
>>>> basis, creating
>>>> physically contiguous memory out of each VMA, which is virtually 
>>>> contiguous.
>>>> A simple range tree is used to ensure no two VMAs are overlapping 
>>>> with each
>>>> other in the physical address space.
>>>
>>> This appears to be a new approach to generating contiguous areas.  
>>> Previous
>>> attempts had relied on finding a contiguous area that can then be 
>>> used for
>>> various purposes including user mappings.  Here, you take an 
>>> existing mapping
>>> and make it contiguous.  [RFC PATCH 04/31] mm: add mem_defrag 
>>> functionality
>>> talks about creating a (VPN, PFN) anchor pair for each vma and then 
>>> using
>>> this pair as the base for creating a contiguous area.
>>>
>>> I'm curious, how 'fixed' is the anchor?  As you know, there could be 
>>> a
>>> non-movable page in the PFN range.  As a result, you will not be 
>>> able to
>>> create a contiguous area starting at that PFN.  In such a case, do 
>>> we try
>>> another PFN?  I know this could result in much page shuffling.  I'm 
>>> just
>>> trying to figure out how we satisfy a user who really wants a 
>>> contiguous
>>> area.  Is there some method to keep trying?
>>
>> Good question. The anchor is determined on a per-VMA basis, which can 
>> be changed
>> easily,
>> but in this patchiest, I used a very simple strategy — making all 
>> VMAs not
>> overlapping
>> in the physical address space to get maximum overall contiguity and 
>> not changing
>> anchors
>> even if non-moveable pages are encountered when generating physically 
>> contiguous
>> pages.
>>
>> Basically, first VMA1 in the virtual address space has its anchor as
>> (VMA1_start_VPN, ZONE_start_PFN),
>> second VMA1 has its anchor as (VMA2_start_VPN, ZONE_start_PFN + 
>> VMA1_size), and
>> so on.
>> This makes all VMA not overlapping in physical address space during 
>> contiguous
>> memory
>> generation. When there is a non-moveable page, the anchor will not be 
>> changed,
>> because
>> no matter whether we assign a new anchor or not, the contiguous pages 
>> stops at
>> the non-moveable page. If we are trying to get a new anchor, more 
>> effort is
>> needed to
>> avoid overlapping new anchor with existing contiguous pages. Any 
>> overlapping will
>> nullify the existing contiguous pages.
>>
>> To satisfy a user who wants a contiguous area with N pages, the 
>> minimal distance
>> between
>> any two non-moveable pages should be bigger than N pages in the 
>> system memory.
>> Otherwise,
>> nothing would work. If there is such an area (PFN1, PFN1+N) in the 
>> physical
>> address space,
>> you can set the anchor to (VPN_USER, PFN1) and use exchange_pages() 
>> to generate
>> a contiguous
>> area with N pages. Instead, alloc_contig_pages(PFN1, PFN1+N, …) 
>> could also work,
>> but
>> only at page allocation time. It also requires the system has N free 
>> pages when
>> alloc_contig_pages() are migrating the pages in (PFN1, PFN1+N) away, 
>> or you need
>> to swap
>> pages to make the space.
>>
>> Let me know if this makes sense to you.
>>
>
> Yes, that is how I expected the implementation would work.  Thank you.
>
> Another high level question.  One of the benefits of this approach is
> that exchanging pages does not require N free pages as you describe
> above.  This assumes that the vma which we are trying to make 
> contiguous
> is already populated.  If it is not populated, then you also need to
> have N free pages.  Correct?  If this is true, then is the expected 
> use
> case to first populate a vma, and then try to make contiguous?  I 
> would
> assume that if it is not populated and a request to make contiguous is
> given, we should try to allocate/populate the vma with contiguous 
> pages
> at that time?

Yes, I assume the pages within the VMA are already populated but not 
contiguous yet.

My approach considers memory contiguity as an on-demand resource. In 
some phases
of an application, accelerators or RDMA controllers would 
process/transfer data in one
or more VMAs, at which time contiguous memory can help reduce address 
translation
overheads or lift certain constraints. And different VMAs could be 
processed at
different program phases, thus it might be hard to get contiguous memory 
for all
these VMAs at the allocation time using alloc_contig_pages(). My 
approach can
help get contiguous memory later, when the demand comes.

For some cases, you definitely can use alloc_contig_pages() to give 
users
a contiguous area at page allocation time, if you know the user is going 
to use this
area for accelerator data processing or as a RDMA buffer and the area 
size is fixed.

In addition, we can also use khugepaged approach, having a daemon 
periodically
scan VMAs and use alloc_contig_pages() to convert non-contiguous pages 
in a VMA
to contiguous pages, but it would require N free pages during the 
conversion.

In sum, my approach complements alloc_contig_pages() and provides more 
flexibility.
It is not trying to replaces alloc_contig_pages().


--
Best Regards,
Yan Zi

Mike Kravetz Feb. 20, 2019, 5:27 a.m. UTC | #5

On 2/19/19 9:19 PM, Zi Yan wrote:
> On 19 Feb 2019, at 19:18, Mike Kravetz wrote:
>> Another high level question.  One of the benefits of this approach is
>> that exchanging pages does not require N free pages as you describe
>> above.  This assumes that the vma which we are trying to make contiguous
>> is already populated.  If it is not populated, then you also need to
>> have N free pages.  Correct?  If this is true, then is the expected use
>> case to first populate a vma, and then try to make contiguous?  I would
>> assume that if it is not populated and a request to make contiguous is
>> given, we should try to allocate/populate the vma with contiguous pages
>> at that time?
> 
> Yes, I assume the pages within the VMA are already populated but not contiguous
> yet.
> 
> My approach considers memory contiguity as an on-demand resource. In some phases
> of an application, accelerators or RDMA controllers would process/transfer data
> in one
> or more VMAs, at which time contiguous memory can help reduce address translation
> overheads or lift certain constraints. And different VMAs could be processed at
> different program phases, thus it might be hard to get contiguous memory for all
> these VMAs at the allocation time using alloc_contig_pages(). My approach can
> help get contiguous memory later, when the demand comes.
> 
> For some cases, you definitely can use alloc_contig_pages() to give users
> a contiguous area at page allocation time, if you know the user is going to use
> this
> area for accelerator data processing or as a RDMA buffer and the area size is
> fixed.
> 
> In addition, we can also use khugepaged approach, having a daemon periodically
> scan VMAs and use alloc_contig_pages() to convert non-contiguous pages in a VMA
> to contiguous pages, but it would require N free pages during the conversion.
> 
> In sum, my approach complements alloc_contig_pages() and provides more flexibility.
> It is not trying to replaces alloc_contig_pages().

Thank you for the explanation.  That makes sense.  I have mostly been
thinking about contiguous memory from an allocation perspective and did
not really consider other use cases.

[RFC,00/31] Generating physically contiguous memory after page allocation

Message

Comments