mbox series

[RFC,0/4] Multiple consecutive page for anonymous mapping

Message ID 20230109072232.2398464-1-fengwei.yin@intel.com (mailing list archive)
Headers show
Series Multiple consecutive page for anonymous mapping | expand

Message

Yin, Fengwei Jan. 9, 2023, 7:22 a.m. UTC
In a nutshell:  4k is too small and 2M is too big.  We started
asking ourselves whether there was something in the middle that
we could do.  This series shows what that middle ground might
look like.  It provides some of the benefits of THP while
eliminating some of the downsides.

This series uses "multiple consecutive pages" (mcpages) of
between 8K and 2M of base pages for anonymous user space mappings.
This will lead to less internal fragmentation versus 2M mappings
and thus less memory consumption and wasted CPU time zeroing
memory which will never be used.

In the implementation, we allocate high order page with order of
mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
physical contiguous memory is used and benefit sequential memory
access latency.

Then split the high order page. By doing this, the sub-page of
mcpage is just 4K normal page. The current kernel page
management is applied to "mc" pages without any changes. Batching
page faults is allowed with mcpage and reduce page faults number.

There are costs with mcpage. Besides no TLB benefit THP brings, it
increases memory consumption and latency of allocation page
comparing to 4K base page.

This series is the first step of mcpage. The furture work can be
enable mcpage for more components like page cache, swapping etc.
Finally, most pages in system will be allocated/free/reclaimed
with mcpage order.

The series is constructed as following:
Patch 1 add the mcpage size related definitions and Kconfig entry
Patch 2 specific for x86_64 to align mmap start address to mcpage
        size
Patch 3 is the main change. It adds code to hook to anonymous page
        fault handle and apply mcpage to anonymous mapping
Patch 4 adds some statistic of mcpage

The overall code change is quite straight forward. The most thing I
like to hear here is whether this is a right direction I can go
further.

This series does not leverage compound pages.  This means that
normal kernel code that encounters an 'mcpage' region does not
need to do anything special.  It also does not leverage folios,
although trying to leverage folios is something that we would
like to explore.  We would welcome input on how that might
happen.

Some performance data were collected with 16K mcpage size and
shown in patch 2/4 and 4/4. If you have other workload and like
to know the impact, just let me know. I can setup the env and
run the test.


Yin Fengwei (4):
  mcpage: add size/mask/shift definition for multiple consecutive page
  mcpage: anon page: Use mcpage for anonymous mapping
  mcpage: add vmstat counters for mcpages
  mcpage: get_unmapped_area return mcpage size aligned addr

 arch/x86/kernel/sys_x86_64.c  |   8 ++
 include/linux/gfp.h           |   5 ++
 include/linux/mcpage_mm.h     |  35 +++++++++
 include/linux/mm_types.h      |  11 +++
 include/linux/vm_event_item.h |  10 +++
 mm/Kconfig                    |  19 +++++
 mm/Makefile                   |   1 +
 mm/mcpage_memory.c            | 140 ++++++++++++++++++++++++++++++++++
 mm/memory.c                   |  12 +++
 mm/mempolicy.c                |  51 +++++++++++++
 mm/vmstat.c                   |   7 ++
 11 files changed, 299 insertions(+)
 create mode 100644 include/linux/mcpage_mm.h
 create mode 100644 mm/mcpage_memory.c


base-commit: b7bfaa761d760e72a969d116517eaa12e404c262

Comments

Kirill A . Shutemov Jan. 9, 2023, 8:37 a.m. UTC | #1
On Mon, Jan 09, 2023 at 03:22:28PM +0800, Yin Fengwei wrote:
> In a nutshell:  4k is too small and 2M is too big.  We started
> asking ourselves whether there was something in the middle that
> we could do.  This series shows what that middle ground might
> look like.  It provides some of the benefits of THP while
> eliminating some of the downsides.
> 
> This series uses "multiple consecutive pages" (mcpages) of
> between 8K and 2M of base pages for anonymous user space mappings.
> This will lead to less internal fragmentation versus 2M mappings
> and thus less memory consumption and wasted CPU time zeroing
> memory which will never be used.
> 
> In the implementation, we allocate high order page with order of
> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
> physical contiguous memory is used and benefit sequential memory
> access latency.
> 
> Then split the high order page. By doing this, the sub-page of
> mcpage is just 4K normal page. The current kernel page
> management is applied to "mc" pages without any changes. Batching
> page faults is allowed with mcpage and reduce page faults number.
> 
> There are costs with mcpage. Besides no TLB benefit THP brings, it
> increases memory consumption and latency of allocation page
> comparing to 4K base page.
> 
> This series is the first step of mcpage. The furture work can be
> enable mcpage for more components like page cache, swapping etc.
> Finally, most pages in system will be allocated/free/reclaimed
> with mcpage order.

It doesn't worth adding a new path in page fault handing. We need to make
existing mechanisms more flexible.

I think it has to be done on top of folios:

1. Converts anonymous memory to folios. Only order-9 (HPAGE_PMD_ORDER) and
   order-0 at first.
2. Remove assumption of THP being order-9.
3. Start allocating THPs <order-9.
David Hildenbrand Jan. 9, 2023, 5:33 p.m. UTC | #2
On 09.01.23 08:22, Yin Fengwei wrote:
> In a nutshell:  4k is too small and 2M is too big.  We started
> asking ourselves whether there was something in the middle that
> we could do.  This series shows what that middle ground might
> look like.  It provides some of the benefits of THP while
> eliminating some of the downsides.
> 
> This series uses "multiple consecutive pages" (mcpages) of
> between 8K and 2M of base pages for anonymous user space mappings.
> This will lead to less internal fragmentation versus 2M mappings
> and thus less memory consumption and wasted CPU time zeroing
> memory which will never be used.

Hi,

what I understand is that this is some form of faultaround for anonymous 
memory, with the special-case that we try to allocate the pages 
consecutively.

Some thoughts:

(1) Faultaround might be unexpected for some workloads and increase
     memory consumption unnecessarily.

Yes, something like that can happen with THP BUT

(a) THP can be disabled or is frequently only enabled for madvised
     regions -- for example, exactly for this reason.
(b) Some workloads (especially memory ballooning) rely on memory not
     suddenly re-appearing after MADV_DONTNEED. This works even with THP,
     because the 4k MADV_DONTNEED will first PTE-map the THP. Because
     there is a PTE page table, we won't suddenly get a THP populated
     again (unless khugepaged is configured to fill holes).


I strongly assume we will need something similar to force-disable, 
selectively-enable etc.


(2) This steals consecutive pages to immediately split them up

I know, everybody thinks it might be valuable for their use case to grab 
all higher-order pages :) It will be "fun" once all these cases start 
competing. TBH, splitting up them immediately again smells like being 
the lowest priority among all higher-order users.


(3) All effort will be lost once page compaction gets active, compacts,
     and simply migrates to random 4k pages. This is most probably the
     biggest "issue" of the whole approach AFAIKS: it's only temporary
     because there is no notion of these pages belonging together
     anymore.

> 
> In the implementation, we allocate high order page with order of
> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
> physical contiguous memory is used and benefit sequential memory
> access latency.
> 
> Then split the high order page. By doing this, the sub-page of
> mcpage is just 4K normal page. The current kernel page
> management is applied to "mc" pages without any changes. Batching
> page faults is allowed with mcpage and reduce page faults number.
> 
> There are costs with mcpage. Besides no TLB benefit THP brings, it
> increases memory consumption and latency of allocation page
> comparing to 4K base page.
> 
> This series is the first step of mcpage. The furture work can be
> enable mcpage for more components like page cache, swapping etc.
> Finally, most pages in system will be allocated/free/reclaimed
> with mcpage order.

I think avoiding new, herd-to-get terminology ("mcpage") might be 
better. I know, everybody wants to give its child a name, but the name 
us not really future proof: "multiple consecutive pages" might at one 
point be maybe just a folio.

I'd summarize the ideas as "faultaround" whereby we try optimizing for 
locality.

Note that a similar (but different) concept already exists (hidden) for 
hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence 
of PTEs that logically map a hugetlb page.
Matthew Wilcox Jan. 9, 2023, 7:11 p.m. UTC | #3
On Mon, Jan 09, 2023 at 06:33:09PM +0100, David Hildenbrand wrote:
> (2) This steals consecutive pages to immediately split them up
> 
> I know, everybody thinks it might be valuable for their use case to grab all
> higher-order pages :) It will be "fun" once all these cases start competing.
> TBH, splitting up them immediately again smells like being the lowest
> priority among all higher-order users.

Actually, it is good for everybody to allocate higher-order pages, if they
can make use of them.  It has the end effect of reducing fragmentation
(imagine if the base unit of allocation were 512 bytes; every page fault
would have to do an order-3 allocation, and it wouldn't be long until
order-0 allocations had fragmented memory such that we could no longer
service a page fault).

Splitting them again is clearly one of the bad things done in this
proof-of-concept.  Anything that goes upstream won't do that, but I
suspect it was necessary to avoid fixing all the places in the kernel
that assume anon memory is either order-0 or -9.

> (3) All effort will be lost once page compaction gets active, compacts,
>     and simply migrates to random 4k pages. This is most probably the
>     biggest "issue" of the whole approach AFAIKS: it's only temporary
>     because there is no notion of these pages belonging together
>     anymore.

Sure, page compaction / migration is going to have to learn how to handle
order 1-8 folios.  Again, not needed for the PoC.
Yin, Fengwei Jan. 10, 2023, 3:57 a.m. UTC | #4
On 1/10/2023 1:33 AM, David Hildenbrand wrote:
> On 09.01.23 08:22, Yin Fengwei wrote:
>> In a nutshell:  4k is too small and 2M is too big.  We started
>> asking ourselves whether there was something in the middle that
>> we could do.  This series shows what that middle ground might
>> look like.  It provides some of the benefits of THP while
>> eliminating some of the downsides.
>>
>> This series uses "multiple consecutive pages" (mcpages) of
>> between 8K and 2M of base pages for anonymous user space mappings.
>> This will lead to less internal fragmentation versus 2M mappings
>> and thus less memory consumption and wasted CPU time zeroing
>> memory which will never be used.
> 
> Hi,
> 
> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache,
swapping etc.

> 
> Some thoughts:
> 
> (1) Faultaround might be unexpected for some workloads and increase
>     memory consumption unnecessarily.
Comparing to THP, the memory consumption and latency introduced by
mcpage is minor.

> 
> Yes, something like that can happen with THP BUT
> 
> (a) THP can be disabled or is frequently only enabled for madvised
>     regions -- for example, exactly for this reason.
> (b) Some workloads (especially memory ballooning) rely on memory not
>     suddenly re-appearing after MADV_DONTNEED. This works even with THP,
>     because the 4k MADV_DONTNEED will first PTE-map the THP. Because
>     there is a PTE page table, we won't suddenly get a THP populated
>     again (unless khugepaged is configured to fill holes).
> 
> 
> I strongly assume we will need something similar to force-disable, selectively-enable etc.
Agree.

> 
> 
> (2) This steals consecutive pages to immediately split them up
> 
> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users.
> 
The motivations to split it immediately are:
1. All the sub-pages is just normal 4K page. No other changes need be
   added to handle it.
2. splitting it before use doesn't involved complicated page lock handling.

> 
> (3) All effort will be lost once page compaction gets active, compacts,
>     and simply migrates to random 4k pages. This is most probably the
>     biggest "issue" of the whole approach AFAIKS: it's only temporary
>     because there is no notion of these pages belonging together
>     anymore.
Yes. But I suppose page compaction could be updated to handle mcpage.
Like always handle all sub-pages together. We did experience for
reclaim.

> 
>>
>> In the implementation, we allocate high order page with order of
>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>> physical contiguous memory is used and benefit sequential memory
>> access latency.
>>
>> Then split the high order page. By doing this, the sub-page of
>> mcpage is just 4K normal page. The current kernel page
>> management is applied to "mc" pages without any changes. Batching
>> page faults is allowed with mcpage and reduce page faults number.
>>
>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>> increases memory consumption and latency of allocation page
>> comparing to 4K base page.
>>
>> This series is the first step of mcpage. The furture work can be
>> enable mcpage for more components like page cache, swapping etc.
>> Finally, most pages in system will be allocated/free/reclaimed
>> with mcpage order.
> 
> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio.
> 
> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality.
> 
> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page.
"cont-pte" on ARM64 has fixed size which match the silicon definition.
mcpage allows software/user to define the size which is not necessary
to be exact same as silicon defined. Thanks.

Regards
Yin, Fengwei 

>
David Hildenbrand Jan. 10, 2023, 2:13 p.m. UTC | #5
On 09.01.23 20:11, Matthew Wilcox wrote:
> On Mon, Jan 09, 2023 at 06:33:09PM +0100, David Hildenbrand wrote:
>> (2) This steals consecutive pages to immediately split them up
>>
>> I know, everybody thinks it might be valuable for their use case to grab all
>> higher-order pages :) It will be "fun" once all these cases start competing.
>> TBH, splitting up them immediately again smells like being the lowest
>> priority among all higher-order users.
> 
> Actually, it is good for everybody to allocate higher-order pages, if they
> can make use of them.  It has the end effect of reducing fragmentation
> (imagine if the base unit of allocation were 512 bytes; every page fault
> would have to do an order-3 allocation, and it wouldn't be long until
> order-0 allocations had fragmented memory such that we could no longer
> service a page fault).

I don't believe that this reasoning is universally true. But I can see 
some part being true if everybody would be allocating higher-order pages 
and there would be no memory pressure.

Simple example why I am skeptical: Our free lists hold a order-9 page 
and 4 order-0 pages.

It's counter-intuitive to split (fragment!) the order-9 page to allocate 
an order-2 page instead of just "consuming the leftover" and letting 
somebody else make use of the full order-9 page (e.g., a proper THP).

Now, reality will tell us if we're handing out 
higher-order-but-not-thp-order pages too easily to end up fragmenting 
the wrong orders. IMHO,
fragmentation is and remains a challenge ... and I don't think 
especially once we have more consumers of higher-order pages -- 
especially where they might not be that beneficial.

I'm happy to be wrong on this one.

> 
> Splitting them again is clearly one of the bad things done in this
> proof-of-concept.  Anything that goes upstream won't do that, but I
> suspect it was necessary to avoid fixing all the places in the kernel
> that assume anon memory is either order-0 or -9.

Agreed. An usptream version shouldn't perform this split -- which will 
require more work.
David Hildenbrand Jan. 10, 2023, 2:40 p.m. UTC | #6
On 10.01.23 04:57, Yin, Fengwei wrote:
> 
> 
> On 1/10/2023 1:33 AM, David Hildenbrand wrote:
>> On 09.01.23 08:22, Yin Fengwei wrote:
>>> In a nutshell:  4k is too small and 2M is too big.  We started
>>> asking ourselves whether there was something in the middle that
>>> we could do.  This series shows what that middle ground might
>>> look like.  It provides some of the benefits of THP while
>>> eliminating some of the downsides.
>>>
>>> This series uses "multiple consecutive pages" (mcpages) of
>>> between 8K and 2M of base pages for anonymous user space mappings.
>>> This will lead to less internal fragmentation versus 2M mappings
>>> and thus less memory consumption and wasted CPU time zeroing
>>> memory which will never be used.
>>
>> Hi,

Hi,

>>
>> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache,
> swapping etc.

Right, PTE-mapping higher-order pages, in a faultaround fashion. But for 
pagecache etc. that doesn't require mcpage IMHO. I think it's the 
natural evolution of folios that Willy envisioned at some point.

> 
>>
>> Some thoughts:
>>
>> (1) Faultaround might be unexpected for some workloads and increase
>>      memory consumption unnecessarily.
> Comparing to THP, the memory consumption and latency introduced by
> mcpage is minor.

But it exists :)

> 
>>
>> Yes, something like that can happen with THP BUT
>>
>> (a) THP can be disabled or is frequently only enabled for madvised
>>      regions -- for example, exactly for this reason.
>> (b) Some workloads (especially memory ballooning) rely on memory not
>>      suddenly re-appearing after MADV_DONTNEED. This works even with THP,
>>      because the 4k MADV_DONTNEED will first PTE-map the THP. Because
>>      there is a PTE page table, we won't suddenly get a THP populated
>>      again (unless khugepaged is configured to fill holes).
>>
>>
>> I strongly assume we will need something similar to force-disable, selectively-enable etc.
> Agree.

Thinking again, we might want to piggy-back on the THP machinery/config 
knobs completely, hmm. After all, it's a similar concept to a THP (once 
we properly handle folios), just that we are not able to PMD-map the 
folio because it is too small.

Some applications that trigger MADV_NOHUGEPAGE don't want to get more 
pages populated than actually accessed. userfaultfd users come to mind, 
where we might not even have the guaranteed to see a UFFD registration 
before enabling MADV_NOHUGEPAGE and filling out some pages ... if we'd 
populate too many PTEs, we could miss uffd faults later ...

> 
>>
>>
>> (2) This steals consecutive pages to immediately split them up
>>
>> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users.
>>
> The motivations to split it immediately are:
> 1. All the sub-pages is just normal 4K page. No other changes need be
>     added to handle it.
> 2. splitting it before use doesn't involved complicated page lock handling.

I think for an upstream version we really want to avoid these splits.

>>>
>>> In the implementation, we allocate high order page with order of
>>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>>> physical contiguous memory is used and benefit sequential memory
>>> access latency.
>>>
>>> Then split the high order page. By doing this, the sub-page of
>>> mcpage is just 4K normal page. The current kernel page
>>> management is applied to "mc" pages without any changes. Batching
>>> page faults is allowed with mcpage and reduce page faults number.
>>>
>>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>>> increases memory consumption and latency of allocation page
>>> comparing to 4K base page.
>>>
>>> This series is the first step of mcpage. The furture work can be
>>> enable mcpage for more components like page cache, swapping etc.
>>> Finally, most pages in system will be allocated/free/reclaimed
>>> with mcpage order.
>>
>> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio.
>>
>> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality.
>>
>> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page.
> "cont-pte" on ARM64 has fixed size which match the silicon definition.
> mcpage allows software/user to define the size which is not necessary
> to be exact same as silicon defined. Thanks.

Yes. And the whole concept is abstracted away: it's logically a single, 
larger PTE, and we can only map/unmap in that PTE granularity.
Yin, Fengwei Jan. 11, 2023, 6:12 a.m. UTC | #7
On 1/10/2023 10:40 PM, David Hildenbrand wrote:
> On 10.01.23 04:57, Yin, Fengwei wrote:
>>
>>
>> On 1/10/2023 1:33 AM, David Hildenbrand wrote:
>>> On 09.01.23 08:22, Yin Fengwei wrote:
>>>> In a nutshell:  4k is too small and 2M is too big.  We started
>>>> asking ourselves whether there was something in the middle that
>>>> we could do.  This series shows what that middle ground might
>>>> look like.  It provides some of the benefits of THP while
>>>> eliminating some of the downsides.
>>>>
>>>> This series uses "multiple consecutive pages" (mcpages) of
>>>> between 8K and 2M of base pages for anonymous user space mappings.
>>>> This will lead to less internal fragmentation versus 2M mappings
>>>> and thus less memory consumption and wasted CPU time zeroing
>>>> memory which will never be used.
>>>
>>> Hi,
> 
> Hi,
> 
>>>
>>> what I understand is that this is some form of faultaround for anonymous memory, with the special-case that we try to allocate the pages consecutively.For this patchset, yes. But mcpage can be enabled for page cache,
>> swapping etc.
> 
> Right, PTE-mapping higher-order pages, in a faultaround fashion. But for pagecache etc. that doesn't require mcpage IMHO. I think it's the natural evolution of folios that Willy envisioned at some point.
Agree.

> 
>>
>>>
>>> Some thoughts:
>>>
>>> (1) Faultaround might be unexpected for some workloads and increase
>>>      memory consumption unnecessarily.
>> Comparing to THP, the memory consumption and latency introduced by
>> mcpage is minor.
> 
> But it exists :)
Yes. There is extra memory consumption even it's minor.

> 
>>
>>>
>>> Yes, something like that can happen with THP BUT
>>>
>>> (a) THP can be disabled or is frequently only enabled for madvised
>>>      regions -- for example, exactly for this reason.
>>> (b) Some workloads (especially memory ballooning) rely on memory not
>>>      suddenly re-appearing after MADV_DONTNEED. This works even with THP,
>>>      because the 4k MADV_DONTNEED will first PTE-map the THP. Because
>>>      there is a PTE page table, we won't suddenly get a THP populated
>>>      again (unless khugepaged is configured to fill holes).
>>>
>>>
>>> I strongly assume we will need something similar to force-disable, selectively-enable etc.
>> Agree.
> 
> Thinking again, we might want to piggy-back on the THP machinery/config knobs completely, hmm. After all, it's a similar concept to a THP (once we properly handle folios), just that we are not able to PMD-map the folio because it is too small.
> 
> Some applications that trigger MADV_NOHUGEPAGE don't want to get more pages populated than actually accessed. userfaultfd users come to mind, where we might not even have the guaranteed to see a UFFD registration before enabling MADV_NOHUGEPAGE and filling out some pages ... if we'd populate too many PTEs, we could miss uffd faults later ...
This is good point.

> 
>>
>>>
>>>
>>> (2) This steals consecutive pages to immediately split them up
>>>
>>> I know, everybody thinks it might be valuable for their use case to grab all higher-order pages :) It will be "fun" once all these cases start competing. TBH, splitting up them immediately again smells like being the lowest priority among all higher-order users.
>>>
>> The motivations to split it immediately are:
>> 1. All the sub-pages is just normal 4K page. No other changes need be
>>     added to handle it.
>> 2. splitting it before use doesn't involved complicated page lock handling.
> 
> I think for an upstream version we really want to avoid these splits.
OK.

> 
>>>>
>>>> In the implementation, we allocate high order page with order of
>>>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>>>> physical contiguous memory is used and benefit sequential memory
>>>> access latency.
>>>>
>>>> Then split the high order page. By doing this, the sub-page of
>>>> mcpage is just 4K normal page. The current kernel page
>>>> management is applied to "mc" pages without any changes. Batching
>>>> page faults is allowed with mcpage and reduce page faults number.
>>>>
>>>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>>>> increases memory consumption and latency of allocation page
>>>> comparing to 4K base page.
>>>>
>>>> This series is the first step of mcpage. The furture work can be
>>>> enable mcpage for more components like page cache, swapping etc.
>>>> Finally, most pages in system will be allocated/free/reclaimed
>>>> with mcpage order.
>>>
>>> I think avoiding new, herd-to-get terminology ("mcpage") might be better. I know, everybody wants to give its child a name, but the name us not really future proof: "multiple consecutive pages" might at one point be maybe just a folio.
>>>
>>> I'd summarize the ideas as "faultaround" whereby we try optimizing for locality.
>>>
>>> Note that a similar (but different) concept already exists (hidden) for hugetlb e.g., on arm64. The feature is called "cont-pte" -- a sequence of PTEs that logically map a hugetlb page.
>> "cont-pte" on ARM64 has fixed size which match the silicon definition.
>> mcpage allows software/user to define the size which is not necessary
>> to be exact same as silicon defined. Thanks.
> 
> Yes. And the whole concept is abstracted away: it's logically a single, larger PTE, and we can only map/unmap in that PTE granularity.
David, thanks a lot for the comments.


Regards
Yin, Fengwei

>
Yin, Fengwei Jan. 11, 2023, 6:13 a.m. UTC | #8
On 1/9/2023 4:37 PM, Kirill A. Shutemov wrote:
> On Mon, Jan 09, 2023 at 03:22:28PM +0800, Yin Fengwei wrote:
>> In a nutshell:  4k is too small and 2M is too big.  We started
>> asking ourselves whether there was something in the middle that
>> we could do.  This series shows what that middle ground might
>> look like.  It provides some of the benefits of THP while
>> eliminating some of the downsides.
>>
>> This series uses "multiple consecutive pages" (mcpages) of
>> between 8K and 2M of base pages for anonymous user space mappings.
>> This will lead to less internal fragmentation versus 2M mappings
>> and thus less memory consumption and wasted CPU time zeroing
>> memory which will never be used.
>>
>> In the implementation, we allocate high order page with order of
>> mcpage (e.g., order 2 for 16KB mcpage). This makes sure the
>> physical contiguous memory is used and benefit sequential memory
>> access latency.
>>
>> Then split the high order page. By doing this, the sub-page of
>> mcpage is just 4K normal page. The current kernel page
>> management is applied to "mc" pages without any changes. Batching
>> page faults is allowed with mcpage and reduce page faults number.
>>
>> There are costs with mcpage. Besides no TLB benefit THP brings, it
>> increases memory consumption and latency of allocation page
>> comparing to 4K base page.
>>
>> This series is the first step of mcpage. The furture work can be
>> enable mcpage for more components like page cache, swapping etc.
>> Finally, most pages in system will be allocated/free/reclaimed
>> with mcpage order.
> 
> It doesn't worth adding a new path in page fault handing. We need to make
> existing mechanisms more flexible.
> 
> I think it has to be done on top of folios:
> 
> 1. Converts anonymous memory to folios. Only order-9 (HPAGE_PMD_ORDER) and
>    order-0 at first.
> 2. Remove assumption of THP being order-9.
> 3. Start allocating THPs <order-9.
Thanks a lot for the comments. Really appreciate it.


Regards
Yin, Fengwei

>