[RFC,v2,00/17] variable-order, large folios for anonymous memory

Message ID	20230414130303.2345383-1-ryan.roberts@arm.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Andrew Morton <akpm@linux-foundation.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Yu Zhao <yuzhao@google.com>, "Yin, Fengwei" <fengwei.yin@intel.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org Subject: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory Date: Fri, 14 Apr 2023 14:02:46 +0100 Message-Id: <20230414130303.2345383-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	variable-order, large folios for anonymous memory \| expand [RFC,v2,00/17] variable-order, large folios for anonymous memory [RFC,v2,01/17] mm: Expose clear_huge_page() unconditionally [RFC,v2,02/17] mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() [RFC,v2,03/17] mm: Introduce try_vma_alloc_movable_folio() [RFC,v2,04/17] mm: Implement folio_add_new_anon_rmap_range() [RFC,v2,05/17] mm: Routines to determine max anon folio allocation order [RFC,v2,06/17] mm: Allocate large folios for anonymous memory [RFC,v2,07/17] mm: Allow deferred splitting of arbitrary large anon folios [RFC,v2,08/17] mm: Implement folio_move_anon_rmap_range() [RFC,v2,09/17] mm: Update wp_page_reuse() to operate on range of pages [RFC,v2,10/17] mm: Reuse large folios for anonymous memory [RFC,v2,11/17] mm: Split __wp_page_copy_user() into 2 variants [RFC,v2,12/17] mm: ptep_clear_flush_range_notify() macro for batch operation [RFC,v2,13/17] mm: Implement folio_remove_rmap_range() [RFC,v2,14/17] mm: Copy large folios for anonymous memory [RFC,v2,15/17] mm: Convert zero page to large folios on write [RFC,v2,16/17] mm: mmap: Align unhinted maps to highest anon folio order [RFC,v2,17/17] mm: Batch-zap large anonymous folio PTE mappings

Ryan Roberts April 14, 2023, 1:02 p.m. UTC

Hi All,

This is a second RFC and my first proper attempt at implementing variable order,
large folios for anonymous memory. The first RFC [1], was a partial
implementation and a plea for help in debugging an issue I was hitting; thanks
to Yin Fengwei and Matthew Wilcox for their advice in solving that!

The objective of variable order anonymous folios is to improve performance by
allocating larger chunks of memory during anonymous page faults:

 - Since SW (the kernel) is dealing with larger chunks of memory than base
   pages, there are efficiency savings to be had; fewer page faults, batched PTE
   and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
   overhead. This should benefit all architectures.
 - Since we are now mapping physically contiguous chunks of memory, we can take
   advantage of HW TLB compression techniques. A reduction in TLB pressure
   speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
   TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].

This patch set deals with the SW side of things only but sets us up nicely for
taking advantage of the HW improvements in the near future.

I'm not yet benchmarking a wide variety of use cases, but those that I have
looked at are positive; I see kernel compilation time improved by up to 10%,
which I expect to improve further once I add in the arm64 "contiguous bit".
Memory consumption is somewhere between 1% less and 2% more, depending on how
its measured. More on perf and memory below.

The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
conflict resolution). I have a tree at [4].

[1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
[3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
[4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2

Approach
========

There are 4 fault paths that have been modified:
 - write fault on unallocated address: do_anonymous_page()
 - write fault on zero page: wp_page_copy()
 - write fault on non-exclusive CoW page: wp_page_copy()
 - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()

In the first 2 cases, we will determine the preferred order folio to allocate,
limited by a max order (currently order-4; see below), VMA and PMD bounds, and
state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
folio as the source, subject to constraints that may arise if the source has
been mremapped or partially munmapped. And in the 4th case, we reuse as much of
the folio as we can, subject to the same mremap/munmap constraints.

If allocation of our preferred folio order fails, we gracefully fall back to
lower orders all the way to 0.

Note that none of this affects the behavior of traditional PMD-sized THP. If we
take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.

Open Questions
==============

How to Move Forwards
--------------------

While the series is a small-ish code change, it represents a big shift in the
way things are done. So I'd appreciate any help in scaling up performance
testing, review and general advice on how best to guide a change like this into
the kernel.

Folio Allocation Order Policy
-----------------------------

The current code is hardcoded to use a maximum order of 4. This was chosen for a
couple of reasons:
 - From the SW performance perspective, I see a knee around here where
   increasing it doesn't lead to much more performance gain.
 - Intuitively I assume that higher orders become increasingly difficult to
   allocate.
 - From the HW performance perspective, arm64's HPA works on order-2 blocks and
   "the contiguous bit" works on order-4 for 4KB base pages (although it's
   order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
   any higher.

I suggest that ultimately setting the max order should be left to the
architecture. arm64 would take advantage of this and set it to the order
required for the contiguous bit for the configured base page size.

However, I also have a (mild) concern about increased memory consumption. If an
app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
we would end up allocating 16x as much memory as we used to. One potential
approach I see here is to track fault addresses per-VMA, and increase a per-VMA
max allocation order for consecutive faults that extend a contiguous range, and
decrement when discontiguous. Alternatively/additionally, we could use the VMA
size as an indicator. I'd be interested in your thoughts/opinions.

Deferred Split Queue Lock Contention
------------------------------------

The results below show that we are spending a much greater proportion of time in
the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.

I think this is (at least partially) related for contention on the deferred
split queue lock. This is a per-memcg spinlock, which means a single spinlock
shared among all 160 CPUs. I've solved part of the problem with the last patch
in the series (which cuts down the need to take the lock), but at folio free
time (free_transhuge_page()), the lock is still taken and I think this could be
a problem. Now that most anonymous pages are large folios, this lock is taken a
lot more.

I think we could probably avoid taking the lock unless !list_empty(), but I
haven't convinced myself its definitely safe, so haven't applied it yet.

Roadmap
=======

Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
bit" on arm64 to validate predictions about HW speedups.

I also think there are some opportunities with madvise to split folios to non-0
orders, which might improve performance in some cases. madvise is also mistaking
exclusive large folios for non-exclusive ones at the moment (due to the "small
pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
frees the folio.

Results
=======

Performance
-----------

Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
before each run.

make defconfig && time make -jN Image

First with -j8:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |          373.0 |          342.8 |          -8.1% |
| user-time |         2333.9 |         2275.3 |          -2.5% |
| sys-time  |          510.7 |          340.9 |         -33.3% |

Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
execution. The next 2 tables show a breakdown of the cycles spent in the kernel
for the 8 job config:

|                      | baseline | anonfolio | percent change |
|                      | (cycles) | (cycles)  | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| data abort           |     683B |      316B |         -53.8% |
| instruction abort    |      93B |       76B |         -18.4% |
| syscall              |     887B |      767B |         -13.6% |

|                      | baseline | anonfolio | percent change |
|                      | (cycles) | (cycles)  | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| arm64_sys_openat     |     194B |      188B |          -3.3% |
| arm64_sys_exit_group |     192B |      124B |         -35.7% |
| arm64_sys_read       |     124B |      108B |         -12.7% |
| arm64_sys_execve     |      75B |       67B |         -11.0% |
| arm64_sys_mmap       |      51B |       50B |          -3.0% |
| arm64_sys_mprotect   |      15B |       13B |         -12.0% |
| arm64_sys_write      |      43B |       42B |          -2.9% |
| arm64_sys_munmap     |      15B |       12B |         -17.0% |
| arm64_sys_newfstatat |      46B |       41B |          -9.7% |
| arm64_sys_clone      |      26B |       24B |         -10.0% |

And now with -j160:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |           53.7 |           48.2 |         -10.2% |
| user-time |         2705.8 |         2842.1 |           5.0% |
| sys-time  |         1370.4 |         1064.3 |         -22.3% |

Above shows a 10.2% improvement in real time execution. But ~3x more time is
spent in the kernel than for the -j8 config. I think this is related to the lock
contention issue I highlighted above, but haven't bottomed it out yet. It's also
not yet clear to me why user-time increases by 5%.

I've also run all the will-it-scale microbenchmarks for a single task, using the
process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
fluctuation. So I'm just calling out tests with results that have gt 5%
improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
are regressed:

| benchmark            | baseline | anonfolio | percent change |
|                      | ops/s    | ops/s     | BIGGER=better  |
| ---------------------|---------:|----------:|---------------:|
| context_switch1.csv  |   328744 |    351150 |          6.8%  |
| malloc1.csv          |    96214 |     50890 |        -47.1%  |
| mmap1.csv            |   410253 |    375746 |         -8.4%  |
| page_fault1.csv      |   624061 |   3185678 |        410.5%  |
| page_fault2.csv      |   416483 |    557448 |         33.8%  |
| page_fault3.csv      |   724566 |   1152726 |         59.1%  |
| read1.csv            |  1806908 |   1905752 |          5.5%  |
| read2.csv            |   587722 |   1942062 |        230.4%  |
| tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
| tlb_flush2.csv       |   266763 |    322320 |         20.8%  |

I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
object in a loop and never touches the allocated memory. I think the malloc
implementation is maintaining a header just before the allocated object, which
causes a single page fault. Previously that page fault allocated 1 page. Now it
is allocating 16 pages. This cost would be repaid if the test code wrote to the
allocated object. Alternatively the folio allocation order policy described
above would also solve this.

It is not clear to me why mmap1 has slowed down. This remains a todo.

Memory
------

I measured memory consumption while doing a kernel compile with 8 jobs on a
system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
workload, then calcualted "memory used" high and low watermarks using both
MemFree and MemAvailable. If there is a better way of measuring system memory
consumption, please let me know!

mem-used = 4GB - /proc/meminfo:MemFree

|                      | baseline | anonfolio | percent change |
|                      | (MB)     | (MB)      | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| mem-used-low         |      825 |       842 |           2.1% |
| mem-used-high        |     2697 |      2672 |          -0.9% |

mem-used = 4GB - /proc/meminfo:MemAvailable

|                      | baseline | anonfolio | percent change |
|                      | (MB)     | (MB)      | SMALLER=better |
|----------------------|---------:|----------:|---------------:|
| mem-used-low         |      518 |       530 |           2.3% |
| mem-used-high        |     1522 |      1537 |           1.0% |

For the high watermark, the methods disagree; we are either saving 1% or using
1% more. For the low watermark, both methods agree that we are using about 2%
more. I plan to investigate whether the proposed folio allocation order policy
can reduce this to zero.

Thanks for making it this far!
Ryan


Ryan Roberts (17):
  mm: Expose clear_huge_page() unconditionally
  mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
  mm: Introduce try_vma_alloc_movable_folio()
  mm: Implement folio_add_new_anon_rmap_range()
  mm: Routines to determine max anon folio allocation order
  mm: Allocate large folios for anonymous memory
  mm: Allow deferred splitting of arbitrary large anon folios
  mm: Implement folio_move_anon_rmap_range()
  mm: Update wp_page_reuse() to operate on range of pages
  mm: Reuse large folios for anonymous memory
  mm: Split __wp_page_copy_user() into 2 variants
  mm: ptep_clear_flush_range_notify() macro for batch operation
  mm: Implement folio_remove_rmap_range()
  mm: Copy large folios for anonymous memory
  mm: Convert zero page to large folios on write
  mm: mmap: Align unhinted maps to highest anon folio order
  mm: Batch-zap large anonymous folio PTE mappings

 arch/alpha/include/asm/page.h   |   5 +-
 arch/arm64/include/asm/page.h   |   3 +-
 arch/arm64/mm/fault.c           |   7 +-
 arch/ia64/include/asm/page.h    |   5 +-
 arch/m68k/include/asm/page_no.h |   7 +-
 arch/s390/include/asm/page.h    |   5 +-
 arch/x86/include/asm/page.h     |   5 +-
 include/linux/highmem.h         |  23 +-
 include/linux/mm.h              |   8 +-
 include/linux/mmu_notifier.h    |  31 ++
 include/linux/rmap.h            |   6 +
 mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
 mm/mmap.c                       |   4 +-
 mm/rmap.c                       | 147 +++++-
 14 files changed, 1000 insertions(+), 133 deletions(-)

--
2.25.1

Yin Fengwei April 17, 2023, 8:04 a.m. UTC | #1

On 4/14/2023 9:02 PM, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>  - write fault on unallocated address: do_anonymous_page()
>  - write fault on zero page: wp_page_copy()
>  - write fault on non-exclusive CoW page: wp_page_copy()
>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.
> 
> If allocation of our preferred folio order fails, we gracefully fall back to
> lower orders all the way to 0.
> 
> Note that none of this affects the behavior of traditional PMD-sized THP. If we
> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
> 
> Open Questions
> ==============
> 
> How to Move Forwards
> --------------------
> 
> While the series is a small-ish code change, it represents a big shift in the
> way things are done. So I'd appreciate any help in scaling up performance
> testing, review and general advice on how best to guide a change like this into
> the kernel.
> 
> Folio Allocation Order Policy
> -----------------------------
> 
> The current code is hardcoded to use a maximum order of 4. This was chosen for a
> couple of reasons:
>  - From the SW performance perspective, I see a knee around here where
>    increasing it doesn't lead to much more performance gain.
>  - Intuitively I assume that higher orders become increasingly difficult to
>    allocate.
>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>    any higher.
> 
> I suggest that ultimately setting the max order should be left to the
> architecture. arm64 would take advantage of this and set it to the order
> required for the contiguous bit for the configured base page size.
> 
> However, I also have a (mild) concern about increased memory consumption. If an
> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
> we would end up allocating 16x as much memory as we used to. One potential
> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
> max allocation order for consecutive faults that extend a contiguous range, and
> decrement when discontiguous. Alternatively/additionally, we could use the VMA
> size as an indicator. I'd be interested in your thoughts/opinions.
> 
> Deferred Split Queue Lock Contention
> ------------------------------------
> 
> The results below show that we are spending a much greater proportion of time in
> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
> 
> I think this is (at least partially) related for contention on the deferred
> split queue lock. This is a per-memcg spinlock, which means a single spinlock
> shared among all 160 CPUs. I've solved part of the problem with the last patch
> in the series (which cuts down the need to take the lock), but at folio free
> time (free_transhuge_page()), the lock is still taken and I think this could be
> a problem. Now that most anonymous pages are large folios, this lock is taken a
> lot more.
> 
> I think we could probably avoid taking the lock unless !list_empty(), but I
> haven't convinced myself its definitely safe, so haven't applied it yet.
Yes. It's safe. We also identified other lock contention with large folio
for anonymous mapping like lru lock and zone lock. My understanding is that
the anonymous page has much higher alloc/free frequency than page cache.

So the lock contention was not exposed by large folio for page cache.


I posted the related patch to:
https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t


Regards
Yin, Fengwei


> 
> Roadmap
> =======
> 
> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
> bit" on arm64 to validate predictions about HW speedups.
> 
> I also think there are some opportunities with madvise to split folios to non-0
> orders, which might improve performance in some cases. madvise is also mistaking
> exclusive large folios for non-exclusive ones at the moment (due to the "small
> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
> frees the folio.
> 
> Results
> =======
> 
> Performance
> -----------
> 
> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
> before each run.
> 
> make defconfig && time make -jN Image
> 
> First with -j8:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |          373.0 |          342.8 |          -8.1% |
> | user-time |         2333.9 |         2275.3 |          -2.5% |
> | sys-time  |          510.7 |          340.9 |         -33.3% |
> 
> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
> for the 8 job config:
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | data abort           |     683B |      316B |         -53.8% |
> | instruction abort    |      93B |       76B |         -18.4% |
> | syscall              |     887B |      767B |         -13.6% |
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | arm64_sys_openat     |     194B |      188B |          -3.3% |
> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
> | arm64_sys_read       |     124B |      108B |         -12.7% |
> | arm64_sys_execve     |      75B |       67B |         -11.0% |
> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
> | arm64_sys_write      |      43B |       42B |          -2.9% |
> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
> | arm64_sys_clone      |      26B |       24B |         -10.0% |
> 
> And now with -j160:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |           53.7 |           48.2 |         -10.2% |
> | user-time |         2705.8 |         2842.1 |           5.0% |
> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
> 
> Above shows a 10.2% improvement in real time execution. But ~3x more time is
> spent in the kernel than for the -j8 config. I think this is related to the lock
> contention issue I highlighted above, but haven't bottomed it out yet. It's also
> not yet clear to me why user-time increases by 5%.
> 
> I've also run all the will-it-scale microbenchmarks for a single task, using the
> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
> fluctuation. So I'm just calling out tests with results that have gt 5%
> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
> are regressed:
> 
> | benchmark            | baseline | anonfolio | percent change |
> |                      | ops/s    | ops/s     | BIGGER=better  |
> | ---------------------|---------:|----------:|---------------:|
> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
> | read1.csv            |  1806908 |   1905752 |          5.5%  |
> | read2.csv            |   587722 |   1942062 |        230.4%  |
> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
> 
> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
> object in a loop and never touches the allocated memory. I think the malloc
> implementation is maintaining a header just before the allocated object, which
> causes a single page fault. Previously that page fault allocated 1 page. Now it
> is allocating 16 pages. This cost would be repaid if the test code wrote to the
> allocated object. Alternatively the folio allocation order policy described
> above would also solve this.
> 
> It is not clear to me why mmap1 has slowed down. This remains a todo.
> 
> Memory
> ------
> 
> I measured memory consumption while doing a kernel compile with 8 jobs on a
> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
> workload, then calcualted "memory used" high and low watermarks using both
> MemFree and MemAvailable. If there is a better way of measuring system memory
> consumption, please let me know!
> 
> mem-used = 4GB - /proc/meminfo:MemFree
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      825 |       842 |           2.1% |
> | mem-used-high        |     2697 |      2672 |          -0.9% |
> 
> mem-used = 4GB - /proc/meminfo:MemAvailable
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      518 |       530 |           2.3% |
> | mem-used-high        |     1522 |      1537 |           1.0% |
> 
> For the high watermark, the methods disagree; we are either saving 1% or using
> 1% more. For the low watermark, both methods agree that we are using about 2%
> more. I plan to investigate whether the proposed folio allocation order policy
> can reduce this to zero.
> 
> Thanks for making it this far!
> Ryan
> 
> 
> Ryan Roberts (17):
>   mm: Expose clear_huge_page() unconditionally
>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>   mm: Introduce try_vma_alloc_movable_folio()
>   mm: Implement folio_add_new_anon_rmap_range()
>   mm: Routines to determine max anon folio allocation order
>   mm: Allocate large folios for anonymous memory
>   mm: Allow deferred splitting of arbitrary large anon folios
>   mm: Implement folio_move_anon_rmap_range()
>   mm: Update wp_page_reuse() to operate on range of pages
>   mm: Reuse large folios for anonymous memory
>   mm: Split __wp_page_copy_user() into 2 variants
>   mm: ptep_clear_flush_range_notify() macro for batch operation
>   mm: Implement folio_remove_rmap_range()
>   mm: Copy large folios for anonymous memory
>   mm: Convert zero page to large folios on write
>   mm: mmap: Align unhinted maps to highest anon folio order
>   mm: Batch-zap large anonymous folio PTE mappings
> 
>  arch/alpha/include/asm/page.h   |   5 +-
>  arch/arm64/include/asm/page.h   |   3 +-
>  arch/arm64/mm/fault.c           |   7 +-
>  arch/ia64/include/asm/page.h    |   5 +-
>  arch/m68k/include/asm/page_no.h |   7 +-
>  arch/s390/include/asm/page.h    |   5 +-
>  arch/x86/include/asm/page.h     |   5 +-
>  include/linux/highmem.h         |  23 +-
>  include/linux/mm.h              |   8 +-
>  include/linux/mmu_notifier.h    |  31 ++
>  include/linux/rmap.h            |   6 +
>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>  mm/mmap.c                       |   4 +-
>  mm/rmap.c                       | 147 +++++-
>  14 files changed, 1000 insertions(+), 133 deletions(-)
> 
> --
> 2.25.1
>

Yin Fengwei April 17, 2023, 8:19 a.m. UTC | #2

On 4/14/2023 9:02 PM, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>    overhead. This should benefit all architectures.
>  - Since we are now mapping physically contiguous chunks of memory, we can take
>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>  - write fault on unallocated address: do_anonymous_page()
>  - write fault on zero page: wp_page_copy()
>  - write fault on non-exclusive CoW page: wp_page_copy()
>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.
> 
> If allocation of our preferred folio order fails, we gracefully fall back to
> lower orders all the way to 0.
> 
> Note that none of this affects the behavior of traditional PMD-sized THP. If we
> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
> 
> Open Questions
> ==============
> 
> How to Move Forwards
> --------------------
> 
> While the series is a small-ish code change, it represents a big shift in the
> way things are done. So I'd appreciate any help in scaling up performance
> testing, review and general advice on how best to guide a change like this into
> the kernel.
> 
> Folio Allocation Order Policy
> -----------------------------
> 
> The current code is hardcoded to use a maximum order of 4. This was chosen for a
> couple of reasons:
>  - From the SW performance perspective, I see a knee around here where
>    increasing it doesn't lead to much more performance gain.
>  - Intuitively I assume that higher orders become increasingly difficult to
>    allocate.
>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>    any higher.
> 
> I suggest that ultimately setting the max order should be left to the
> architecture. arm64 would take advantage of this and set it to the order
> required for the contiguous bit for the configured base page size.
> 
> However, I also have a (mild) concern about increased memory consumption. If an
> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
> we would end up allocating 16x as much memory as we used to. One potential
> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
> max allocation order for consecutive faults that extend a contiguous range, and
> decrement when discontiguous. Alternatively/additionally, we could use the VMA
> size as an indicator. I'd be interested in your thoughts/opinions.
> 
> Deferred Split Queue Lock Contention
> ------------------------------------
> 
> The results below show that we are spending a much greater proportion of time in
> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
> 
> I think this is (at least partially) related for contention on the deferred
> split queue lock. This is a per-memcg spinlock, which means a single spinlock
> shared among all 160 CPUs. I've solved part of the problem with the last patch
> in the series (which cuts down the need to take the lock), but at folio free
> time (free_transhuge_page()), the lock is still taken and I think this could be
> a problem. Now that most anonymous pages are large folios, this lock is taken a
> lot more.
> 
> I think we could probably avoid taking the lock unless !list_empty(), but I
> haven't convinced myself its definitely safe, so haven't applied it yet.
> 
> Roadmap
> =======
> 
> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
> bit" on arm64 to validate predictions about HW speedups.
> 
> I also think there are some opportunities with madvise to split folios to non-0
> orders, which might improve performance in some cases. madvise is also mistaking
> exclusive large folios for non-exclusive ones at the moment (due to the "small
> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
> frees the folio.
> 
> Results
> =======
> 
> Performance
> -----------
> 
> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
> before each run.
> 
> make defconfig && time make -jN Image
> 
> First with -j8:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |          373.0 |          342.8 |          -8.1% |
> | user-time |         2333.9 |         2275.3 |          -2.5% |
> | sys-time  |          510.7 |          340.9 |         -33.3% |
> 
> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
> for the 8 job config:
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | data abort           |     683B |      316B |         -53.8% |
> | instruction abort    |      93B |       76B |         -18.4% |
> | syscall              |     887B |      767B |         -13.6% |
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (cycles) | (cycles)  | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | arm64_sys_openat     |     194B |      188B |          -3.3% |
> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
> | arm64_sys_read       |     124B |      108B |         -12.7% |
> | arm64_sys_execve     |      75B |       67B |         -11.0% |
> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
> | arm64_sys_write      |      43B |       42B |          -2.9% |
> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
> | arm64_sys_clone      |      26B |       24B |         -10.0% |
> 
> And now with -j160:
> 
> |           | baseline time  | anonfolio time | percent change |
> |           | to compile (s) | to compile (s) | SMALLER=better |
> |-----------|---------------:|---------------:|---------------:|
> | real-time |           53.7 |           48.2 |         -10.2% |
> | user-time |         2705.8 |         2842.1 |           5.0% |
> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
> 
> Above shows a 10.2% improvement in real time execution. But ~3x more time is
> spent in the kernel than for the -j8 config. I think this is related to the lock
> contention issue I highlighted above, but haven't bottomed it out yet. It's also
> not yet clear to me why user-time increases by 5%.
> 
> I've also run all the will-it-scale microbenchmarks for a single task, using the
> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
> fluctuation. So I'm just calling out tests with results that have gt 5%
> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
> are regressed:
> 
> | benchmark            | baseline | anonfolio | percent change |
> |                      | ops/s    | ops/s     | BIGGER=better  |
> | ---------------------|---------:|----------:|---------------:|
> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
> | read1.csv            |  1806908 |   1905752 |          5.5%  |
> | read2.csv            |   587722 |   1942062 |        230.4%  |
> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
> 
> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
> object in a loop and never touches the allocated memory. I think the malloc
> implementation is maintaining a header just before the allocated object, which
> causes a single page fault. Previously that page fault allocated 1 page. Now it
> is allocating 16 pages. This cost would be repaid if the test code wrote to the
> allocated object. Alternatively the folio allocation order policy described
> above would also solve this.
> 
> It is not clear to me why mmap1 has slowed down. This remains a todo.
> 
> Memory
> ------
> 
> I measured memory consumption while doing a kernel compile with 8 jobs on a
> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
> workload, then calcualted "memory used" high and low watermarks using both
> MemFree and MemAvailable. If there is a better way of measuring system memory
> consumption, please let me know!
> 
> mem-used = 4GB - /proc/meminfo:MemFree
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      825 |       842 |           2.1% |
> | mem-used-high        |     2697 |      2672 |          -0.9% |
> 
> mem-used = 4GB - /proc/meminfo:MemAvailable
> 
> |                      | baseline | anonfolio | percent change |
> |                      | (MB)     | (MB)      | SMALLER=better |
> |----------------------|---------:|----------:|---------------:|
> | mem-used-low         |      518 |       530 |           2.3% |
> | mem-used-high        |     1522 |      1537 |           1.0% |
> 
> For the high watermark, the methods disagree; we are either saving 1% or using
> 1% more. For the low watermark, both methods agree that we are using about 2%
> more. I plan to investigate whether the proposed folio allocation order policy
> can reduce this to zero.

Besides the memory consumption, the large folio could impact the tail latency
of page allocation also (extra zeroing memory, more operations on slow path of
page allocation).

Again, my understanding is the tail latency of page allocation is more important
to anonymous page than page cache page because anonymous page is allocated/freed
more frequently.


Regards
Yin, Fengwei

> 
> Thanks for making it this far!
> Ryan
> 
> 
> Ryan Roberts (17):
>   mm: Expose clear_huge_page() unconditionally
>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>   mm: Introduce try_vma_alloc_movable_folio()
>   mm: Implement folio_add_new_anon_rmap_range()
>   mm: Routines to determine max anon folio allocation order
>   mm: Allocate large folios for anonymous memory
>   mm: Allow deferred splitting of arbitrary large anon folios
>   mm: Implement folio_move_anon_rmap_range()
>   mm: Update wp_page_reuse() to operate on range of pages
>   mm: Reuse large folios for anonymous memory
>   mm: Split __wp_page_copy_user() into 2 variants
>   mm: ptep_clear_flush_range_notify() macro for batch operation
>   mm: Implement folio_remove_rmap_range()
>   mm: Copy large folios for anonymous memory
>   mm: Convert zero page to large folios on write
>   mm: mmap: Align unhinted maps to highest anon folio order
>   mm: Batch-zap large anonymous folio PTE mappings
> 
>  arch/alpha/include/asm/page.h   |   5 +-
>  arch/arm64/include/asm/page.h   |   3 +-
>  arch/arm64/mm/fault.c           |   7 +-
>  arch/ia64/include/asm/page.h    |   5 +-
>  arch/m68k/include/asm/page_no.h |   7 +-
>  arch/s390/include/asm/page.h    |   5 +-
>  arch/x86/include/asm/page.h     |   5 +-
>  include/linux/highmem.h         |  23 +-
>  include/linux/mm.h              |   8 +-
>  include/linux/mmu_notifier.h    |  31 ++
>  include/linux/rmap.h            |   6 +
>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>  mm/mmap.c                       |   4 +-
>  mm/rmap.c                       | 147 +++++-
>  14 files changed, 1000 insertions(+), 133 deletions(-)
> 
> --
> 2.25.1
>

Ryan Roberts April 17, 2023, 10:19 a.m. UTC | #3

On 17/04/2023 09:04, Yin, Fengwei wrote:
> 
> 
> On 4/14/2023 9:02 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>    overhead. This should benefit all architectures.
>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>  - write fault on unallocated address: do_anonymous_page()
>>  - write fault on zero page: wp_page_copy()
>>  - write fault on non-exclusive CoW page: wp_page_copy()
>>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
>>
>> If allocation of our preferred folio order fails, we gracefully fall back to
>> lower orders all the way to 0.
>>
>> Note that none of this affects the behavior of traditional PMD-sized THP. If we
>> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
>>
>> Open Questions
>> ==============
>>
>> How to Move Forwards
>> --------------------
>>
>> While the series is a small-ish code change, it represents a big shift in the
>> way things are done. So I'd appreciate any help in scaling up performance
>> testing, review and general advice on how best to guide a change like this into
>> the kernel.
>>
>> Folio Allocation Order Policy
>> -----------------------------
>>
>> The current code is hardcoded to use a maximum order of 4. This was chosen for a
>> couple of reasons:
>>  - From the SW performance perspective, I see a knee around here where
>>    increasing it doesn't lead to much more performance gain.
>>  - Intuitively I assume that higher orders become increasingly difficult to
>>    allocate.
>>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>>    any higher.
>>
>> I suggest that ultimately setting the max order should be left to the
>> architecture. arm64 would take advantage of this and set it to the order
>> required for the contiguous bit for the configured base page size.
>>
>> However, I also have a (mild) concern about increased memory consumption. If an
>> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
>> we would end up allocating 16x as much memory as we used to. One potential
>> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
>> max allocation order for consecutive faults that extend a contiguous range, and
>> decrement when discontiguous. Alternatively/additionally, we could use the VMA
>> size as an indicator. I'd be interested in your thoughts/opinions.
>>
>> Deferred Split Queue Lock Contention
>> ------------------------------------
>>
>> The results below show that we are spending a much greater proportion of time in
>> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
>>
>> I think this is (at least partially) related for contention on the deferred
>> split queue lock. This is a per-memcg spinlock, which means a single spinlock
>> shared among all 160 CPUs. I've solved part of the problem with the last patch
>> in the series (which cuts down the need to take the lock), but at folio free
>> time (free_transhuge_page()), the lock is still taken and I think this could be
>> a problem. Now that most anonymous pages are large folios, this lock is taken a
>> lot more.
>>
>> I think we could probably avoid taking the lock unless !list_empty(), but I
>> haven't convinced myself its definitely safe, so haven't applied it yet.
> Yes. It's safe. We also identified other lock contention with large folio
> for anonymous mapping like lru lock and zone lock. My understanding is that
> the anonymous page has much higher alloc/free frequency than page cache.
> 
> So the lock contention was not exposed by large folio for page cache.
> 
> 
> I posted the related patch to:
> https://lore.kernel.org/linux-mm/20230417075643.3287513-1-fengwei.yin@intel.com/T/#t

Thanks for posting! I added these patches on top of mine and reran my perf test
for -j160. Results as follows:

|           | baseline time  | anonfolio time | percent change |
|           | to compile (s) | to compile (s) | SMALLER=better |
|-----------|---------------:|---------------:|---------------:|
| real-time |           53.7 |           49.1 |          -8.4% |
| user-time |         2705.8 |         2939.1 |           8.6% |
| sys-time  |         1370.4 |          993.4 |         -27.5% |

So comparing to the original table below, its definitely reduced the time spent
in the kernel, but time in user space has increased, which is unexpected! I
think this is something I will have to dig into more deeply, but I don't have
access to the HW for the next couple of weeks so will come back to it then.



> 
> 
> Regards
> Yin, Fengwei
> 
> 
>>
>> Roadmap
>> =======
>>
>> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
>> bit" on arm64 to validate predictions about HW speedups.
>>
>> I also think there are some opportunities with madvise to split folios to non-0
>> orders, which might improve performance in some cases. madvise is also mistaking
>> exclusive large folios for non-exclusive ones at the moment (due to the "small
>> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
>> frees the folio.
>>
>> Results
>> =======
>>
>> Performance
>> -----------
>>
>> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
>> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
>> before each run.
>>
>> make defconfig && time make -jN Image
>>
>> First with -j8:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |          373.0 |          342.8 |          -8.1% |
>> | user-time |         2333.9 |         2275.3 |          -2.5% |
>> | sys-time  |          510.7 |          340.9 |         -33.3% |
>>
>> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
>> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
>> for the 8 job config:
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | data abort           |     683B |      316B |         -53.8% |
>> | instruction abort    |      93B |       76B |         -18.4% |
>> | syscall              |     887B |      767B |         -13.6% |
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | arm64_sys_openat     |     194B |      188B |          -3.3% |
>> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
>> | arm64_sys_read       |     124B |      108B |         -12.7% |
>> | arm64_sys_execve     |      75B |       67B |         -11.0% |
>> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
>> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
>> | arm64_sys_write      |      43B |       42B |          -2.9% |
>> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
>> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
>> | arm64_sys_clone      |      26B |       24B |         -10.0% |
>>
>> And now with -j160:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |           53.7 |           48.2 |         -10.2% |
>> | user-time |         2705.8 |         2842.1 |           5.0% |
>> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
>>
>> Above shows a 10.2% improvement in real time execution. But ~3x more time is
>> spent in the kernel than for the -j8 config. I think this is related to the lock
>> contention issue I highlighted above, but haven't bottomed it out yet. It's also
>> not yet clear to me why user-time increases by 5%.
>>
>> I've also run all the will-it-scale microbenchmarks for a single task, using the
>> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
>> fluctuation. So I'm just calling out tests with results that have gt 5%
>> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
>> are regressed:
>>
>> | benchmark            | baseline | anonfolio | percent change |
>> |                      | ops/s    | ops/s     | BIGGER=better  |
>> | ---------------------|---------:|----------:|---------------:|
>> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
>> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
>> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
>> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
>> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
>> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
>> | read1.csv            |  1806908 |   1905752 |          5.5%  |
>> | read2.csv            |   587722 |   1942062 |        230.4%  |
>> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
>> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
>>
>> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
>> object in a loop and never touches the allocated memory. I think the malloc
>> implementation is maintaining a header just before the allocated object, which
>> causes a single page fault. Previously that page fault allocated 1 page. Now it
>> is allocating 16 pages. This cost would be repaid if the test code wrote to the
>> allocated object. Alternatively the folio allocation order policy described
>> above would also solve this.
>>
>> It is not clear to me why mmap1 has slowed down. This remains a todo.
>>
>> Memory
>> ------
>>
>> I measured memory consumption while doing a kernel compile with 8 jobs on a
>> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
>> workload, then calcualted "memory used" high and low watermarks using both
>> MemFree and MemAvailable. If there is a better way of measuring system memory
>> consumption, please let me know!
>>
>> mem-used = 4GB - /proc/meminfo:MemFree
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      825 |       842 |           2.1% |
>> | mem-used-high        |     2697 |      2672 |          -0.9% |
>>
>> mem-used = 4GB - /proc/meminfo:MemAvailable
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      518 |       530 |           2.3% |
>> | mem-used-high        |     1522 |      1537 |           1.0% |
>>
>> For the high watermark, the methods disagree; we are either saving 1% or using
>> 1% more. For the low watermark, both methods agree that we are using about 2%
>> more. I plan to investigate whether the proposed folio allocation order policy
>> can reduce this to zero.
>>
>> Thanks for making it this far!
>> Ryan
>>
>>
>> Ryan Roberts (17):
>>   mm: Expose clear_huge_page() unconditionally
>>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>>   mm: Introduce try_vma_alloc_movable_folio()
>>   mm: Implement folio_add_new_anon_rmap_range()
>>   mm: Routines to determine max anon folio allocation order
>>   mm: Allocate large folios for anonymous memory
>>   mm: Allow deferred splitting of arbitrary large anon folios
>>   mm: Implement folio_move_anon_rmap_range()
>>   mm: Update wp_page_reuse() to operate on range of pages
>>   mm: Reuse large folios for anonymous memory
>>   mm: Split __wp_page_copy_user() into 2 variants
>>   mm: ptep_clear_flush_range_notify() macro for batch operation
>>   mm: Implement folio_remove_rmap_range()
>>   mm: Copy large folios for anonymous memory
>>   mm: Convert zero page to large folios on write
>>   mm: mmap: Align unhinted maps to highest anon folio order
>>   mm: Batch-zap large anonymous folio PTE mappings
>>
>>  arch/alpha/include/asm/page.h   |   5 +-
>>  arch/arm64/include/asm/page.h   |   3 +-
>>  arch/arm64/mm/fault.c           |   7 +-
>>  arch/ia64/include/asm/page.h    |   5 +-
>>  arch/m68k/include/asm/page_no.h |   7 +-
>>  arch/s390/include/asm/page.h    |   5 +-
>>  arch/x86/include/asm/page.h     |   5 +-
>>  include/linux/highmem.h         |  23 +-
>>  include/linux/mm.h              |   8 +-
>>  include/linux/mmu_notifier.h    |  31 ++
>>  include/linux/rmap.h            |   6 +
>>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>>  mm/mmap.c                       |   4 +-
>>  mm/rmap.c                       | 147 +++++-
>>  14 files changed, 1000 insertions(+), 133 deletions(-)
>>
>> --
>> 2.25.1
>>

Ryan Roberts April 17, 2023, 10:28 a.m. UTC | #4

On 17/04/2023 09:19, Yin, Fengwei wrote:
> 
> 
> On 4/14/2023 9:02 PM, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>  - Since SW (the kernel) is dealing with larger chunks of memory than base
>>    pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>    and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>    overhead. This should benefit all architectures.
>>  - Since we are now mapping physically contiguous chunks of memory, we can take
>>    advantage of HW TLB compression techniques. A reduction in TLB pressure
>>    speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>    TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>  - write fault on unallocated address: do_anonymous_page()
>>  - write fault on zero page: wp_page_copy()
>>  - write fault on non-exclusive CoW page: wp_page_copy()
>>  - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
>>
>> If allocation of our preferred folio order fails, we gracefully fall back to
>> lower orders all the way to 0.
>>
>> Note that none of this affects the behavior of traditional PMD-sized THP. If we
>> take a fault in an MADV_HUGEPAGE region, you still get PMD-sized mappings.
>>
>> Open Questions
>> ==============
>>
>> How to Move Forwards
>> --------------------
>>
>> While the series is a small-ish code change, it represents a big shift in the
>> way things are done. So I'd appreciate any help in scaling up performance
>> testing, review and general advice on how best to guide a change like this into
>> the kernel.
>>
>> Folio Allocation Order Policy
>> -----------------------------
>>
>> The current code is hardcoded to use a maximum order of 4. This was chosen for a
>> couple of reasons:
>>  - From the SW performance perspective, I see a knee around here where
>>    increasing it doesn't lead to much more performance gain.
>>  - Intuitively I assume that higher orders become increasingly difficult to
>>    allocate.
>>  - From the HW performance perspective, arm64's HPA works on order-2 blocks and
>>    "the contiguous bit" works on order-4 for 4KB base pages (although it's
>>    order-7 for 16KB and order-5 for 64KB), so there is no HW benefit to going
>>    any higher.
>>
>> I suggest that ultimately setting the max order should be left to the
>> architecture. arm64 would take advantage of this and set it to the order
>> required for the contiguous bit for the configured base page size.
>>
>> However, I also have a (mild) concern about increased memory consumption. If an
>> app has a pathological fault pattern (e.g. sparsely touches memory every 64KB)
>> we would end up allocating 16x as much memory as we used to. One potential
>> approach I see here is to track fault addresses per-VMA, and increase a per-VMA
>> max allocation order for consecutive faults that extend a contiguous range, and
>> decrement when discontiguous. Alternatively/additionally, we could use the VMA
>> size as an indicator. I'd be interested in your thoughts/opinions.
>>
>> Deferred Split Queue Lock Contention
>> ------------------------------------
>>
>> The results below show that we are spending a much greater proportion of time in
>> the kernel when doing a kernel compile using 160 CPUs vs 8 CPUs.
>>
>> I think this is (at least partially) related for contention on the deferred
>> split queue lock. This is a per-memcg spinlock, which means a single spinlock
>> shared among all 160 CPUs. I've solved part of the problem with the last patch
>> in the series (which cuts down the need to take the lock), but at folio free
>> time (free_transhuge_page()), the lock is still taken and I think this could be
>> a problem. Now that most anonymous pages are large folios, this lock is taken a
>> lot more.
>>
>> I think we could probably avoid taking the lock unless !list_empty(), but I
>> haven't convinced myself its definitely safe, so haven't applied it yet.
>>
>> Roadmap
>> =======
>>
>> Beyond scaling up perf testing, I'm planning to enable use of the "contiguous
>> bit" on arm64 to validate predictions about HW speedups.
>>
>> I also think there are some opportunities with madvise to split folios to non-0
>> orders, which might improve performance in some cases. madvise is also mistaking
>> exclusive large folios for non-exclusive ones at the moment (due to the "small
>> pages" mapcount scheme), so that needs to be fixed so that MADV_FREE correctly
>> frees the folio.
>>
>> Results
>> =======
>>
>> Performance
>> -----------
>>
>> Test: Kernel Compilation, on Ampere Altra (160 CPU machine), with 8 jobs and
>> with 160 jobs. First run discarded, next 3 runs averaged. Git repo cleaned
>> before each run.
>>
>> make defconfig && time make -jN Image
>>
>> First with -j8:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |          373.0 |          342.8 |          -8.1% |
>> | user-time |         2333.9 |         2275.3 |          -2.5% |
>> | sys-time  |          510.7 |          340.9 |         -33.3% |
>>
>> Above shows 8.1% improvement in real time execution, and 33.3% saving in kernel
>> execution. The next 2 tables show a breakdown of the cycles spent in the kernel
>> for the 8 job config:
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | data abort           |     683B |      316B |         -53.8% |
>> | instruction abort    |      93B |       76B |         -18.4% |
>> | syscall              |     887B |      767B |         -13.6% |
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (cycles) | (cycles)  | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | arm64_sys_openat     |     194B |      188B |          -3.3% |
>> | arm64_sys_exit_group |     192B |      124B |         -35.7% |
>> | arm64_sys_read       |     124B |      108B |         -12.7% |
>> | arm64_sys_execve     |      75B |       67B |         -11.0% |
>> | arm64_sys_mmap       |      51B |       50B |          -3.0% |
>> | arm64_sys_mprotect   |      15B |       13B |         -12.0% |
>> | arm64_sys_write      |      43B |       42B |          -2.9% |
>> | arm64_sys_munmap     |      15B |       12B |         -17.0% |
>> | arm64_sys_newfstatat |      46B |       41B |          -9.7% |
>> | arm64_sys_clone      |      26B |       24B |         -10.0% |
>>
>> And now with -j160:
>>
>> |           | baseline time  | anonfolio time | percent change |
>> |           | to compile (s) | to compile (s) | SMALLER=better |
>> |-----------|---------------:|---------------:|---------------:|
>> | real-time |           53.7 |           48.2 |         -10.2% |
>> | user-time |         2705.8 |         2842.1 |           5.0% |
>> | sys-time  |         1370.4 |         1064.3 |         -22.3% |
>>
>> Above shows a 10.2% improvement in real time execution. But ~3x more time is
>> spent in the kernel than for the -j8 config. I think this is related to the lock
>> contention issue I highlighted above, but haven't bottomed it out yet. It's also
>> not yet clear to me why user-time increases by 5%.
>>
>> I've also run all the will-it-scale microbenchmarks for a single task, using the
>> process mode. Results for multiple runs on the same kernel are noisy - I see ~5%
>> fluctuation. So I'm just calling out tests with results that have gt 5%
>> improvement or lt -5% regression. Results are average of 3 runs. Only 2 tests
>> are regressed:
>>
>> | benchmark            | baseline | anonfolio | percent change |
>> |                      | ops/s    | ops/s     | BIGGER=better  |
>> | ---------------------|---------:|----------:|---------------:|
>> | context_switch1.csv  |   328744 |    351150 |          6.8%  |
>> | malloc1.csv          |    96214 |     50890 |        -47.1%  |
>> | mmap1.csv            |   410253 |    375746 |         -8.4%  |
>> | page_fault1.csv      |   624061 |   3185678 |        410.5%  |
>> | page_fault2.csv      |   416483 |    557448 |         33.8%  |
>> | page_fault3.csv      |   724566 |   1152726 |         59.1%  |
>> | read1.csv            |  1806908 |   1905752 |          5.5%  |
>> | read2.csv            |   587722 |   1942062 |        230.4%  |
>> | tlb_flush1.csv       |   143910 |    152097 |          5.7%  |
>> | tlb_flush2.csv       |   266763 |    322320 |         20.8%  |
>>
>> I believe malloc1 is an unrealistic test, since it does malloc/free for 128M
>> object in a loop and never touches the allocated memory. I think the malloc
>> implementation is maintaining a header just before the allocated object, which
>> causes a single page fault. Previously that page fault allocated 1 page. Now it
>> is allocating 16 pages. This cost would be repaid if the test code wrote to the
>> allocated object. Alternatively the folio allocation order policy described
>> above would also solve this.
>>
>> It is not clear to me why mmap1 has slowed down. This remains a todo.
>>
>> Memory
>> ------
>>
>> I measured memory consumption while doing a kernel compile with 8 jobs on a
>> system limited to 4GB RAM. I polled /proc/meminfo every 0.5 seconds during the
>> workload, then calcualted "memory used" high and low watermarks using both
>> MemFree and MemAvailable. If there is a better way of measuring system memory
>> consumption, please let me know!
>>
>> mem-used = 4GB - /proc/meminfo:MemFree
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      825 |       842 |           2.1% |
>> | mem-used-high        |     2697 |      2672 |          -0.9% |
>>
>> mem-used = 4GB - /proc/meminfo:MemAvailable
>>
>> |                      | baseline | anonfolio | percent change |
>> |                      | (MB)     | (MB)      | SMALLER=better |
>> |----------------------|---------:|----------:|---------------:|
>> | mem-used-low         |      518 |       530 |           2.3% |
>> | mem-used-high        |     1522 |      1537 |           1.0% |
>>
>> For the high watermark, the methods disagree; we are either saving 1% or using
>> 1% more. For the low watermark, both methods agree that we are using about 2%
>> more. I plan to investigate whether the proposed folio allocation order policy
>> can reduce this to zero.
> 
> Besides the memory consumption, the large folio could impact the tail latency
> of page allocation also (extra zeroing memory, more operations on slow path of
> page allocation).

I agree; this series could be thought of as trading latency for throughput.
There are a couple of mitigations to try to limit the extra latency; on the
Altra at least, we are taking advantage of the CPU's streaming mode for the
memory zeroing - its >2x faster to zero a block of 64K than it is to zero 16x
4K. And currently I'm using __GPF_NORETRY for high order folio allocations,
which I understood to mean we shouldn't suffer high allocation latency due to
reclaim, etc. Although, I know we are discussing whether this is correct in the
thread for patch 3.

> 
> Again, my understanding is the tail latency of page allocation is more important
> to anonymous page than page cache page because anonymous page is allocated/freed
> more frequently.

I had assumed that any serious application that needs to guarantee no latency
due to page faults would preallocate and lock all memory during init? Perhaps
that's wishful thinking?

> 
> 
> Regards
> Yin, Fengwei
> 
>>
>> Thanks for making it this far!
>> Ryan
>>
>>
>> Ryan Roberts (17):
>>   mm: Expose clear_huge_page() unconditionally
>>   mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio()
>>   mm: Introduce try_vma_alloc_movable_folio()
>>   mm: Implement folio_add_new_anon_rmap_range()
>>   mm: Routines to determine max anon folio allocation order
>>   mm: Allocate large folios for anonymous memory
>>   mm: Allow deferred splitting of arbitrary large anon folios
>>   mm: Implement folio_move_anon_rmap_range()
>>   mm: Update wp_page_reuse() to operate on range of pages
>>   mm: Reuse large folios for anonymous memory
>>   mm: Split __wp_page_copy_user() into 2 variants
>>   mm: ptep_clear_flush_range_notify() macro for batch operation
>>   mm: Implement folio_remove_rmap_range()
>>   mm: Copy large folios for anonymous memory
>>   mm: Convert zero page to large folios on write
>>   mm: mmap: Align unhinted maps to highest anon folio order
>>   mm: Batch-zap large anonymous folio PTE mappings
>>
>>  arch/alpha/include/asm/page.h   |   5 +-
>>  arch/arm64/include/asm/page.h   |   3 +-
>>  arch/arm64/mm/fault.c           |   7 +-
>>  arch/ia64/include/asm/page.h    |   5 +-
>>  arch/m68k/include/asm/page_no.h |   7 +-
>>  arch/s390/include/asm/page.h    |   5 +-
>>  arch/x86/include/asm/page.h     |   5 +-
>>  include/linux/highmem.h         |  23 +-
>>  include/linux/mm.h              |   8 +-
>>  include/linux/mmu_notifier.h    |  31 ++
>>  include/linux/rmap.h            |   6 +
>>  mm/memory.c                     | 877 ++++++++++++++++++++++++++++----
>>  mm/mmap.c                       |   4 +-
>>  mm/rmap.c                       | 147 +++++-
>>  14 files changed, 1000 insertions(+), 133 deletions(-)
>>
>> --
>> 2.25.1
>>

David Hildenbrand April 17, 2023, 10:54 a.m. UTC | #5

On 14.04.23 15:02, Ryan Roberts wrote:
> Hi All,
> 
> This is a second RFC and my first proper attempt at implementing variable order,
> large folios for anonymous memory. The first RFC [1], was a partial
> implementation and a plea for help in debugging an issue I was hitting; thanks
> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
> 
> The objective of variable order anonymous folios is to improve performance by
> allocating larger chunks of memory during anonymous page faults:
> 
>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>     overhead. This should benefit all architectures.
>   - Since we are now mapping physically contiguous chunks of memory, we can take
>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
> 
> This patch set deals with the SW side of things only but sets us up nicely for
> taking advantage of the HW improvements in the near future.
> 
> I'm not yet benchmarking a wide variety of use cases, but those that I have
> looked at are positive; I see kernel compilation time improved by up to 10%,
> which I expect to improve further once I add in the arm64 "contiguous bit".
> Memory consumption is somewhere between 1% less and 2% more, depending on how
> its measured. More on perf and memory below.
> 
> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
> conflict resolution). I have a tree at [4].
> 
> [1] https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
> [3] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
> [4] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
> 
> Approach
> ========
> 
> There are 4 fault paths that have been modified:
>   - write fault on unallocated address: do_anonymous_page()
>   - write fault on zero page: wp_page_copy()
>   - write fault on non-exclusive CoW page: wp_page_copy()
>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
> 
> In the first 2 cases, we will determine the preferred order folio to allocate,
> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
> folio as the source, subject to constraints that may arise if the source has
> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
> the folio as we can, subject to the same mremap/munmap constraints.

Just a note (that you maybe already know) that we have to be a bit 
careful in the wp_copy path with replacing sub-pages that are marked 
exclusive.

Currently, we always only replace a single shared anon (sub)page by a 
fresh exclusive base-page during a write-fault/unsharing. As the 
sub-page is already marked "maybe shared", it cannot get pinned 
concurrently and everybody is happy.

If you now decide to replace more subpages, you have to be careful that 
none of them are still exclusive -- because they could get pinned 
concurrently and replacing them would result in memory corruptions.

There are scenarios (most prominently: MADV_WIPEONFORK), but also failed 
partial fork() that could result in something like that.

Further, we have to be a bit careful regarding replacing ranges that are 
backed by different anon pages (for example, due to fork() deciding to 
copy some sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

So what should be safe is replacing all sub-pages of a folio that are 
marked "maybe shared" by a new folio under PT lock. However, I wonder if 
it's really worth the complexity. For THP we were happy so far to *not* 
optimize this, implying that maybe we shouldn't worry about optimizing 
the fork() case for now that heavily.

One optimization once could think of instead (that I raised previously 
in other context) is the detection of exclusivity after fork()+exit in 
the child (IOW, only the parent continues to exist). Once 
PG_anon_exclusive was cleared for all sub-pages of the THP-mapped folio 
during fork(), we'd always decide to copy instead of reuse (because 
page_count() > 1, as the folio is PTE mapped). Scanning the surrounding 
page table if it makes sense (e.g., page_count() <= folio_nr_pages()), 
to test if all page references are from the current process would allow 
for reusing the folio (setting PG_anon_exclusive) for the sub-pages. The 
smaller the folio order, the cheaper this "scan surrounding PTEs" scan 
is. For THP, which are usually PMD-mapped even after fork()+exit, we 
didn't add this optimization.

Ryan Roberts April 17, 2023, 11:43 a.m. UTC | #6

On 17/04/2023 11:54, David Hildenbrand wrote:
> On 14.04.23 15:02, Ryan Roberts wrote:
>> Hi All,
>>
>> This is a second RFC and my first proper attempt at implementing variable order,
>> large folios for anonymous memory. The first RFC [1], was a partial
>> implementation and a plea for help in debugging an issue I was hitting; thanks
>> to Yin Fengwei and Matthew Wilcox for their advice in solving that!
>>
>> The objective of variable order anonymous folios is to improve performance by
>> allocating larger chunks of memory during anonymous page faults:
>>
>>   - Since SW (the kernel) is dealing with larger chunks of memory than base
>>     pages, there are efficiency savings to be had; fewer page faults, batched PTE
>>     and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel
>>     overhead. This should benefit all architectures.
>>   - Since we are now mapping physically contiguous chunks of memory, we can take
>>     advantage of HW TLB compression techniques. A reduction in TLB pressure
>>     speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce
>>     TLB entries; "the contiguous bit" (architectural) and HPA (uarch) - see [2].
>>
>> This patch set deals with the SW side of things only but sets us up nicely for
>> taking advantage of the HW improvements in the near future.
>>
>> I'm not yet benchmarking a wide variety of use cases, but those that I have
>> looked at are positive; I see kernel compilation time improved by up to 10%,
>> which I expect to improve further once I add in the arm64 "contiguous bit".
>> Memory consumption is somewhere between 1% less and 2% more, depending on how
>> its measured. More on perf and memory below.
>>
>> The patches are based on v6.3-rc6 + patches 1-31 of [3] (which needed one minor
>> conflict resolution). I have a tree at [4].
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20230317105802.2634004-1-ryan.roberts@arm.com/
>> [2]
>> https://lore.kernel.org/linux-mm/d347c5b0-0c0f-ae50-9613-2cf962d8676e@arm.com/
>> [3]
>> https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/
>> [4]
>> https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anon_folio-lkml-rfc2
>>
>> Approach
>> ========
>>
>> There are 4 fault paths that have been modified:
>>   - write fault on unallocated address: do_anonymous_page()
>>   - write fault on zero page: wp_page_copy()
>>   - write fault on non-exclusive CoW page: wp_page_copy()
>>   - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse()
>>
>> In the first 2 cases, we will determine the preferred order folio to allocate,
>> limited by a max order (currently order-4; see below), VMA and PMD bounds, and
>> state of neighboring PTEs. In the 3rd case, we aim to allocate the same order
>> folio as the source, subject to constraints that may arise if the source has
>> been mremapped or partially munmapped. And in the 4th case, we reuse as much of
>> the folio as we can, subject to the same mremap/munmap constraints.
> 
> Just a note (that you maybe already know) that we have to be a bit careful in
> the wp_copy path with replacing sub-pages that are marked exclusive.

Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
think I have a bug.

(I'm guessing the GUP fast path assumes that if it sees an exclusive page then
that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
But surely user space can munmap it at any time and the consequences are
similar? It's probably clear that I don't know much about the GUP implementation
details...)

My current patch always prefers reuse over copy, and expands the size of the
reuse to the biggest set of pages that are all exclusive (determined either by
the presence of the anon_exclusive flag or from the refcount), and covered by
the same folio (and a few other bounds constraints - see
calc_anon_folio_range_reuse()).

If I determine I must copy (because the "anchor" page is not exclusive), then I
determine the size of the copy region based on a few constraints (see
calc_anon_folio_order_copy()). But I think you are saying that no pages in that
region are allowed to have the anon_exclusive flag set? In which case, this
could be fixed by adding that check in the function.

> 
> Currently, we always only replace a single shared anon (sub)page by a fresh
> exclusive base-page during a write-fault/unsharing. As the sub-page is already
> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.

When you say, "maybe shared" is that determined by the absence of the
"exclusive" flag?

> 
> If you now decide to replace more subpages, you have to be careful that none of
> them are still exclusive -- because they could get pinned concurrently and
> replacing them would result in memory corruptions.
> 
> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
> fork() that could result in something like that.

Are there any test cases that stress the kernel in this area that I could use to
validate my fix?

> 
> Further, we have to be a bit careful regarding replacing ranges that are backed
> by different anon pages (for example, due to fork() deciding to copy some
> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).

I don't understand this statement; do you mean "different anon _folios_"? I am
scanning the page table to expand the region that I reuse/copy and as part of
that scan, make sure that I only cover a single folio. So I think I conform here
- the scan would give up once it gets to the hole.

> 
> 
> So what should be safe is replacing all sub-pages of a folio that are marked
> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
> worth the complexity. For THP we were happy so far to *not* optimize this,
> implying that maybe we shouldn't worry about optimizing the fork() case for now
> that heavily.

I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
large copies was contributing a measurable amount to the performance
improvement. (Certainly, the zero-page copy case, is definitely a big
contributer). I don't have access to the HW at the moment but can rerun later
with and without to double check.

> 
> 
> One optimization once could think of instead (that I raised previously in other
> context) is the detection of exclusivity after fork()+exit in the child (IOW,
> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
> folio_nr_pages()), to test if all page references are from the current process
> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
> optimization.

Yes, I have already implemented this in my series; see patch 10.

Thanks,
Ryan

David Hildenbrand April 17, 2023, 2:05 p.m. UTC | #7

[...]

>> Just a note (that you maybe already know) that we have to be a bit careful in
>> the wp_copy path with replacing sub-pages that are marked exclusive.
> 
> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
> think I have a bug.
> 
> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then
> that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
> But surely user space can munmap it at any time and the consequences are
> similar? It's probably clear that I don't know much about the GUP implementation
> details...)

If GUP finds a read-only PTE pointing at an exclusive subpage, it 
assumes that this page cannot randomly be replaced by core MM due to 
COW. See gup_must_unshare(). So it can go ahead and pin the page. As 
long as user space doesn't do something stupid with the mapping 
(MADV_DONTNEED, munmap()) the pinned page must correspond to the mapped 
page.

If GUP finds a writeable PTE, it assumes that this page cannot randomly 
be replaced by core MM due to COW -- because writable implies exclusive. 
See, for example the VM_BUG_ON_PAGE() in follow_page_pte(). So, 
similarly, GUP can simply go ahead and pin the page.

GUP-fast runs lockless, not even taking the PT locks. It syncs against 
concurrent fork() using a special seqlock, and essentially unpins 
whatever it temporarily pinned when it detects that fork() was running 
concurrently. But it might result in some pages temporarily being 
flagged as "maybe pinned".

In other cases (!fork()), GUP-fast synchronizes against concurrent 
sharing (KSM) or unmapping (migration, swapout) that implies clearing of 
the PG_anon_flag of the subpage by first unmapping the PTE and 
conditionally remapping it. See mm/ksm.c:write_protect_page() as an 
example for the sharing side (especially: if page_try_share_anon_rmap() 
fails because the page may be pinned).

Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is 
easy. Replacing an exclusive PTE (including writable PTEs) requires some 
work to sync with GUP-fast and goes rather in the "maybe just don't 
bother" terriroty.

> 
> My current patch always prefers reuse over copy, and expands the size of the
> reuse to the biggest set of pages that are all exclusive (determined either by
> the presence of the anon_exclusive flag or from the refcount), and covered by
> the same folio (and a few other bounds constraints - see
> calc_anon_folio_range_reuse()).
> 
> If I determine I must copy (because the "anchor" page is not exclusive), then I
> determine the size of the copy region based on a few constraints (see
> calc_anon_folio_order_copy()). But I think you are saying that no pages in that
> region are allowed to have the anon_exclusive flag set? In which case, this
> could be fixed by adding that check in the function.

Yes, changing a PTE that points at an anonymous subpage that has the 
"exclusive" flag set requires more care.

> 
>>
>> Currently, we always only replace a single shared anon (sub)page by a fresh
>> exclusive base-page during a write-fault/unsharing. As the sub-page is already
>> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.
> 
> When you say, "maybe shared" is that determined by the absence of the
> "exclusive" flag?

Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". 
Once "maybe shared", we must only go back to "exclusive (set the flag) 
if we are sure that there are no other references to the page.

> 
>>
>> If you now decide to replace more subpages, you have to be careful that none of
>> them are still exclusive -- because they could get pinned concurrently and
>> replacing them would result in memory corruptions.
>>
>> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
>> fork() that could result in something like that.
> 
> Are there any test cases that stress the kernel in this area that I could use to
> validate my fix?

tools/testing/selftests/mm/cow.c does excessive tests (including some 
MADV_DONTFORK -- that's what I actually meant --  and partial mremap 
tests), but mostly focuses on ordinary base pages (order-0), THP, and 
hugetlb.

We don't have any "GUP-fast racing with fork()" tests or similar yet 
(tests that rely on races are not a good candidate for selftests).

We might want to extend tools/testing/selftests/mm/cow.c to test for 
some of the cases you extend.

We may also change the detection of THP (I think, by luck, it would 
currently also test your patches to some degree set the way it tests for 
THP)

if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Would have to be, for example,

if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
	ksft_test_result_skip("Did not get a THP populated\n");
	goto munmap;
}

Because we touch the first PTE in a PMD and want to test if core-mm gave 
us a full THP (last PTE also populated).


Extending the tests to cover other anon THP sizes could work by aligning 
a VMA to THP/2 size (so we're sure we don't get a full THP), and then 
testing if we get more PTEs populated -> your code active.

> 
>>
>> Further, we have to be a bit careful regarding replacing ranges that are backed
>> by different anon pages (for example, due to fork() deciding to copy some
>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
> 
> I don't understand this statement; do you mean "different anon _folios_"? I am
> scanning the page table to expand the region that I reuse/copy and as part of
> that scan, make sure that I only cover a single folio. So I think I conform here
> - the scan would give up once it gets to the hole.

During fork(), what could happen (temporary detection of pinned page 
resulting in a copy) is something weird like:

PTE 0: subpage0 of anon page #1 (maybe shared)
PTE 1: subpage1 of anon page #1 (maybe shared
PTE 2: anon page #2 (exclusive)
PTE 3: subpage2 of anon page #1 (maybe shared

Of course, any combination of above.

Further, with mremap() we might get completely crazy layouts, randomly 
mapping sub-pages of anon pages, mixed with other sub-pages or base-page 
folios.

Maybe it's all handled already by your code, just pointing out which 
kind of mess we might get :)

> 
>>
>>
>> So what should be safe is replacing all sub-pages of a folio that are marked
>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>> worth the complexity. For THP we were happy so far to *not* optimize this,
>> implying that maybe we shouldn't worry about optimizing the fork() case for now
>> that heavily.
> 
> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
> large copies was contributing a measurable amount to the performance
> improvement. (Certainly, the zero-page copy case, is definitely a big
> contributer). I don't have access to the HW at the moment but can rerun later
> with and without to double check.

In which test exactly? Some micro-benchmark?

> 
>>
>>
>> One optimization once could think of instead (that I raised previously in other
>> context) is the detection of exclusivity after fork()+exit in the child (IOW,
>> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
>> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
>> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
>> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
>> folio_nr_pages()), to test if all page references are from the current process
>> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
>> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
>> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
>> optimization.
> 
> Yes, I have already implemented this in my series; see patch 10.

Oh, good! That's the most important part.

Ryan Roberts April 17, 2023, 3:38 p.m. UTC | #8

On 17/04/2023 15:05, David Hildenbrand wrote:
> [...]
> 
>>> Just a note (that you maybe already know) that we have to be a bit careful in
>>> the wp_copy path with replacing sub-pages that are marked exclusive.
>>
>> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
>> think I have a bug.
>>
>> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then
>> that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
>> But surely user space can munmap it at any time and the consequences are
>> similar? It's probably clear that I don't know much about the GUP implementation
>> details...)
> 
> If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that
> this page cannot randomly be replaced by core MM due to COW. See
> gup_must_unshare(). So it can go ahead and pin the page. As long as user space
> doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the
> pinned page must correspond to the mapped page.
> 
> If GUP finds a writeable PTE, it assumes that this page cannot randomly be
> replaced by core MM due to COW -- because writable implies exclusive. See, for
> example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply
> go ahead and pin the page.
> 
> GUP-fast runs lockless, not even taking the PT locks. It syncs against
> concurrent fork() using a special seqlock, and essentially unpins whatever it
> temporarily pinned when it detects that fork() was running concurrently. But it
> might result in some pages temporarily being flagged as "maybe pinned".
> 
> In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM)
> or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of
> the subpage by first unmapping the PTE and conditionally remapping it. See
> mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if
> page_try_share_anon_rmap() fails because the page may be pinned).
> 
> Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy.
> Replacing an exclusive PTE (including writable PTEs) requires some work to sync
> with GUP-fast and goes rather in the "maybe just don't bother" terriroty.

Yep agreed. I'll plan to fix this by adding the constraint that all pages of the
copy range (calc_anon_folio_order_copy()) must be "maybe shared".

> 
>>
>> My current patch always prefers reuse over copy, and expands the size of the
>> reuse to the biggest set of pages that are all exclusive (determined either by
>> the presence of the anon_exclusive flag or from the refcount), and covered by
>> the same folio (and a few other bounds constraints - see
>> calc_anon_folio_range_reuse()).
>>
>> If I determine I must copy (because the "anchor" page is not exclusive), then I
>> determine the size of the copy region based on a few constraints (see
>> calc_anon_folio_order_copy()). But I think you are saying that no pages in that
>> region are allowed to have the anon_exclusive flag set? In which case, this
>> could be fixed by adding that check in the function.
> 
> Yes, changing a PTE that points at an anonymous subpage that has the "exclusive"
> flag set requires more care.
> 
>>
>>>
>>> Currently, we always only replace a single shared anon (sub)page by a fresh
>>> exclusive base-page during a write-fault/unsharing. As the sub-page is already
>>> marked "maybe shared", it cannot get pinned concurrently and everybody is happy.
>>
>> When you say, "maybe shared" is that determined by the absence of the
>> "exclusive" flag?
> 
> Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". Once
> "maybe shared", we must only go back to "exclusive (set the flag) if we are sure
> that there are no other references to the page.
> 
>>
>>>
>>> If you now decide to replace more subpages, you have to be careful that none of
>>> them are still exclusive -- because they could get pinned concurrently and
>>> replacing them would result in memory corruptions.
>>>
>>> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial
>>> fork() that could result in something like that.
>>
>> Are there any test cases that stress the kernel in this area that I could use to
>> validate my fix?
> 
> tools/testing/selftests/mm/cow.c does excessive tests (including some
> MADV_DONTFORK -- that's what I actually meant --  and partial mremap tests), but
> mostly focuses on ordinary base pages (order-0), THP, and hugetlb.
> 
> We don't have any "GUP-fast racing with fork()" tests or similar yet (tests that
> rely on races are not a good candidate for selftests).
> 
> We might want to extend tools/testing/selftests/mm/cow.c to test for some of the
> cases you extend.
> 
> We may also change the detection of THP (I think, by luck, it would currently
> also test your patches to some degree set the way it tests for THP)
> 
> if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) {
>     ksft_test_result_skip("Did not get a THP populated\n");
>     goto munmap;
> }
> 
> Would have to be, for example,
> 
> if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) {
>     ksft_test_result_skip("Did not get a THP populated\n");
>     goto munmap;
> }
> 
> Because we touch the first PTE in a PMD and want to test if core-mm gave us a
> full THP (last PTE also populated).
> 
> 
> Extending the tests to cover other anon THP sizes could work by aligning a VMA
> to THP/2 size (so we're sure we don't get a full THP), and then testing if we
> get more PTEs populated -> your code active.

Thanks. I'll run all these and make sure they pass and look at adding new
variants for the next rev.

> 
>>
>>>
>>> Further, we have to be a bit careful regarding replacing ranges that are backed
>>> by different anon pages (for example, due to fork() deciding to copy some
>>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
>>
>> I don't understand this statement; do you mean "different anon _folios_"? I am
>> scanning the page table to expand the region that I reuse/copy and as part of
>> that scan, make sure that I only cover a single folio. So I think I conform here
>> - the scan would give up once it gets to the hole.
> 
> During fork(), what could happen (temporary detection of pinned page resulting
> in a copy) is something weird like:
> 
> PTE 0: subpage0 of anon page #1 (maybe shared)
> PTE 1: subpage1 of anon page #1 (maybe shared
> PTE 2: anon page #2 (exclusive)
> PTE 3: subpage2 of anon page #1 (maybe shared

Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap
something new in PTE2. But I don't see how it happens at fork. For PTE3, did you
mean subpage _3_?

> 
> Of course, any combination of above.
> 
> Further, with mremap() we might get completely crazy layouts, randomly mapping
> sub-pages of anon pages, mixed with other sub-pages or base-page folios.
> 
> Maybe it's all handled already by your code, just pointing out which kind of
> mess we might get :)

Yep, this is already handled; the scan to expand the range ensures that all the
PTEs map to the expected contiguous pages in the same folio.

> 
>>
>>>
>>>
>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>> implying that maybe we shouldn't worry about optimizing the fork() case for now
>>> that heavily.
>>
>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>> large copies was contributing a measurable amount to the performance
>> improvement. (Certainly, the zero-page copy case, is definitely a big
>> contributer). I don't have access to the HW at the moment but can rerun later
>> with and without to double check.
> 
> In which test exactly? Some micro-benchmark?

The kernel compile benchmark that I quoted numbers for in the cover letter. I
have some trace points (not part of the submitted series) that tell me how many
mappings of each order we get for each code path. I'm pretty sure I remember all
of these 4 code paths contributing non-negligible amounts.

> 
>>
>>>
>>>
>>> One optimization once could think of instead (that I raised previously in other
>>> context) is the detection of exclusivity after fork()+exit in the child (IOW,
>>> only the parent continues to exist). Once PG_anon_exclusive was cleared for all
>>> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy
>>> instead of reuse (because page_count() > 1, as the folio is PTE mapped).
>>> Scanning the surrounding page table if it makes sense (e.g., page_count() <=
>>> folio_nr_pages()), to test if all page references are from the current process
>>> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages.
>>> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is.
>>> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this
>>> optimization.
>>
>> Yes, I have already implemented this in my series; see patch 10.
> 
> Oh, good! That's the most important part.
>

David Hildenbrand April 17, 2023, 3:44 p.m. UTC | #9

>>>
>>>>
>>>> Further, we have to be a bit careful regarding replacing ranges that are backed
>>>> by different anon pages (for example, due to fork() deciding to copy some
>>>> sub-pages of a PTE-mapped folio instead of sharing all sub-pages).
>>>
>>> I don't understand this statement; do you mean "different anon _folios_"? I am
>>> scanning the page table to expand the region that I reuse/copy and as part of
>>> that scan, make sure that I only cover a single folio. So I think I conform here
>>> - the scan would give up once it gets to the hole.
>>
>> During fork(), what could happen (temporary detection of pinned page resulting
>> in a copy) is something weird like:
>>
>> PTE 0: subpage0 of anon page #1 (maybe shared)
>> PTE 1: subpage1 of anon page #1 (maybe shared
>> PTE 2: anon page #2 (exclusive)
>> PTE 3: subpage2 of anon page #1 (maybe shared
> 
> Hmm... I can see how this could happen if you mremap PTE2 to PTE3, then mmap
> something new in PTE2. But I don't see how it happens at fork. For PTE3, did you
> mean subpage _3_?
>

Yes, fat fingers :) Thanks for paying attention!

Above could be optimized by processing all consecutive PTEs at once: 
meaning, we check if the page maybe pinned only once, and then either 
copy all PTEs or share all PTEs. It's unlikely to happen in practice, I 
guess, though.


>>
>> Of course, any combination of above.
>>
>> Further, with mremap() we might get completely crazy layouts, randomly mapping
>> sub-pages of anon pages, mixed with other sub-pages or base-page folios.
>>
>> Maybe it's all handled already by your code, just pointing out which kind of
>> mess we might get :)
> 
> Yep, this is already handled; the scan to expand the range ensures that all the
> PTEs map to the expected contiguous pages in the same folio.

Okay, great.

> 
>>
>>>
>>>>
>>>>
>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>> implying that maybe we shouldn't worry about optimizing the fork() case for now
>>>> that heavily.
>>>
>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>> large copies was contributing a measurable amount to the performance
>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>> contributer). I don't have access to the HW at the moment but can rerun later
>>> with and without to double check.
>>
>> In which test exactly? Some micro-benchmark?
> 
> The kernel compile benchmark that I quoted numbers for in the cover letter. I
> have some trace points (not part of the submitted series) that tell me how many
> mappings of each order we get for each code path. I'm pretty sure I remember all
> of these 4 code paths contributing non-negligible amounts.

Interesting! It would be great to see if there is an actual difference 
after patch #10 was applied without the other COW replacement.

Ryan Roberts April 17, 2023, 4:15 p.m. UTC | #10

On 17/04/2023 16:44, David Hildenbrand wrote:
>>>>
>>>>>
>>>>>
>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>> now
>>>>> that heavily.
>>>>
>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>>> large copies was contributing a measurable amount to the performance
>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>> with and without to double check.
>>>
>>> In which test exactly? Some micro-benchmark?
>>
>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>> have some trace points (not part of the submitted series) that tell me how many
>> mappings of each order we get for each code path. I'm pretty sure I remember all
>> of these 4 code paths contributing non-negligible amounts.
> 
> Interesting! It would be great to see if there is an actual difference after
> patch #10 was applied without the other COW replacement.
> 

I'll aim to get some formal numbers when I next have access to the HW.

Ryan Roberts April 19, 2023, 10:12 a.m. UTC | #11

Hi David,

On 17/04/2023 15:05, David Hildenbrand wrote:
> [...]
> 
>>> Just a note (that you maybe already know) that we have to be a bit careful in
>>> the wp_copy path with replacing sub-pages that are marked exclusive.
>>
>> Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I
>> think I have a bug.
>>
>> (I'm guessing the GUP fast path assumes that if it sees an exclusive page then
>> that page can't go away? And if I then wp_copy it, I'm breaking the assumption?
>> But surely user space can munmap it at any time and the consequences are
>> similar? It's probably clear that I don't know much about the GUP implementation
>> details...)
> 
> If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that
> this page cannot randomly be replaced by core MM due to COW. See
> gup_must_unshare(). So it can go ahead and pin the page. As long as user space
> doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the
> pinned page must correspond to the mapped page.
> 
> If GUP finds a writeable PTE, it assumes that this page cannot randomly be
> replaced by core MM due to COW -- because writable implies exclusive. See, for
> example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply
> go ahead and pin the page.
> 
> GUP-fast runs lockless, not even taking the PT locks. It syncs against
> concurrent fork() using a special seqlock, and essentially unpins whatever it
> temporarily pinned when it detects that fork() was running concurrently. But it
> might result in some pages temporarily being flagged as "maybe pinned".
> 
> In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM)
> or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of
> the subpage by first unmapping the PTE and conditionally remapping it. See
> mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if
> page_try_share_anon_rmap() fails because the page may be pinned).
> 
> Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy.
> Replacing an exclusive PTE (including writable PTEs) requires some work to sync
> with GUP-fast and goes rather in the "maybe just don't bother" terriroty.

I'm looking to fix this problem in my code, but am struggling to see how the
current code is safe. I'm thinking about the following scenario:

 - A page is CoW mapped into processes A and B.
 - The page takes a fault in process A, and do_wp_page() determines that it is
   "maybe-shared" and therefore must copy. So drops the PTL and calls
   wp_page_copy().
 - Process B exits.
 - Another thread in process A faults on the page. This time dw_wp_page()
   determines that the page is exclusive (due to the ref count), and reuses it,
   marking it exclusive along the way.
 - wp_page_copy() from the original thread in process A retakes the PTL and
   copies the _now exclusive_ page.

Having typed it up, I guess this can't happen, because wp_page_copy() will only
do the copy if the PTE hasn't changed and it will have changed because it is now
writable? So this is safe?

To make things more convoluted, what happens if the second thread does an
mprotect() to make the page RO after its write fault was handled? I think
mprotect() will serialize on the mmap write lock so this is safe too?

Sorry if this is a bit rambly, just trying to make sure I've understood
everything correctly.

Thanks,
Ryan

David Hildenbrand April 19, 2023, 10:51 a.m. UTC | #12

> I'm looking to fix this problem in my code, but am struggling to see how the
> current code is safe. I'm thinking about the following scenario:
> 

Let's see :)

>   - A page is CoW mapped into processes A and B.
>   - The page takes a fault in process A, and do_wp_page() determines that it is
>     "maybe-shared" and therefore must copy. So drops the PTL and calls
>     wp_page_copy().

Note that before calling wp_page_copy(), we do a folio_get(folio); 
Further, the page table reference is only dropped once we actually 
replace the page in the page table. So while in wp_page_copy(), the 
folio should have at least 2 references if the page is still mapped.

>   - Process B exits.
>   - Another thread in process A faults on the page. This time dw_wp_page()
>     determines that the page is exclusive (due to the ref count), and reuses it,
>     marking it exclusive along the way.

The refcount should not be 1 (other reference from the wp_page_copy() 
caller), so A won't be able to reuse it, and ...

>   - wp_page_copy() from the original thread in process A retakes the PTL and
>     copies the _now exclusive_ page.
> 
> Having typed it up, I guess this can't happen, because wp_page_copy() will only
> do the copy if the PTE hasn't changed and it will have changed because it is now
> writable? So this is safe?

this applies as well. If the pte changed (when reusing due to a write 
failt it's now writable, or someone else broke COW), we back off. For 
FAULT_FLAG_UNSHARE, however, the PTE may not change. But the additional 
reference should make it work.

I think it works as intended. It would be clearer if we'd also recheck 
in wp_page_copy() whether we still don't have an exclusive anon page 
under PT lock --  and if we would, back off.

> 
> To make things more convoluted, what happens if the second thread does an
> mprotect() to make the page RO after its write fault was handled? I think
> mprotect() will serialize on the mmap write lock so this is safe too?

Yes, mprotect() synchronizes that. There are other mechanisms to 
write-protect a page, though, under mmap lock in read mode (uffd-wp). So 
it's a valid concern.

In all of these cases, reuse should be prevented due to the additional 
reference on the folio when entering wp_page_copy() right from the 
start, not turning the page exclusive but instead replacing it by a 
copy. An additional sanity check sounds like the right thing to do.

> 
> Sorry if this is a bit rambly, just trying to make sure I've understood
> everything correctly.

It's a very interesting corner case, thanks for bringing that up. I 
think the old mapcount based approach could have suffered from this 
theoretical issue, but I might be wrong.

Ryan Roberts April 19, 2023, 11:13 a.m. UTC | #13

On 19/04/2023 11:51, David Hildenbrand wrote:
>> I'm looking to fix this problem in my code, but am struggling to see how the
>> current code is safe. I'm thinking about the following scenario:
>>
> 
> Let's see :)
> 
>>   - A page is CoW mapped into processes A and B.
>>   - The page takes a fault in process A, and do_wp_page() determines that it is
>>     "maybe-shared" and therefore must copy. So drops the PTL and calls
>>     wp_page_copy().
> 
> Note that before calling wp_page_copy(), we do a folio_get(folio); Further, the
> page table reference is only dropped once we actually replace the page in the
> page table. So while in wp_page_copy(), the folio should have at least 2
> references if the page is still mapped.

Ahh, yes, of course!

> 
>>   - Process B exits.
>>   - Another thread in process A faults on the page. This time dw_wp_page()
>>     determines that the page is exclusive (due to the ref count), and reuses it,
>>     marking it exclusive along the way.
> 
> The refcount should not be 1 (other reference from the wp_page_copy() caller),
> so A won't be able to reuse it, and ...
> 
>>   - wp_page_copy() from the original thread in process A retakes the PTL and
>>     copies the _now exclusive_ page.
>>
>> Having typed it up, I guess this can't happen, because wp_page_copy() will only
>> do the copy if the PTE hasn't changed and it will have changed because it is now
>> writable? So this is safe?
> 
> this applies as well. If the pte changed (when reusing due to a write failt it's
> now writable, or someone else broke COW), we back off. For FAULT_FLAG_UNSHARE,
> however, the PTE may not change. But the additional reference should make it work.
> 
> I think it works as intended. It would be clearer if we'd also recheck in
> wp_page_copy() whether we still don't have an exclusive anon page under PT lock
> --  and if we would, back off.

Yes, agreed. I now have a fix for my code that adds a check during the PTE scan
that each page is non-exclusive. And the scan is (already) done once without the
PTL to determine the size of destination folio, and then again with the lock to
ensure there was no race. I'm not seeing any change in the relative counts of
folio orders (which is expected due to the case being rare).

Thanks!

Ryan Roberts April 26, 2023, 10:41 a.m. UTC | #14

Hi David,

On 17/04/2023 16:44, David Hildenbrand wrote:

>>>>>
>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>> now
>>>>> that heavily.
>>>>
>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>>> large copies was contributing a measurable amount to the performance
>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>> with and without to double check.
>>>
>>> In which test exactly? Some micro-benchmark?
>>
>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>> have some trace points (not part of the submitted series) that tell me how many
>> mappings of each order we get for each code path. I'm pretty sure I remember all
>> of these 4 code paths contributing non-negligible amounts.
> 
> Interesting! It would be great to see if there is an actual difference after
> patch #10 was applied without the other COW replacement.
> 

Sorry about the delay. I now have some numbers for this...

I rearranged the patch order so that all the "utility" stuff (new rmap
functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a
couple of general improvements (7, 17), which should be dormant until we have
the final patches, then finally (6, 10, 14, 15), which implement large anon
folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've
dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we
never replace an exclusive page).

I've measured performance at the following locations in the patch set:

- baseline: none of my patches applied
- utility: has utility and general improvement patches applied
- alloc: utility + 6
- reuse: utility + 6 + 10
- copy: utility + 6 + 10 + 14
- zero-alloc: utility + 6 + 19 + 14 + 15

The test is `make defconfig && time make -jN Image` for a clean checkout of
v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some
per-boot variance (probably down to kaslr, etc). So have booted each kernel 7
times for a total of 3x7=21 samples per kernel. Then I've taken the mean:


jobs=8:

| label      |   real |   user |   kernel |
|:-----------|-------:|-------:|---------:|
| baseline   |   0.0% |   0.0% |     0.0% |
| utility    |  -2.7% |  -2.8% |    -3.1% |
| alloc      |  -6.0% |  -2.3% |   -24.1% |
| reuse      |  -9.5% |  -5.8% |   -28.5% |
| copy       | -10.6% |  -6.9% |   -29.4% |
| zero-alloc |  -9.2% |  -5.1% |   -29.8% |


jobs=160:

| label      |   real |   user |   kernel |
|:-----------|-------:|-------:|---------:|
| baseline   |   0.0% |   0.0% |     0.0% |
| utility    |  -1.8% |  -0.0% |    -7.7% |
| alloc      |  -6.0% |   1.8% |   -20.9% |
| reuse      |  -7.8% |  -1.6% |   -24.1% |
| copy       |  -7.8% |  -2.5% |   -26.8% |
| zero-alloc |  -7.7% |   1.5% |   -29.4% |


So it looks like patch 10 (reuse) is making a difference, but copy and
zero-alloc are not adding a huge amount, as you hypothesized. Personally I would
prefer not to drop those patches though, as it will all help towards utilization
of contiguous PTEs on arm64, which is the second part of the change that I'm now
working on.


For the final config ("zero-alloc") I also collected stats on how many
operations each of the 4 paths was performing, using ftrace and histograms.
"pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of
pages in the source folio):


do_anonymous_page:

{ pnr:          1 } hitcount:    2749722
{ pnr:          4 } hitcount:     387832
{ pnr:          8 } hitcount:     409628
{ pnr:         16 } hitcount:    4296115

pages: 76315914
faults: 7843297
pages per fault: 9.7


wp_page_reuse (anon):

{ pnr:          1, fnr:          1 } hitcount:      47887
{ pnr:          3, fnr:          4 } hitcount:          2
{ pnr:          4, fnr:          4 } hitcount:       6131
{ pnr:          6, fnr:          8 } hitcount:          1
{ pnr:          7, fnr:          8 } hitcount:         10
{ pnr:          8, fnr:          8 } hitcount:       3794
{ pnr:          1, fnr:         16 } hitcount:         36
{ pnr:          2, fnr:         16 } hitcount:         23
{ pnr:          3, fnr:         16 } hitcount:          5
{ pnr:          4, fnr:         16 } hitcount:          9
{ pnr:          5, fnr:         16 } hitcount:          8
{ pnr:          6, fnr:         16 } hitcount:          9
{ pnr:          7, fnr:         16 } hitcount:          3
{ pnr:          8, fnr:         16 } hitcount:         24
{ pnr:          9, fnr:         16 } hitcount:          2
{ pnr:         10, fnr:         16 } hitcount:          1
{ pnr:         11, fnr:         16 } hitcount:          9
{ pnr:         12, fnr:         16 } hitcount:          2
{ pnr:         13, fnr:         16 } hitcount:         27
{ pnr:         14, fnr:         16 } hitcount:          2
{ pnr:         15, fnr:         16 } hitcount:         54
{ pnr:         16, fnr:         16 } hitcount:       6673

pages: 211393
faults: 64712
pages per fault: 3.3


wp_page_copy (anon):

{ pnr:          1, fnr:          1 } hitcount:      81242
{ pnr:          4, fnr:          4 } hitcount:       5974
{ pnr:          1, fnr:          8 } hitcount:          1
{ pnr:          4, fnr:          8 } hitcount:          1
{ pnr:          8, fnr:          8 } hitcount:      12933
{ pnr:          1, fnr:         16 } hitcount:         19
{ pnr:          4, fnr:         16 } hitcount:          3
{ pnr:          8, fnr:         16 } hitcount:          7
{ pnr:         16, fnr:         16 } hitcount:       4106

pages: 274390
faults: 104286
pages per fault: 2.6


wp_page_copy (zero):

{ pnr:          1 } hitcount:     178699
{ pnr:          4 } hitcount:      14498
{ pnr:          8 } hitcount:      23644
{ pnr:         16 } hitcount:     257940

pages: 4552883
faults: 474781
pages per fault: 9.6


Thanks,
Ryan

David Hildenbrand May 17, 2023, 1:58 p.m. UTC | #15

On 26.04.23 12:41, Ryan Roberts wrote:
> Hi David,
> 
> On 17/04/2023 16:44, David Hildenbrand wrote:
> 
>>>>>>
>>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's really
>>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>>> now
>>>>>> that heavily.
>>>>>
>>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember enabling
>>>>> large copies was contributing a measurable amount to the performance
>>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>>> with and without to double check.
>>>>
>>>> In which test exactly? Some micro-benchmark?
>>>
>>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>>> have some trace points (not part of the submitted series) that tell me how many
>>> mappings of each order we get for each code path. I'm pretty sure I remember all
>>> of these 4 code paths contributing non-negligible amounts.
>>
>> Interesting! It would be great to see if there is an actual difference after
>> patch #10 was applied without the other COW replacement.
>>
> 
> Sorry about the delay. I now have some numbers for this...
> 

Dito, I'm swamped :) Thanks for running these benchmarks!

As LSF/MM reminded me again of this topic ...

> I rearranged the patch order so that all the "utility" stuff (new rmap
> functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a
> couple of general improvements (7, 17), which should be dormant until we have
> the final patches, then finally (6, 10, 14, 15), which implement large anon
> folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've
> dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we
> never replace an exclusive page).
> 
> I've measured performance at the following locations in the patch set:
> 
> - baseline: none of my patches applied
> - utility: has utility and general improvement patches applied
> - alloc: utility + 6
> - reuse: utility + 6 + 10
> - copy: utility + 6 + 10 + 14
> - zero-alloc: utility + 6 + 19 + 14 + 15
> 
> The test is `make defconfig && time make -jN Image` for a clean checkout of
> v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some
> per-boot variance (probably down to kaslr, etc). So have booted each kernel 7
> times for a total of 3x7=21 samples per kernel. Then I've taken the mean:
> 
> 
> jobs=8:
> 
> | label      |   real |   user |   kernel |
> |:-----------|-------:|-------:|---------:|
> | baseline   |   0.0% |   0.0% |     0.0% |
> | utility    |  -2.7% |  -2.8% |    -3.1% |
> | alloc      |  -6.0% |  -2.3% |   -24.1% |
> | reuse      |  -9.5% |  -5.8% |   -28.5% |
> | copy       | -10.6% |  -6.9% |   -29.4% |
> | zero-alloc |  -9.2% |  -5.1% |   -29.8% |
> 
> 
> jobs=160:
> 
> | label      |   real |   user |   kernel |
> |:-----------|-------:|-------:|---------:|
> | baseline   |   0.0% |   0.0% |     0.0% |
> | utility    |  -1.8% |  -0.0% |    -7.7% |
> | alloc      |  -6.0% |   1.8% |   -20.9% |
> | reuse      |  -7.8% |  -1.6% |   -24.1% |
> | copy       |  -7.8% |  -2.5% |   -26.8% |
> | zero-alloc |  -7.7% |   1.5% |   -29.4% |
> 
> 
> So it looks like patch 10 (reuse) is making a difference, but copy and
> zero-alloc are not adding a huge amount, as you hypothesized. Personally I would
> prefer not to drop those patches though, as it will all help towards utilization
> of contiguous PTEs on arm64, which is the second part of the change that I'm now
> working on.

Yes, pretty much what I expected :) I can only suggest to

(1) Make the initial support as simple and minimal as possible. That
     means, strip anything that is not absolutely required. That is,
     exclude *at least* copy and zero-alloc. We can always add selected
     optimizations on top later.

     You'll do yourself a favor to get as much review coverage, faster
     review for inclusion, and less chances for nasty BUGs.

(2) Keep the COW logic simple. We've had too many issues
     in that area for my taste already. As 09854ba94c6a ("mm:
     do_wp_page() simplification") from Linus puts it: "Simplify,
     simplify, simplify.". If it doesn't add significant benefit, rather
     keep it simple.

> 
> 
> For the final config ("zero-alloc") I also collected stats on how many
> operations each of the 4 paths was performing, using ftrace and histograms.
> "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of
> pages in the source folio):
> 
> 
> do_anonymous_page:
> 
> { pnr:          1 } hitcount:    2749722
> { pnr:          4 } hitcount:     387832
> { pnr:          8 } hitcount:     409628
> { pnr:         16 } hitcount:    4296115
> 
> pages: 76315914
> faults: 7843297
> pages per fault: 9.7
> 
> 
> wp_page_reuse (anon):
> 
> { pnr:          1, fnr:          1 } hitcount:      47887
> { pnr:          3, fnr:          4 } hitcount:          2
> { pnr:          4, fnr:          4 } hitcount:       6131
> { pnr:          6, fnr:          8 } hitcount:          1
> { pnr:          7, fnr:          8 } hitcount:         10
> { pnr:          8, fnr:          8 } hitcount:       3794
> { pnr:          1, fnr:         16 } hitcount:         36
> { pnr:          2, fnr:         16 } hitcount:         23
> { pnr:          3, fnr:         16 } hitcount:          5
> { pnr:          4, fnr:         16 } hitcount:          9
> { pnr:          5, fnr:         16 } hitcount:          8
> { pnr:          6, fnr:         16 } hitcount:          9
> { pnr:          7, fnr:         16 } hitcount:          3
> { pnr:          8, fnr:         16 } hitcount:         24
> { pnr:          9, fnr:         16 } hitcount:          2
> { pnr:         10, fnr:         16 } hitcount:          1
> { pnr:         11, fnr:         16 } hitcount:          9
> { pnr:         12, fnr:         16 } hitcount:          2
> { pnr:         13, fnr:         16 } hitcount:         27
> { pnr:         14, fnr:         16 } hitcount:          2
> { pnr:         15, fnr:         16 } hitcount:         54
> { pnr:         16, fnr:         16 } hitcount:       6673
> 
> pages: 211393
> faults: 64712
> pages per fault: 3.3
> 
> 
> wp_page_copy (anon):
> 
> { pnr:          1, fnr:          1 } hitcount:      81242
> { pnr:          4, fnr:          4 } hitcount:       5974
> { pnr:          1, fnr:          8 } hitcount:          1
> { pnr:          4, fnr:          8 } hitcount:          1
> { pnr:          8, fnr:          8 } hitcount:      12933
> { pnr:          1, fnr:         16 } hitcount:         19
> { pnr:          4, fnr:         16 } hitcount:          3
> { pnr:          8, fnr:         16 } hitcount:          7
> { pnr:         16, fnr:         16 } hitcount:       4106
> 
> pages: 274390
> faults: 104286
> pages per fault: 2.6
> 
> 
> wp_page_copy (zero):
> 
> { pnr:          1 } hitcount:     178699
> { pnr:          4 } hitcount:      14498
> { pnr:          8 } hitcount:      23644
> { pnr:         16 } hitcount:     257940
> 
> pages: 4552883
> faults: 474781
> pages per fault: 9.6

I'll have to set aside more time to digest these values :)

Ryan Roberts May 18, 2023, 11:23 a.m. UTC | #16

On 17/05/2023 14:58, David Hildenbrand wrote:
> On 26.04.23 12:41, Ryan Roberts wrote:
>> Hi David,
>>
>> On 17/04/2023 16:44, David Hildenbrand wrote:
>>
>>>>>>>
>>>>>>> So what should be safe is replacing all sub-pages of a folio that are marked
>>>>>>> "maybe shared" by a new folio under PT lock. However, I wonder if it's
>>>>>>> really
>>>>>>> worth the complexity. For THP we were happy so far to *not* optimize this,
>>>>>>> implying that maybe we shouldn't worry about optimizing the fork() case for
>>>>>>> now
>>>>>>> that heavily.
>>>>>>
>>>>>> I don't have the exact numbers to hand, but I'm pretty sure I remember
>>>>>> enabling
>>>>>> large copies was contributing a measurable amount to the performance
>>>>>> improvement. (Certainly, the zero-page copy case, is definitely a big
>>>>>> contributer). I don't have access to the HW at the moment but can rerun later
>>>>>> with and without to double check.
>>>>>
>>>>> In which test exactly? Some micro-benchmark?
>>>>
>>>> The kernel compile benchmark that I quoted numbers for in the cover letter. I
>>>> have some trace points (not part of the submitted series) that tell me how many
>>>> mappings of each order we get for each code path. I'm pretty sure I remember
>>>> all
>>>> of these 4 code paths contributing non-negligible amounts.
>>>
>>> Interesting! It would be great to see if there is an actual difference after
>>> patch #10 was applied without the other COW replacement.
>>>
>>
>> Sorry about the delay. I now have some numbers for this...
>>
> 
> Dito, I'm swamped :) Thanks for running these benchmarks!
> 
> As LSF/MM reminded me again of this topic ...
> 
>> I rearranged the patch order so that all the "utility" stuff (new rmap
>> functions, etc) are first (1, 2, 3, 4, 5, 8, 9, 11, 12, 13), followed by a
>> couple of general improvements (7, 17), which should be dormant until we have
>> the final patches, then finally (6, 10, 14, 15), which implement large anon
>> folios the allocate, reuse, copy-non-zero and copy-zero paths respectively. I've
>> dropped patch 16 and fixed the copy-exclusive bug you spotted (by ensuring we
>> never replace an exclusive page).
>>
>> I've measured performance at the following locations in the patch set:
>>
>> - baseline: none of my patches applied
>> - utility: has utility and general improvement patches applied
>> - alloc: utility + 6
>> - reuse: utility + 6 + 10
>> - copy: utility + 6 + 10 + 14
>> - zero-alloc: utility + 6 + 19 + 14 + 15
>>
>> The test is `make defconfig && time make -jN Image` for a clean checkout of
>> v6.3-rc3. The first result is thrown away, and the next 3 are kept. I saw some
>> per-boot variance (probably down to kaslr, etc). So have booted each kernel 7
>> times for a total of 3x7=21 samples per kernel. Then I've taken the mean:
>>
>>
>> jobs=8:
>>
>> | label      |   real |   user |   kernel |
>> |:-----------|-------:|-------:|---------:|
>> | baseline   |   0.0% |   0.0% |     0.0% |
>> | utility    |  -2.7% |  -2.8% |    -3.1% |
>> | alloc      |  -6.0% |  -2.3% |   -24.1% |
>> | reuse      |  -9.5% |  -5.8% |   -28.5% |
>> | copy       | -10.6% |  -6.9% |   -29.4% |
>> | zero-alloc |  -9.2% |  -5.1% |   -29.8% |
>>
>>
>> jobs=160:
>>
>> | label      |   real |   user |   kernel |
>> |:-----------|-------:|-------:|---------:|
>> | baseline   |   0.0% |   0.0% |     0.0% |
>> | utility    |  -1.8% |  -0.0% |    -7.7% |
>> | alloc      |  -6.0% |   1.8% |   -20.9% |
>> | reuse      |  -7.8% |  -1.6% |   -24.1% |
>> | copy       |  -7.8% |  -2.5% |   -26.8% |
>> | zero-alloc |  -7.7% |   1.5% |   -29.4% |
>>
>>
>> So it looks like patch 10 (reuse) is making a difference, but copy and
>> zero-alloc are not adding a huge amount, as you hypothesized. Personally I would
>> prefer not to drop those patches though, as it will all help towards utilization
>> of contiguous PTEs on arm64, which is the second part of the change that I'm now
>> working on.
> 
> Yes, pretty much what I expected :) I can only suggest to
> 
> (1) Make the initial support as simple and minimal as possible. That
>     means, strip anything that is not absolutely required. That is,
>     exclude *at least* copy and zero-alloc. We can always add selected
>     optimizations on top later.
> 
>     You'll do yourself a favor to get as much review coverage, faster
>     review for inclusion, and less chances for nasty BUGs.

As I'm building out the testing capability, I'm seeing that with different HW
configs and workloads, things move a bit and zero-alloc can account for up to 1%
in some cases. Copy is still pretty marginal, but I wonder if we might see more
value from it on Android where the Zygote is constantly forked?

Regardless, I appreciate your point about making the initial contribution as
small and simple as possible, so as I get closer to posting a v1, I'll keep it
in mind and make sure I follow your advice.

Thanks,
Ryan

> 
> (2) Keep the COW logic simple. We've had too many issues
>     in that area for my taste already. As 09854ba94c6a ("mm:
>     do_wp_page() simplification") from Linus puts it: "Simplify,
>     simplify, simplify.". If it doesn't add significant benefit, rather
>     keep it simple.
> 
>>
>>
>> For the final config ("zero-alloc") I also collected stats on how many
>> operations each of the 4 paths was performing, using ftrace and histograms.
>> "pnr" is the number of pages allocated/reused/copied, and "fnr" is the number of
>> pages in the source folio):
>>
>>
>> do_anonymous_page:
>>
>> { pnr:          1 } hitcount:    2749722
>> { pnr:          4 } hitcount:     387832
>> { pnr:          8 } hitcount:     409628
>> { pnr:         16 } hitcount:    4296115
>>
>> pages: 76315914
>> faults: 7843297
>> pages per fault: 9.7
>>
>>
>> wp_page_reuse (anon):
>>
>> { pnr:          1, fnr:          1 } hitcount:      47887
>> { pnr:          3, fnr:          4 } hitcount:          2
>> { pnr:          4, fnr:          4 } hitcount:       6131
>> { pnr:          6, fnr:          8 } hitcount:          1
>> { pnr:          7, fnr:          8 } hitcount:         10
>> { pnr:          8, fnr:          8 } hitcount:       3794
>> { pnr:          1, fnr:         16 } hitcount:         36
>> { pnr:          2, fnr:         16 } hitcount:         23
>> { pnr:          3, fnr:         16 } hitcount:          5
>> { pnr:          4, fnr:         16 } hitcount:          9
>> { pnr:          5, fnr:         16 } hitcount:          8
>> { pnr:          6, fnr:         16 } hitcount:          9
>> { pnr:          7, fnr:         16 } hitcount:          3
>> { pnr:          8, fnr:         16 } hitcount:         24
>> { pnr:          9, fnr:         16 } hitcount:          2
>> { pnr:         10, fnr:         16 } hitcount:          1
>> { pnr:         11, fnr:         16 } hitcount:          9
>> { pnr:         12, fnr:         16 } hitcount:          2
>> { pnr:         13, fnr:         16 } hitcount:         27
>> { pnr:         14, fnr:         16 } hitcount:          2
>> { pnr:         15, fnr:         16 } hitcount:         54
>> { pnr:         16, fnr:         16 } hitcount:       6673
>>
>> pages: 211393
>> faults: 64712
>> pages per fault: 3.3
>>
>>
>> wp_page_copy (anon):
>>
>> { pnr:          1, fnr:          1 } hitcount:      81242
>> { pnr:          4, fnr:          4 } hitcount:       5974
>> { pnr:          1, fnr:          8 } hitcount:          1
>> { pnr:          4, fnr:          8 } hitcount:          1
>> { pnr:          8, fnr:          8 } hitcount:      12933
>> { pnr:          1, fnr:         16 } hitcount:         19
>> { pnr:          4, fnr:         16 } hitcount:          3
>> { pnr:          8, fnr:         16 } hitcount:          7
>> { pnr:         16, fnr:         16 } hitcount:       4106
>>
>> pages: 274390
>> faults: 104286
>> pages per fault: 2.6
>>
>>
>> wp_page_copy (zero):
>>
>> { pnr:          1 } hitcount:     178699
>> { pnr:          4 } hitcount:      14498
>> { pnr:          8 } hitcount:      23644
>> { pnr:         16 } hitcount:     257940
>>
>> pages: 4552883
>> faults: 474781
>> pages per fault: 9.6
> 
> I'll have to set aside more time to digest these values :)
>

[RFC,v2,00/17] variable-order, large folios for anonymous memory

Message

Comments