mbox series

[v4,0/6] Swap-out mTHP without splitting

Message ID 20240311150058.1122862-1-ryan.roberts@arm.com (mailing list archive)
Headers show
Series Swap-out mTHP without splitting | expand

Message

Ryan Roberts March 11, 2024, 3 p.m. UTC
Hi All,

This series adds support for swapping out multi-size THP (mTHP) without needing
to first split the large folio via split_huge_page_to_list_to_order(). It
closely follows the approach already used to swap-out PMD-sized THP.

There are a couple of reasons for swapping out mTHP without splitting:

  - Performance: It is expensive to split a large folio and under extreme memory
    pressure some workloads regressed performance when using 64K mTHP vs 4K
    small folios because of this extra cost in the swap-out path. This series
    not only eliminates the regression but makes it faster to swap out 64K mTHP
    vs 4K small folios.

  - Memory fragmentation avoidance: If we can avoid splitting a large folio
    memory is less likely to become fragmented, making it easier to re-allocate
    a large folio in future.

  - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
    means we won't lose the TLB-efficiency benefits of mTHP once the memory has
    been through a swap cycle.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is sufficient.


Performance Testing
===================

I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
VM is set up with a 35G block ram device as the swap device and the test is run
from inside a memcg limited to 40G memory. I've then run `usemem` from
vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
repeated everything 6 times and taken the mean performance improvement relative
to 4K page baseline:

| alloc size |            baseline |       + this series |
|            |  v6.6-rc4+anonfolio |                     |
|:-----------|--------------------:|--------------------:|
| 4K Page    |                0.0% |                1.4% |
| 64K THP    |              -14.6% |               44.2% |
| 2M THP     |               87.4% |               97.7% |

So with this change, the 64K swap performance goes from a 15% regression to a
44% improvement. 4K and 2M swap improves slightly too.

This test also acts as a good stress test for swap and, more generally mm. A
couple of existing bugs were found as a result [5] [6].


---
The series applies against mm-unstable (d7182786dd0a). Although I've
additionally been running with a couple of extra fixes to avoid the issues at
[6].


Changes since v3 [3]
====================

 - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
 - Simplified max offset calculation (per Huang, Ying)
 - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
   offset (per Huang, Ying)
 - Removed swap_alloc_large() and merged its functionality into
   scan_swap_map_slots() (per Huang, Ying)
 - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
   by freeing swap entries in batches (see patch 2) (per DavidH)
 - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
 - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
 - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
   since it's not actually a problem for THP as I first thought.


Changes since v2 [2]
====================

 - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
   allocation. This required some refactoring to make everything work nicely
   (new patches 2 and 3).
 - Fix bug where nr_swap_pages would say there are pages available but the
   scanner would not be able to allocate them because they were reserved for the
   per-cpu allocator. We now allow stealing of order-0 entries from the high
   order per-cpu clusters (in addition to exisiting stealing from order-0
   per-cpu clusters).


Changes since v1 [1]
====================

 - patch 1:
    - Use cluster_set_count() instead of cluster_set_count_flag() in
      swap_alloc_cluster() since we no longer have any flag to set. I was unable
      to kill cluster_set_count_flag() as proposed against v1 as other call
      sites depend explicitly setting flags to 0.
 - patch 2:
    - Moved large_next[] array into percpu_cluster to make it per-cpu
      (recommended by Huang, Ying).
    - large_next[] array is dynamically allocated because PMD_ORDER is not
      compile-time constant for powerpc (fixes build error).


[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
[4] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
[5] https://lore.kernel.org/linux-mm/20240311084426.447164-1-ying.huang@intel.com/
[6] https://lore.kernel.org/linux-mm/79dad067-1d26-4867-8eb1-941277b9a77b@arm.com/

Thanks,
Ryan


Ryan Roberts (6):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
  mm: swap: Simplify struct percpu_cluster
  mm: swap: Allow storage of all mTHP orders
  mm: vmscan: Avoid split during shrink_folio_list()
  mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

 include/linux/pgtable.h |  28 ++++
 include/linux/swap.h    |  33 +++--
 mm/huge_memory.c        |   3 -
 mm/internal.h           |  48 +++++++
 mm/madvise.c            | 101 ++++++++------
 mm/memory.c             |  13 +-
 mm/swapfile.c           | 298 ++++++++++++++++++++++------------------
 mm/vmscan.c             |   9 +-
 8 files changed, 332 insertions(+), 201 deletions(-)

--
2.25.1

Comments

Huang, Ying March 12, 2024, 8:01 a.m. UTC | #1
Ryan Roberts <ryan.roberts@arm.com> writes:

> Hi All,
>
> This series adds support for swapping out multi-size THP (mTHP) without needing
> to first split the large folio via split_huge_page_to_list_to_order(). It
> closely follows the approach already used to swap-out PMD-sized THP.
>
> There are a couple of reasons for swapping out mTHP without splitting:
>
>   - Performance: It is expensive to split a large folio and under extreme memory
>     pressure some workloads regressed performance when using 64K mTHP vs 4K
>     small folios because of this extra cost in the swap-out path. This series
>     not only eliminates the regression but makes it faster to swap out 64K mTHP
>     vs 4K small folios.
>
>   - Memory fragmentation avoidance: If we can avoid splitting a large folio
>     memory is less likely to become fragmented, making it easier to re-allocate
>     a large folio in future.
>
>   - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>     means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>     been through a swap cycle.
>
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is sufficient.
>
>
> Performance Testing
> ===================
>
> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
> VM is set up with a 35G block ram device as the swap device and the test is run
> from inside a memcg limited to 40G memory. I've then run `usemem` from
> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
> repeated everything 6 times and taken the mean performance improvement relative
> to 4K page baseline:
>
> | alloc size |            baseline |       + this series |
> |            |  v6.6-rc4+anonfolio |                     |
> |:-----------|--------------------:|--------------------:|
> | 4K Page    |                0.0% |                1.4% |
> | 64K THP    |              -14.6% |               44.2% |
> | 2M THP     |               87.4% |               97.7% |
>
> So with this change, the 64K swap performance goes from a 15% regression to a
> 44% improvement. 4K and 2M swap improves slightly too.

I don't understand why the performance of 2M THP improves.  The swap
entry allocation becomes a little slower.  Can you provide some
perf-profile to root cause it?

--
Best Regards,
Huang, Ying

> This test also acts as a good stress test for swap and, more generally mm. A
> couple of existing bugs were found as a result [5] [6].
>
>
> ---
> The series applies against mm-unstable (d7182786dd0a). Although I've
> additionally been running with a couple of extra fixes to avoid the issues at
> [6].
>
>
> Changes since v3 [3]
> ====================
>
>  - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
>  - Simplified max offset calculation (per Huang, Ying)
>  - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
>    offset (per Huang, Ying)
>  - Removed swap_alloc_large() and merged its functionality into
>    scan_swap_map_slots() (per Huang, Ying)
>  - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
>    by freeing swap entries in batches (see patch 2) (per DavidH)
>  - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
>  - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
>  - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
>    since it's not actually a problem for THP as I first thought.
>
>
> Changes since v2 [2]
> ====================
>
>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>    allocation. This required some refactoring to make everything work nicely
>    (new patches 2 and 3).
>  - Fix bug where nr_swap_pages would say there are pages available but the
>    scanner would not be able to allocate them because they were reserved for the
>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>    order per-cpu clusters (in addition to exisiting stealing from order-0
>    per-cpu clusters).
>
>
> Changes since v1 [1]
> ====================
>
>  - patch 1:
>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>       to kill cluster_set_count_flag() as proposed against v1 as other call
>       sites depend explicitly setting flags to 0.
>  - patch 2:
>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>       (recommended by Huang, Ying).
>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>       compile-time constant for powerpc (fixes build error).
>
>
> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
> [5] https://lore.kernel.org/linux-mm/20240311084426.447164-1-ying.huang@intel.com/
> [6] https://lore.kernel.org/linux-mm/79dad067-1d26-4867-8eb1-941277b9a77b@arm.com/
>
> Thanks,
> Ryan
>
>
> Ryan Roberts (6):
>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>   mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
>   mm: swap: Simplify struct percpu_cluster
>   mm: swap: Allow storage of all mTHP orders
>   mm: vmscan: Avoid split during shrink_folio_list()
>   mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
>
>  include/linux/pgtable.h |  28 ++++
>  include/linux/swap.h    |  33 +++--
>  mm/huge_memory.c        |   3 -
>  mm/internal.h           |  48 +++++++
>  mm/madvise.c            | 101 ++++++++------
>  mm/memory.c             |  13 +-
>  mm/swapfile.c           | 298 ++++++++++++++++++++++------------------
>  mm/vmscan.c             |   9 +-
>  8 files changed, 332 insertions(+), 201 deletions(-)
>
> --
> 2.25.1
Ryan Roberts March 12, 2024, 8:45 a.m. UTC | #2
On 11/03/2024 15:00, Ryan Roberts wrote:
> Hi All,
> 
> This series adds support for swapping out multi-size THP (mTHP) without needing
> to first split the large folio via split_huge_page_to_list_to_order(). It
> closely follows the approach already used to swap-out PMD-sized THP.
> 
> There are a couple of reasons for swapping out mTHP without splitting:
> 
>   - Performance: It is expensive to split a large folio and under extreme memory
>     pressure some workloads regressed performance when using 64K mTHP vs 4K
>     small folios because of this extra cost in the swap-out path. This series
>     not only eliminates the regression but makes it faster to swap out 64K mTHP
>     vs 4K small folios.
> 
>   - Memory fragmentation avoidance: If we can avoid splitting a large folio
>     memory is less likely to become fragmented, making it easier to re-allocate
>     a large folio in future.
> 
>   - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>     means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>     been through a swap cycle.
> 
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is sufficient.
> 
> 
> Performance Testing
> ===================
> 
> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
> VM is set up with a 35G block ram device as the swap device and the test is run
> from inside a memcg limited to 40G memory. I've then run `usemem` from
> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
> repeated everything 6 times and taken the mean performance improvement relative
> to 4K page baseline:
> 
> | alloc size |            baseline |       + this series |
> |            |  v6.6-rc4+anonfolio |                     |

Oops, just noticed I failed to update these column headers. The baseline is
actually mm-unstable (d7182786dd0a) which is based on v6.8-rc5 and already
contains "anonfolio" - now called mTHP.


> |:-----------|--------------------:|--------------------:|
> | 4K Page    |                0.0% |                1.4% |
> | 64K THP    |              -14.6% |               44.2% |
> | 2M THP     |               87.4% |               97.7% |
> 
> So with this change, the 64K swap performance goes from a 15% regression to a
> 44% improvement. 4K and 2M swap improves slightly too.
> 
> This test also acts as a good stress test for swap and, more generally mm. A
> couple of existing bugs were found as a result [5] [6].
> 
> 
> ---
> The series applies against mm-unstable (d7182786dd0a). Although I've
> additionally been running with a couple of extra fixes to avoid the issues at
> [6].
> 
> 
> Changes since v3 [3]
> ====================
> 
>  - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
>  - Simplified max offset calculation (per Huang, Ying)
>  - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
>    offset (per Huang, Ying)
>  - Removed swap_alloc_large() and merged its functionality into
>    scan_swap_map_slots() (per Huang, Ying)
>  - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
>    by freeing swap entries in batches (see patch 2) (per DavidH)
>  - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
>  - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
>  - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
>    since it's not actually a problem for THP as I first thought.
> 
> 
> Changes since v2 [2]
> ====================
> 
>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>    allocation. This required some refactoring to make everything work nicely
>    (new patches 2 and 3).
>  - Fix bug where nr_swap_pages would say there are pages available but the
>    scanner would not be able to allocate them because they were reserved for the
>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>    order per-cpu clusters (in addition to exisiting stealing from order-0
>    per-cpu clusters).
> 
> 
> Changes since v1 [1]
> ====================
> 
>  - patch 1:
>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>       to kill cluster_set_count_flag() as proposed against v1 as other call
>       sites depend explicitly setting flags to 0.
>  - patch 2:
>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>       (recommended by Huang, Ying).
>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>       compile-time constant for powerpc (fixes build error).
> 
> 
> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
> [4] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
> [5] https://lore.kernel.org/linux-mm/20240311084426.447164-1-ying.huang@intel.com/
> [6] https://lore.kernel.org/linux-mm/79dad067-1d26-4867-8eb1-941277b9a77b@arm.com/
> 
> Thanks,
> Ryan
> 
> 
> Ryan Roberts (6):
>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>   mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
>   mm: swap: Simplify struct percpu_cluster
>   mm: swap: Allow storage of all mTHP orders
>   mm: vmscan: Avoid split during shrink_folio_list()
>   mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
> 
>  include/linux/pgtable.h |  28 ++++
>  include/linux/swap.h    |  33 +++--
>  mm/huge_memory.c        |   3 -
>  mm/internal.h           |  48 +++++++
>  mm/madvise.c            | 101 ++++++++------
>  mm/memory.c             |  13 +-
>  mm/swapfile.c           | 298 ++++++++++++++++++++++------------------
>  mm/vmscan.c             |   9 +-
>  8 files changed, 332 insertions(+), 201 deletions(-)
> 
> --
> 2.25.1
>
Ryan Roberts March 12, 2024, 8:49 a.m. UTC | #3
On 12/03/2024 08:01, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> Hi All,
>>
>> This series adds support for swapping out multi-size THP (mTHP) without needing
>> to first split the large folio via split_huge_page_to_list_to_order(). It
>> closely follows the approach already used to swap-out PMD-sized THP.
>>
>> There are a couple of reasons for swapping out mTHP without splitting:
>>
>>   - Performance: It is expensive to split a large folio and under extreme memory
>>     pressure some workloads regressed performance when using 64K mTHP vs 4K
>>     small folios because of this extra cost in the swap-out path. This series
>>     not only eliminates the regression but makes it faster to swap out 64K mTHP
>>     vs 4K small folios.
>>
>>   - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>     memory is less likely to become fragmented, making it easier to re-allocate
>>     a large folio in future.
>>
>>   - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>     means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>     been through a swap cycle.
>>
>> I've done what I thought was the smallest change possible, and as a result, this
>> approach is only employed when the swap is backed by a non-rotating block device
>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>> that this is sufficient.
>>
>>
>> Performance Testing
>> ===================
>>
>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>> VM is set up with a 35G block ram device as the swap device and the test is run
>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>> repeated everything 6 times and taken the mean performance improvement relative
>> to 4K page baseline:
>>
>> | alloc size |            baseline |       + this series |
>> |            |  v6.6-rc4+anonfolio |                     |
>> |:-----------|--------------------:|--------------------:|
>> | 4K Page    |                0.0% |                1.4% |
>> | 64K THP    |              -14.6% |               44.2% |
>> | 2M THP     |               87.4% |               97.7% |
>>
>> So with this change, the 64K swap performance goes from a 15% regression to a
>> 44% improvement. 4K and 2M swap improves slightly too.
> 
> I don't understand why the performance of 2M THP improves.  The swap
> entry allocation becomes a little slower.  Can you provide some
> perf-profile to root cause it?

I didn't post the stdev, which is quite large (~10%), so that may explain some
of it:

| kernel   |   mean_rel |   std_rel |
|:---------|-----------:|----------:|
| base-4K  |       0.0% |      5.5% |
| base-64K |     -14.6% |      3.8% |
| base-2M  |      87.4% |     10.6% |
| v4-4K    |       1.4% |      3.7% |
| v4-64K   |      44.2% |     11.8% |
| v4-2M    |      97.7% |     13.3% |

Regardless, I'll do some perf profiling and post results shortly.

> 
> --
> Best Regards,
> Huang, Ying
> 
>> This test also acts as a good stress test for swap and, more generally mm. A
>> couple of existing bugs were found as a result [5] [6].
>>
>>
>> ---
>> The series applies against mm-unstable (d7182786dd0a). Although I've
>> additionally been running with a couple of extra fixes to avoid the issues at
>> [6].
>>
>>
>> Changes since v3 [3]
>> ====================
>>
>>  - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
>>  - Simplified max offset calculation (per Huang, Ying)
>>  - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
>>    offset (per Huang, Ying)
>>  - Removed swap_alloc_large() and merged its functionality into
>>    scan_swap_map_slots() (per Huang, Ying)
>>  - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
>>    by freeing swap entries in batches (see patch 2) (per DavidH)
>>  - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
>>  - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
>>  - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
>>    since it's not actually a problem for THP as I first thought.
>>
>>
>> Changes since v2 [2]
>> ====================
>>
>>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>>    allocation. This required some refactoring to make everything work nicely
>>    (new patches 2 and 3).
>>  - Fix bug where nr_swap_pages would say there are pages available but the
>>    scanner would not be able to allocate them because they were reserved for the
>>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>>    order per-cpu clusters (in addition to exisiting stealing from order-0
>>    per-cpu clusters).
>>
>>
>> Changes since v1 [1]
>> ====================
>>
>>  - patch 1:
>>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>>       to kill cluster_set_count_flag() as proposed against v1 as other call
>>       sites depend explicitly setting flags to 0.
>>  - patch 2:
>>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>>       (recommended by Huang, Ying).
>>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>>       compile-time constant for powerpc (fixes build error).
>>
>>
>> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
>> [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
>> [4] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
>> [5] https://lore.kernel.org/linux-mm/20240311084426.447164-1-ying.huang@intel.com/
>> [6] https://lore.kernel.org/linux-mm/79dad067-1d26-4867-8eb1-941277b9a77b@arm.com/
>>
>> Thanks,
>> Ryan
>>
>>
>> Ryan Roberts (6):
>>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>>   mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
>>   mm: swap: Simplify struct percpu_cluster
>>   mm: swap: Allow storage of all mTHP orders
>>   mm: vmscan: Avoid split during shrink_folio_list()
>>   mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
>>
>>  include/linux/pgtable.h |  28 ++++
>>  include/linux/swap.h    |  33 +++--
>>  mm/huge_memory.c        |   3 -
>>  mm/internal.h           |  48 +++++++
>>  mm/madvise.c            | 101 ++++++++------
>>  mm/memory.c             |  13 +-
>>  mm/swapfile.c           | 298 ++++++++++++++++++++++------------------
>>  mm/vmscan.c             |   9 +-
>>  8 files changed, 332 insertions(+), 201 deletions(-)
>>
>> --
>> 2.25.1
Ryan Roberts March 12, 2024, 1:56 p.m. UTC | #4
On 12/03/2024 08:49, Ryan Roberts wrote:
> On 12/03/2024 08:01, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> Hi All,
>>>
>>> This series adds support for swapping out multi-size THP (mTHP) without needing
>>> to first split the large folio via split_huge_page_to_list_to_order(). It
>>> closely follows the approach already used to swap-out PMD-sized THP.
>>>
>>> There are a couple of reasons for swapping out mTHP without splitting:
>>>
>>>   - Performance: It is expensive to split a large folio and under extreme memory
>>>     pressure some workloads regressed performance when using 64K mTHP vs 4K
>>>     small folios because of this extra cost in the swap-out path. This series
>>>     not only eliminates the regression but makes it faster to swap out 64K mTHP
>>>     vs 4K small folios.
>>>
>>>   - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>>     memory is less likely to become fragmented, making it easier to re-allocate
>>>     a large folio in future.
>>>
>>>   - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>>     means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>>     been through a swap cycle.
>>>
>>> I've done what I thought was the smallest change possible, and as a result, this
>>> approach is only employed when the swap is backed by a non-rotating block device
>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>>> that this is sufficient.
>>>
>>>
>>> Performance Testing
>>> ===================
>>>
>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>>> VM is set up with a 35G block ram device as the swap device and the test is run
>>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>>> repeated everything 6 times and taken the mean performance improvement relative
>>> to 4K page baseline:
>>>
>>> | alloc size |            baseline |       + this series |
>>> |            |  v6.6-rc4+anonfolio |                     |
>>> |:-----------|--------------------:|--------------------:|
>>> | 4K Page    |                0.0% |                1.4% |
>>> | 64K THP    |              -14.6% |               44.2% |
>>> | 2M THP     |               87.4% |               97.7% |
>>>
>>> So with this change, the 64K swap performance goes from a 15% regression to a
>>> 44% improvement. 4K and 2M swap improves slightly too.
>>
>> I don't understand why the performance of 2M THP improves.  The swap
>> entry allocation becomes a little slower.  Can you provide some
>> perf-profile to root cause it?
> 
> I didn't post the stdev, which is quite large (~10%), so that may explain some
> of it:
> 
> | kernel   |   mean_rel |   std_rel |
> |:---------|-----------:|----------:|
> | base-4K  |       0.0% |      5.5% |
> | base-64K |     -14.6% |      3.8% |
> | base-2M  |      87.4% |     10.6% |
> | v4-4K    |       1.4% |      3.7% |
> | v4-64K   |      44.2% |     11.8% |
> | v4-2M    |      97.7% |     13.3% |
> 
> Regardless, I'll do some perf profiling and post results shortly.

I did a lot more runs (24 for each config) and meaned them to try to remove the
noise in the measurements. It's now only showing a 4% improvement for 2M. So I
don't think the 2M improvement is real:

| kernel   |   mean_rel |   std_rel |
|:---------|-----------:|----------:|
| base-4K  |       0.0% |      3.2% |
| base-64K |      -9.1% |     10.1% |
| base-2M  |      88.9% |      6.8% |
| v4-4K    |       0.5% |      3.1% |
| v4-64K   |      44.7% |      8.3% |
| v4-2M    |      93.3% |      7.8% |

Looking at the perf data, the only thing that sticks out is that a big chunk of
time is spent in during contpte_convert(), called as a result of
try_to_unmap_one(). This is present in both the before and after configs.

This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is
being split during shrink_folio_list()  with TTU_SPLIT_HUGE_PMD, meaning the
THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one
which means the contpte block needs to be unfolded. I think try_to_unmap_one()
could potentially be optimized to batch unmap a contiguously mapped folio and
avoid this unfold. But that would be an independent and separate piece of work.

> 
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>>> This test also acts as a good stress test for swap and, more generally mm. A
>>> couple of existing bugs were found as a result [5] [6].
>>>
>>>
>>> ---
>>> The series applies against mm-unstable (d7182786dd0a). Although I've
>>> additionally been running with a couple of extra fixes to avoid the issues at
>>> [6].
>>>
>>>
>>> Changes since v3 [3]
>>> ====================
>>>
>>>  - Renamed SWAP_NEXT_NULL -> SWAP_NEXT_INVALID (per Huang, Ying)
>>>  - Simplified max offset calculation (per Huang, Ying)
>>>  - Reinstated struct percpu_cluster to contain per-cluster, per-order `next`
>>>    offset (per Huang, Ying)
>>>  - Removed swap_alloc_large() and merged its functionality into
>>>    scan_swap_map_slots() (per Huang, Ying)
>>>  - Avoid extra cost of folio ref and lock due to removal of CLUSTER_FLAG_HUGE
>>>    by freeing swap entries in batches (see patch 2) (per DavidH)
>>>  - vmscan splits folio if its partially mapped (per Barry Song, DavidH)
>>>  - Avoid splitting in MADV_PAGEOUT path (per Barry Song)
>>>  - Dropped "mm: swap: Simplify ssd behavior when scanner steals entry" patch
>>>    since it's not actually a problem for THP as I first thought.
>>>
>>>
>>> Changes since v2 [2]
>>> ====================
>>>
>>>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>>>    allocation. This required some refactoring to make everything work nicely
>>>    (new patches 2 and 3).
>>>  - Fix bug where nr_swap_pages would say there are pages available but the
>>>    scanner would not be able to allocate them because they were reserved for the
>>>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>>>    order per-cpu clusters (in addition to exisiting stealing from order-0
>>>    per-cpu clusters).
>>>
>>>
>>> Changes since v1 [1]
>>> ====================
>>>
>>>  - patch 1:
>>>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>>>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>>>       to kill cluster_set_count_flag() as proposed against v1 as other call
>>>       sites depend explicitly setting flags to 0.
>>>  - patch 2:
>>>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>>>       (recommended by Huang, Ying).
>>>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>>>       compile-time constant for powerpc (fixes build error).
>>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
>>> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
>>> [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/
>>> [4] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
>>> [5] https://lore.kernel.org/linux-mm/20240311084426.447164-1-ying.huang@intel.com/
>>> [6] https://lore.kernel.org/linux-mm/79dad067-1d26-4867-8eb1-941277b9a77b@arm.com/
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>> Ryan Roberts (6):
>>>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>>>   mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()
>>>   mm: swap: Simplify struct percpu_cluster
>>>   mm: swap: Allow storage of all mTHP orders
>>>   mm: vmscan: Avoid split during shrink_folio_list()
>>>   mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD
>>>
>>>  include/linux/pgtable.h |  28 ++++
>>>  include/linux/swap.h    |  33 +++--
>>>  mm/huge_memory.c        |   3 -
>>>  mm/internal.h           |  48 +++++++
>>>  mm/madvise.c            | 101 ++++++++------
>>>  mm/memory.c             |  13 +-
>>>  mm/swapfile.c           | 298 ++++++++++++++++++++++------------------
>>>  mm/vmscan.c             |   9 +-
>>>  8 files changed, 332 insertions(+), 201 deletions(-)
>>>
>>> --
>>> 2.25.1
>
Huang, Ying March 13, 2024, 1:15 a.m. UTC | #5
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 12/03/2024 08:49, Ryan Roberts wrote:
>> On 12/03/2024 08:01, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> Hi All,
>>>>
>>>> This series adds support for swapping out multi-size THP (mTHP) without needing
>>>> to first split the large folio via split_huge_page_to_list_to_order(). It
>>>> closely follows the approach already used to swap-out PMD-sized THP.
>>>>
>>>> There are a couple of reasons for swapping out mTHP without splitting:
>>>>
>>>>   - Performance: It is expensive to split a large folio and under extreme memory
>>>>     pressure some workloads regressed performance when using 64K mTHP vs 4K
>>>>     small folios because of this extra cost in the swap-out path. This series
>>>>     not only eliminates the regression but makes it faster to swap out 64K mTHP
>>>>     vs 4K small folios.
>>>>
>>>>   - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>>>     memory is less likely to become fragmented, making it easier to re-allocate
>>>>     a large folio in future.
>>>>
>>>>   - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>>>     means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>>>     been through a swap cycle.
>>>>
>>>> I've done what I thought was the smallest change possible, and as a result, this
>>>> approach is only employed when the swap is backed by a non-rotating block device
>>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>>>> that this is sufficient.
>>>>
>>>>
>>>> Performance Testing
>>>> ===================
>>>>
>>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>>>> VM is set up with a 35G block ram device as the swap device and the test is run
>>>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>>>> repeated everything 6 times and taken the mean performance improvement relative
>>>> to 4K page baseline:
>>>>
>>>> | alloc size |            baseline |       + this series |
>>>> |            |  v6.6-rc4+anonfolio |                     |
>>>> |:-----------|--------------------:|--------------------:|
>>>> | 4K Page    |                0.0% |                1.4% |
>>>> | 64K THP    |              -14.6% |               44.2% |
>>>> | 2M THP     |               87.4% |               97.7% |
>>>>
>>>> So with this change, the 64K swap performance goes from a 15% regression to a
>>>> 44% improvement. 4K and 2M swap improves slightly too.
>>>
>>> I don't understand why the performance of 2M THP improves.  The swap
>>> entry allocation becomes a little slower.  Can you provide some
>>> perf-profile to root cause it?
>> 
>> I didn't post the stdev, which is quite large (~10%), so that may explain some
>> of it:
>> 
>> | kernel   |   mean_rel |   std_rel |
>> |:---------|-----------:|----------:|
>> | base-4K  |       0.0% |      5.5% |
>> | base-64K |     -14.6% |      3.8% |
>> | base-2M  |      87.4% |     10.6% |
>> | v4-4K    |       1.4% |      3.7% |
>> | v4-64K   |      44.2% |     11.8% |
>> | v4-2M    |      97.7% |     13.3% |
>> 
>> Regardless, I'll do some perf profiling and post results shortly.
>
> I did a lot more runs (24 for each config) and meaned them to try to remove the
> noise in the measurements. It's now only showing a 4% improvement for 2M. So I
> don't think the 2M improvement is real:
>
> | kernel   |   mean_rel |   std_rel |
> |:---------|-----------:|----------:|
> | base-4K  |       0.0% |      3.2% |
> | base-64K |      -9.1% |     10.1% |
> | base-2M  |      88.9% |      6.8% |
> | v4-4K    |       0.5% |      3.1% |
> | v4-64K   |      44.7% |      8.3% |
> | v4-2M    |      93.3% |      7.8% |
>
> Looking at the perf data, the only thing that sticks out is that a big chunk of
> time is spent in during contpte_convert(), called as a result of
> try_to_unmap_one(). This is present in both the before and after configs.
>
> This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is
> being split during shrink_folio_list()  with TTU_SPLIT_HUGE_PMD, meaning the
> THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one
> which means the contpte block needs to be unfolded. I think try_to_unmap_one()
> could potentially be optimized to batch unmap a contiguously mapped folio and
> avoid this unfold. But that would be an independent and separate piece of work.

Thanks for more data and detailed explanation.

--
Best Regards,
Huang, Ying
Ryan Roberts March 13, 2024, 8:50 a.m. UTC | #6
On 13/03/2024 01:15, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 12/03/2024 08:49, Ryan Roberts wrote:
>>> On 12/03/2024 08:01, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> Hi All,
>>>>>
>>>>> This series adds support for swapping out multi-size THP (mTHP) without needing
>>>>> to first split the large folio via split_huge_page_to_list_to_order(). It
>>>>> closely follows the approach already used to swap-out PMD-sized THP.
>>>>>
>>>>> There are a couple of reasons for swapping out mTHP without splitting:
>>>>>
>>>>>   - Performance: It is expensive to split a large folio and under extreme memory
>>>>>     pressure some workloads regressed performance when using 64K mTHP vs 4K
>>>>>     small folios because of this extra cost in the swap-out path. This series
>>>>>     not only eliminates the regression but makes it faster to swap out 64K mTHP
>>>>>     vs 4K small folios.
>>>>>
>>>>>   - Memory fragmentation avoidance: If we can avoid splitting a large folio
>>>>>     memory is less likely to become fragmented, making it easier to re-allocate
>>>>>     a large folio in future.
>>>>>
>>>>>   - Performance: Enables a separate series [4] to swap-in whole mTHPs, which
>>>>>     means we won't lose the TLB-efficiency benefits of mTHP once the memory has
>>>>>     been through a swap cycle.
>>>>>
>>>>> I've done what I thought was the smallest change possible, and as a result, this
>>>>> approach is only employed when the swap is backed by a non-rotating block device
>>>>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>>>>> that this is sufficient.
>>>>>
>>>>>
>>>>> Performance Testing
>>>>> ===================
>>>>>
>>>>> I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The
>>>>> VM is set up with a 35G block ram device as the swap device and the test is run
>>>>> from inside a memcg limited to 40G memory. I've then run `usemem` from
>>>>> vm-scalability with 70 processes, each allocating and writing 1G of memory. I've
>>>>> repeated everything 6 times and taken the mean performance improvement relative
>>>>> to 4K page baseline:
>>>>>
>>>>> | alloc size |            baseline |       + this series |
>>>>> |            |  v6.6-rc4+anonfolio |                     |
>>>>> |:-----------|--------------------:|--------------------:|
>>>>> | 4K Page    |                0.0% |                1.4% |
>>>>> | 64K THP    |              -14.6% |               44.2% |
>>>>> | 2M THP     |               87.4% |               97.7% |
>>>>>
>>>>> So with this change, the 64K swap performance goes from a 15% regression to a
>>>>> 44% improvement. 4K and 2M swap improves slightly too.
>>>>
>>>> I don't understand why the performance of 2M THP improves.  The swap
>>>> entry allocation becomes a little slower.  Can you provide some
>>>> perf-profile to root cause it?
>>>
>>> I didn't post the stdev, which is quite large (~10%), so that may explain some
>>> of it:
>>>
>>> | kernel   |   mean_rel |   std_rel |
>>> |:---------|-----------:|----------:|
>>> | base-4K  |       0.0% |      5.5% |
>>> | base-64K |     -14.6% |      3.8% |
>>> | base-2M  |      87.4% |     10.6% |
>>> | v4-4K    |       1.4% |      3.7% |
>>> | v4-64K   |      44.2% |     11.8% |
>>> | v4-2M    |      97.7% |     13.3% |
>>>
>>> Regardless, I'll do some perf profiling and post results shortly.
>>
>> I did a lot more runs (24 for each config) and meaned them to try to remove the
>> noise in the measurements. It's now only showing a 4% improvement for 2M. So I
>> don't think the 2M improvement is real:
>>
>> | kernel   |   mean_rel |   std_rel |
>> |:---------|-----------:|----------:|
>> | base-4K  |       0.0% |      3.2% |
>> | base-64K |      -9.1% |     10.1% |
>> | base-2M  |      88.9% |      6.8% |
>> | v4-4K    |       0.5% |      3.1% |
>> | v4-64K   |      44.7% |      8.3% |
>> | v4-2M    |      93.3% |      7.8% |
>>
>> Looking at the perf data, the only thing that sticks out is that a big chunk of
>> time is spent in during contpte_convert(), called as a result of
>> try_to_unmap_one(). This is present in both the before and after configs.
>>
>> This is an arm64 function to "unfold" contpte mappings. Essentially, the PMD is
>> being split during shrink_folio_list()  with TTU_SPLIT_HUGE_PMD, meaning the
>> THPs are PTE-mapped in contpte blocks. Then we are unmapping each pte one-by-one
>> which means the contpte block needs to be unfolded. I think try_to_unmap_one()
>> could potentially be optimized to batch unmap a contiguously mapped folio and
>> avoid this unfold. But that would be an independent and separate piece of work.
> 
> Thanks for more data and detailed explanation.

And thanks for your review! I'll address all your comments (and any others that
I get in the meantime) and repost after the merge window. It would be great if
we can get this in for v6.10.

> 
> --
> Best Regards,
> Huang, Ying