mbox series

[v3,0/4] Swap-out small-sized THP without splitting

Message ID 20231025144546.577640-1-ryan.roberts@arm.com (mailing list archive)
Headers show
Series Swap-out small-sized THP without splitting | expand

Message

Ryan Roberts Oct. 25, 2023, 2:45 p.m. UTC
Hi All,

This is v3 of a series to add support for swapping out small-sized THP without
needing to first split the large folio via __split_huge_page(). It closely
follows the approach already used by PMD-sized THP.

"Small-sized THP" is an upcoming feature that enables performance improvements
by allocating large folios for anonymous memory, where the large folio size is
smaller than the traditional PMD-size. See [3].

In some circumstances I've observed a performance regression (see patch 2 for
details), and this series is an attempt to fix the regression in advance of
merging small-sized THP support.

I've done what I thought was the smallest change possible, and as a result, this
approach is only employed when the swap is backed by a non-rotating block device
(just as PMD-sized THP is supported today). Discussion against the RFC concluded
that this is probably sufficient.

The series applies against mm-unstable (1a3c85fa684a)


Changes since v2 [2]
====================

 - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
   allocation. This required some refactoring to make everything work nicely
   (new patches 2 and 3).
 - Fix bug where nr_swap_pages would say there are pages available but the
   scanner would not be able to allocate them because they were reserved for the
   per-cpu allocator. We now allow stealing of order-0 entries from the high
   order per-cpu clusters (in addition to exisiting stealing from order-0
   per-cpu clusters).

Thanks to Huang, Ying for the review feedback and suggestions!


Changes since v1 [1]
====================

 - patch 1:
    - Use cluster_set_count() instead of cluster_set_count_flag() in
      swap_alloc_cluster() since we no longer have any flag to set. I was unable
      to kill cluster_set_count_flag() as proposed against v1 as other call
      sites depend explicitly setting flags to 0.
 - patch 2:
    - Moved large_next[] array into percpu_cluster to make it per-cpu
      (recommended by Huang, Ying).
    - large_next[] array is dynamically allocated because PMD_ORDER is not
      compile-time constant for powerpc (fixes build error).


Thanks,
Ryan

P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
but given Huang Ying had provided some review feedback, I wanted to progress it.
All the actual prerequisites are either complete or being worked on by others.


[1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
[3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/


Ryan Roberts (4):
  mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
  mm: swap: Remove struct percpu_cluster
  mm: swap: Simplify ssd behavior when scanner steals entry
  mm: swap: Swap-out small-sized THP without splitting

 include/linux/swap.h |  31 +++---
 mm/huge_memory.c     |   3 -
 mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
 mm/vmscan.c          |  10 +-
 4 files changed, 149 insertions(+), 127 deletions(-)

--
2.25.1

Comments

Barry Song Nov. 29, 2023, 7:47 a.m. UTC | #1
> Hi All,
> 
> This is v3 of a series to add support for swapping out small-sized THP without
> needing to first split the large folio via __split_huge_page(). It closely
> follows the approach already used by PMD-sized THP.
> 
> "Small-sized THP" is an upcoming feature that enables performance improvements
> by allocating large folios for anonymous memory, where the large folio size is
> smaller than the traditional PMD-size. See [3].
> 
> In some circumstances I've observed a performance regression (see patch 2 for
> details), and this series is an attempt to fix the regression in advance of
> merging small-sized THP support.
> 
> I've done what I thought was the smallest change possible, and as a result, this
> approach is only employed when the swap is backed by a non-rotating block device
> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> that this is probably sufficient.
> 
> The series applies against mm-unstable (1a3c85fa684a)
> 
> 
> Changes since v2 [2]
> ====================
> 
>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>    allocation. This required some refactoring to make everything work nicely
>    (new patches 2 and 3).
>  - Fix bug where nr_swap_pages would say there are pages available but the
>    scanner would not be able to allocate them because they were reserved for the
>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>    order per-cpu clusters (in addition to exisiting stealing from order-0
>    per-cpu clusters).
> 
> Thanks to Huang, Ying for the review feedback and suggestions!
> 
> 
> Changes since v1 [1]
> ====================
> 
>  - patch 1:
>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>       to kill cluster_set_count_flag() as proposed against v1 as other call
>       sites depend explicitly setting flags to 0.
>  - patch 2:
>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>       (recommended by Huang, Ying).
>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>       compile-time constant for powerpc (fixes build error).
> 
> 
> Thanks,
> Ryan

> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
> but given Huang Ying had provided some review feedback, I wanted to progress it.
> All the actual prerequisites are either complete or being worked on by others.
> 

Hi Ryan,

this is quite important to a phone and a must-have component, so is large-folio
swapin, as i explained to you in another email. 
Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
swapin on top of your this patchset, probably a port and cleanup of our
do_swap_page[1] againest yours.

Another concern is that swapslots can be fragmented, if we place small/large folios
in a swap device, since large folios always require contiguous swapslot, we can
result in failure of getting slots even we still have many free slots which are not
contiguous. To avoid this, [2] dynamic hugepage solution have two swap devices,
one for basepage, the other one for CONTPTE. we have modified the priority-based
selection of swap devices to choose swap devices based on small/large folios.
i realize this approache is super ugly and might be very hard to find a way to
upstream though, it seems not universal especially if you are a linux server (-_-)

two devices are not a nice approach though it works well for a real product,
we might still need some decent way to address this problem while the problem
is for sure not a stopper of your patchset.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
[2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129

> 
> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
> [3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/
> 
> 
> Ryan Roberts (4):
>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>   mm: swap: Remove struct percpu_cluster
>   mm: swap: Simplify ssd behavior when scanner steals entry
>   mm: swap: Swap-out small-sized THP without splitting
> 
>  include/linux/swap.h |  31 +++---
>  mm/huge_memory.c     |   3 -
>  mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
>  mm/vmscan.c          |  10 +-
>  4 files changed, 149 insertions(+), 127 deletions(-)

Thanks
Barry
Ryan Roberts Nov. 29, 2023, 12:06 p.m. UTC | #2
On 29/11/2023 07:47, Barry Song wrote:
>> Hi All,
>>
>> This is v3 of a series to add support for swapping out small-sized THP without
>> needing to first split the large folio via __split_huge_page(). It closely
>> follows the approach already used by PMD-sized THP.
>>
>> "Small-sized THP" is an upcoming feature that enables performance improvements
>> by allocating large folios for anonymous memory, where the large folio size is
>> smaller than the traditional PMD-size. See [3].
>>
>> In some circumstances I've observed a performance regression (see patch 2 for
>> details), and this series is an attempt to fix the regression in advance of
>> merging small-sized THP support.
>>
>> I've done what I thought was the smallest change possible, and as a result, this
>> approach is only employed when the swap is backed by a non-rotating block device
>> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
>> that this is probably sufficient.
>>
>> The series applies against mm-unstable (1a3c85fa684a)
>>
>>
>> Changes since v2 [2]
>> ====================
>>
>>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
>>    allocation. This required some refactoring to make everything work nicely
>>    (new patches 2 and 3).
>>  - Fix bug where nr_swap_pages would say there are pages available but the
>>    scanner would not be able to allocate them because they were reserved for the
>>    per-cpu allocator. We now allow stealing of order-0 entries from the high
>>    order per-cpu clusters (in addition to exisiting stealing from order-0
>>    per-cpu clusters).
>>
>> Thanks to Huang, Ying for the review feedback and suggestions!
>>
>>
>> Changes since v1 [1]
>> ====================
>>
>>  - patch 1:
>>     - Use cluster_set_count() instead of cluster_set_count_flag() in
>>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
>>       to kill cluster_set_count_flag() as proposed against v1 as other call
>>       sites depend explicitly setting flags to 0.
>>  - patch 2:
>>     - Moved large_next[] array into percpu_cluster to make it per-cpu
>>       (recommended by Huang, Ying).
>>     - large_next[] array is dynamically allocated because PMD_ORDER is not
>>       compile-time constant for powerpc (fixes build error).
>>
>>
>> Thanks,
>> Ryan
> 
>> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
>> but given Huang Ying had provided some review feedback, I wanted to progress it.
>> All the actual prerequisites are either complete or being worked on by others.
>>
> 
> Hi Ryan,
> 
> this is quite important to a phone and a must-have component, so is large-folio
> swapin, as i explained to you in another email. 

Yes understood; the "prerequisites" are just the things that must be merged
*before* small-sized THP to ensure we don't regress existing behaviour or to
ensure that small-size THP is correct/robust when enabled. Performance
improvements can be merged after the initial small-sized series.

> Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
> swapin on top of your this patchset, probably a port and cleanup of our
> do_swap_page[1] againest yours.

That's great to hear - welcome aboard, Chuanhua Han! Feel free to reach out if
you have questions.

I would guess that any large swap-in changes would be independent of this
swap-out patch though? Wouldn't you just be looking for contiguous swap entries
in the page table to determine a suitable folio order, then swap-in each of
those entries into the folio? And if they happen to have contiguous swap offsets
(enabled by this swap-out series) then you potentially get a batched disk access
benefit.

That's just a guess though, perhaps you can describe your proposed approach?

> 
> Another concern is that swapslots can be fragmented, if we place small/large folios
> in a swap device, since large folios always require contiguous swapslot, we can
> result in failure of getting slots even we still have many free slots which are not
> contiguous.

This series tries to mitigate that problem by reserving a swap cluster per
order. That works well until we run out of swap clusters; a cluster can't be
freed until all contained swap entries are swapped back in and deallocated.

But I think we should start with the simple approach first, and only solve the
problems as they arise through real testing.

 To avoid this, [2] dynamic hugepage solution have two swap devices,
> one for basepage, the other one for CONTPTE. we have modified the priority-based
> selection of swap devices to choose swap devices based on small/large folios.
> i realize this approache is super ugly and might be very hard to find a way to
> upstream though, it seems not universal especially if you are a linux server (-_-)
> 
> two devices are not a nice approach though it works well for a real product,
> we might still need some decent way to address this problem while the problem
> is for sure not a stopper of your patchset.

I guess that approach works for your case because A) you only have 2 sizes, and
B) your swap device is zRAM, which dynamically allocate RAM as it needs it.

The upstream small-sized THP solution can support multiple sizes, so you would
need a swap device per size (I think 13 is the limit at the moment - PMD size
for 64K base page). And if your swap device is a physical block device, you
can't dynamically parition it the way you can with zRAM. Nether of those things
scale particularly well IMHO.

> 
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
> [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129
> 
>>
>> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
>> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
>> [3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/
>>
>>
>> Ryan Roberts (4):
>>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
>>   mm: swap: Remove struct percpu_cluster
>>   mm: swap: Simplify ssd behavior when scanner steals entry
>>   mm: swap: Swap-out small-sized THP without splitting
>>
>>  include/linux/swap.h |  31 +++---
>>  mm/huge_memory.c     |   3 -
>>  mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
>>  mm/vmscan.c          |  10 +-
>>  4 files changed, 149 insertions(+), 127 deletions(-)
> 
> Thanks
> Barry
Barry Song Nov. 29, 2023, 8:38 p.m. UTC | #3
On Thu, Nov 30, 2023 at 1:06 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 29/11/2023 07:47, Barry Song wrote:
> >> Hi All,
> >>
> >> This is v3 of a series to add support for swapping out small-sized THP without
> >> needing to first split the large folio via __split_huge_page(). It closely
> >> follows the approach already used by PMD-sized THP.
> >>
> >> "Small-sized THP" is an upcoming feature that enables performance improvements
> >> by allocating large folios for anonymous memory, where the large folio size is
> >> smaller than the traditional PMD-size. See [3].
> >>
> >> In some circumstances I've observed a performance regression (see patch 2 for
> >> details), and this series is an attempt to fix the regression in advance of
> >> merging small-sized THP support.
> >>
> >> I've done what I thought was the smallest change possible, and as a result, this
> >> approach is only employed when the swap is backed by a non-rotating block device
> >> (just as PMD-sized THP is supported today). Discussion against the RFC concluded
> >> that this is probably sufficient.
> >>
> >> The series applies against mm-unstable (1a3c85fa684a)
> >>
> >>
> >> Changes since v2 [2]
> >> ====================
> >>
> >>  - Reuse scan_swap_map_try_ssd_cluster() between order-0 and order > 0
> >>    allocation. This required some refactoring to make everything work nicely
> >>    (new patches 2 and 3).
> >>  - Fix bug where nr_swap_pages would say there are pages available but the
> >>    scanner would not be able to allocate them because they were reserved for the
> >>    per-cpu allocator. We now allow stealing of order-0 entries from the high
> >>    order per-cpu clusters (in addition to exisiting stealing from order-0
> >>    per-cpu clusters).
> >>
> >> Thanks to Huang, Ying for the review feedback and suggestions!
> >>
> >>
> >> Changes since v1 [1]
> >> ====================
> >>
> >>  - patch 1:
> >>     - Use cluster_set_count() instead of cluster_set_count_flag() in
> >>       swap_alloc_cluster() since we no longer have any flag to set. I was unable
> >>       to kill cluster_set_count_flag() as proposed against v1 as other call
> >>       sites depend explicitly setting flags to 0.
> >>  - patch 2:
> >>     - Moved large_next[] array into percpu_cluster to make it per-cpu
> >>       (recommended by Huang, Ying).
> >>     - large_next[] array is dynamically allocated because PMD_ORDER is not
> >>       compile-time constant for powerpc (fixes build error).
> >>
> >>
> >> Thanks,
> >> Ryan
> >
> >> P.S. I know we agreed this is not a prerequisite for merging small-sized THP,
> >> but given Huang Ying had provided some review feedback, I wanted to progress it.
> >> All the actual prerequisites are either complete or being worked on by others.
> >>
> >
> > Hi Ryan,
> >
> > this is quite important to a phone and a must-have component, so is large-folio
> > swapin, as i explained to you in another email.
>
> Yes understood; the "prerequisites" are just the things that must be merged
> *before* small-sized THP to ensure we don't regress existing behaviour or to
> ensure that small-size THP is correct/robust when enabled. Performance
> improvements can be merged after the initial small-sized series.

I completely agree. I didn't mean small-THP swap out as a whole  should be
a prerequisite for small-THP initial patchset, just describing how important
it is to a phone :-)

And actually we have done much further than this on phones by optimizing
zsmalloc/zram and allow a large folio compressed and decompressed
as a whole,  we have seen compressing/decompressing a whole large folio
can significantly improve compression ratio and decrease CPU consumption.

so that means large folios can not only save memory but also decrease
CPU consumption.

>
> > Luckily, we are having Chuanhua Han(Cc-ed) to prepare a patchset of largefolio
> > swapin on top of your this patchset, probably a port and cleanup of our
> > do_swap_page[1] againest yours.
>
> That's great to hear - welcome aboard, Chuanhua Han! Feel free to reach out if
> you have questions.
>
> I would guess that any large swap-in changes would be independent of this
> swap-out patch though? Wouldn't you just be looking for contiguous swap entries
> in the page table to determine a suitable folio order, then swap-in each of
> those entries into the folio? And if they happen to have contiguous swap offsets
> (enabled by this swap-out series) then you potentially get a batched disk access
> benefit.

I agree. Maybe we still need to check if the number of contiguous swap entries
is one of those supported large folio sizes?

>
> That's just a guess though, perhaps you can describe your proposed approach?

we have an ugly hack if we are swapping in from the dedicated zRAM for
large folios,
we assume we have a chance to swapin as a whole, but we do also handle corner
cases in which some entries might have been zap_pte_range()-ed.

My current proposal is as below,
A1. we get the number of contiguous swap entries with PTL and find it
is a valid large folio size
A2. we allocate large folio without PTL
A3. after getting PTL again, we re-check PTEs if the situation in A1
have been changed,
if no other threads change those PTEs, we set_ptes and finish the swap-in

but we have a chance to fail in A2, so in this case we still need to
fall back to basepage.

considering the MTE thread[1] I am handling, and MTE tag life cycle is
the same with swap
entry life cycle. it seems we will still need a page-level
arch_swap_restore even after
we support large folio swap-in for the below two reasons

1. contiguous PTEs might be partially dropped by madvise(DONTNEED) etc
2. we can still fall back to basepage for swap-in if we fail to get
large folio even PTEs are all
contiguous swap entries

Of course,  if we succeed in setting all PTEs for a large folio in A3,
we can have
a folio-level arch_swap_restore.

To me, an universal folio-level arch_swap_restore seems not sensible
to handle all kinds of
complex cases.

[1] [RFC V3 PATCH] arm64: mm: swap: save and restore mte tags for large folios
https://lore.kernel.org/linux-mm/20231114014313.67232-1-v-songbaohua@oppo.com/

>
> >
> > Another concern is that swapslots can be fragmented, if we place small/large folios
> > in a swap device, since large folios always require contiguous swapslot, we can
> > result in failure of getting slots even we still have many free slots which are not
> > contiguous.
>
> This series tries to mitigate that problem by reserving a swap cluster per
> order. That works well until we run out of swap clusters; a cluster can't be
> freed until all contained swap entries are swapped back in and deallocated.
>
> But I think we should start with the simple approach first, and only solve the
> problems as they arise through real testing.

I agree.

>
>  To avoid this, [2] dynamic hugepage solution have two swap devices,
> > one for basepage, the other one for CONTPTE. we have modified the priority-based
> > selection of swap devices to choose swap devices based on small/large folios.
> > i realize this approache is super ugly and might be very hard to find a way to
> > upstream though, it seems not universal especially if you are a linux server (-_-)
> >
> > two devices are not a nice approach though it works well for a real product,
> > we might still need some decent way to address this problem while the problem
> > is for sure not a stopper of your patchset.
>
> I guess that approach works for your case because A) you only have 2 sizes, and
> B) your swap device is zRAM, which dynamically allocate RAM as it needs it.
>
> The upstream small-sized THP solution can support multiple sizes, so you would
> need a swap device per size (I think 13 is the limit at the moment - PMD size
> for 64K base page). And if your swap device is a physical block device, you
> can't dynamically parition it the way you can with zRAM. Nether of those things
> scale particularly well IMHO.

right.

>
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L4648
> > [2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/swapfile.c#L1129
> >
> >>
> >> [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/
> >> [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/
> >> [3] https://lore.kernel.org/linux-mm/15a52c3d-9584-449b-8228-1335e0753b04@arm.com/
> >>
> >>
> >> Ryan Roberts (4):
> >>   mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags
> >>   mm: swap: Remove struct percpu_cluster
> >>   mm: swap: Simplify ssd behavior when scanner steals entry
> >>   mm: swap: Swap-out small-sized THP without splitting
> >>
> >>  include/linux/swap.h |  31 +++---
> >>  mm/huge_memory.c     |   3 -
> >>  mm/swapfile.c        | 232 ++++++++++++++++++++++++-------------------
> >>  mm/vmscan.c          |  10 +-
> >>  4 files changed, 149 insertions(+), 127 deletions(-)
> >

Thanks
Barry