mbox series

[RFC,v4,0/2] mm: support mTHP swap-in for zRAM-like swapfile

Message ID 20240629111010.230484-1-21cnbao@gmail.com (mailing list archive)
Headers show
Series mm: support mTHP swap-in for zRAM-like swapfile | expand

Message

Barry Song June 29, 2024, 11:10 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

In an embedded system like Android, more than half of anonymous memory is
actually stored in swap devices such as zRAM. For instance, when an app 
is switched to the background, most of its memory might be swapped out.

Currently, we have mTHP features, but unfortunately, without support
for large folio swap-ins, once those large folios are swapped out,
we lose them immediately because mTHP is a one-way ticket.

This is unacceptable and reduces mTHP to merely a toy on systems
with significant swap utilization.

This patch introduces mTHP swap-in support. For now, we limit mTHP
swap-ins to contiguous swaps that were likely swapped out from mTHP as
a whole.

Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
case. This is the simplest and most common use case, benefiting millions
of Android phones and similar devices with minimal implementation
cost. In this straightforward scenario, large folios are always exclusive,
eliminating the need to handle complex rmap and swapcache issues.

It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
   swap-out and swap-in.
2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
   without fragmentation. Based on the observed data [1] on Chris's and Ryan's
   THP swap allocation optimization, aligned swap-in plays a crucial role
   in the success of THP_SWPOUT.
3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
   and enhancing compression ratios significantly. We have another patchset
   to enable mTHP compression and decompression in zsmalloc/zRAM[2].

Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
to be an optimal approach. There's a critical distinction between pagecache
and anonymous pages: pagecache can be evicted and later retrieved from disk,
potentially becoming a mTHP upon retrieval, whereas anonymous pages must
always reside in memory or swapfile. If we swap in small folios and identify
adjacent memory suitable for swapping in as mTHP, those pages that have been
converted to small folios may never transition to mTHP. The process of
converting mTHP into small folios remains irreversible. This introduces
the risk of losing all mTHP through several swap-out and swap-in cycles,
let alone losing the benefits of defragmentation, improved compression
ratios, and reduced CPU usage based on mTHP compression/decompression.

Conversely, in deploying mTHP on millions of real-world products with this
feature in OPPO's out-of-tree code[3], we haven't observed any significant
increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.

[1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
[2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
[3] OnePlusOSS / android_kernel_oneplus_sm8550 
https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11

-v4:
 Many parts of v3 have been merged into the mm tree with the help on reviewing
 from Ryan, David, Ying and Chris etc. Thank you very much!
 This is the final part to allocate large folios and map them.

 * Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
   in this v4 RFC though it should be fixed in Yosry's patch
 * lots of code improvement (drop large stack, hold ptl etc) according
   to Yosry's and Ryan's feedback
 * rebased on top of the latest mm-unstable and utilized some new helpers
   introduced recently.

-v3:
 https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
 * avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
   thanks!
 * fix the issue folio is charged twice for do_swap_page, separating
   alloc_anon_folio and alloc_swap_folio as they have many differences
   now on
   * memcg charing
   * clearing allocated folio or not

-v2:
 https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
 * lots of code cleanup according to Chris's comments, thanks!
 * collect Chris's ack tags, thanks!
 * address David's comment on moving to use folio_add_new_anon_rmap
   for !folio_test_anon in do_swap_page, thanks!
 * remove the MADV_PAGEOUT patch from this series as Ryan will
   intergrate it into swap-out series
 * Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
   on large folios swap-in as well
 * fixed corrupted data(zero-filled data) in two races: zswap and
   a part of entries are in swapcache while some others are not
   in by checking SWAP_HAS_CACHE while swapping in a large folio

-v1:
 https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t

Barry Song (1):
  mm: swap: introduce swapcache_prepare_nr and swapcache_clear_nr for
    large folios swap-in

Chuanhua Han (1):
  mm: support large folios swapin as a whole for zRAM-like swapfile

 include/linux/swap.h  |   4 +-
 include/linux/zswap.h |   2 +-
 mm/memory.c           | 210 +++++++++++++++++++++++++++++++++++-------
 mm/swap.h             |   4 +-
 mm/swap_state.c       |   2 +-
 mm/swapfile.c         | 114 +++++++++++++----------
 6 files changed, 251 insertions(+), 85 deletions(-)

Comments

Huang, Ying July 3, 2024, 6:31 a.m. UTC | #1
Barry Song <21cnbao@gmail.com> writes:

> From: Barry Song <v-songbaohua@oppo.com>
>
> In an embedded system like Android, more than half of anonymous memory is
> actually stored in swap devices such as zRAM. For instance, when an app 
> is switched to the background, most of its memory might be swapped out.
>
> Currently, we have mTHP features, but unfortunately, without support
> for large folio swap-ins, once those large folios are swapped out,
> we lose them immediately because mTHP is a one-way ticket.

No exactly one-way ticket, we have (or will have) khugepaged.  But I
admit that it may be not good enough for you.

> This is unacceptable and reduces mTHP to merely a toy on systems
> with significant swap utilization.

May be true in your systems.  May be not in some other systems.

> This patch introduces mTHP swap-in support. For now, we limit mTHP
> swap-ins to contiguous swaps that were likely swapped out from mTHP as
> a whole.
>
> Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> case. This is the simplest and most common use case, benefiting millions

I admit that Android is an important target platform of Linux kernel.
But I will not advocate that it's MOST common ...

> of Android phones and similar devices with minimal implementation
> cost. In this straightforward scenario, large folios are always exclusive,
> eliminating the need to handle complex rmap and swapcache issues.
>
> It offers several benefits:
> 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
>    swap-out and swap-in.
> 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
>    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
>    THP swap allocation optimization, aligned swap-in plays a crucial role
>    in the success of THP_SWPOUT.
> 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
>    and enhancing compression ratios significantly. We have another patchset
>    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
>
> Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> to be an optimal approach. There's a critical distinction between pagecache
> and anonymous pages: pagecache can be evicted and later retrieved from disk,
> potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> always reside in memory or swapfile. If we swap in small folios and identify
> adjacent memory suitable for swapping in as mTHP, those pages that have been
> converted to small folios may never transition to mTHP. The process of
> converting mTHP into small folios remains irreversible. This introduces
> the risk of losing all mTHP through several swap-out and swap-in cycles,
> let alone losing the benefits of defragmentation, improved compression
> ratios, and reduced CPU usage based on mTHP compression/decompression.

I understand that the most optimal policy in your use cases may be
always swapping-in mTHP in highest order.  But, it may be not in some
other use cases.  For example, relative slow swap devices, non-fault
sub-pages swapped out again before usage, etc.

So, IMO, the default policy should be the one that can adapt to the
requirements automatically.  For example, if most non-fault sub-pages
will be read/written before being swapped out again, we should swap-in
in larger order, otherwise in smaller order.  Swap readahead is one
possible way to do that.  But, I admit that this may not work perfectly
in your use cases.

Previously I hope that we can start with this automatic policy that
helps everyone, then check whether it can satisfy your requirements
before implementing the optimal policy for you.  But it appears that you
don't agree with this.

Based on the above, IMO, we should not use your policy as default at
least for now.  A user space interface can be implemented to select
different swap-in order policy similar as that of mTHP allocation order
policy.  We need a different policy because the performance characters
of the memory allocation is quite different from that of swap-in.  For
example, the SSD reading could be much slower than the memory
allocation.  With the policy selection, I think that we can implement
mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
are doing.

> Conversely, in deploying mTHP on millions of real-world products with this
> feature in OPPO's out-of-tree code[3], we haven't observed any significant
> increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
>
> [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
> [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> [3] OnePlusOSS / android_kernel_oneplus_sm8550 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
>

[snip]

--
Best Regards,
Huang, Ying
Barry Song July 3, 2024, 7:58 a.m. UTC | #2
On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
>

Ying, thanks!

> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > In an embedded system like Android, more than half of anonymous memory is
> > actually stored in swap devices such as zRAM. For instance, when an app
> > is switched to the background, most of its memory might be swapped out.
> >
> > Currently, we have mTHP features, but unfortunately, without support
> > for large folio swap-ins, once those large folios are swapped out,
> > we lose them immediately because mTHP is a one-way ticket.
>
> No exactly one-way ticket, we have (or will have) khugepaged.  But I
> admit that it may be not good enough for you.

That's right. From what I understand, khugepaged currently only supports PMD THP
till now?
Moreover, I have concerns that khugepaged might not be suitable for
all mTHPs for
the following reasons:

1. The lifecycle of mTHP might not be that long. We paid the cost for
the collapse,
but it could swap-out just after that. We expect THP to be durable and
not become
obsolete quickly, given the significant amount of money we spent on it.

2. mTHP's size might not be substantial enough for a collapse. For
example, if we can
find an effective method, such as Yu's TAO or others, we can achieve a
high success
rate in mTHP allocations at a minimal cost rather than depending on
compaction/collapse.

3. It could be a significant challenge to manage the collapse - unmap,
and map processes
in relation to the power consumption of phones considering the number
of mTHP could
be much larger than PMD-mapped THP. This behavior could be quite often.

>
> > This is unacceptable and reduces mTHP to merely a toy on systems
> > with significant swap utilization.
>
> May be true in your systems.  May be not in some other systems.

I agree that this isn't a concern for systems without significant
swapout and swapin activity.
However, on Android, where we frequently switch between applications
like YouTube,
Chrome, Zoom, WeChat, Alipay, TikTok, and others, swapping could occur
throughout the
day :-)

>
> > This patch introduces mTHP swap-in support. For now, we limit mTHP
> > swap-ins to contiguous swaps that were likely swapped out from mTHP as
> > a whole.
> >
> > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> > case. This is the simplest and most common use case, benefiting millions
>
> I admit that Android is an important target platform of Linux kernel.
> But I will not advocate that it's MOST common ...

Okay, I understand that there are still many embedded systems similar
to Android, even if
they are not Android :-)

>
> > of Android phones and similar devices with minimal implementation
> > cost. In this straightforward scenario, large folios are always exclusive,
> > eliminating the need to handle complex rmap and swapcache issues.
> >
> > It offers several benefits:
> > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
> >    swap-out and swap-in.
> > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
> >    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
> >    THP swap allocation optimization, aligned swap-in plays a crucial role
> >    in the success of THP_SWPOUT.
> > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
> >    and enhancing compression ratios significantly. We have another patchset
> >    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
> >
> > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> > to be an optimal approach. There's a critical distinction between pagecache
> > and anonymous pages: pagecache can be evicted and later retrieved from disk,
> > potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> > always reside in memory or swapfile. If we swap in small folios and identify
> > adjacent memory suitable for swapping in as mTHP, those pages that have been
> > converted to small folios may never transition to mTHP. The process of
> > converting mTHP into small folios remains irreversible. This introduces
> > the risk of losing all mTHP through several swap-out and swap-in cycles,
> > let alone losing the benefits of defragmentation, improved compression
> > ratios, and reduced CPU usage based on mTHP compression/decompression.
>
> I understand that the most optimal policy in your use cases may be
> always swapping-in mTHP in highest order.  But, it may be not in some
> other use cases.  For example, relative slow swap devices, non-fault
> sub-pages swapped out again before usage, etc.
>
> So, IMO, the default policy should be the one that can adapt to the
> requirements automatically.  For example, if most non-fault sub-pages
> will be read/written before being swapped out again, we should swap-in
> in larger order, otherwise in smaller order.  Swap readahead is one
> possible way to do that.  But, I admit that this may not work perfectly
> in your use cases.
>
> Previously I hope that we can start with this automatic policy that
> helps everyone, then check whether it can satisfy your requirements
> before implementing the optimal policy for you.  But it appears that you
> don't agree with this.
>
> Based on the above, IMO, we should not use your policy as default at
> least for now.  A user space interface can be implemented to select
> different swap-in order policy similar as that of mTHP allocation order
> policy.  We need a different policy because the performance characters
> of the memory allocation is quite different from that of swap-in.  For
> example, the SSD reading could be much slower than the memory
> allocation.  With the policy selection, I think that we can implement
> mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
> are doing.

Agreed. Ryan also suggested something similar before.
Could we add this user policy by:

/sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
which could be 0 or 1, I assume we don't need so many "always inherit
madvise never"?

Do you have any suggestions regarding the user interface?

>
> > Conversely, in deploying mTHP on millions of real-world products with this
> > feature in OPPO's out-of-tree code[3], we haven't observed any significant
> > increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
> >
> > [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
> > [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > [3] OnePlusOSS / android_kernel_oneplus_sm8550
> > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> >
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Barry Song July 3, 2024, 8:32 a.m. UTC | #3
On Wed, Jul 3, 2024 at 7:58 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jul 3, 2024 at 6:33 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
>
> Ying, thanks!
>
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > In an embedded system like Android, more than half of anonymous memory is
> > > actually stored in swap devices such as zRAM. For instance, when an app
> > > is switched to the background, most of its memory might be swapped out.
> > >
> > > Currently, we have mTHP features, but unfortunately, without support
> > > for large folio swap-ins, once those large folios are swapped out,
> > > we lose them immediately because mTHP is a one-way ticket.
> >
> > No exactly one-way ticket, we have (or will have) khugepaged.  But I
> > admit that it may be not good enough for you.
>
> That's right. From what I understand, khugepaged currently only supports PMD THP
> till now?
> Moreover, I have concerns that khugepaged might not be suitable for
> all mTHPs for
> the following reasons:
>
> 1. The lifecycle of mTHP might not be that long. We paid the cost for
> the collapse,
> but it could swap-out just after that. We expect THP to be durable and
> not become
> obsolete quickly, given the significant amount of money we spent on it.
>
> 2. mTHP's size might not be substantial enough for a collapse. For
> example, if we can
> find an effective method, such as Yu's TAO or others, we can achieve a
> high success
> rate in mTHP allocations at a minimal cost rather than depending on
> compaction/collapse.
>
> 3. It could be a significant challenge to manage the collapse - unmap,
> and map processes
> in relation to the power consumption of phones considering the number
> of mTHP could
> be much larger than PMD-mapped THP. This behavior could be quite often.
>
> >
> > > This is unacceptable and reduces mTHP to merely a toy on systems
> > > with significant swap utilization.
> >
> > May be true in your systems.  May be not in some other systems.
>
> I agree that this isn't a concern for systems without significant
> swapout and swapin activity.
> However, on Android, where we frequently switch between applications
> like YouTube,
> Chrome, Zoom, WeChat, Alipay, TikTok, and others, swapping could occur
> throughout the
> day :-)
>
> >
> > > This patch introduces mTHP swap-in support. For now, we limit mTHP
> > > swap-ins to contiguous swaps that were likely swapped out from mTHP as
> > > a whole.
> > >
> > > Additionally, the current implementation only covers the SWAP_SYNCHRONOUS
> > > case. This is the simplest and most common use case, benefiting millions
> >
> > I admit that Android is an important target platform of Linux kernel.
> > But I will not advocate that it's MOST common ...
>
> Okay, I understand that there are still many embedded systems similar
> to Android, even if
> they are not Android :-)
>
> >
> > > of Android phones and similar devices with minimal implementation
> > > cost. In this straightforward scenario, large folios are always exclusive,
> > > eliminating the need to handle complex rmap and swapcache issues.
> > >
> > > It offers several benefits:
> > > 1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
> > >    swap-out and swap-in.
> > > 2. Eliminates fragmentation in swap slots and supports successful THP_SWPOUT
> > >    without fragmentation. Based on the observed data [1] on Chris's and Ryan's
> > >    THP swap allocation optimization, aligned swap-in plays a crucial role
> > >    in the success of THP_SWPOUT.
> > > 3. Enables zRAM/zsmalloc to compress and decompress mTHP, reducing CPU usage
> > >    and enhancing compression ratios significantly. We have another patchset
> > >    to enable mTHP compression and decompression in zsmalloc/zRAM[2].
> > >
> > > Using the readahead mechanism to decide whether to swap in mTHP doesn't seem
> > > to be an optimal approach. There's a critical distinction between pagecache
> > > and anonymous pages: pagecache can be evicted and later retrieved from disk,
> > > potentially becoming a mTHP upon retrieval, whereas anonymous pages must
> > > always reside in memory or swapfile. If we swap in small folios and identify
> > > adjacent memory suitable for swapping in as mTHP, those pages that have been
> > > converted to small folios may never transition to mTHP. The process of
> > > converting mTHP into small folios remains irreversible. This introduces
> > > the risk of losing all mTHP through several swap-out and swap-in cycles,
> > > let alone losing the benefits of defragmentation, improved compression
> > > ratios, and reduced CPU usage based on mTHP compression/decompression.
> >
> > I understand that the most optimal policy in your use cases may be
> > always swapping-in mTHP in highest order.  But, it may be not in some
> > other use cases.  For example, relative slow swap devices, non-fault
> > sub-pages swapped out again before usage, etc.
> >
> > So, IMO, the default policy should be the one that can adapt to the
> > requirements automatically.  For example, if most non-fault sub-pages
> > will be read/written before being swapped out again, we should swap-in
> > in larger order, otherwise in smaller order.  Swap readahead is one
> > possible way to do that.  But, I admit that this may not work perfectly
> > in your use cases.
> >
> > Previously I hope that we can start with this automatic policy that
> > helps everyone, then check whether it can satisfy your requirements
> > before implementing the optimal policy for you.  But it appears that you
> > don't agree with this.
> >
> > Based on the above, IMO, we should not use your policy as default at
> > least for now.  A user space interface can be implemented to select
> > different swap-in order policy similar as that of mTHP allocation order
> > policy.  We need a different policy because the performance characters
> > of the memory allocation is quite different from that of swap-in.  For
> > example, the SSD reading could be much slower than the memory
> > allocation.  With the policy selection, I think that we can implement
> > mTHP swap-in for non-SWAP_SYNCHRONOUS too.  Users need to know what they
> > are doing.
>
> Agreed. Ryan also suggested something similar before.
> Could we add this user policy by:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-<size>/swapin_enabled
> which could be 0 or 1, I assume we don't need so many "always inherit
> madvise never"?

I actually meant:

Firstly, we respect the existing THP policy, and then we incorporate
swapin_enabled after checking both allowable and suitable, pseudo
code like this,

        orders = thp_vma_allowable_orders(vma, vma->vm_flags,
                        TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
        orders = thp_vma_suitable_orders(vma, vmf->address, orders);

        orders = thp_swapin_allowable_order(orders);

>
> Do you have any suggestions regarding the user interface?
>
> >
> > > Conversely, in deploying mTHP on millions of real-world products with this
> > > feature in OPPO's out-of-tree code[3], we haven't observed any significant
> > > increase in memory footprint for 64KiB mTHP based on CONT-PTE on ARM64.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20240622071231.576056-1-21cnbao@gmail.com/
> > > [2] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > > [3] OnePlusOSS / android_kernel_oneplus_sm8550
> > > https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/tree/oneplus/sm8550_u_14.0.0_oneplus11
> > >
> >
> > [snip]
> >
> > --
> > Best Regards,
> > Huang, Ying
>
> Thanks
> Barry