diff mbox series

[v5,4/4] mm: Introduce per-thpsize swapin control policy

Message ID 20240726094618.401593-5-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series mm: support mTHP swap-in for zRAM-like swapfile | expand

Commit Message

Barry Song July 26, 2024, 9:46 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

Quote Ying's comment:
A user space interface can be implemented to select different swap-in
order policies, similar to the mTHP allocation order policy. We need
a distinct policy because the performance characteristics of memory
allocation differ significantly from those of swap-in. For example,
SSD read speeds can be much slower than memory allocation. With
policy selection, I believe we can implement mTHP swap-in for
non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
the implications of their choices. I think that it's better to start
with at least always never. I believe that we will add auto in the
future to tune automatically, which can be used as default finally.

Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  6 +++
 include/linux/huge_mm.h                    |  1 +
 mm/huge_memory.c                           | 44 ++++++++++++++++++++++
 mm/memory.c                                |  3 +-
 4 files changed, 53 insertions(+), 1 deletion(-)

Comments

kernel test robot July 27, 2024, 5:58 a.m. UTC | #1
Hi Barry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com
patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config)
compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static?
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock
   mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock
   mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock
   mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit
   mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit
   mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock
   mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock
   mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock
   mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
   include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false

vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c

    51	
    52	/*
    53	 * By default, transparent hugepage support is disabled in order to avoid
    54	 * risking an increased memory footprint for applications that are not
    55	 * guaranteed to benefit from it. When transparent hugepage support is
    56	 * enabled, it is for all mappings, and khugepaged scans all mappings.
    57	 * Defrag is invoked by khugepaged hugepage allocations and by page faults
    58	 * for all hugepage allocations.
    59	 */
    60	unsigned long transparent_hugepage_flags __read_mostly =
    61	#ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
    62		(1<<TRANSPARENT_HUGEPAGE_FLAG)|
    63	#endif
    64	#ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
    65		(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
    66	#endif
    67		(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
    68		(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
    69		(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
    70	
    71	static struct shrinker *deferred_split_shrinker;
    72	static unsigned long deferred_split_count(struct shrinker *shrink,
    73						  struct shrink_control *sc);
    74	static unsigned long deferred_split_scan(struct shrinker *shrink,
    75						 struct shrink_control *sc);
    76	
    77	static atomic_t huge_zero_refcount;
    78	struct folio *huge_zero_folio __read_mostly;
    79	unsigned long huge_zero_pfn __read_mostly = ~0UL;
    80	unsigned long huge_anon_orders_always __read_mostly;
    81	unsigned long huge_anon_orders_madvise __read_mostly;
    82	unsigned long huge_anon_orders_inherit __read_mostly;
  > 83	unsigned long huge_anon_orders_swapin_always __read_mostly;
    84
Barry Song July 29, 2024, 1:37 a.m. UTC | #2
On Sat, Jul 27, 2024 at 5:58 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Barry,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on akpm-mm/mm-everything]

Hi Thanks!
Would you check if the below patch fixes the problem?

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 41460847988c..06984a325af7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -80,7 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL;
 unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
-unsigned long huge_anon_orders_swapin_always __read_mostly;
+static unsigned long huge_anon_orders_swapin_always __read_mostly;
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com
> patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy
> config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config)
> compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/
>
> sparse warnings: (new ones prefixed by >>)
> >> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static?
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock
>    mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock
>    mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock
>    mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit
>    mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit
>    mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock
>    mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock
>    mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock
>    mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...):
>    include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
>    mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h):
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>    include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false
>
> vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c
>
>     51 
>     52  /*
>     53   * By default, transparent hugepage support is disabled in order to avoid
>     54   * risking an increased memory footprint for applications that are not
>     55   * guaranteed to benefit from it. When transparent hugepage support is
>     56   * enabled, it is for all mappings, and khugepaged scans all mappings.
>     57   * Defrag is invoked by khugepaged hugepage allocations and by page faults
>     58   * for all hugepage allocations.
>     59   */
>     60  unsigned long transparent_hugepage_flags __read_mostly =
>     61  #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS
>     62          (1<<TRANSPARENT_HUGEPAGE_FLAG)|
>     63  #endif
>     64  #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>     65          (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
>     66  #endif
>     67          (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
>     68          (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
>     69          (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
>     70 
>     71  static struct shrinker *deferred_split_shrinker;
>     72  static unsigned long deferred_split_count(struct shrinker *shrink,
>     73                                            struct shrink_control *sc);
>     74  static unsigned long deferred_split_scan(struct shrinker *shrink,
>     75                                           struct shrink_control *sc);
>     76 
>     77  static atomic_t huge_zero_refcount;
>     78  struct folio *huge_zero_folio __read_mostly;
>     79  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>     80  unsigned long huge_anon_orders_always __read_mostly;
>     81  unsigned long huge_anon_orders_madvise __read_mostly;
>     82  unsigned long huge_anon_orders_inherit __read_mostly;
>   > 83  unsigned long huge_anon_orders_swapin_always __read_mostly;
>     84 
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

Thanks
Barry
Matthew Wilcox (Oracle) July 29, 2024, 3:52 a.m. UTC | #3
On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
> A user space interface can be implemented to select different swap-in
> order policies, similar to the mTHP allocation order policy. We need
> a distinct policy because the performance characteristics of memory
> allocation differ significantly from those of swap-in. For example,
> SSD read speeds can be much slower than memory allocation. With
> policy selection, I believe we can implement mTHP swap-in for
> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
> the implications of their choices. I think that it's better to start
> with at least always never. I believe that we will add auto in the
> future to tune automatically, which can be used as default finally.

I strongly disagree.  Use the same sysctl as the other anonymous memory
allocations.
Barry Song July 29, 2024, 4:49 a.m. UTC | #4
On Mon, Jul 29, 2024 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
> > A user space interface can be implemented to select different swap-in
> > order policies, similar to the mTHP allocation order policy. We need
> > a distinct policy because the performance characteristics of memory
> > allocation differ significantly from those of swap-in. For example,
> > SSD read speeds can be much slower than memory allocation. With
> > policy selection, I believe we can implement mTHP swap-in for
> > non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
> > the implications of their choices. I think that it's better to start
> > with at least always never. I believe that we will add auto in the
> > future to tune automatically, which can be used as default finally.
>
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

In versions v1-v4, we used the same controls as anonymous memory allocations.
Ying expressed concerns that this approach isn't always ideal, especially for
non-zRAM devices, as SSD read speeds can be much slower than memory
allocation. I think his concern is reasonable to some extent.

However, this patchset only addresses scenarios involving zRAM-like devices
and will not impact SSDs. I would like to get Ying's feedback on whether
it's acceptable to drop this one in v6.

Thanks
Barry
Hellwig, Christoph July 29, 2024, 4:11 p.m. UTC | #5
On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote:
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

I agree with Matthew here.

We also really need to stop optimizing for this weird zram case and move
people to zswap instead after fixing the various issues.  A special
block device that isn't really a block device and needs various special
hooks isn't the right abstraction for different zwap strategies.
Barry Song July 29, 2024, 8:11 p.m. UTC | #6
On Tue, Jul 30, 2024 at 4:11 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote:
> > I strongly disagree.  Use the same sysctl as the other anonymous memory
> > allocations.
>
> I agree with Matthew here.

The whole anonymous memory allocation control is still used here. this is
just an addition: anonymous memory allocation control & swapin policy,
primarily for addressing SSD concern not for zRAM in the original v4's
comment.

>
> We also really need to stop optimizing for this weird zram case and move
> people to zswap instead after fixing the various issues.  A special
> block device that isn't really a block device and needs various special
> hooks isn't the right abstraction for different zwap strategies.

My understanding is zRAM is much more popularly used in embedded
systems than zswap. I seldomly(or never) hear who is using zswap
in Android. it seems pointless to force people to move to zswap, in
embedded systems we don't have a backend real block disk device
after zswap.

>
Chuanhua Han July 30, 2024, 2:27 a.m. UTC | #7
Christoph Hellwig <hch@infradead.org> 于2024年7月30日周二 00:11写道:
>
> On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote:
> > I strongly disagree.  Use the same sysctl as the other anonymous memory
> > allocations.
>
> I agree with Matthew here.
>
> We also really need to stop optimizing for this weird zram case and move
> people to zswap instead after fixing the various issues.  A special
> block device that isn't really a block device and needs various special
> hooks isn't the right abstraction for different zwap strategies.
I disagree, zram is most popular in embedded systems (like Android).
>
>
Ryan Roberts July 30, 2024, 8:36 a.m. UTC | #8
On 29/07/2024 04:52, Matthew Wilcox wrote:
> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>> A user space interface can be implemented to select different swap-in
>> order policies, similar to the mTHP allocation order policy. We need
>> a distinct policy because the performance characteristics of memory
>> allocation differ significantly from those of swap-in. For example,
>> SSD read speeds can be much slower than memory allocation. With
>> policy selection, I believe we can implement mTHP swap-in for
>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>> the implications of their choices. I think that it's better to start
>> with at least always never. I believe that we will add auto in the
>> future to tune automatically, which can be used as default finally.
> 
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

I vaguely recall arguing in the past that just because the user has requested 2M
THP that doesn't mean its the right thing to do for performance to swap-in the
whole 2M in one go. That's potentially a pretty huge latency, depending on where
the backend is, and it could be a waste of IO if the application never touches
most of the 2M. Although the fact that the application hinted for a 2M THP in
the first place hopefully means that they are storing objects that need to be
accessed at similar times. Today it will be swapped in page-by-page then
eventually collapsed by khugepaged.

But I think those arguments become weaker as the THP size gets smaller. 16K/64K
swap-in will likely yield significant performance improvements, and I think
Barry has numbers for this?

So I guess we have a few options:

 - Just use the same sysfs interface as for anon allocation, And see if anyone
reports performance regressions. Investigate one of the options below if an
issue is raised. That's the simplest and cleanest approach, I think.

 - New sysfs interface as Barry has implemented; nobody really wants more
controls if it can be helped.

 - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts
and never got any traction.

 - Secret option 4: Can we allocate a full-size folio but only choose to swap-in
to it bit-by-bit? You would need a way to mark which pages of the folio are
valid (e.g. per-page flag) but guess that's a non-starter given the strategy to
remove per-page flags?

Thanks,
Ryan
David Hildenbrand July 30, 2024, 8:47 a.m. UTC | #9
On 30.07.24 10:36, Ryan Roberts wrote:
> On 29/07/2024 04:52, Matthew Wilcox wrote:
>> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>>> A user space interface can be implemented to select different swap-in
>>> order policies, similar to the mTHP allocation order policy. We need
>>> a distinct policy because the performance characteristics of memory
>>> allocation differ significantly from those of swap-in. For example,
>>> SSD read speeds can be much slower than memory allocation. With
>>> policy selection, I believe we can implement mTHP swap-in for
>>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>>> the implications of their choices. I think that it's better to start
>>> with at least always never. I believe that we will add auto in the
>>> future to tune automatically, which can be used as default finally.
>>
>> I strongly disagree.  Use the same sysctl as the other anonymous memory
>> allocations.
> 
> I vaguely recall arguing in the past that just because the user has requested 2M
> THP that doesn't mean its the right thing to do for performance to swap-in the
> whole 2M in one go. That's potentially a pretty huge latency, depending on where
> the backend is, and it could be a waste of IO if the application never touches
> most of the 2M. Although the fact that the application hinted for a 2M THP in
> the first place hopefully means that they are storing objects that need to be
> accessed at similar times. Today it will be swapped in page-by-page then
> eventually collapsed by khugepaged.
> 
> But I think those arguments become weaker as the THP size gets smaller. 16K/64K
> swap-in will likely yield significant performance improvements, and I think
> Barry has numbers for this?
> 
> So I guess we have a few options:
> 
>   - Just use the same sysfs interface as for anon allocation, And see if anyone
> reports performance regressions. Investigate one of the options below if an
> issue is raised. That's the simplest and cleanest approach, I think.
> 
>   - New sysfs interface as Barry has implemented; nobody really wants more
> controls if it can be helped.
> 
>   - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts
> and never got any traction.
> 
>   - Secret option 4: Can we allocate a full-size folio but only choose to swap-in
> to it bit-by-bit? You would need a way to mark which pages of the folio are
> valid (e.g. per-page flag) but guess that's a non-starter given the strategy to
> remove per-page flags?

Maybe we could allocate for folios in the swapcache a bitmap to store 
that information (folio->private).

But I am not convinced that is the right thing to do.

If we know some basic properties of the backend, can't we automatically 
make a pretty good decision regarding the folio size to use? E.g., slow 
disk, avoid 2M ...

Avoiding sysctls if possible here would really be preferable...
Hellwig, Christoph July 30, 2024, 4:30 p.m. UTC | #10
On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote:
> > We also really need to stop optimizing for this weird zram case and move
> > people to zswap instead after fixing the various issues.  A special
> > block device that isn't really a block device and needs various special
> > hooks isn't the right abstraction for different zwap strategies.
> 
> My understanding is zRAM is much more popularly used in embedded
> systems than zswap. I seldomly(or never) hear who is using zswap
> in Android. it seems pointless to force people to move to zswap, in
> embedded systems we don't have a backend real block disk device
> after zswap.

Well, that is the point.  zram is a horrible hack that abuses a block
device to implement a feature missing the VM layer.  Right now people
have a reason for it because zswap requires a "real" backing device
and that's fine for them and for now.  But instead of building VM
infrastructure around these kinds of hacks we need to fix the VM
infrastructure.  Chris Li has been talking about and working towards
a proper swap abstraction and that needs to happen.
Nhat Pham July 30, 2024, 7:28 p.m. UTC | #11
On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote:
>
>
> Well, that is the point.  zram is a horrible hack that abuses a block
> device to implement a feature missing the VM layer.  Right now people
> have a reason for it because zswap requires a "real" backing device
> and that's fine for them and for now.  But instead of building VM

I completely agree with this assessment.

> infrastructure around these kinds of hacks we need to fix the VM
> infrastructure.  Chris Li has been talking about and working towards
> a proper swap abstraction and that needs to happen.

I'm also working towards something along this line. My design would
add a "virtual" swap ID that will be stored in the page table, and can
refer to either a real, storage-backed swap entry, or a zswap entry.
zswap can then exist without any backing swap device.

There are several additional benefits of this approach:

1. We can optimize swapoff as well - the page table can still refer to
the swap ID, but the ID now points to a physical page frame. swapoff
code just needs to sever the link from the swap ID to the physical
swap entry (which either just requires a swap ID mapping walk, or even
faster if we have a reverse mapping mechanism), and update the link to
the page frame instead.

2. We can take this opportunity to clean up the swap count code.

I'd be happy to collaborate/compare notes :)
Barry Song July 30, 2024, 9:06 p.m. UTC | #12
On Wed, Jul 31, 2024 at 7:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote:
> >
> >
> > Well, that is the point.  zram is a horrible hack that abuses a block
> > device to implement a feature missing the VM layer.  Right now people
> > have a reason for it because zswap requires a "real" backing device
> > and that's fine for them and for now.  But instead of building VM
>
> I completely agree with this assessment.
>
> > infrastructure around these kinds of hacks we need to fix the VM
> > infrastructure.  Chris Li has been talking about and working towards
> > a proper swap abstraction and that needs to happen.
>
> I'm also working towards something along this line. My design would
> add a "virtual" swap ID that will be stored in the page table, and can
> refer to either a real, storage-backed swap entry, or a zswap entry.
> zswap can then exist without any backing swap device.
>
> There are several additional benefits of this approach:
>
> 1. We can optimize swapoff as well - the page table can still refer to
> the swap ID, but the ID now points to a physical page frame. swapoff
> code just needs to sever the link from the swap ID to the physical
> swap entry (which either just requires a swap ID mapping walk, or even
> faster if we have a reverse mapping mechanism), and update the link to
> the page frame instead.
>
> 2. We can take this opportunity to clean up the swap count code.
>
> I'd be happy to collaborate/compare notes :)

I appreciate that you have a good plan, and I welcome the improvements in zswap.
However, we need to face reality. Having a good plan doesn't mean we should
wait for you to proceed.

In my experience, I've never heard of anyone using zswap in an embedded
system, especially among the billions of Android devices.(Correct me if you
know one.) How soon do you expect embedded systems and Android to adopt
zswap? In one year, two years, five years, or ten years? Have you asked if
Google plans to use zswap in Android?

Currently, zswap does not support large folios, which is why Yosry has
introduced
an API like zswap_never_enabled() to allow others to explore parallel
options like
mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS
error.  I believe you were involved in those discussions:

mm: zswap: add zswap_never_enabled()
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6
mm: zswap: handle incorrect attempts to load large folios
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1

Should everyone around the world hold off on working on mTHP swap until
zswap has addressed the issue to support large folios? Not to mention whether
people are ready and happy to switch to zswap.

I don't see any reason why we should wait and not start implementing something
that could benefit billions of devices worldwide. Parallel exploration leads to
human progress in different fields. That's why I believe Yosry's patch, which
allows others to move forward, is a more considerate approach.

Thanks
Barry
Nhat Pham July 31, 2024, 6:35 p.m. UTC | #13
On Tue, Jul 30, 2024 at 2:06 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > I'd be happy to collaborate/compare notes :)
>
> I appreciate that you have a good plan, and I welcome the improvements in zswap.
> However, we need to face reality. Having a good plan doesn't mean we should
> wait for you to proceed.
>
> In my experience, I've never heard of anyone using zswap in an embedded
> system, especially among the billions of Android devices.(Correct me if you
> know one.) How soon do you expect embedded systems and Android to adopt
> zswap? In one year, two years, five years, or ten years? Have you asked if
> Google plans to use zswap in Android?

Well, no one uses zswap in an embedded environment precisely because
of the aforementioned issues, which we are working to resolve :)

>
> Currently, zswap does not support large folios, which is why Yosry has
> introduced
> an API like zswap_never_enabled() to allow others to explore parallel
> options like
> mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS
> error.  I believe you were involved in those discussions:
>
> mm: zswap: add zswap_never_enabled()
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6
> mm: zswap: handle incorrect attempts to load large folios
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1
>

I am, and for the record I reviewed and/or ack-ed all of these
patches, and provided my inputs on how to move forward with zswap's
support for large folios. I do not want zswap to prevent the
development of the rest of the swap ecosystem.

> Should everyone around the world hold off on working on mTHP swap until
> zswap has addressed the issue to support large folios? Not to mention whether
> people are ready and happy to switch to zswap.
>

I think you misunderstood my intention. For the record, I'm not trying
to stop you from improving zram, and I'm not proposing that we kill
zram right away. Well, at least not until zswap reaches feature parity
with zram, which, as you point out, will take awhile. Both support for
large folios and swap/zswap decoupling are on our agenda, and you're
welcome to participate in the discussion - for what it's worth, your
attempt with zram (+zstd) is the basis/proof-of-concept for our future
efforts :)

That said, I believe that there is a fundamental redundancy here,
which we (zram and zswap developers) should resolve at some point by
unifying the two memory compression systems. The sooner we can unify
these two, the less effort we will have to spend on developing and
maintaining two separate mechanisms for the same (or very similar)
purpose. For instance, large folio support has to be done twice. Same
goes with writeback/offloading to backend storage, etc. And I
(admittedly with a bias), agree with Christoph that zswap is the way
to go moving forwards.

I will not address the rest - seems like there isn't something to
disagree or discuss down there :)
Sergey Senozhatsky Aug. 1, 2024, 3 a.m. UTC | #14
On (24/07/31 11:35), Nhat Pham wrote:
>
> I'm not proposing that we kill zram right away.
>

Just for the record, zram is a generic block device and has
use-cases outside of swap. Just mkfs on /dev/zram0, mount it
and do whatever. The "kill zram" thing is not going to fly.
Chris Li Aug. 1, 2024, 8:55 p.m. UTC | #15
On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote:
> > > We also really need to stop optimizing for this weird zram case and move
> > > people to zswap instead after fixing the various issues.  A special
> > > block device that isn't really a block device and needs various special
> > > hooks isn't the right abstraction for different zwap strategies.
> >
> > My understanding is zRAM is much more popularly used in embedded
> > systems than zswap. I seldomly(or never) hear who is using zswap
> > in Android. it seems pointless to force people to move to zswap, in
> > embedded systems we don't have a backend real block disk device
> > after zswap.
>
> Well, that is the point.  zram is a horrible hack that abuses a block
> device to implement a feature missing the VM layer.  Right now people
> have a reason for it because zswap requires a "real" backing device
> and that's fine for them and for now.  But instead of building VM
> infrastructure around these kinds of hacks we need to fix the VM
> infrastructure.  Chris Li has been talking about and working towards
> a proper swap abstraction and that needs to happen.

Yes, I have been working on the swap allocator for the mTHP usage
case. Haven't got to the zswap vs zram yet.
Currently there is a feature gap between zswap and zram, so zswap
doesn't do all the stuff zram does. For the zswap "real" backend
issue, Google has been using the ghost swapfile for many years. That
can be one way to get around that. The patch is much smaller than
overhauling the swap back end abstraction.

Currently Android uses zram and it needs to be the Android team's
decision to move from zram to something else. I don't see that
happening any time soon. There are practical limitations.
Personally I have been using zram as some way to provide a block like
device as my goto route for testing the swap stack. I still do an SSD
drive swap test, but at the same time I want to reduce the SSD swap
usage to avoid the wear on my SSD drive. I already destroyed two of my
old HDD drives during the swap testing. The swap random seek is very
unfriendly to HDD, not sure who is still using HDD for swap any more.

Anyway, removing zram is never a goal of the swap abstraction because
I am still using it. We can start with reducing the feature gap
between zswap and ZRAM. The end of the day, it is the Android team's
call using zram or not.

Chris
Huang, Ying Aug. 5, 2024, 6:10 a.m. UTC | #16
Matthew Wilcox <willy@infradead.org> writes:

> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote:
>> A user space interface can be implemented to select different swap-in
>> order policies, similar to the mTHP allocation order policy. We need
>> a distinct policy because the performance characteristics of memory
>> allocation differ significantly from those of swap-in. For example,
>> SSD read speeds can be much slower than memory allocation. With
>> policy selection, I believe we can implement mTHP swap-in for
>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand
>> the implications of their choices. I think that it's better to start
>> with at least always never. I believe that we will add auto in the
>> future to tune automatically, which can be used as default finally.
>
> I strongly disagree.  Use the same sysctl as the other anonymous memory
> allocations.

I still believe we have some reasons for this tunable.

1. As Ryan pointed out in [1], swap-in with large mTHP orders may cause
   long latency, which some users might want to avoid.

[1] https://lore.kernel.org/lkml/f0c7f061-6284-4fe5-8cbf-93281070895b@arm.com/

2. We have readahead information available for swap-in, which is
   unavailable for anonymous page allocation.  This enables us to build
   an automatic swap-in order policy similar to that for page cache
   order based on readahead.

3. Swap-out/swap-in cycles present an opportunity to identify hot pages.
   In many use cases, we can utilize mTHP for hot pages and order-0 page
   for cold pages, especially under memory pressure.  When an mTHP has
   been swapped out, it indicates that it could be a cold page.
   Converting it to order-0 pages might be a beneficial policy.

--
Best Regards,
Huang, Ying
Hellwig, Christoph Aug. 12, 2024, 8:27 a.m. UTC | #17
On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote:
> Currently Android uses zram and it needs to be the Android team's
> decision to move from zram to something else. I don't see that
> happening any time soon. There are practical limitations.

No one can tell anyone to stop using things.  But we can stop adding
new hacks for this, and especially user facing controls.
Barry Song Aug. 12, 2024, 8:44 a.m. UTC | #18
On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote:
> > Currently Android uses zram and it needs to be the Android team's
> > decision to move from zram to something else. I don't see that
> > happening any time soon. There are practical limitations.
>
> No one can tell anyone to stop using things.  But we can stop adding
> new hacks for this, and especially user facing controls.

Well, this user-facing control has absolutely nothing to do with zram-related
hacks.  It's meant to address a general issue, mainly concerning slow-speed
swap devices like SSDs, as suggested in Ying's comment on v4.

Thanks
Barry
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 058485daf186..2e94e956ee12 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -144,6 +144,12 @@  hugepage sizes have enabled="never". If enabling multiple hugepage
 sizes, the kernel will select the most appropriate enabled size for a
 given allocation.
 
+Transparent Hugepage Swap-in for anonymous memory can be disabled or enabled
+by per-supported-THP-size with one of::
+
+	echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled
+	echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled
+
 It's also possible to limit defrag efforts in the VM to generate
 anonymous hugepages in case they're not immediately free to madvise
 regions or to never try to defrag memory and simply fallback to regular
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e25d9ebfdf89..25174305b17f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,6 +92,7 @@  extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define TVA_SMAPS		(1 << 0)	/* Will be used for procfs */
 #define TVA_IN_PF		(1 << 1)	/* Page fault handler */
 #define TVA_ENFORCE_SYSFS	(1 << 2)	/* Obey sysfs configuration */
+#define TVA_IN_SWAPIN          (1 << 3)        /* Do swap-in */
 
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0167dc27e365..41460847988c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -80,6 +80,7 @@  unsigned long huge_zero_pfn __read_mostly = ~0UL;
 unsigned long huge_anon_orders_always __read_mostly;
 unsigned long huge_anon_orders_madvise __read_mostly;
 unsigned long huge_anon_orders_inherit __read_mostly;
+unsigned long huge_anon_orders_swapin_always __read_mostly;
 
 unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 					 unsigned long vm_flags,
@@ -88,6 +89,7 @@  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 {
 	bool smaps = tva_flags & TVA_SMAPS;
 	bool in_pf = tva_flags & TVA_IN_PF;
+	bool in_swapin = tva_flags & TVA_IN_SWAPIN;
 	bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS;
 	unsigned long supported_orders;
 
@@ -100,6 +102,8 @@  unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
 		supported_orders = THP_ORDERS_ALL_FILE_DEFAULT;
 
 	orders &= supported_orders;
+	if (in_swapin)
+		orders &= READ_ONCE(huge_anon_orders_swapin_always);
 	if (!orders)
 		return 0;
 
@@ -523,8 +527,48 @@  static ssize_t thpsize_enabled_store(struct kobject *kobj,
 static struct kobj_attribute thpsize_enabled_attr =
 	__ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store);
 
+static DEFINE_SPINLOCK(huge_anon_orders_swapin_lock);
+
+static ssize_t thpsize_swapin_enabled_show(struct kobject *kobj,
+				    struct kobj_attribute *attr, char *buf)
+{
+	int order = to_thpsize(kobj)->order;
+	const char *output;
+
+	if (test_bit(order, &huge_anon_orders_swapin_always))
+		output = "[always] never";
+	else
+		output = "always [never]";
+
+	return sysfs_emit(buf, "%s\n", output);
+}
+
+static ssize_t thpsize_swapin_enabled_store(struct kobject *kobj,
+				     struct kobj_attribute *attr,
+				     const char *buf, size_t count)
+{
+	int order = to_thpsize(kobj)->order;
+	ssize_t ret = count;
+
+	if (sysfs_streq(buf, "always")) {
+		spin_lock(&huge_anon_orders_swapin_lock);
+		set_bit(order, &huge_anon_orders_swapin_always);
+		spin_unlock(&huge_anon_orders_swapin_lock);
+	} else if (sysfs_streq(buf, "never")) {
+		spin_lock(&huge_anon_orders_swapin_lock);
+		clear_bit(order, &huge_anon_orders_swapin_always);
+		spin_unlock(&huge_anon_orders_swapin_lock);
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+static struct kobj_attribute thpsize_swapin_enabled_attr =
+	__ATTR(swapin_enabled, 0644, thpsize_swapin_enabled_show, thpsize_swapin_enabled_store);
+
 static struct attribute *thpsize_attrs[] = {
 	&thpsize_enabled_attr.attr,
+	&thpsize_swapin_enabled_attr.attr,
 #ifdef CONFIG_SHMEM
 	&thpsize_shmem_enabled_attr.attr,
 #endif
diff --git a/mm/memory.c b/mm/memory.c
index 14048e9285d4..27c77f739a2c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4091,7 +4091,8 @@  static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * and suitable for swapping THP.
 	 */
 	orders = thp_vma_allowable_orders(vma, vma->vm_flags,
-			TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+			TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS,
+			BIT(PMD_ORDER) - 1);
 	orders = thp_vma_suitable_orders(vma, vmf->address, orders);
 	orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);