Message ID | 20240726094618.401593-5-21cnbao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: support mTHP swap-in for zRAM-like swapfile | expand |
Hi Barry, kernel test robot noticed the following build warnings: [auto build test WARNING on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config) compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/ sparse warnings: (new ones prefixed by >>) >> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static? mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c 51 52 /* 53 * By default, transparent hugepage support is disabled in order to avoid 54 * risking an increased memory footprint for applications that are not 55 * guaranteed to benefit from it. When transparent hugepage support is 56 * enabled, it is for all mappings, and khugepaged scans all mappings. 57 * Defrag is invoked by khugepaged hugepage allocations and by page faults 58 * for all hugepage allocations. 59 */ 60 unsigned long transparent_hugepage_flags __read_mostly = 61 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS 62 (1<<TRANSPARENT_HUGEPAGE_FLAG)| 63 #endif 64 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE 65 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)| 66 #endif 67 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)| 68 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| 69 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); 70 71 static struct shrinker *deferred_split_shrinker; 72 static unsigned long deferred_split_count(struct shrinker *shrink, 73 struct shrink_control *sc); 74 static unsigned long deferred_split_scan(struct shrinker *shrink, 75 struct shrink_control *sc); 76 77 static atomic_t huge_zero_refcount; 78 struct folio *huge_zero_folio __read_mostly; 79 unsigned long huge_zero_pfn __read_mostly = ~0UL; 80 unsigned long huge_anon_orders_always __read_mostly; 81 unsigned long huge_anon_orders_madvise __read_mostly; 82 unsigned long huge_anon_orders_inherit __read_mostly; > 83 unsigned long huge_anon_orders_swapin_always __read_mostly; 84
On Sat, Jul 27, 2024 at 5:58 PM kernel test robot <lkp@intel.com> wrote: > > Hi Barry, > > kernel test robot noticed the following build warnings: > > [auto build test WARNING on akpm-mm/mm-everything] Hi Thanks! Would you check if the below patch fixes the problem? diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 41460847988c..06984a325af7 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -80,7 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; -unsigned long huge_anon_orders_swapin_always __read_mostly; +static unsigned long huge_anon_orders_swapin_always __read_mostly; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, > > url: https://github.com/intel-lab-lkp/linux/commits/Barry-Song/mm-swap-introduce-swapcache_prepare_nr-and-swapcache_clear_nr-for-large-folios-swap-in/20240726-181412 > base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything > patch link: https://lore.kernel.org/r/20240726094618.401593-5-21cnbao%40gmail.com > patch subject: [PATCH v5 4/4] mm: Introduce per-thpsize swapin control policy > config: x86_64-randconfig-121-20240727 (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/config) > compiler: gcc-11 (Ubuntu 11.4.0-4ubuntu1) 11.4.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240727/202407271351.ffZPMT6W-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202407271351.ffZPMT6W-lkp@intel.com/ > > sparse warnings: (new ones prefixed by >>) > >> mm/huge_memory.c:83:15: sparse: sparse: symbol 'huge_anon_orders_swapin_always' was not declared. Should it be static? > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c:1867:20: sparse: sparse: context imbalance in 'madvise_free_huge_pmd' - unexpected unlock > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c:1905:28: sparse: sparse: context imbalance in 'zap_huge_pmd' - unexpected unlock > mm/huge_memory.c:2016:28: sparse: sparse: context imbalance in 'move_huge_pmd' - unexpected unlock > mm/huge_memory.c:2156:20: sparse: sparse: context imbalance in 'change_huge_pmd' - unexpected unlock > mm/huge_memory.c:2306:12: sparse: sparse: context imbalance in '__pmd_trans_huge_lock' - wrong count at exit > mm/huge_memory.c:2323:12: sparse: sparse: context imbalance in '__pud_trans_huge_lock' - wrong count at exit > mm/huge_memory.c:2347:28: sparse: sparse: context imbalance in 'zap_huge_pud' - unexpected unlock > mm/huge_memory.c:2426:18: sparse: sparse: context imbalance in '__split_huge_zero_page_pmd' - unexpected unlock > mm/huge_memory.c:2640:18: sparse: sparse: context imbalance in '__split_huge_pmd_locked' - unexpected unlock > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c:3031:30: sparse: sparse: context imbalance in '__split_huge_page' - unexpected unlock > mm/huge_memory.c:3306:17: sparse: sparse: context imbalance in 'split_huge_page_to_list_to_order' - different lock contexts for basic block > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > mm/huge_memory.c: note: in included file (through include/linux/smp.h, include/linux/lockdep.h, include/linux/spinlock.h, ...): > include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true > mm/huge_memory.c: note: in included file (through include/linux/mmzone.h, include/linux/gfp.h, include/linux/mm.h): > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > include/linux/page-flags.h:235:46: sparse: sparse: self-comparison always evaluates to false > > vim +/huge_anon_orders_swapin_always +83 mm/huge_memory.c > > 51 > 52 /* > 53 * By default, transparent hugepage support is disabled in order to avoid > 54 * risking an increased memory footprint for applications that are not > 55 * guaranteed to benefit from it. When transparent hugepage support is > 56 * enabled, it is for all mappings, and khugepaged scans all mappings. > 57 * Defrag is invoked by khugepaged hugepage allocations and by page faults > 58 * for all hugepage allocations. > 59 */ > 60 unsigned long transparent_hugepage_flags __read_mostly = > 61 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS > 62 (1<<TRANSPARENT_HUGEPAGE_FLAG)| > 63 #endif > 64 #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE > 65 (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)| > 66 #endif > 67 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)| > 68 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)| > 69 (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG); > 70 > 71 static struct shrinker *deferred_split_shrinker; > 72 static unsigned long deferred_split_count(struct shrinker *shrink, > 73 struct shrink_control *sc); > 74 static unsigned long deferred_split_scan(struct shrinker *shrink, > 75 struct shrink_control *sc); > 76 > 77 static atomic_t huge_zero_refcount; > 78 struct folio *huge_zero_folio __read_mostly; > 79 unsigned long huge_zero_pfn __read_mostly = ~0UL; > 80 unsigned long huge_anon_orders_always __read_mostly; > 81 unsigned long huge_anon_orders_madvise __read_mostly; > 82 unsigned long huge_anon_orders_inherit __read_mostly; > > 83 unsigned long huge_anon_orders_swapin_always __read_mostly; > 84 > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki Thanks Barry
On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: > A user space interface can be implemented to select different swap-in > order policies, similar to the mTHP allocation order policy. We need > a distinct policy because the performance characteristics of memory > allocation differ significantly from those of swap-in. For example, > SSD read speeds can be much slower than memory allocation. With > policy selection, I believe we can implement mTHP swap-in for > non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand > the implications of their choices. I think that it's better to start > with at least always never. I believe that we will add auto in the > future to tune automatically, which can be used as default finally. I strongly disagree. Use the same sysctl as the other anonymous memory allocations.
On Mon, Jul 29, 2024 at 3:52 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: > > A user space interface can be implemented to select different swap-in > > order policies, similar to the mTHP allocation order policy. We need > > a distinct policy because the performance characteristics of memory > > allocation differ significantly from those of swap-in. For example, > > SSD read speeds can be much slower than memory allocation. With > > policy selection, I believe we can implement mTHP swap-in for > > non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand > > the implications of their choices. I think that it's better to start > > with at least always never. I believe that we will add auto in the > > future to tune automatically, which can be used as default finally. > > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. In versions v1-v4, we used the same controls as anonymous memory allocations. Ying expressed concerns that this approach isn't always ideal, especially for non-zRAM devices, as SSD read speeds can be much slower than memory allocation. I think his concern is reasonable to some extent. However, this patchset only addresses scenarios involving zRAM-like devices and will not impact SSDs. I would like to get Ying's feedback on whether it's acceptable to drop this one in v6. Thanks Barry
On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote: > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. I agree with Matthew here. We also really need to stop optimizing for this weird zram case and move people to zswap instead after fixing the various issues. A special block device that isn't really a block device and needs various special hooks isn't the right abstraction for different zwap strategies.
On Tue, Jul 30, 2024 at 4:11 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote: > > I strongly disagree. Use the same sysctl as the other anonymous memory > > allocations. > > I agree with Matthew here. The whole anonymous memory allocation control is still used here. this is just an addition: anonymous memory allocation control & swapin policy, primarily for addressing SSD concern not for zRAM in the original v4's comment. > > We also really need to stop optimizing for this weird zram case and move > people to zswap instead after fixing the various issues. A special > block device that isn't really a block device and needs various special > hooks isn't the right abstraction for different zwap strategies. My understanding is zRAM is much more popularly used in embedded systems than zswap. I seldomly(or never) hear who is using zswap in Android. it seems pointless to force people to move to zswap, in embedded systems we don't have a backend real block disk device after zswap. >
Christoph Hellwig <hch@infradead.org> 于2024年7月30日周二 00:11写道: > > On Mon, Jul 29, 2024 at 04:52:30AM +0100, Matthew Wilcox wrote: > > I strongly disagree. Use the same sysctl as the other anonymous memory > > allocations. > > I agree with Matthew here. > > We also really need to stop optimizing for this weird zram case and move > people to zswap instead after fixing the various issues. A special > block device that isn't really a block device and needs various special > hooks isn't the right abstraction for different zwap strategies. I disagree, zram is most popular in embedded systems (like Android). > >
On 29/07/2024 04:52, Matthew Wilcox wrote: > On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: >> A user space interface can be implemented to select different swap-in >> order policies, similar to the mTHP allocation order policy. We need >> a distinct policy because the performance characteristics of memory >> allocation differ significantly from those of swap-in. For example, >> SSD read speeds can be much slower than memory allocation. With >> policy selection, I believe we can implement mTHP swap-in for >> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand >> the implications of their choices. I think that it's better to start >> with at least always never. I believe that we will add auto in the >> future to tune automatically, which can be used as default finally. > > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. I vaguely recall arguing in the past that just because the user has requested 2M THP that doesn't mean its the right thing to do for performance to swap-in the whole 2M in one go. That's potentially a pretty huge latency, depending on where the backend is, and it could be a waste of IO if the application never touches most of the 2M. Although the fact that the application hinted for a 2M THP in the first place hopefully means that they are storing objects that need to be accessed at similar times. Today it will be swapped in page-by-page then eventually collapsed by khugepaged. But I think those arguments become weaker as the THP size gets smaller. 16K/64K swap-in will likely yield significant performance improvements, and I think Barry has numbers for this? So I guess we have a few options: - Just use the same sysfs interface as for anon allocation, And see if anyone reports performance regressions. Investigate one of the options below if an issue is raised. That's the simplest and cleanest approach, I think. - New sysfs interface as Barry has implemented; nobody really wants more controls if it can be helped. - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts and never got any traction. - Secret option 4: Can we allocate a full-size folio but only choose to swap-in to it bit-by-bit? You would need a way to mark which pages of the folio are valid (e.g. per-page flag) but guess that's a non-starter given the strategy to remove per-page flags? Thanks, Ryan
On 30.07.24 10:36, Ryan Roberts wrote: > On 29/07/2024 04:52, Matthew Wilcox wrote: >> On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: >>> A user space interface can be implemented to select different swap-in >>> order policies, similar to the mTHP allocation order policy. We need >>> a distinct policy because the performance characteristics of memory >>> allocation differ significantly from those of swap-in. For example, >>> SSD read speeds can be much slower than memory allocation. With >>> policy selection, I believe we can implement mTHP swap-in for >>> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand >>> the implications of their choices. I think that it's better to start >>> with at least always never. I believe that we will add auto in the >>> future to tune automatically, which can be used as default finally. >> >> I strongly disagree. Use the same sysctl as the other anonymous memory >> allocations. > > I vaguely recall arguing in the past that just because the user has requested 2M > THP that doesn't mean its the right thing to do for performance to swap-in the > whole 2M in one go. That's potentially a pretty huge latency, depending on where > the backend is, and it could be a waste of IO if the application never touches > most of the 2M. Although the fact that the application hinted for a 2M THP in > the first place hopefully means that they are storing objects that need to be > accessed at similar times. Today it will be swapped in page-by-page then > eventually collapsed by khugepaged. > > But I think those arguments become weaker as the THP size gets smaller. 16K/64K > swap-in will likely yield significant performance improvements, and I think > Barry has numbers for this? > > So I guess we have a few options: > > - Just use the same sysfs interface as for anon allocation, And see if anyone > reports performance regressions. Investigate one of the options below if an > issue is raised. That's the simplest and cleanest approach, I think. > > - New sysfs interface as Barry has implemented; nobody really wants more > controls if it can be helped. > > - Hardcode a size limit (e.g. 64K); I've tried this in a few different contexts > and never got any traction. > > - Secret option 4: Can we allocate a full-size folio but only choose to swap-in > to it bit-by-bit? You would need a way to mark which pages of the folio are > valid (e.g. per-page flag) but guess that's a non-starter given the strategy to > remove per-page flags? Maybe we could allocate for folios in the swapcache a bitmap to store that information (folio->private). But I am not convinced that is the right thing to do. If we know some basic properties of the backend, can't we automatically make a pretty good decision regarding the folio size to use? E.g., slow disk, avoid 2M ... Avoiding sysctls if possible here would really be preferable...
On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote: > > We also really need to stop optimizing for this weird zram case and move > > people to zswap instead after fixing the various issues. A special > > block device that isn't really a block device and needs various special > > hooks isn't the right abstraction for different zwap strategies. > > My understanding is zRAM is much more popularly used in embedded > systems than zswap. I seldomly(or never) hear who is using zswap > in Android. it seems pointless to force people to move to zswap, in > embedded systems we don't have a backend real block disk device > after zswap. Well, that is the point. zram is a horrible hack that abuses a block device to implement a feature missing the VM layer. Right now people have a reason for it because zswap requires a "real" backing device and that's fine for them and for now. But instead of building VM infrastructure around these kinds of hacks we need to fix the VM infrastructure. Chris Li has been talking about and working towards a proper swap abstraction and that needs to happen.
On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > > Well, that is the point. zram is a horrible hack that abuses a block > device to implement a feature missing the VM layer. Right now people > have a reason for it because zswap requires a "real" backing device > and that's fine for them and for now. But instead of building VM I completely agree with this assessment. > infrastructure around these kinds of hacks we need to fix the VM > infrastructure. Chris Li has been talking about and working towards > a proper swap abstraction and that needs to happen. I'm also working towards something along this line. My design would add a "virtual" swap ID that will be stored in the page table, and can refer to either a real, storage-backed swap entry, or a zswap entry. zswap can then exist without any backing swap device. There are several additional benefits of this approach: 1. We can optimize swapoff as well - the page table can still refer to the swap ID, but the ID now points to a physical page frame. swapoff code just needs to sever the link from the swap ID to the physical swap entry (which either just requires a swap ID mapping walk, or even faster if we have a reverse mapping mechanism), and update the link to the page frame instead. 2. We can take this opportunity to clean up the swap count code. I'd be happy to collaborate/compare notes :)
On Wed, Jul 31, 2024 at 7:28 AM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > > > > > Well, that is the point. zram is a horrible hack that abuses a block > > device to implement a feature missing the VM layer. Right now people > > have a reason for it because zswap requires a "real" backing device > > and that's fine for them and for now. But instead of building VM > > I completely agree with this assessment. > > > infrastructure around these kinds of hacks we need to fix the VM > > infrastructure. Chris Li has been talking about and working towards > > a proper swap abstraction and that needs to happen. > > I'm also working towards something along this line. My design would > add a "virtual" swap ID that will be stored in the page table, and can > refer to either a real, storage-backed swap entry, or a zswap entry. > zswap can then exist without any backing swap device. > > There are several additional benefits of this approach: > > 1. We can optimize swapoff as well - the page table can still refer to > the swap ID, but the ID now points to a physical page frame. swapoff > code just needs to sever the link from the swap ID to the physical > swap entry (which either just requires a swap ID mapping walk, or even > faster if we have a reverse mapping mechanism), and update the link to > the page frame instead. > > 2. We can take this opportunity to clean up the swap count code. > > I'd be happy to collaborate/compare notes :) I appreciate that you have a good plan, and I welcome the improvements in zswap. However, we need to face reality. Having a good plan doesn't mean we should wait for you to proceed. In my experience, I've never heard of anyone using zswap in an embedded system, especially among the billions of Android devices.(Correct me if you know one.) How soon do you expect embedded systems and Android to adopt zswap? In one year, two years, five years, or ten years? Have you asked if Google plans to use zswap in Android? Currently, zswap does not support large folios, which is why Yosry has introduced an API like zswap_never_enabled() to allow others to explore parallel options like mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS error. I believe you were involved in those discussions: mm: zswap: add zswap_never_enabled() https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6 mm: zswap: handle incorrect attempts to load large folios https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1 Should everyone around the world hold off on working on mTHP swap until zswap has addressed the issue to support large folios? Not to mention whether people are ready and happy to switch to zswap. I don't see any reason why we should wait and not start implementing something that could benefit billions of devices worldwide. Parallel exploration leads to human progress in different fields. That's why I believe Yosry's patch, which allows others to move forward, is a more considerate approach. Thanks Barry
On Tue, Jul 30, 2024 at 2:06 PM Barry Song <21cnbao@gmail.com> wrote: > > > I'd be happy to collaborate/compare notes :) > > I appreciate that you have a good plan, and I welcome the improvements in zswap. > However, we need to face reality. Having a good plan doesn't mean we should > wait for you to proceed. > > In my experience, I've never heard of anyone using zswap in an embedded > system, especially among the billions of Android devices.(Correct me if you > know one.) How soon do you expect embedded systems and Android to adopt > zswap? In one year, two years, five years, or ten years? Have you asked if > Google plans to use zswap in Android? Well, no one uses zswap in an embedded environment precisely because of the aforementioned issues, which we are working to resolve :) > > Currently, zswap does not support large folios, which is why Yosry has > introduced > an API like zswap_never_enabled() to allow others to explore parallel > options like > mTHP swap. Meanwhile, If zswap encounters large folios, it will trigger a SIGBUS > error. I believe you were involved in those discussions: > > mm: zswap: add zswap_never_enabled() > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2d4d2b1cfb85cc07f6 > mm: zswap: handle incorrect attempts to load large folios > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c63f210d4891f5b1 > I am, and for the record I reviewed and/or ack-ed all of these patches, and provided my inputs on how to move forward with zswap's support for large folios. I do not want zswap to prevent the development of the rest of the swap ecosystem. > Should everyone around the world hold off on working on mTHP swap until > zswap has addressed the issue to support large folios? Not to mention whether > people are ready and happy to switch to zswap. > I think you misunderstood my intention. For the record, I'm not trying to stop you from improving zram, and I'm not proposing that we kill zram right away. Well, at least not until zswap reaches feature parity with zram, which, as you point out, will take awhile. Both support for large folios and swap/zswap decoupling are on our agenda, and you're welcome to participate in the discussion - for what it's worth, your attempt with zram (+zstd) is the basis/proof-of-concept for our future efforts :) That said, I believe that there is a fundamental redundancy here, which we (zram and zswap developers) should resolve at some point by unifying the two memory compression systems. The sooner we can unify these two, the less effort we will have to spend on developing and maintaining two separate mechanisms for the same (or very similar) purpose. For instance, large folio support has to be done twice. Same goes with writeback/offloading to backend storage, etc. And I (admittedly with a bias), agree with Christoph that zswap is the way to go moving forwards. I will not address the rest - seems like there isn't something to disagree or discuss down there :)
On (24/07/31 11:35), Nhat Pham wrote: > > I'm not proposing that we kill zram right away. > Just for the record, zram is a generic block device and has use-cases outside of swap. Just mkfs on /dev/zram0, mount it and do whatever. The "kill zram" thing is not going to fly.
On Tue, Jul 30, 2024 at 9:30 AM Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, Jul 30, 2024 at 08:11:16AM +1200, Barry Song wrote: > > > We also really need to stop optimizing for this weird zram case and move > > > people to zswap instead after fixing the various issues. A special > > > block device that isn't really a block device and needs various special > > > hooks isn't the right abstraction for different zwap strategies. > > > > My understanding is zRAM is much more popularly used in embedded > > systems than zswap. I seldomly(or never) hear who is using zswap > > in Android. it seems pointless to force people to move to zswap, in > > embedded systems we don't have a backend real block disk device > > after zswap. > > Well, that is the point. zram is a horrible hack that abuses a block > device to implement a feature missing the VM layer. Right now people > have a reason for it because zswap requires a "real" backing device > and that's fine for them and for now. But instead of building VM > infrastructure around these kinds of hacks we need to fix the VM > infrastructure. Chris Li has been talking about and working towards > a proper swap abstraction and that needs to happen. Yes, I have been working on the swap allocator for the mTHP usage case. Haven't got to the zswap vs zram yet. Currently there is a feature gap between zswap and zram, so zswap doesn't do all the stuff zram does. For the zswap "real" backend issue, Google has been using the ghost swapfile for many years. That can be one way to get around that. The patch is much smaller than overhauling the swap back end abstraction. Currently Android uses zram and it needs to be the Android team's decision to move from zram to something else. I don't see that happening any time soon. There are practical limitations. Personally I have been using zram as some way to provide a block like device as my goto route for testing the swap stack. I still do an SSD drive swap test, but at the same time I want to reduce the SSD swap usage to avoid the wear on my SSD drive. I already destroyed two of my old HDD drives during the swap testing. The swap random seek is very unfriendly to HDD, not sure who is still using HDD for swap any more. Anyway, removing zram is never a goal of the swap abstraction because I am still using it. We can start with reducing the feature gap between zswap and ZRAM. The end of the day, it is the Android team's call using zram or not. Chris
Matthew Wilcox <willy@infradead.org> writes: > On Fri, Jul 26, 2024 at 09:46:18PM +1200, Barry Song wrote: >> A user space interface can be implemented to select different swap-in >> order policies, similar to the mTHP allocation order policy. We need >> a distinct policy because the performance characteristics of memory >> allocation differ significantly from those of swap-in. For example, >> SSD read speeds can be much slower than memory allocation. With >> policy selection, I believe we can implement mTHP swap-in for >> non-SWAP_SYNCHRONOUS scenarios as well. However, users need to understand >> the implications of their choices. I think that it's better to start >> with at least always never. I believe that we will add auto in the >> future to tune automatically, which can be used as default finally. > > I strongly disagree. Use the same sysctl as the other anonymous memory > allocations. I still believe we have some reasons for this tunable. 1. As Ryan pointed out in [1], swap-in with large mTHP orders may cause long latency, which some users might want to avoid. [1] https://lore.kernel.org/lkml/f0c7f061-6284-4fe5-8cbf-93281070895b@arm.com/ 2. We have readahead information available for swap-in, which is unavailable for anonymous page allocation. This enables us to build an automatic swap-in order policy similar to that for page cache order based on readahead. 3. Swap-out/swap-in cycles present an opportunity to identify hot pages. In many use cases, we can utilize mTHP for hot pages and order-0 page for cold pages, especially under memory pressure. When an mTHP has been swapped out, it indicates that it could be a cold page. Converting it to order-0 pages might be a beneficial policy. -- Best Regards, Huang, Ying
On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote: > Currently Android uses zram and it needs to be the Android team's > decision to move from zram to something else. I don't see that > happening any time soon. There are practical limitations. No one can tell anyone to stop using things. But we can stop adding new hacks for this, and especially user facing controls.
On Mon, Aug 12, 2024 at 8:27 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Thu, Aug 01, 2024 at 01:55:51PM -0700, Chris Li wrote: > > Currently Android uses zram and it needs to be the Android team's > > decision to move from zram to something else. I don't see that > > happening any time soon. There are practical limitations. > > No one can tell anyone to stop using things. But we can stop adding > new hacks for this, and especially user facing controls. Well, this user-facing control has absolutely nothing to do with zram-related hacks. It's meant to address a general issue, mainly concerning slow-speed swap devices like SSDs, as suggested in Ying's comment on v4. Thanks Barry
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 058485daf186..2e94e956ee12 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -144,6 +144,12 @@ hugepage sizes have enabled="never". If enabling multiple hugepage sizes, the kernel will select the most appropriate enabled size for a given allocation. +Transparent Hugepage Swap-in for anonymous memory can be disabled or enabled +by per-supported-THP-size with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled + echo never >/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/swapin_enabled + It's also possible to limit defrag efforts in the VM to generate anonymous hugepages in case they're not immediately free to madvise regions or to never try to defrag memory and simply fallback to regular diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index e25d9ebfdf89..25174305b17f 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -92,6 +92,7 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define TVA_SMAPS (1 << 0) /* Will be used for procfs */ #define TVA_IN_PF (1 << 1) /* Page fault handler */ #define TVA_ENFORCE_SYSFS (1 << 2) /* Obey sysfs configuration */ +#define TVA_IN_SWAPIN (1 << 3) /* Do swap-in */ #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \ (!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order))) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0167dc27e365..41460847988c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -80,6 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; +unsigned long huge_anon_orders_swapin_always __read_mostly; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, @@ -88,6 +89,7 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, { bool smaps = tva_flags & TVA_SMAPS; bool in_pf = tva_flags & TVA_IN_PF; + bool in_swapin = tva_flags & TVA_IN_SWAPIN; bool enforce_sysfs = tva_flags & TVA_ENFORCE_SYSFS; unsigned long supported_orders; @@ -100,6 +102,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, supported_orders = THP_ORDERS_ALL_FILE_DEFAULT; orders &= supported_orders; + if (in_swapin) + orders &= READ_ONCE(huge_anon_orders_swapin_always); if (!orders) return 0; @@ -523,8 +527,48 @@ static ssize_t thpsize_enabled_store(struct kobject *kobj, static struct kobj_attribute thpsize_enabled_attr = __ATTR(enabled, 0644, thpsize_enabled_show, thpsize_enabled_store); +static DEFINE_SPINLOCK(huge_anon_orders_swapin_lock); + +static ssize_t thpsize_swapin_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int order = to_thpsize(kobj)->order; + const char *output; + + if (test_bit(order, &huge_anon_orders_swapin_always)) + output = "[always] never"; + else + output = "always [never]"; + + return sysfs_emit(buf, "%s\n", output); +} + +static ssize_t thpsize_swapin_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int order = to_thpsize(kobj)->order; + ssize_t ret = count; + + if (sysfs_streq(buf, "always")) { + spin_lock(&huge_anon_orders_swapin_lock); + set_bit(order, &huge_anon_orders_swapin_always); + spin_unlock(&huge_anon_orders_swapin_lock); + } else if (sysfs_streq(buf, "never")) { + spin_lock(&huge_anon_orders_swapin_lock); + clear_bit(order, &huge_anon_orders_swapin_always); + spin_unlock(&huge_anon_orders_swapin_lock); + } else + ret = -EINVAL; + + return ret; +} +static struct kobj_attribute thpsize_swapin_enabled_attr = + __ATTR(swapin_enabled, 0644, thpsize_swapin_enabled_show, thpsize_swapin_enabled_store); + static struct attribute *thpsize_attrs[] = { &thpsize_enabled_attr.attr, + &thpsize_swapin_enabled_attr.attr, #ifdef CONFIG_SHMEM &thpsize_shmem_enabled_attr.attr, #endif diff --git a/mm/memory.c b/mm/memory.c index 14048e9285d4..27c77f739a2c 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4091,7 +4091,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) * and suitable for swapping THP. */ orders = thp_vma_allowable_orders(vma, vma->vm_flags, - TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); + TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS, + BIT(PMD_ORDER) - 1); orders = thp_vma_suitable_orders(vma, vmf->address, orders); orders = thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders);