[v3,3/3] mm: THP low utilization shrinker

Message ID	cb85939d1c13547fca4eecf71d5d041fc62433d4.1665614216.git.alexlzhu@fb.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: <alexlzhu@fb.com> To: <linux-mm@kvack.org>, <kernel-team@fb.com> CC: <willy@infradead.org>, <hannes@cmpxchg.org>, <riel@surriel.com>, Alexander Zhu <alexlzhu@fb.com> Subject: [PATCH v3 3/3] mm: THP low utilization shrinker Date: Wed, 12 Oct 2022 15:51:47 -0700 Message-ID: <cb85939d1c13547fca4eecf71d5d041fc62433d4.1665614216.git.alexlzhu@fb.com> In-Reply-To: <cover.1665614216.git.alexlzhu@fb.com> References: <cover.1665614216.git.alexlzhu@fb.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	THP Shrinker \| expand [v3,0/3] THP Shrinker [v3,1/3] mm: add thp_utilization metrics to debugfs [v3,2/3] mm: changes to split_huge_page() to free zero filled tail pages [v3,3/3] mm: THP low utilization shrinker

Alex Zhu (Kernel) Oct. 12, 2022, 10:51 p.m. UTC

From: Alexander Zhu <alexlzhu@fb.com>

This patch introduces a shrinker that will remove THPs in the lowest
utilization bucket. As previously mentioned, we have observed that
almost all of the memory waste when THPs are always enabled
is contained in the lowest utilization bucket. The shrinker will
add these THPs to a list_lru and split anonymous THPs based off
information from kswapd. It requires the changes from
thp_utilization to identify the least utilized THPs, and the
changes to split_huge_page to identify and free zero pages
within THPs.

Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
---
v2 to v3
-put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
-removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
-moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
-moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
-removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 

v1 to v2
-Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
-Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
-Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
-Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 

RFC to v1
-Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
-Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
-Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 

 include/linux/huge_mm.h  |   7 +++
 include/linux/list_lru.h |  24 +++++++++
 include/linux/mm_types.h |   5 ++
 mm/huge_memory.c         | 114 ++++++++++++++++++++++++++++++++++++++-
 mm/list_lru.c            |  49 +++++++++++++++++
 mm/page_alloc.c          |   6 +++
 6 files changed, 203 insertions(+), 2 deletions(-)

kernel test robot Oct. 13, 2022, 1:48 p.m. UTC | #1

Hi,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on next-20221013]
[cannot apply to shuah-kselftest/next v6.0]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/alexlzhu-fb-com/THP-Shrinker/20221013-065303
config: x86_64-randconfig-a011
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/7b8b0daade87dcdb8158c1fcc16fcc7963a8869c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review alexlzhu-fb-com/THP-Shrinker/20221013-065303
        git checkout 7b8b0daade87dcdb8158c1fcc16fcc7963a8869c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   ld: mm/huge_memory.o: in function `add_underutilized_thp':
>> mm/huge_memory.c:3033: undefined reference to `memcg_list_lru_alloc'


vim +3033 mm/huge_memory.c

  3010	
  3011	void add_underutilized_thp(struct page *page)
  3012	{
  3013		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
  3014	
  3015		if (PageSwapCache(page))
  3016			return;
  3017	
  3018		/*
  3019		 * Need to take a reference on the page to prevent the page from getting free'd from
  3020		 * under us while we are adding the THP to the shrinker.
  3021		 */
  3022		if (!get_page_unless_zero(page))
  3023			return;
  3024	
  3025		if (!is_anon_transparent_hugepage(page))
  3026			goto out_put;
  3027	
  3028		if (is_huge_zero_page(page))
  3029			goto out_put;
  3030	
  3031		lock_page(page);
  3032	
> 3033		if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
  3034			goto out_unlock;
  3035	
  3036		list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
  3037	
  3038	out_unlock:
  3039		unlock_page(page);
  3040	out_put:
  3041		put_page(page);
  3042	}
  3043

Johannes Weiner Oct. 13, 2022, 4:39 p.m. UTC | #2

On Wed, Oct 12, 2022 at 03:51:47PM -0700, alexlzhu@fb.com wrote:
> From: Alexander Zhu <alexlzhu@fb.com>
> 
> This patch introduces a shrinker that will remove THPs in the lowest
> utilization bucket. As previously mentioned, we have observed that
> almost all of the memory waste when THPs are always enabled
> is contained in the lowest utilization bucket. The shrinker will
> add these THPs to a list_lru and split anonymous THPs based off
> information from kswapd. It requires the changes from
> thp_utilization to identify the least utilized THPs, and the
> changes to split_huge_page to identify and free zero pages
> within THPs.
> 
> Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
> ---
> v2 to v3
> -put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
> -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
> -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
> -moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
> -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 
> 
> v1 to v2
> -Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
> -Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
> -Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
> -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 
> 
> RFC to v1
> -Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
> -Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
> -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 
> 
>  include/linux/huge_mm.h  |   7 +++
>  include/linux/list_lru.h |  24 +++++++++
>  include/linux/mm_types.h |   5 ++
>  mm/huge_memory.c         | 114 ++++++++++++++++++++++++++++++++++++++-
>  mm/list_lru.c            |  49 +++++++++++++++++
>  mm/page_alloc.c          |   6 +++
>  6 files changed, 203 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 13ac7b2f29ae..75e4080256be 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -192,6 +192,8 @@ static inline int split_huge_page(struct page *page)
>  }
>  void deferred_split_huge_page(struct page *page);
>  
> +void add_underutilized_thp(struct page *page);
> +
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long address, bool freeze, struct folio *folio);
>  
> @@ -305,6 +307,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
>  	return &page[2].deferred_list;
>  }
>  
> +static inline struct list_head *page_underutilized_thp_list(struct page *page)
> +{
> +	return &page[3].underutilized_thp_list;
> +}
> +
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>  #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index b35968ee9fb5..c2cf146ea880 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
>   */
>  bool list_lru_add(struct list_lru *lru, struct list_head *item);
>  
> +/**
> + * list_lru_add_page: add an element to the lru list's tail
> + * @list_lru: the lru pointer
> + * @page: the page containing the item
> + * @item: the item to be deleted.
> + *
> + * This function works the same as list_lru_add in terms of list
> + * manipulation. Used for non slab objects contained in the page.
> + *
> + * Return value: true if the list was updated, false otherwise
> + */
> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
>  /**
>   * list_lru_del: delete an element to the lru list
>   * @list_lru: the lru pointer
> @@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
>   */
>  bool list_lru_del(struct list_lru *lru, struct list_head *item);
>  
> +/**
> + * list_lru_del_page: delete an element to the lru list
> + * @list_lru: the lru pointer
> + * @page: the page containing the item
> + * @item: the item to be deleted.
> + *
> + * This function works the same as list_lru_del in terms of list
> + * manipulation. Used for non slab objects contained in the page.
> + *
> + * Return value: true if the list was updated, false otherwise
> + */
> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
>  /**
>   * list_lru_count_one: return the number of objects currently held by @lru
>   * @lru: the lru pointer.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..da1d1cf42158 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -152,6 +152,11 @@ struct page {
>  			/* For both global and memcg */
>  			struct list_head deferred_list;
>  		};
> +		struct { /* Third tail page of compound page */
> +			unsigned long _compound_pad_3; /* compound_head */
> +			unsigned long _compound_pad_4;
> +			struct list_head underutilized_thp_list;
> +		};
>  		struct {	/* Page table pages */
>  			unsigned long _pt_pad_1;	/* compound_head */
>  			pgtable_t pmd_huge_pte; /* protected by page->ptl */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a08885228cb2..362df977cc73 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -81,6 +81,8 @@ static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>  
> +static struct list_lru huge_low_util_page_lru;
> +
>  static void thp_utilization_workfn(struct work_struct *work);
>  static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
>  
> @@ -263,6 +265,57 @@ static struct shrinker huge_zero_page_shrinker = {
>  	.seeks = DEFAULT_SEEKS,
>  };
>  
> +static enum lru_status low_util_free_page(struct list_head *item,
> +					  struct list_lru_one *lru,
> +					  spinlock_t *lock,

lru_lock would be a tad more descriptive.

> +					  void *cb_arg)
> +{
> +	int bucket, num_utilized_pages;
> +	struct page *head = compound_head(list_entry(item,
> +									struct page,
> +									underutilized_thp_list));

You should already be getting head pages here since you walk the pfns
in huge page steps. It would be better to do pfn_folio() in the
scanner and just work with folios here throughout.

The indentation is a little messed up.

> +
> +	if (get_page_unless_zero(head)) {
> +		if (!trylock_page(head)) {
> +			put_page(head);
> +			return LRU_SKIP;
> +		}

The trylock could use a comment:

		/* 

> +		list_lru_isolate(lru, item);
> +		spin_unlock_irq(lock);
> +		num_utilized_pages = thp_number_utilized_pages(head);
> +		bucket = thp_utilization_bucket(num_utilized_pages);
> +		if (bucket < THP_UTIL_BUCKET_NR - 1) {
> +			split_huge_page(head);
> +			spin_lock_irq(lock);

> +		}
> +		unlock_page(head);
> +		put_page(head);
> +	}
> +
> +	return LRU_REMOVED_RETRY;
> +}
> +
> +static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
> +						     struct shrink_control *sc)
> +{
> +	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
> +}
> +
> +static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
> +						    struct shrink_control *sc)
> +{
> +	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
> +							sc, low_util_free_page, NULL);
> +}
> +
> +static struct shrinker huge_low_util_page_shrinker = {
> +	.count_objects = shrink_huge_low_util_page_count,
> +	.scan_objects = shrink_huge_low_util_page_scan,
> +	.seeks = DEFAULT_SEEKS,
> +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
> +		SHRINKER_NONSLAB,
> +};
> +
>  #ifdef CONFIG_SYSFS
>  static ssize_t enabled_show(struct kobject *kobj,
>  			    struct kobj_attribute *attr, char *buf)
> @@ -515,6 +568,9 @@ static int __init hugepage_init(void)
>  		goto err_slab;
>  
>  	schedule_delayed_work(&thp_utilization_work, HZ);
> +	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
> +	if (err)
> +		goto err_low_util_shrinker;
>  	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
>  	if (err)
>  		goto err_hzp_shrinker;
> @@ -522,6 +578,9 @@ static int __init hugepage_init(void)
>  	if (err)
>  		goto err_split_shrinker;
>  
> +	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
> +	if (err)
> +		goto err_low_util_list_lru;
>  	/*
>  	 * By default disable transparent hugepages on smaller systems,
>  	 * where the extra memory used could hurt more than TLB overhead
> @@ -538,10 +597,14 @@ static int __init hugepage_init(void)
>  
>  	return 0;
>  err_khugepaged:
> +	list_lru_destroy(&huge_low_util_page_lru);
> +err_low_util_list_lru:
>  	unregister_shrinker(&deferred_split_shrinker);
>  err_split_shrinker:
>  	unregister_shrinker(&huge_zero_page_shrinker);
>  err_hzp_shrinker:
> +	unregister_shrinker(&huge_low_util_page_shrinker);
> +err_low_util_shrinker:
>  	khugepaged_destroy();
>  err_slab:
>  	hugepage_exit_sysfs(hugepage_kobj);
> @@ -616,6 +679,7 @@ void prep_transhuge_page(struct page *page)
>  	 */
>  
>  	INIT_LIST_HEAD(page_deferred_list(page));
> +	INIT_LIST_HEAD(page_underutilized_thp_list(page));
>  	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>  }
>  
> @@ -2529,8 +2593,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  			 LRU_GEN_MASK | LRU_REFS_MASK));
>  
>  	/* ->mapping in first tail page is compound_mapcount */
> -	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
> -			page_tail);
> +	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
>  	page_tail->mapping = head->mapping;
>  	page_tail->index = head->index + tail;
>  	page_tail->private = 0;
> @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  	struct folio *folio = page_folio(page);
>  	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
>  	struct anon_vma *anon_vma = NULL;
>  	struct address_space *mapping = NULL;
>  	int extra_pins, ret;
> @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  			list_del(page_deferred_list(&folio->page));
>  		}
>  		spin_unlock(&ds_queue->split_queue_lock);
> +		if (!list_empty(underutilized_thp_list))
> +			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
> +					  underutilized_thp_list);
>  		if (mapping) {
>  			int nr = folio_nr_pages(folio);
>  
> @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  void free_transhuge_page(struct page *page)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(page);
> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
>  		list_del(page_deferred_list(page));
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +	if (!list_empty(underutilized_thp_list))
> +		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
> +
> +	if (PageLRU(page))
> +		__folio_clear_lru_flags(page_folio(page));
> +
>  	free_compound_page(page);
>  }
>  
> @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page)
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
>  
> +void add_underutilized_thp(struct page *page)
> +{
> +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> +
> +	if (PageSwapCache(page))
> +		return;
> +
> +	/*
> +	 * Need to take a reference on the page to prevent the page from getting free'd from
> +	 * under us while we are adding the THP to the shrinker.
> +	 */
> +	if (!get_page_unless_zero(page))
> +		return;
> +
> +	if (!is_anon_transparent_hugepage(page))
> +		goto out_put;
> +
> +	if (is_huge_zero_page(page))
> +		goto out_put;
> +
> +	lock_page(page);
> +
> +	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
> +		goto out_unlock;
> +
> +	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
> +
> +out_unlock:
> +	unlock_page(page);
> +out_put:
> +	put_page(page);
> +}
> +
>  static unsigned long deferred_split_count(struct shrinker *shrink,
>  		struct shrink_control *sc)
>  {
> @@ -3478,6 +3585,9 @@ static void thp_util_scan(unsigned long pfn_end)
>  		if (bucket < 0)
>  			continue;
>  
> +		if (bucket < THP_UTIL_BUCKET_NR - 1)
> +			add_underutilized_thp(page);
> +
>  		thp_scan.buckets[bucket].nr_thps++;
>  		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
>  	}
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index a05e5bef3b40..8cc56a84b554 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>  }
>  EXPORT_SYMBOL_GPL(list_lru_add);
>  
> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
> +{
> +	int nid = page_to_nid(page);
> +	struct list_lru_node *nlru = &lru->node[nid];
> +	struct list_lru_one *l;
> +	struct mem_cgroup *memcg;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&nlru->lock, flags);
> +	if (list_empty(item)) {
> +		memcg = page_memcg(page);
> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> +		list_add_tail(item, &l->list);
> +		/* Set shrinker bit if the first element was added */
> +		if (!l->nr_items++)
> +			set_shrinker_bit(memcg, nid,
> +					 lru_shrinker_id(lru));
> +		nlru->nr_items++;
> +		spin_unlock_irqrestore(&nlru->lock, flags);
> +		return true;
> +	}
> +	spin_unlock_irqrestore(&nlru->lock, flags);
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_add_page);
> +
>  bool list_lru_del(struct list_lru *lru, struct list_head *item)
>  {
>  	int nid = page_to_nid(virt_to_page(item));
> @@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
>  }
>  EXPORT_SYMBOL_GPL(list_lru_del);
>  
> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
> +{
> +	int nid = page_to_nid(page);
> +	struct list_lru_node *nlru = &lru->node[nid];
> +	struct list_lru_one *l;
> +	struct mem_cgroup *memcg;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&nlru->lock, flags);
> +	if (!list_empty(item)) {
> +		memcg = page_memcg(page);
> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> +		list_del_init(item);
> +		l->nr_items--;
> +		nlru->nr_items--;
> +		spin_unlock_irqrestore(&nlru->lock, flags);
> +		return true;
> +	}
> +	spin_unlock_irqrestore(&nlru->lock, flags);
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_del_page);
> +
>  void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
>  {
>  	list_del_init(item);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac2c9f12a7b2..468eaaade7fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
>  		 * deferred_list.next -- ignore value.
>  		 */
>  		break;
> +	case 3:
> +		/*
> +		 * the third tail page: ->mapping is
> +		 * underutilized_thp_list.next -- ignore value.
> +		 */
> +		break;
>  	default:
>  		if (page->mapping != TAIL_MAPPING) {
>  			bad_page(page, "corrupted mapping in tail page");
> -- 
> 2.30.2
>

Johannes Weiner Oct. 13, 2022, 4:56 p.m. UTC | #3

Oof, fatfingered send vs postpone. Here is the rest ;)

On Thu, Oct 13, 2022 at 12:39:53PM -0400, Johannes Weiner wrote:
> On Wed, Oct 12, 2022 at 03:51:47PM -0700, alexlzhu@fb.com wrote:
> > +
> > +	if (get_page_unless_zero(head)) {
> > +		if (!trylock_page(head)) {
> > +			put_page(head);
> > +			return LRU_SKIP;
> > +		}
> 
> The trylock could use a comment:

		/* Inverse lock order from add_underutilized_thp() */

> > +		list_lru_isolate(lru, item);
> > +		spin_unlock_irq(lock);
> > +		num_utilized_pages = thp_number_utilized_pages(head);
> > +		bucket = thp_utilization_bucket(num_utilized_pages);
> > +		if (bucket < THP_UTIL_BUCKET_NR - 1) {
> > +			split_huge_page(head);
> > +			spin_lock_irq(lock);

The re-lock also needs to be unconditional, or you'll get a double
unlock in the not-split case.

> > @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> >  	struct folio *folio = page_folio(page);
> >  	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
> >  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> > +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
> >  	struct anon_vma *anon_vma = NULL;
> >  	struct address_space *mapping = NULL;
> >  	int extra_pins, ret;
> > @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> >  			list_del(page_deferred_list(&folio->page));
> >  		}
> >  		spin_unlock(&ds_queue->split_queue_lock);

A brief comment would be useful:

		/* Frozen refs lock out additions, test can be lockless */

> > +		if (!list_empty(underutilized_thp_list))
> > +			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
> > +					  underutilized_thp_list);
> >  		if (mapping) {
> >  			int nr = folio_nr_pages(folio);
> >  
> > @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
> >  void free_transhuge_page(struct page *page)
> >  {
> >  	struct deferred_split *ds_queue = get_deferred_split_queue(page);
> > +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> > @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
> >  		list_del(page_deferred_list(page));
> >  	}
> >  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);

Here as well:

	/* A dead page cannot be re-added, test can be lockless */

> > +	if (!list_empty(underutilized_thp_list))
> > +		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
> > +
> > +	if (PageLRU(page))
> > +		__folio_clear_lru_flags(page_folio(page));
> > +
> >  	free_compound_page(page);
> >  }
> >  
> > @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page)
> >  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> >  }
> >  
> > +void add_underutilized_thp(struct page *page)
> > +{
> > +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> > +
> > +	if (PageSwapCache(page))
> > +		return;

Why is that?

> > +
> > +	/*
> > +	 * Need to take a reference on the page to prevent the page from getting free'd from
> > +	 * under us while we are adding the THP to the shrinker.
> > +	 */
> > +	if (!get_page_unless_zero(page))
> > +		return;
> > +
> > +	if (!is_anon_transparent_hugepage(page))
> > +		goto out_put;
> > +
> > +	if (is_huge_zero_page(page))
> > +		goto out_put;
> > +

And here:

	/* Stabilize page->memcg to allocate and add to the same list */

> > +	lock_page(page);
> > +
> > +	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
> > +		goto out_unlock;

The testbot pointed this out already, but this allocation call needs
an #ifdef CONFIG_MEMCG_KMEM guard.

Alex Zhu (Kernel) Oct. 17, 2022, 6:19 p.m. UTC | #4

> On Oct 13, 2022, at 9:56 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> Oof, fatfingered send vs postpone. Here is the rest ;)
> 
> On Thu, Oct 13, 2022 at 12:39:53PM -0400, Johannes Weiner wrote:
>> On Wed, Oct 12, 2022 at 03:51:47PM -0700, alexlzhu@fb.com wrote:
>>> +
>>> +	if (get_page_unless_zero(head)) {
>>> +		if (!trylock_page(head)) {
>>> +			put_page(head);
>>> +			return LRU_SKIP;
>>> +		}
>> 
>> The trylock could use a comment:
> 
> 		/* Inverse lock order from add_underutilized_thp() */
> 
>>> +		list_lru_isolate(lru, item);
>>> +		spin_unlock_irq(lock);
>>> +		num_utilized_pages = thp_number_utilized_pages(head);
>>> +		bucket = thp_utilization_bucket(num_utilized_pages);
>>> +		if (bucket < THP_UTIL_BUCKET_NR - 1) {
>>> +			split_huge_page(head);
>>> +			spin_lock_irq(lock);
> 
> The re-lock also needs to be unconditional, or you'll get a double
> unlock in the not-split case.
> 
>>> @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> 	struct folio *folio = page_folio(page);
>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
>>> 	struct anon_vma *anon_vma = NULL;
>>> 	struct address_space *mapping = NULL;
>>> 	int extra_pins, ret;
>>> @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> 			list_del(page_deferred_list(&folio->page));
>>> 		}
>>> 		spin_unlock(&ds_queue->split_queue_lock);
> 
> A brief comment would be useful:
> 
> 		/* Frozen refs lock out additions, test can be lockless */
> 
>>> +		if (!list_empty(underutilized_thp_list))
>>> +			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
>>> +					  underutilized_thp_list);
>>> 		if (mapping) {
>>> 			int nr = folio_nr_pages(folio);
>>> 
>>> @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> void free_transhuge_page(struct page *page)
>>> {
>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
>>> 	unsigned long flags;
>>> 
>>> 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
>>> 		list_del(page_deferred_list(page));
>>> 	}
>>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> 
> Here as well:
> 
> 	/* A dead page cannot be re-added, test can be lockless */
> 
>>> +	if (!list_empty(underutilized_thp_list))
>>> +		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
>>> +
>>> +	if (PageLRU(page))
>>> +		__folio_clear_lru_flags(page_folio(page));
>>> +
>>> 	free_compound_page(page);
>>> }
>>> 
>>> @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page)
>>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>> }
>>> 
>>> +void add_underutilized_thp(struct page *page)
>>> +{
>>> +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>>> +
>>> +	if (PageSwapCache(page))
>>> +		return;
> 
> Why is that?
> 
>>> +
>>> +	/*
>>> +	 * Need to take a reference on the page to prevent the page from getting free'd from
>>> +	 * under us while we are adding the THP to the shrinker.
>>> +	 */
>>> +	if (!get_page_unless_zero(page))
>>> +		return;
>>> +
>>> +	if (!is_anon_transparent_hugepage(page))
>>> +		goto out_put;
>>> +
>>> +	if (is_huge_zero_page(page))
>>> +		goto out_put;
>>> +
> 
> And here:
> 
> 	/* Stabilize page->memcg to allocate and add to the same list */
> 
>>> +	lock_page(page);
>>> +
>>> +	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
>>> +		goto out_unlock;
> 
> The testbot pointed this out already, but this allocation call needs
> an #ifdef CONFIG_MEMCG_KMEM guard.

Sounds good. I made the changes on Friday. Testing them out today. Will send out v4 soon.

Huang, Ying Oct. 19, 2022, 7:04 a.m. UTC | #5

<alexlzhu@fb.com> writes:

> From: Alexander Zhu <alexlzhu@fb.com>
>
> This patch introduces a shrinker that will remove THPs in the lowest
> utilization bucket. As previously mentioned, we have observed that
> almost all of the memory waste when THPs are always enabled
> is contained in the lowest utilization bucket. The shrinker will
> add these THPs to a list_lru and split anonymous THPs based off
> information from kswapd. It requires the changes from
> thp_utilization to identify the least utilized THPs, and the
> changes to split_huge_page to identify and free zero pages
> within THPs.
>
> Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
> ---
> v2 to v3
> -put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
> -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
> -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
> -moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
> -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 
>
> v1 to v2
> -Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
> -Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
> -Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
> -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 
>
> RFC to v1
> -Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
> -Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
> -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 
>
>  include/linux/huge_mm.h  |   7 +++
>  include/linux/list_lru.h |  24 +++++++++
>  include/linux/mm_types.h |   5 ++
>  mm/huge_memory.c         | 114 ++++++++++++++++++++++++++++++++++++++-
>  mm/list_lru.c            |  49 +++++++++++++++++
>  mm/page_alloc.c          |   6 +++
>  6 files changed, 203 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 13ac7b2f29ae..75e4080256be 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -192,6 +192,8 @@ static inline int split_huge_page(struct page *page)
>  }
>  void deferred_split_huge_page(struct page *page);
>  
> +void add_underutilized_thp(struct page *page);
> +
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long address, bool freeze, struct folio *folio);
>  
> @@ -305,6 +307,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
>  	return &page[2].deferred_list;
>  }
>  
> +static inline struct list_head *page_underutilized_thp_list(struct page *page)
> +{
> +	return &page[3].underutilized_thp_list;
> +}
> +
>  #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>  #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
> index b35968ee9fb5..c2cf146ea880 100644
> --- a/include/linux/list_lru.h
> +++ b/include/linux/list_lru.h
> @@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
>   */
>  bool list_lru_add(struct list_lru *lru, struct list_head *item);
>  
> +/**
> + * list_lru_add_page: add an element to the lru list's tail
> + * @list_lru: the lru pointer
> + * @page: the page containing the item
> + * @item: the item to be deleted.
> + *
> + * This function works the same as list_lru_add in terms of list
> + * manipulation. Used for non slab objects contained in the page.
> + *
> + * Return value: true if the list was updated, false otherwise
> + */
> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
>  /**
>   * list_lru_del: delete an element to the lru list
>   * @list_lru: the lru pointer
> @@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
>   */
>  bool list_lru_del(struct list_lru *lru, struct list_head *item);
>  
> +/**
> + * list_lru_del_page: delete an element to the lru list
> + * @list_lru: the lru pointer
> + * @page: the page containing the item
> + * @item: the item to be deleted.
> + *
> + * This function works the same as list_lru_del in terms of list
> + * manipulation. Used for non slab objects contained in the page.
> + *
> + * Return value: true if the list was updated, false otherwise
> + */
> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
>  /**
>   * list_lru_count_one: return the number of objects currently held by @lru
>   * @lru: the lru pointer.
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..da1d1cf42158 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -152,6 +152,11 @@ struct page {
>  			/* For both global and memcg */
>  			struct list_head deferred_list;
>  		};
> +		struct { /* Third tail page of compound page */
> +			unsigned long _compound_pad_3; /* compound_head */
> +			unsigned long _compound_pad_4;
> +			struct list_head underutilized_thp_list;
> +		};
>  		struct {	/* Page table pages */
>  			unsigned long _pt_pad_1;	/* compound_head */
>  			pgtable_t pmd_huge_pte; /* protected by page->ptl */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a08885228cb2..362df977cc73 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -81,6 +81,8 @@ static atomic_t huge_zero_refcount;
>  struct page *huge_zero_page __read_mostly;
>  unsigned long huge_zero_pfn __read_mostly = ~0UL;
>  
> +static struct list_lru huge_low_util_page_lru;
> +
>  static void thp_utilization_workfn(struct work_struct *work);
>  static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
>  
> @@ -263,6 +265,57 @@ static struct shrinker huge_zero_page_shrinker = {
>  	.seeks = DEFAULT_SEEKS,
>  };
>  
> +static enum lru_status low_util_free_page(struct list_head *item,
> +					  struct list_lru_one *lru,
> +					  spinlock_t *lock,
> +					  void *cb_arg)
> +{
> +	int bucket, num_utilized_pages;
> +	struct page *head = compound_head(list_entry(item,
> +									struct page,
> +									underutilized_thp_list));
> +
> +	if (get_page_unless_zero(head)) {
> +		if (!trylock_page(head)) {
> +			put_page(head);
> +			return LRU_SKIP;
> +		}
> +		list_lru_isolate(lru, item);
> +		spin_unlock_irq(lock);
> +		num_utilized_pages = thp_number_utilized_pages(head);
> +		bucket = thp_utilization_bucket(num_utilized_pages);
> +		if (bucket < THP_UTIL_BUCKET_NR - 1) {

If my understanding were correct, the THP will be considered under
utilized if utilization percentage < 90%.  Right?  If so, I thought
something like utilization percentage < 50% appears more appropriate.

> +			split_huge_page(head);
> +			spin_lock_irq(lock);
> +		}
> +		unlock_page(head);
> +		put_page(head);
> +	}
> +
> +	return LRU_REMOVED_RETRY;
> +}
> +
> +static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
> +						     struct shrink_control *sc)
> +{
> +	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
> +}
> +
> +static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
> +						    struct shrink_control *sc)
> +{
> +	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
> +							sc, low_util_free_page, NULL);
> +}
> +
> +static struct shrinker huge_low_util_page_shrinker = {
> +	.count_objects = shrink_huge_low_util_page_count,
> +	.scan_objects = shrink_huge_low_util_page_scan,
> +	.seeks = DEFAULT_SEEKS,
> +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
> +		SHRINKER_NONSLAB,
> +};
> +
>  #ifdef CONFIG_SYSFS
>  static ssize_t enabled_show(struct kobject *kobj,
>  			    struct kobj_attribute *attr, char *buf)
> @@ -515,6 +568,9 @@ static int __init hugepage_init(void)
>  		goto err_slab;
>  
>  	schedule_delayed_work(&thp_utilization_work, HZ);
> +	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
> +	if (err)
> +		goto err_low_util_shrinker;
>  	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
>  	if (err)
>  		goto err_hzp_shrinker;
> @@ -522,6 +578,9 @@ static int __init hugepage_init(void)
>  	if (err)
>  		goto err_split_shrinker;
>  
> +	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
> +	if (err)
> +		goto err_low_util_list_lru;
>  	/*
>  	 * By default disable transparent hugepages on smaller systems,
>  	 * where the extra memory used could hurt more than TLB overhead
> @@ -538,10 +597,14 @@ static int __init hugepage_init(void)
>  
>  	return 0;
>  err_khugepaged:
> +	list_lru_destroy(&huge_low_util_page_lru);
> +err_low_util_list_lru:
>  	unregister_shrinker(&deferred_split_shrinker);
>  err_split_shrinker:
>  	unregister_shrinker(&huge_zero_page_shrinker);
>  err_hzp_shrinker:
> +	unregister_shrinker(&huge_low_util_page_shrinker);
> +err_low_util_shrinker:
>  	khugepaged_destroy();
>  err_slab:
>  	hugepage_exit_sysfs(hugepage_kobj);
> @@ -616,6 +679,7 @@ void prep_transhuge_page(struct page *page)
>  	 */
>  
>  	INIT_LIST_HEAD(page_deferred_list(page));
> +	INIT_LIST_HEAD(page_underutilized_thp_list(page));
>  	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>  }
>  
> @@ -2529,8 +2593,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>  			 LRU_GEN_MASK | LRU_REFS_MASK));
>  
>  	/* ->mapping in first tail page is compound_mapcount */
> -	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
> -			page_tail);
> +	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
>  	page_tail->mapping = head->mapping;
>  	page_tail->index = head->index + tail;
>  	page_tail->private = 0;
> @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  	struct folio *folio = page_folio(page);
>  	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
>  	struct anon_vma *anon_vma = NULL;
>  	struct address_space *mapping = NULL;
>  	int extra_pins, ret;
> @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  			list_del(page_deferred_list(&folio->page));
>  		}
>  		spin_unlock(&ds_queue->split_queue_lock);
> +		if (!list_empty(underutilized_thp_list))
> +			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
> +					  underutilized_thp_list);
>  		if (mapping) {
>  			int nr = folio_nr_pages(folio);
>  
> @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>  void free_transhuge_page(struct page *page)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(page);
> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
>  		list_del(page_deferred_list(page));
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +	if (!list_empty(underutilized_thp_list))
> +		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
> +
> +	if (PageLRU(page))
> +		__folio_clear_lru_flags(page_folio(page));
> +
>  	free_compound_page(page);
>  }
>  
> @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page)
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  }
>  
> +void add_underutilized_thp(struct page *page)
> +{
> +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);

Because we haven't get reference of the page, the page may be split or
free under us, so VM_BUG_ON_PAGE() here may be triggered wrongly?

> +
> +	if (PageSwapCache(page))
> +		return;
> +
> +	/*
> +	 * Need to take a reference on the page to prevent the page from getting free'd from
> +	 * under us while we are adding the THP to the shrinker.
> +	 */
> +	if (!get_page_unless_zero(page))
> +		return;
> +
> +	if (!is_anon_transparent_hugepage(page))
> +		goto out_put;
> +
> +	if (is_huge_zero_page(page))
> +		goto out_put;

is_huge_zero_page() check can be done in thp_util_scan() too?

> +
> +	lock_page(page);
> +
> +	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
> +		goto out_unlock;
> +
> +	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
> +
> +out_unlock:
> +	unlock_page(page);
> +out_put:
> +	put_page(page);
> +}
> +
>  static unsigned long deferred_split_count(struct shrinker *shrink,
>  		struct shrink_control *sc)
>  {
> @@ -3478,6 +3585,9 @@ static void thp_util_scan(unsigned long pfn_end)
>  		if (bucket < 0)
>  			continue;
>  
> +		if (bucket < THP_UTIL_BUCKET_NR - 1)
> +			add_underutilized_thp(page);
> +
>  		thp_scan.buckets[bucket].nr_thps++;
>  		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
>  	}
> diff --git a/mm/list_lru.c b/mm/list_lru.c
> index a05e5bef3b40..8cc56a84b554 100644
> --- a/mm/list_lru.c
> +++ b/mm/list_lru.c
> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>  }
>  EXPORT_SYMBOL_GPL(list_lru_add);
>  
> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
> +{
> +	int nid = page_to_nid(page);
> +	struct list_lru_node *nlru = &lru->node[nid];
> +	struct list_lru_one *l;
> +	struct mem_cgroup *memcg;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&nlru->lock, flags);
> +	if (list_empty(item)) {
> +		memcg = page_memcg(page);
> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> +		list_add_tail(item, &l->list);
> +		/* Set shrinker bit if the first element was added */
> +		if (!l->nr_items++)
> +			set_shrinker_bit(memcg, nid,
> +					 lru_shrinker_id(lru));
> +		nlru->nr_items++;
> +		spin_unlock_irqrestore(&nlru->lock, flags);
> +		return true;
> +	}
> +	spin_unlock_irqrestore(&nlru->lock, flags);
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_add_page);
> +

It appears that only 2 lines are different from list_lru_add().  Is it
possible for us to share code?  For example, add another flag for
page_memcg() case?

>  bool list_lru_del(struct list_lru *lru, struct list_head *item)
>  {
>  	int nid = page_to_nid(virt_to_page(item));
> @@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
>  }
>  EXPORT_SYMBOL_GPL(list_lru_del);
>  
> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
> +{
> +	int nid = page_to_nid(page);
> +	struct list_lru_node *nlru = &lru->node[nid];
> +	struct list_lru_one *l;
> +	struct mem_cgroup *memcg;
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&nlru->lock, flags);
> +	if (!list_empty(item)) {
> +		memcg = page_memcg(page);
> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
> +		list_del_init(item);
> +		l->nr_items--;
> +		nlru->nr_items--;
> +		spin_unlock_irqrestore(&nlru->lock, flags);
> +		return true;
> +	}
> +	spin_unlock_irqrestore(&nlru->lock, flags);
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(list_lru_del_page);
> +
>  void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
>  {
>  	list_del_init(item);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ac2c9f12a7b2..468eaaade7fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
>  		 * deferred_list.next -- ignore value.
>  		 */
>  		break;
> +	case 3:
> +		/*
> +		 * the third tail page: ->mapping is
> +		 * underutilized_thp_list.next -- ignore value.
> +		 */
> +		break;
>  	default:
>  		if (page->mapping != TAIL_MAPPING) {
>  			bad_page(page, "corrupted mapping in tail page");

Best Regards,
Huang, Ying

Alex Zhu (Kernel) Oct. 19, 2022, 7:08 p.m. UTC | #6

> On Oct 19, 2022, at 12:04 AM, Huang, Ying <ying.huang@intel.com> wrote:
> 
> > 
> <alexlzhu@fb.com> writes:
> 
>> From: Alexander Zhu <alexlzhu@fb.com>
>> 
>> This patch introduces a shrinker that will remove THPs in the lowest
>> utilization bucket. As previously mentioned, we have observed that
>> almost all of the memory waste when THPs are always enabled
>> is contained in the lowest utilization bucket. The shrinker will
>> add these THPs to a list_lru and split anonymous THPs based off
>> information from kswapd. It requires the changes from
>> thp_utilization to identify the least utilized THPs, and the
>> changes to split_huge_page to identify and free zero pages
>> within THPs.
>> 
>> Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
>> ---
>> v2 to v3
>> -put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
>> -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
>> -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
>> -moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
>> -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 
>> 
>> v1 to v2
>> -Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
>> -Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
>> -Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
>> -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 
>> 
>> RFC to v1
>> -Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
>> -Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
>> -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 
>> 
>> include/linux/huge_mm.h  |   7 +++
>> include/linux/list_lru.h |  24 +++++++++
>> include/linux/mm_types.h |   5 ++
>> mm/huge_memory.c         | 114 ++++++++++++++++++++++++++++++++++++++-
>> mm/list_lru.c            |  49 +++++++++++++++++
>> mm/page_alloc.c          |   6 +++
>> 6 files changed, 203 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 13ac7b2f29ae..75e4080256be 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -192,6 +192,8 @@ static inline int split_huge_page(struct page *page)
>> }
>> void deferred_split_huge_page(struct page *page);
>> 
>> +void add_underutilized_thp(struct page *page);
>> +
>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> 		unsigned long address, bool freeze, struct folio *folio);
>> 
>> @@ -305,6 +307,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
>> 	return &page[2].deferred_list;
>> }
>> 
>> +static inline struct list_head *page_underutilized_thp_list(struct page *page)
>> +{
>> +	return &page[3].underutilized_thp_list;
>> +}
>> +
>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>> index b35968ee9fb5..c2cf146ea880 100644
>> --- a/include/linux/list_lru.h
>> +++ b/include/linux/list_lru.h
>> @@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
>>  */
>> bool list_lru_add(struct list_lru *lru, struct list_head *item);
>> 
>> +/**
>> + * list_lru_add_page: add an element to the lru list's tail
>> + * @list_lru: the lru pointer
>> + * @page: the page containing the item
>> + * @item: the item to be deleted.
>> + *
>> + * This function works the same as list_lru_add in terms of list
>> + * manipulation. Used for non slab objects contained in the page.
>> + *
>> + * Return value: true if the list was updated, false otherwise
>> + */
>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
>> /**
>>  * list_lru_del: delete an element to the lru list
>>  * @list_lru: the lru pointer
>> @@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
>>  */
>> bool list_lru_del(struct list_lru *lru, struct list_head *item);
>> 
>> +/**
>> + * list_lru_del_page: delete an element to the lru list
>> + * @list_lru: the lru pointer
>> + * @page: the page containing the item
>> + * @item: the item to be deleted.
>> + *
>> + * This function works the same as list_lru_del in terms of list
>> + * manipulation. Used for non slab objects contained in the page.
>> + *
>> + * Return value: true if the list was updated, false otherwise
>> + */
>> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
>> /**
>>  * list_lru_count_one: return the number of objects currently held by @lru
>>  * @lru: the lru pointer.
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 500e536796ca..da1d1cf42158 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -152,6 +152,11 @@ struct page {
>> 			/* For both global and memcg */
>> 			struct list_head deferred_list;
>> 		};
>> +		struct { /* Third tail page of compound page */
>> +			unsigned long _compound_pad_3; /* compound_head */
>> +			unsigned long _compound_pad_4;
>> +			struct list_head underutilized_thp_list;
>> +		};
>> 		struct {	/* Page table pages */
>> 			unsigned long _pt_pad_1;	/* compound_head */
>> 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index a08885228cb2..362df977cc73 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -81,6 +81,8 @@ static atomic_t huge_zero_refcount;
>> struct page *huge_zero_page __read_mostly;
>> unsigned long huge_zero_pfn __read_mostly = ~0UL;
>> 
>> +static struct list_lru huge_low_util_page_lru;
>> +
>> static void thp_utilization_workfn(struct work_struct *work);
>> static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
>> 
>> @@ -263,6 +265,57 @@ static struct shrinker huge_zero_page_shrinker = {
>> 	.seeks = DEFAULT_SEEKS,
>> };
>> 
>> +static enum lru_status low_util_free_page(struct list_head *item,
>> +					  struct list_lru_one *lru,
>> +					  spinlock_t *lock,
>> +					  void *cb_arg)
>> +{
>> +	int bucket, num_utilized_pages;
>> +	struct page *head = compound_head(list_entry(item,
>> +									struct page,
>> +									underutilized_thp_list));
>> +
>> +	if (get_page_unless_zero(head)) {
>> +		if (!trylock_page(head)) {
>> +			put_page(head);
>> +			return LRU_SKIP;
>> +		}
>> +		list_lru_isolate(lru, item);
>> +		spin_unlock_irq(lock);
>> +		num_utilized_pages = thp_number_utilized_pages(head);
>> +		bucket = thp_utilization_bucket(num_utilized_pages);
>> +		if (bucket < THP_UTIL_BUCKET_NR - 1) {
> 
> If my understanding were correct, the THP will be considered under
> utilized if utilization percentage < 90%.  Right?  If so, I thought
> something like utilization percentage < 50% appears more appropriate.

Yes, < 90%. That is just the best number we have found so far from our experiments, as it seems almost all THPs are <10% utilized or >90% utilized . There are an almost trivial number of pages in between. 
> 
>> +			split_huge_page(head);
>> +			spin_lock_irq(lock);
>> +		}
>> +		unlock_page(head);
>> +		put_page(head);
>> +	}
>> +
>> +	return LRU_REMOVED_RETRY;
>> +}
>> +
>> +static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
>> +						     struct shrink_control *sc)
>> +{
>> +	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
>> +}
>> +
>> +static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
>> +						    struct shrink_control *sc)
>> +{
>> +	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
>> +							sc, low_util_free_page, NULL);
>> +}
>> +
>> +static struct shrinker huge_low_util_page_shrinker = {
>> +	.count_objects = shrink_huge_low_util_page_count,
>> +	.scan_objects = shrink_huge_low_util_page_scan,
>> +	.seeks = DEFAULT_SEEKS,
>> +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
>> +		SHRINKER_NONSLAB,
>> +};
>> +
>> #ifdef CONFIG_SYSFS
>> static ssize_t enabled_show(struct kobject *kobj,
>> 			    struct kobj_attribute *attr, char *buf)
>> @@ -515,6 +568,9 @@ static int __init hugepage_init(void)
>> 		goto err_slab;
>> 
>> 	schedule_delayed_work(&thp_utilization_work, HZ);
>> +	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
>> +	if (err)
>> +		goto err_low_util_shrinker;
>> 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
>> 	if (err)
>> 		goto err_hzp_shrinker;
>> @@ -522,6 +578,9 @@ static int __init hugepage_init(void)
>> 	if (err)
>> 		goto err_split_shrinker;
>> 
>> +	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
>> +	if (err)
>> +		goto err_low_util_list_lru;
>> 	/*
>> 	 * By default disable transparent hugepages on smaller systems,
>> 	 * where the extra memory used could hurt more than TLB overhead
>> @@ -538,10 +597,14 @@ static int __init hugepage_init(void)
>> 
>> 	return 0;
>> err_khugepaged:
>> +	list_lru_destroy(&huge_low_util_page_lru);
>> +err_low_util_list_lru:
>> 	unregister_shrinker(&deferred_split_shrinker);
>> err_split_shrinker:
>> 	unregister_shrinker(&huge_zero_page_shrinker);
>> err_hzp_shrinker:
>> +	unregister_shrinker(&huge_low_util_page_shrinker);
>> +err_low_util_shrinker:
>> 	khugepaged_destroy();
>> err_slab:
>> 	hugepage_exit_sysfs(hugepage_kobj);
>> @@ -616,6 +679,7 @@ void prep_transhuge_page(struct page *page)
>> 	 */
>> 
>> 	INIT_LIST_HEAD(page_deferred_list(page));
>> +	INIT_LIST_HEAD(page_underutilized_thp_list(page));
>> 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>> }
>> 
>> @@ -2529,8 +2593,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>> 			 LRU_GEN_MASK | LRU_REFS_MASK));
>> 
>> 	/* ->mapping in first tail page is compound_mapcount */
>> -	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
>> -			page_tail);
>> +	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
>> 	page_tail->mapping = head->mapping;
>> 	page_tail->index = head->index + tail;
>> 	page_tail->private = 0;
>> @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>> 	struct folio *folio = page_folio(page);
>> 	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
>> 	struct anon_vma *anon_vma = NULL;
>> 	struct address_space *mapping = NULL;
>> 	int extra_pins, ret;
>> @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>> 			list_del(page_deferred_list(&folio->page));
>> 		}
>> 		spin_unlock(&ds_queue->split_queue_lock);
>> +		if (!list_empty(underutilized_thp_list))
>> +			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
>> +					  underutilized_thp_list);
>> 		if (mapping) {
>> 			int nr = folio_nr_pages(folio);
>> 
>> @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>> void free_transhuge_page(struct page *page)
>> {
>> 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
>> 	unsigned long flags;
>> 
>> 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
>> 		list_del(page_deferred_list(page));
>> 	}
>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> +	if (!list_empty(underutilized_thp_list))
>> +		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
>> +
>> +	if (PageLRU(page))
>> +		__folio_clear_lru_flags(page_folio(page));
>> +
>> 	free_compound_page(page);
>> }
>> 
>> @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page)
>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> }
>> 
>> +void add_underutilized_thp(struct page *page)
>> +{
>> +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> 
> Because we haven't get reference of the page, the page may be split or
> free under us, so VM_BUG_ON_PAGE() here may be triggered wrongly?
> 
>> +
>> +	if (PageSwapCache(page))
>> +		return;
>> +
>> +	/*
>> +	 * Need to take a reference on the page to prevent the page from getting free'd from
>> +	 * under us while we are adding the THP to the shrinker.
>> +	 */
>> +	if (!get_page_unless_zero(page))
>> +		return;
>> +
>> +	if (!is_anon_transparent_hugepage(page))
>> +		goto out_put;
>> +
>> +	if (is_huge_zero_page(page))
>> +		goto out_put;
> 
> is_huge_zero_page() check can be done in thp_util_scan() too?

Sounds good.
> 
>> +
>> +	lock_page(page);
>> +
>> +	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
>> +		goto out_unlock;
>> +
>> +	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
>> +
>> +out_unlock:
>> +	unlock_page(page);
>> +out_put:
>> +	put_page(page);
>> +}
>> +
>> static unsigned long deferred_split_count(struct shrinker *shrink,
>> 		struct shrink_control *sc)
>> {
>> @@ -3478,6 +3585,9 @@ static void thp_util_scan(unsigned long pfn_end)
>> 		if (bucket < 0)
>> 			continue;
>> 
>> +		if (bucket < THP_UTIL_BUCKET_NR - 1)
>> +			add_underutilized_thp(page);
>> +
>> 		thp_scan.buckets[bucket].nr_thps++;
>> 		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
>> 	}
>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>> index a05e5bef3b40..8cc56a84b554 100644
>> --- a/mm/list_lru.c
>> +++ b/mm/list_lru.c
>> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>> }
>> EXPORT_SYMBOL_GPL(list_lru_add);
>> 
>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
>> +{
>> +	int nid = page_to_nid(page);
>> +	struct list_lru_node *nlru = &lru->node[nid];
>> +	struct list_lru_one *l;
>> +	struct mem_cgroup *memcg;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&nlru->lock, flags);
>> +	if (list_empty(item)) {
>> +		memcg = page_memcg(page);
>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>> +		list_add_tail(item, &l->list);
>> +		/* Set shrinker bit if the first element was added */
>> +		if (!l->nr_items++)
>> +			set_shrinker_bit(memcg, nid,
>> +					 lru_shrinker_id(lru));
>> +		nlru->nr_items++;
>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>> +		return true;
>> +	}
>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>> +	return false;
>> +}
>> +EXPORT_SYMBOL_GPL(list_lru_add_page);
>> +
> 
> It appears that only 2 lines are different from list_lru_add().  Is it
> possible for us to share code?  For example, add another flag for
> page_memcg() case?

I believe there are 4 lines. The page_to_nid(page) and the spin_lock_irqsave/restore. 
It was implemented this way as we found we needed to take a page as a parameter and obtain the node id from
the page. This is because the THP is not necessarily a slab object, as the list_lru_add/delete code assumes. 

Also, there is a potential deadlock when split_huge_page is called from reclaim and when split_huge_page is called 
by the THP Shrinker, which is why we need irqsave/restore. 

I though this would be cleaner than attempting to shared code with list_lru_add/delete. Only the shrinker makes use of this. 
All other use cases assume slab objects. 
> 
>> bool list_lru_del(struct list_lru *lru, struct list_head *item)
>> {
>> 	int nid = page_to_nid(virt_to_page(item));
>> @@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
>> }
>> EXPORT_SYMBOL_GPL(list_lru_del);
>> 
>> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
>> +{
>> +	int nid = page_to_nid(page);
>> +	struct list_lru_node *nlru = &lru->node[nid];
>> +	struct list_lru_one *l;
>> +	struct mem_cgroup *memcg;
>> +	unsigned long flags;
>> +
>> +	spin_lock_irqsave(&nlru->lock, flags);
>> +	if (!list_empty(item)) {
>> +		memcg = page_memcg(page);
>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>> +		list_del_init(item);
>> +		l->nr_items--;
>> +		nlru->nr_items--;
>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>> +		return true;
>> +	}
>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>> +	return false;
>> +}
>> +EXPORT_SYMBOL_GPL(list_lru_del_page);
>> +
>> void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
>> {
>> 	list_del_init(item);
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index ac2c9f12a7b2..468eaaade7fe 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
>> 		 * deferred_list.next -- ignore value.
>> 		 */
>> 		break;
>> +	case 3:
>> +		/*
>> +		 * the third tail page: ->mapping is
>> +		 * underutilized_thp_list.next -- ignore value.
>> +		 */
>> +		break;
>> 	default:
>> 		if (page->mapping != TAIL_MAPPING) {
>> 			bad_page(page, "corrupted mapping in tail page");
> 
> Best Regards,
> Huang, Ying

Huang, Ying Oct. 20, 2022, 1:24 a.m. UTC | #7

"Alex Zhu (Kernel)" <alexlzhu@meta.com> writes:

>> On Oct 19, 2022, at 12:04 AM, Huang, Ying <ying.huang@intel.com> wrote:
>> 
>> > 
>> <alexlzhu@fb.com> writes:
>> 
>>> From: Alexander Zhu <alexlzhu@fb.com>
>>> 
>>> This patch introduces a shrinker that will remove THPs in the lowest
>>> utilization bucket. As previously mentioned, we have observed that
>>> almost all of the memory waste when THPs are always enabled
>>> is contained in the lowest utilization bucket. The shrinker will
>>> add these THPs to a list_lru and split anonymous THPs based off
>>> information from kswapd. It requires the changes from
>>> thp_utilization to identify the least utilized THPs, and the
>>> changes to split_huge_page to identify and free zero pages
>>> within THPs.
>>> 
>>> Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
>>> ---
>>> v2 to v3
>>> -put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
>>> -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
>>> -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
>>> -moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
>>> -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 
>>> 
>>> v1 to v2
>>> -Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
>>> -Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
>>> -Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
>>> -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 
>>> 
>>> RFC to v1
>>> -Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
>>> -Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
>>> -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 
>>> 
>>> include/linux/huge_mm.h  |   7 +++
>>> include/linux/list_lru.h |  24 +++++++++
>>> include/linux/mm_types.h |   5 ++
>>> mm/huge_memory.c         | 114 ++++++++++++++++++++++++++++++++++++++-
>>> mm/list_lru.c            |  49 +++++++++++++++++
>>> mm/page_alloc.c          |   6 +++
>>> 6 files changed, 203 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 13ac7b2f29ae..75e4080256be 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -192,6 +192,8 @@ static inline int split_huge_page(struct page *page)
>>> }
>>> void deferred_split_huge_page(struct page *page);
>>> 
>>> +void add_underutilized_thp(struct page *page);
>>> +
>>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> 		unsigned long address, bool freeze, struct folio *folio);
>>> 
>>> @@ -305,6 +307,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
>>> 	return &page[2].deferred_list;
>>> }
>>> 
>>> +static inline struct list_head *page_underutilized_thp_list(struct page *page)
>>> +{
>>> +	return &page[3].underutilized_thp_list;
>>> +}
>>> +
>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
>>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>>> index b35968ee9fb5..c2cf146ea880 100644
>>> --- a/include/linux/list_lru.h
>>> +++ b/include/linux/list_lru.h
>>> @@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
>>>  */
>>> bool list_lru_add(struct list_lru *lru, struct list_head *item);
>>> 
>>> +/**
>>> + * list_lru_add_page: add an element to the lru list's tail
>>> + * @list_lru: the lru pointer
>>> + * @page: the page containing the item
>>> + * @item: the item to be deleted.
>>> + *
>>> + * This function works the same as list_lru_add in terms of list
>>> + * manipulation. Used for non slab objects contained in the page.
>>> + *
>>> + * Return value: true if the list was updated, false otherwise
>>> + */
>>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
>>> /**
>>>  * list_lru_del: delete an element to the lru list
>>>  * @list_lru: the lru pointer
>>> @@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
>>>  */
>>> bool list_lru_del(struct list_lru *lru, struct list_head *item);
>>> 
>>> +/**
>>> + * list_lru_del_page: delete an element to the lru list
>>> + * @list_lru: the lru pointer
>>> + * @page: the page containing the item
>>> + * @item: the item to be deleted.
>>> + *
>>> + * This function works the same as list_lru_del in terms of list
>>> + * manipulation. Used for non slab objects contained in the page.
>>> + *
>>> + * Return value: true if the list was updated, false otherwise
>>> + */
>>> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
>>> /**
>>>  * list_lru_count_one: return the number of objects currently held by @lru
>>>  * @lru: the lru pointer.
>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>> index 500e536796ca..da1d1cf42158 100644
>>> --- a/include/linux/mm_types.h
>>> +++ b/include/linux/mm_types.h
>>> @@ -152,6 +152,11 @@ struct page {
>>> 			/* For both global and memcg */
>>> 			struct list_head deferred_list;
>>> 		};
>>> +		struct { /* Third tail page of compound page */
>>> +			unsigned long _compound_pad_3; /* compound_head */
>>> +			unsigned long _compound_pad_4;
>>> +			struct list_head underutilized_thp_list;
>>> +		};
>>> 		struct {	/* Page table pages */
>>> 			unsigned long _pt_pad_1;	/* compound_head */
>>> 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index a08885228cb2..362df977cc73 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -81,6 +81,8 @@ static atomic_t huge_zero_refcount;
>>> struct page *huge_zero_page __read_mostly;
>>> unsigned long huge_zero_pfn __read_mostly = ~0UL;
>>> 
>>> +static struct list_lru huge_low_util_page_lru;
>>> +
>>> static void thp_utilization_workfn(struct work_struct *work);
>>> static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
>>> 
>>> @@ -263,6 +265,57 @@ static struct shrinker huge_zero_page_shrinker = {
>>> 	.seeks = DEFAULT_SEEKS,
>>> };
>>> 
>>> +static enum lru_status low_util_free_page(struct list_head *item,
>>> +					  struct list_lru_one *lru,
>>> +					  spinlock_t *lock,
>>> +					  void *cb_arg)
>>> +{
>>> +	int bucket, num_utilized_pages;
>>> +	struct page *head = compound_head(list_entry(item,
>>> +									struct page,
>>> +									underutilized_thp_list));
>>> +
>>> +	if (get_page_unless_zero(head)) {
>>> +		if (!trylock_page(head)) {
>>> +			put_page(head);
>>> +			return LRU_SKIP;
>>> +		}
>>> +		list_lru_isolate(lru, item);
>>> +		spin_unlock_irq(lock);
>>> +		num_utilized_pages = thp_number_utilized_pages(head);
>>> +		bucket = thp_utilization_bucket(num_utilized_pages);
>>> +		if (bucket < THP_UTIL_BUCKET_NR - 1) {
>> 
>> If my understanding were correct, the THP will be considered under
>> utilized if utilization percentage < 90%.  Right?  If so, I thought
>> something like utilization percentage < 50% appears more appropriate.
>
> Yes, < 90%. That is just the best number we have found so far from our
> experiments, as it seems almost all THPs are <10% utilized or >90%
> utilized . There are an almost trivial number of pages in between.

So, even if it's <= 10%, the results will be almost best.  Yes?  If so,
why use 90%?  Which is appears too extreme.

>> 
>>> +			split_huge_page(head);
>>> +			spin_lock_irq(lock);
>>> +		}
>>> +		unlock_page(head);
>>> +		put_page(head);
>>> +	}
>>> +
>>> +	return LRU_REMOVED_RETRY;
>>> +}
>>> +
>>> +static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
>>> +						     struct shrink_control *sc)
>>> +{
>>> +	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
>>> +}
>>> +
>>> +static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
>>> +						    struct shrink_control *sc)
>>> +{
>>> +	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
>>> +							sc, low_util_free_page, NULL);
>>> +}
>>> +
>>> +static struct shrinker huge_low_util_page_shrinker = {
>>> +	.count_objects = shrink_huge_low_util_page_count,
>>> +	.scan_objects = shrink_huge_low_util_page_scan,
>>> +	.seeks = DEFAULT_SEEKS,
>>> +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
>>> +		SHRINKER_NONSLAB,
>>> +};
>>> +
>>> #ifdef CONFIG_SYSFS
>>> static ssize_t enabled_show(struct kobject *kobj,
>>> 			    struct kobj_attribute *attr, char *buf)
>>> @@ -515,6 +568,9 @@ static int __init hugepage_init(void)
>>> 		goto err_slab;
>>> 
>>> 	schedule_delayed_work(&thp_utilization_work, HZ);
>>> +	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
>>> +	if (err)
>>> +		goto err_low_util_shrinker;
>>> 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
>>> 	if (err)
>>> 		goto err_hzp_shrinker;
>>> @@ -522,6 +578,9 @@ static int __init hugepage_init(void)
>>> 	if (err)
>>> 		goto err_split_shrinker;
>>> 
>>> +	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
>>> +	if (err)
>>> +		goto err_low_util_list_lru;
>>> 	/*
>>> 	 * By default disable transparent hugepages on smaller systems,
>>> 	 * where the extra memory used could hurt more than TLB overhead
>>> @@ -538,10 +597,14 @@ static int __init hugepage_init(void)
>>> 
>>> 	return 0;
>>> err_khugepaged:
>>> +	list_lru_destroy(&huge_low_util_page_lru);
>>> +err_low_util_list_lru:
>>> 	unregister_shrinker(&deferred_split_shrinker);
>>> err_split_shrinker:
>>> 	unregister_shrinker(&huge_zero_page_shrinker);
>>> err_hzp_shrinker:
>>> +	unregister_shrinker(&huge_low_util_page_shrinker);
>>> +err_low_util_shrinker:
>>> 	khugepaged_destroy();
>>> err_slab:
>>> 	hugepage_exit_sysfs(hugepage_kobj);
>>> @@ -616,6 +679,7 @@ void prep_transhuge_page(struct page *page)
>>> 	 */
>>> 
>>> 	INIT_LIST_HEAD(page_deferred_list(page));
>>> +	INIT_LIST_HEAD(page_underutilized_thp_list(page));
>>> 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>>> }
>>> 
>>> @@ -2529,8 +2593,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>> 			 LRU_GEN_MASK | LRU_REFS_MASK));
>>> 
>>> 	/* ->mapping in first tail page is compound_mapcount */
>>> -	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
>>> -			page_tail);
>>> +	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
>>> 	page_tail->mapping = head->mapping;
>>> 	page_tail->index = head->index + tail;
>>> 	page_tail->private = 0;
>>> @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> 	struct folio *folio = page_folio(page);
>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
>>> 	struct anon_vma *anon_vma = NULL;
>>> 	struct address_space *mapping = NULL;
>>> 	int extra_pins, ret;
>>> @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> 			list_del(page_deferred_list(&folio->page));
>>> 		}
>>> 		spin_unlock(&ds_queue->split_queue_lock);
>>> +		if (!list_empty(underutilized_thp_list))
>>> +			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
>>> +					  underutilized_thp_list);
>>> 		if (mapping) {
>>> 			int nr = folio_nr_pages(folio);
>>> 
>>> @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>> void free_transhuge_page(struct page *page)
>>> {
>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
>>> 	unsigned long flags;
>>> 
>>> 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>> @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
>>> 		list_del(page_deferred_list(page));
>>> 	}
>>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>> +	if (!list_empty(underutilized_thp_list))
>>> +		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
>>> +
>>> +	if (PageLRU(page))
>>> +		__folio_clear_lru_flags(page_folio(page));
>>> +
>>> 	free_compound_page(page);
>>> }
>>> 
>>> @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page)
>>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>> }
>>> 
>>> +void add_underutilized_thp(struct page *page)
>>> +{
>>> +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>> 
>> Because we haven't get reference of the page, the page may be split or
>> free under us, so VM_BUG_ON_PAGE() here may be triggered wrongly?
>> 
>>> +
>>> +	if (PageSwapCache(page))
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Need to take a reference on the page to prevent the page from getting free'd from
>>> +	 * under us while we are adding the THP to the shrinker.
>>> +	 */
>>> +	if (!get_page_unless_zero(page))
>>> +		return;
>>> +
>>> +	if (!is_anon_transparent_hugepage(page))
>>> +		goto out_put;
>>> +
>>> +	if (is_huge_zero_page(page))
>>> +		goto out_put;
>> 
>> is_huge_zero_page() check can be done in thp_util_scan() too?
>
> Sounds good.
>> 
>>> +
>>> +	lock_page(page);
>>> +
>>> +	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
>>> +		goto out_unlock;
>>> +
>>> +	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
>>> +
>>> +out_unlock:
>>> +	unlock_page(page);
>>> +out_put:
>>> +	put_page(page);
>>> +}
>>> +
>>> static unsigned long deferred_split_count(struct shrinker *shrink,
>>> 		struct shrink_control *sc)
>>> {
>>> @@ -3478,6 +3585,9 @@ static void thp_util_scan(unsigned long pfn_end)
>>> 		if (bucket < 0)
>>> 			continue;
>>> 
>>> +		if (bucket < THP_UTIL_BUCKET_NR - 1)
>>> +			add_underutilized_thp(page);
>>> +
>>> 		thp_scan.buckets[bucket].nr_thps++;
>>> 		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
>>> 	}
>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>> index a05e5bef3b40..8cc56a84b554 100644
>>> --- a/mm/list_lru.c
>>> +++ b/mm/list_lru.c
>>> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>>> }
>>> EXPORT_SYMBOL_GPL(list_lru_add);
>>> 
>>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
>>> +{
>>> +	int nid = page_to_nid(page);
>>> +	struct list_lru_node *nlru = &lru->node[nid];
>>> +	struct list_lru_one *l;
>>> +	struct mem_cgroup *memcg;
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&nlru->lock, flags);
>>> +	if (list_empty(item)) {
>>> +		memcg = page_memcg(page);
>>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>> +		list_add_tail(item, &l->list);
>>> +		/* Set shrinker bit if the first element was added */
>>> +		if (!l->nr_items++)
>>> +			set_shrinker_bit(memcg, nid,
>>> +					 lru_shrinker_id(lru));
>>> +		nlru->nr_items++;
>>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>>> +		return true;
>>> +	}
>>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>>> +	return false;
>>> +}
>>> +EXPORT_SYMBOL_GPL(list_lru_add_page);
>>> +
>> 
>> It appears that only 2 lines are different from list_lru_add().  Is it
>> possible for us to share code?  For example, add another flag for
>> page_memcg() case?
>
> I believe there are 4 lines. The page_to_nid(page) and the spin_lock_irqsave/restore. 
> It was implemented this way as we found we needed to take a page as a parameter and obtain the node id from
> the page. This is because the THP is not necessarily a slab object, as the list_lru_add/delete code assumes. 
>
> Also, there is a potential deadlock when split_huge_page is called from reclaim and when split_huge_page is called 
> by the THP Shrinker, which is why we need irqsave/restore. 

Can you provide more details on this?  If it's necessary, I think it should be OK to use irqsave/restore.

> I though this would be cleaner than attempting to shared code with list_lru_add/delete. Only the shrinker makes use of this. 
> All other use cases assume slab objects.

Not only for slab objects now.  The original code works for slab and
normal page.  Which is controlled by list_lru->memcg_aware.  You can add
another flag for your new use case, or refactor it to use a function
pointer.

Best Regards,
Huang, Ying

>> 
>>> bool list_lru_del(struct list_lru *lru, struct list_head *item)
>>> {
>>> 	int nid = page_to_nid(virt_to_page(item));
>>> @@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
>>> }
>>> EXPORT_SYMBOL_GPL(list_lru_del);
>>> 
>>> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
>>> +{
>>> +	int nid = page_to_nid(page);
>>> +	struct list_lru_node *nlru = &lru->node[nid];
>>> +	struct list_lru_one *l;
>>> +	struct mem_cgroup *memcg;
>>> +	unsigned long flags;
>>> +
>>> +	spin_lock_irqsave(&nlru->lock, flags);
>>> +	if (!list_empty(item)) {
>>> +		memcg = page_memcg(page);
>>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>> +		list_del_init(item);
>>> +		l->nr_items--;
>>> +		nlru->nr_items--;
>>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>>> +		return true;
>>> +	}
>>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>>> +	return false;
>>> +}
>>> +EXPORT_SYMBOL_GPL(list_lru_del_page);
>>> +
>>> void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
>>> {
>>> 	list_del_init(item);
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index ac2c9f12a7b2..468eaaade7fe 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
>>> 		 * deferred_list.next -- ignore value.
>>> 		 */
>>> 		break;
>>> +	case 3:
>>> +		/*
>>> +		 * the third tail page: ->mapping is
>>> +		 * underutilized_thp_list.next -- ignore value.
>>> +		 */
>>> +		break;
>>> 	default:
>>> 		if (page->mapping != TAIL_MAPPING) {
>>> 			bad_page(page, "corrupted mapping in tail page");
>> 
>> Best Regards,
>> Huang, Ying

Alex Zhu (Kernel) Oct. 20, 2022, 3:29 a.m. UTC | #8

> On Oct 19, 2022, at 6:24 PM, Huang, Ying <ying.huang@intel.com> wrote:
> 
> "Alex Zhu (Kernel)" <alexlzhu@meta.com> writes:
> 
>>> On Oct 19, 2022, at 12:04 AM, Huang, Ying <ying.huang@intel.com> wrote:
>>> 
>>>> 
>>> <alexlzhu@fb.com> writes:
>>> 
>>>> From: Alexander Zhu <alexlzhu@fb.com>
>>>> 
>>>> This patch introduces a shrinker that will remove THPs in the lowest
>>>> utilization bucket. As previously mentioned, we have observed that
>>>> almost all of the memory waste when THPs are always enabled
>>>> is contained in the lowest utilization bucket. The shrinker will
>>>> add these THPs to a list_lru and split anonymous THPs based off
>>>> information from kswapd. It requires the changes from
>>>> thp_utilization to identify the least utilized THPs, and the
>>>> changes to split_huge_page to identify and free zero pages
>>>> within THPs.
>>>> 
>>>> Signed-off-by: Alexander Zhu <alexlzhu@fb.com>
>>>> ---
>>>> v2 to v3
>>>> -put_page() after trylock_page in low_util_free_page. put() to be called after get() call 
>>>> -removed spin_unlock_irq in low_util_free_page above LRU_SKIP. There was a double unlock.    
>>>> -moved spin_unlock_irq() to below list_lru_isolate() in low_util_free_page. This is to shorten the critical section.
>>>> -moved lock_page in add_underutilized_thp such that we only lock when allocating and adding to the list_lru  
>>>> -removed list_lru_alloc in list_lru_add_page and list_lru_delete_page as these are no longer needed. 
>>>> 
>>>> v1 to v2
>>>> -Changed lru_lock to be irq safe. Added irq_save and restore around list_lru adds/deletes.
>>>> -Changed low_util_free_page() to trylock the page, and if it fails, unlock lru_lock and return LRU_SKIP. This is to avoid deadlock between reclaim, which calls split_huge_page() and the THP Shrinker
>>>> -Changed low_util_free_page() to unlock lru_lock, split_huge_page, then lock lru_lock. This way split_huge_page is not called with the lru_lock held. That leads to deadlock as split_huge_page calls on_each_cpu_mask 
>>>> -Changed list_lru_shrink_walk to list_lru_shrink_walk_irq. 
>>>> 
>>>> RFC to v1
>>>> -Remove all THPs that are not in the top utilization bucket. This is what we have found to perform the best in production testing, we have found that there are an almost trivial number of THPs in the middle range of buckets that account for most of the memory waste. 
>>>> -Added check for THP utilization prior to split_huge_page for the THP Shrinker. This is to account for THPs that move to the top bucket, but were underutilized at the time they were added to the list_lru. 
>>>> -Multiply the shrink_count and scan_count by HPAGE_PMD_NR. This is because a THP is 512 pages, and should count as 512 objects in reclaim. This way reclaim is triggered at a more appropriate frequency than in the RFC. 
>>>> 
>>>> include/linux/huge_mm.h  |   7 +++
>>>> include/linux/list_lru.h |  24 +++++++++
>>>> include/linux/mm_types.h |   5 ++
>>>> mm/huge_memory.c         | 114 ++++++++++++++++++++++++++++++++++++++-
>>>> mm/list_lru.c            |  49 +++++++++++++++++
>>>> mm/page_alloc.c          |   6 +++
>>>> 6 files changed, 203 insertions(+), 2 deletions(-)
>>>> 
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 13ac7b2f29ae..75e4080256be 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -192,6 +192,8 @@ static inline int split_huge_page(struct page *page)
>>>> }
>>>> void deferred_split_huge_page(struct page *page);
>>>> 
>>>> +void add_underutilized_thp(struct page *page);
>>>> +
>>>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>>> 		unsigned long address, bool freeze, struct folio *folio);
>>>> 
>>>> @@ -305,6 +307,11 @@ static inline struct list_head *page_deferred_list(struct page *page)
>>>> 	return &page[2].deferred_list;
>>>> }
>>>> 
>>>> +static inline struct list_head *page_underutilized_thp_list(struct page *page)
>>>> +{
>>>> +	return &page[3].underutilized_thp_list;
>>>> +}
>>>> +
>>>> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
>>>> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>>> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
>>>> diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
>>>> index b35968ee9fb5..c2cf146ea880 100644
>>>> --- a/include/linux/list_lru.h
>>>> +++ b/include/linux/list_lru.h
>>>> @@ -89,6 +89,18 @@ void memcg_reparent_list_lrus(struct mem_cgroup *memcg, struct mem_cgroup *paren
>>>> */
>>>> bool list_lru_add(struct list_lru *lru, struct list_head *item);
>>>> 
>>>> +/**
>>>> + * list_lru_add_page: add an element to the lru list's tail
>>>> + * @list_lru: the lru pointer
>>>> + * @page: the page containing the item
>>>> + * @item: the item to be deleted.
>>>> + *
>>>> + * This function works the same as list_lru_add in terms of list
>>>> + * manipulation. Used for non slab objects contained in the page.
>>>> + *
>>>> + * Return value: true if the list was updated, false otherwise
>>>> + */
>>>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item);
>>>> /**
>>>> * list_lru_del: delete an element to the lru list
>>>> * @list_lru: the lru pointer
>>>> @@ -102,6 +114,18 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
>>>> */
>>>> bool list_lru_del(struct list_lru *lru, struct list_head *item);
>>>> 
>>>> +/**
>>>> + * list_lru_del_page: delete an element to the lru list
>>>> + * @list_lru: the lru pointer
>>>> + * @page: the page containing the item
>>>> + * @item: the item to be deleted.
>>>> + *
>>>> + * This function works the same as list_lru_del in terms of list
>>>> + * manipulation. Used for non slab objects contained in the page.
>>>> + *
>>>> + * Return value: true if the list was updated, false otherwise
>>>> + */
>>>> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item);
>>>> /**
>>>> * list_lru_count_one: return the number of objects currently held by @lru
>>>> * @lru: the lru pointer.
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 500e536796ca..da1d1cf42158 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -152,6 +152,11 @@ struct page {
>>>> 			/* For both global and memcg */
>>>> 			struct list_head deferred_list;
>>>> 		};
>>>> +		struct { /* Third tail page of compound page */
>>>> +			unsigned long _compound_pad_3; /* compound_head */
>>>> +			unsigned long _compound_pad_4;
>>>> +			struct list_head underutilized_thp_list;
>>>> +		};
>>>> 		struct {	/* Page table pages */
>>>> 			unsigned long _pt_pad_1;	/* compound_head */
>>>> 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index a08885228cb2..362df977cc73 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -81,6 +81,8 @@ static atomic_t huge_zero_refcount;
>>>> struct page *huge_zero_page __read_mostly;
>>>> unsigned long huge_zero_pfn __read_mostly = ~0UL;
>>>> 
>>>> +static struct list_lru huge_low_util_page_lru;
>>>> +
>>>> static void thp_utilization_workfn(struct work_struct *work);
>>>> static DECLARE_DELAYED_WORK(thp_utilization_work, thp_utilization_workfn);
>>>> 
>>>> @@ -263,6 +265,57 @@ static struct shrinker huge_zero_page_shrinker = {
>>>> 	.seeks = DEFAULT_SEEKS,
>>>> };
>>>> 
>>>> +static enum lru_status low_util_free_page(struct list_head *item,
>>>> +					  struct list_lru_one *lru,
>>>> +					  spinlock_t *lock,
>>>> +					  void *cb_arg)
>>>> +{
>>>> +	int bucket, num_utilized_pages;
>>>> +	struct page *head = compound_head(list_entry(item,
>>>> +									struct page,
>>>> +									underutilized_thp_list));
>>>> +
>>>> +	if (get_page_unless_zero(head)) {
>>>> +		if (!trylock_page(head)) {
>>>> +			put_page(head);
>>>> +			return LRU_SKIP;
>>>> +		}
>>>> +		list_lru_isolate(lru, item);
>>>> +		spin_unlock_irq(lock);
>>>> +		num_utilized_pages = thp_number_utilized_pages(head);
>>>> +		bucket = thp_utilization_bucket(num_utilized_pages);
>>>> +		if (bucket < THP_UTIL_BUCKET_NR - 1) {
>>> 
>>> If my understanding were correct, the THP will be considered under
>>> utilized if utilization percentage < 90%.  Right?  If so, I thought
>>> something like utilization percentage < 50% appears more appropriate.
>> 
>> Yes, < 90%. That is just the best number we have found so far from our
>> experiments, as it seems almost all THPs are <10% utilized or >90%
>> utilized . There are an almost trivial number of pages in between.
> 
> So, even if it's <= 10%, the results will be almost best.  Yes?  If so,
> why use 90%?  Which is appears too extreme.
> 
>>> 
>>>> +			split_huge_page(head);
>>>> +			spin_lock_irq(lock);
>>>> +		}
>>>> +		unlock_page(head);
>>>> +		put_page(head);
>>>> +	}
>>>> +
>>>> +	return LRU_REMOVED_RETRY;
>>>> +}
>>>> +
>>>> +static unsigned long shrink_huge_low_util_page_count(struct shrinker *shrink,
>>>> +						     struct shrink_control *sc)
>>>> +{
>>>> +	return HPAGE_PMD_NR * list_lru_shrink_count(&huge_low_util_page_lru, sc);
>>>> +}
>>>> +
>>>> +static unsigned long shrink_huge_low_util_page_scan(struct shrinker *shrink,
>>>> +						    struct shrink_control *sc)
>>>> +{
>>>> +	return HPAGE_PMD_NR * list_lru_shrink_walk_irq(&huge_low_util_page_lru,
>>>> +							sc, low_util_free_page, NULL);
>>>> +}
>>>> +
>>>> +static struct shrinker huge_low_util_page_shrinker = {
>>>> +	.count_objects = shrink_huge_low_util_page_count,
>>>> +	.scan_objects = shrink_huge_low_util_page_scan,
>>>> +	.seeks = DEFAULT_SEEKS,
>>>> +	.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE |
>>>> +		SHRINKER_NONSLAB,
>>>> +};
>>>> +
>>>> #ifdef CONFIG_SYSFS
>>>> static ssize_t enabled_show(struct kobject *kobj,
>>>> 			    struct kobj_attribute *attr, char *buf)
>>>> @@ -515,6 +568,9 @@ static int __init hugepage_init(void)
>>>> 		goto err_slab;
>>>> 
>>>> 	schedule_delayed_work(&thp_utilization_work, HZ);
>>>> +	err = register_shrinker(&huge_low_util_page_shrinker, "thp-low-util");
>>>> +	if (err)
>>>> +		goto err_low_util_shrinker;
>>>> 	err = register_shrinker(&huge_zero_page_shrinker, "thp-zero");
>>>> 	if (err)
>>>> 		goto err_hzp_shrinker;
>>>> @@ -522,6 +578,9 @@ static int __init hugepage_init(void)
>>>> 	if (err)
>>>> 		goto err_split_shrinker;
>>>> 
>>>> +	err = list_lru_init_memcg(&huge_low_util_page_lru, &huge_low_util_page_shrinker);
>>>> +	if (err)
>>>> +		goto err_low_util_list_lru;
>>>> 	/*
>>>> 	 * By default disable transparent hugepages on smaller systems,
>>>> 	 * where the extra memory used could hurt more than TLB overhead
>>>> @@ -538,10 +597,14 @@ static int __init hugepage_init(void)
>>>> 
>>>> 	return 0;
>>>> err_khugepaged:
>>>> +	list_lru_destroy(&huge_low_util_page_lru);
>>>> +err_low_util_list_lru:
>>>> 	unregister_shrinker(&deferred_split_shrinker);
>>>> err_split_shrinker:
>>>> 	unregister_shrinker(&huge_zero_page_shrinker);
>>>> err_hzp_shrinker:
>>>> +	unregister_shrinker(&huge_low_util_page_shrinker);
>>>> +err_low_util_shrinker:
>>>> 	khugepaged_destroy();
>>>> err_slab:
>>>> 	hugepage_exit_sysfs(hugepage_kobj);
>>>> @@ -616,6 +679,7 @@ void prep_transhuge_page(struct page *page)
>>>> 	 */
>>>> 
>>>> 	INIT_LIST_HEAD(page_deferred_list(page));
>>>> +	INIT_LIST_HEAD(page_underutilized_thp_list(page));
>>>> 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
>>>> }
>>>> 
>>>> @@ -2529,8 +2593,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
>>>> 			 LRU_GEN_MASK | LRU_REFS_MASK));
>>>> 
>>>> 	/* ->mapping in first tail page is compound_mapcount */
>>>> -	VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
>>>> -			page_tail);
>>>> +	VM_BUG_ON_PAGE(tail > 3 && page_tail->mapping != TAIL_MAPPING, page_tail);
>>>> 	page_tail->mapping = head->mapping;
>>>> 	page_tail->index = head->index + tail;
>>>> 	page_tail->private = 0;
>>>> @@ -2737,6 +2800,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>>> 	struct folio *folio = page_folio(page);
>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(&folio->page);
>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(&folio->page);
>>>> 	struct anon_vma *anon_vma = NULL;
>>>> 	struct address_space *mapping = NULL;
>>>> 	int extra_pins, ret;
>>>> @@ -2844,6 +2908,9 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>>> 			list_del(page_deferred_list(&folio->page));
>>>> 		}
>>>> 		spin_unlock(&ds_queue->split_queue_lock);
>>>> +		if (!list_empty(underutilized_thp_list))
>>>> +			list_lru_del_page(&huge_low_util_page_lru, &folio->page,
>>>> +					  underutilized_thp_list);
>>>> 		if (mapping) {
>>>> 			int nr = folio_nr_pages(folio);
>>>> 
>>>> @@ -2886,6 +2953,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>>>> void free_transhuge_page(struct page *page)
>>>> {
>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
>>>> +	struct list_head *underutilized_thp_list = page_underutilized_thp_list(page);
>>>> 	unsigned long flags;
>>>> 
>>>> 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>>> @@ -2894,6 +2962,12 @@ void free_transhuge_page(struct page *page)
>>>> 		list_del(page_deferred_list(page));
>>>> 	}
>>>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>> +	if (!list_empty(underutilized_thp_list))
>>>> +		list_lru_del_page(&huge_low_util_page_lru, page, underutilized_thp_list);
>>>> +
>>>> +	if (PageLRU(page))
>>>> +		__folio_clear_lru_flags(page_folio(page));
>>>> +
>>>> 	free_compound_page(page);
>>>> }
>>>> 
>>>> @@ -2934,6 +3008,39 @@ void deferred_split_huge_page(struct page *page)
>>>> 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>>>> }
>>>> 
>>>> +void add_underutilized_thp(struct page *page)
>>>> +{
>>>> +	VM_BUG_ON_PAGE(!PageTransHuge(page), page);
>>> 
>>> Because we haven't get reference of the page, the page may be split or
>>> free under us, so VM_BUG_ON_PAGE() here may be triggered wrongly?
>>> 
>>>> +
>>>> +	if (PageSwapCache(page))
>>>> +		return;
>>>> +
>>>> +	/*
>>>> +	 * Need to take a reference on the page to prevent the page from getting free'd from
>>>> +	 * under us while we are adding the THP to the shrinker.
>>>> +	 */
>>>> +	if (!get_page_unless_zero(page))
>>>> +		return;
>>>> +
>>>> +	if (!is_anon_transparent_hugepage(page))
>>>> +		goto out_put;
>>>> +
>>>> +	if (is_huge_zero_page(page))
>>>> +		goto out_put;
>>> 
>>> is_huge_zero_page() check can be done in thp_util_scan() too?
>> 
>> Sounds good.
>>> 
>>>> +
>>>> +	lock_page(page);
>>>> +
>>>> +	if (memcg_list_lru_alloc(page_memcg(page), &huge_low_util_page_lru, GFP_KERNEL))
>>>> +		goto out_unlock;
>>>> +
>>>> +	list_lru_add_page(&huge_low_util_page_lru, page, page_underutilized_thp_list(page));
>>>> +
>>>> +out_unlock:
>>>> +	unlock_page(page);
>>>> +out_put:
>>>> +	put_page(page);
>>>> +}
>>>> +
>>>> static unsigned long deferred_split_count(struct shrinker *shrink,
>>>> 		struct shrink_control *sc)
>>>> {
>>>> @@ -3478,6 +3585,9 @@ static void thp_util_scan(unsigned long pfn_end)
>>>> 		if (bucket < 0)
>>>> 			continue;
>>>> 
>>>> +		if (bucket < THP_UTIL_BUCKET_NR - 1)
>>>> +			add_underutilized_thp(page);
>>>> +
>>>> 		thp_scan.buckets[bucket].nr_thps++;
>>>> 		thp_scan.buckets[bucket].nr_zero_pages += (HPAGE_PMD_NR - num_utilized_pages);
>>>> 	}
>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>>> index a05e5bef3b40..8cc56a84b554 100644
>>>> --- a/mm/list_lru.c
>>>> +++ b/mm/list_lru.c
>>>> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>>>> }
>>>> EXPORT_SYMBOL_GPL(list_lru_add);
>>>> 
>>>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
>>>> +{
>>>> +	int nid = page_to_nid(page);
>>>> +	struct list_lru_node *nlru = &lru->node[nid];
>>>> +	struct list_lru_one *l;
>>>> +	struct mem_cgroup *memcg;
>>>> +	unsigned long flags;
>>>> +
>>>> +	spin_lock_irqsave(&nlru->lock, flags);
>>>> +	if (list_empty(item)) {
>>>> +		memcg = page_memcg(page);
>>>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>>> +		list_add_tail(item, &l->list);
>>>> +		/* Set shrinker bit if the first element was added */
>>>> +		if (!l->nr_items++)
>>>> +			set_shrinker_bit(memcg, nid,
>>>> +					 lru_shrinker_id(lru));
>>>> +		nlru->nr_items++;
>>>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>>>> +		return true;
>>>> +	}
>>>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>>>> +	return false;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(list_lru_add_page);
>>>> +
>>> 
>>> It appears that only 2 lines are different from list_lru_add().  Is it
>>> possible for us to share code?  For example, add another flag for
>>> page_memcg() case?
>> 
>> I believe there are 4 lines. The page_to_nid(page) and the spin_lock_irqsave/restore. 
>> It was implemented this way as we found we needed to take a page as a parameter and obtain the node id from
>> the page. This is because the THP is not necessarily a slab object, as the list_lru_add/delete code assumes. 
>> 
>> Also, there is a potential deadlock when split_huge_page is called from reclaim and when split_huge_page is called 
>> by the THP Shrinker, which is why we need irqsave/restore. 
> 
> Can you provide more details on this?  If it's necessary, I think it should be OK to use irqsave/restore.

There is an ABBA deadlock between reclaim and the THP shrinker where one waits for the page lock and the other waits for the 
list_lru lock. There is another deadlock where free_transhuge_page interrupts the THP shrinker while it is splitting THPs (has the
list_lru lock) and then acquires the same list_lru lock (self deadlock). These happened in our prod experiments. 
I do believe irqsave/restore is necessary. 

> 
>> I though this would be cleaner than attempting to shared code with list_lru_add/delete. Only the shrinker makes use of this. 
>> All other use cases assume slab objects.
> 
> Not only for slab objects now.  The original code works for slab and
> normal page.  Which is controlled by list_lru->memcg_aware.  You can add
> another flag for your new use case, or refactor it to use a function
> pointer.
> 
> Best Regards,
> Huang, Ying

I’ll take a look at this tomorrow and get back to you. Thanks for the suggestion! 
> 
>>> 
>>>> bool list_lru_del(struct list_lru *lru, struct list_head *item)
>>>> {
>>>> 	int nid = page_to_nid(virt_to_page(item));
>>>> @@ -160,6 +186,29 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
>>>> }
>>>> EXPORT_SYMBOL_GPL(list_lru_del);
>>>> 
>>>> +bool list_lru_del_page(struct list_lru *lru, struct page *page, struct list_head *item)
>>>> +{
>>>> +	int nid = page_to_nid(page);
>>>> +	struct list_lru_node *nlru = &lru->node[nid];
>>>> +	struct list_lru_one *l;
>>>> +	struct mem_cgroup *memcg;
>>>> +	unsigned long flags;
>>>> +
>>>> +	spin_lock_irqsave(&nlru->lock, flags);
>>>> +	if (!list_empty(item)) {
>>>> +		memcg = page_memcg(page);
>>>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>>> +		list_del_init(item);
>>>> +		l->nr_items--;
>>>> +		nlru->nr_items--;
>>>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>>>> +		return true;
>>>> +	}
>>>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>>>> +	return false;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(list_lru_del_page);
>>>> +
>>>> void list_lru_isolate(struct list_lru_one *list, struct list_head *item)
>>>> {
>>>> 	list_del_init(item);
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index ac2c9f12a7b2..468eaaade7fe 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -1335,6 +1335,12 @@ static int free_tail_pages_check(struct page *head_page, struct page *page)
>>>> 		 * deferred_list.next -- ignore value.
>>>> 		 */
>>>> 		break;
>>>> +	case 3:
>>>> +		/*
>>>> +		 * the third tail page: ->mapping is
>>>> +		 * underutilized_thp_list.next -- ignore value.
>>>> +		 */
>>>> +		break;
>>>> 	default:
>>>> 		if (page->mapping != TAIL_MAPPING) {
>>>> 			bad_page(page, "corrupted mapping in tail page");
>>> 
>>> Best Regards,
>>> Huang, Ying

Huang, Ying Oct. 20, 2022, 8:06 a.m. UTC | #9

"Alex Zhu (Kernel)" <alexlzhu@meta.com> writes:

>> On Oct 19, 2022, at 6:24 PM, Huang, Ying <ying.huang@intel.com> wrote:
>> 
>> "Alex Zhu (Kernel)" <alexlzhu@meta.com> writes:
>> 
>>>> On Oct 19, 2022, at 12:04 AM, Huang, Ying <ying.huang@intel.com> wrote:
>>>> 
>>>>> 
>>>> <alexlzhu@fb.com> writes:
>>>>

[snip]

>>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>>>> index a05e5bef3b40..8cc56a84b554 100644
>>>>> --- a/mm/list_lru.c
>>>>> +++ b/mm/list_lru.c
>>>>> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(list_lru_add);
>>>>> 
>>>>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
>>>>> +{
>>>>> +	int nid = page_to_nid(page);
>>>>> +	struct list_lru_node *nlru = &lru->node[nid];
>>>>> +	struct list_lru_one *l;
>>>>> +	struct mem_cgroup *memcg;
>>>>> +	unsigned long flags;
>>>>> +
>>>>> +	spin_lock_irqsave(&nlru->lock, flags);
>>>>> +	if (list_empty(item)) {
>>>>> +		memcg = page_memcg(page);
>>>>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>>>> +		list_add_tail(item, &l->list);
>>>>> +		/* Set shrinker bit if the first element was added */
>>>>> +		if (!l->nr_items++)
>>>>> +			set_shrinker_bit(memcg, nid,
>>>>> +					 lru_shrinker_id(lru));
>>>>> +		nlru->nr_items++;
>>>>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>>>>> +		return true;
>>>>> +	}
>>>>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>>>>> +	return false;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(list_lru_add_page);
>>>>> +
>>>> 
>>>> It appears that only 2 lines are different from list_lru_add().  Is it
>>>> possible for us to share code?  For example, add another flag for
>>>> page_memcg() case?
>>> 
>>> I believe there are 4 lines. The page_to_nid(page) and the spin_lock_irqsave/restore. 
>>> It was implemented this way as we found we needed to take a page as a parameter and obtain the node id from
>>> the page. This is because the THP is not necessarily a slab object, as the list_lru_add/delete code assumes. 
>>> 
>>> Also, there is a potential deadlock when split_huge_page is called from reclaim and when split_huge_page is called 
>>> by the THP Shrinker, which is why we need irqsave/restore. 
>> 
>> Can you provide more details on this?  If it's necessary, I think it should be OK to use irqsave/restore.
>
> There is an ABBA deadlock between reclaim and the THP shrinker where
> one waits for the page lock and the other waits for the list_lru lock.

Per my understanding, when we hold list_lru lock, we only use
trylock_page() and backoff if we fail to acquire the page lock.  So it
appears that we are safe here?  If not, can you provide the detailed
race condition information, for example actions on different CPUs?

> There is another deadlock where free_transhuge_page interrupts the THP
> shrinker while it is splitting THPs (has the list_lru lock) and then
> acquires the same list_lru lock (self deadlock). These happened in our
> prod experiments.  I do believe irqsave/restore is necessary.

Got it.  We may use list_lru lock inside IRQ handler when free pages.  I
think this information is good for code comments or patch description.

>> 
>>> I though this would be cleaner than attempting to shared code with list_lru_add/delete. Only the shrinker makes use of this. 
>>> All other use cases assume slab objects.
>> 
>> Not only for slab objects now.  The original code works for slab and
>> normal page.  Which is controlled by list_lru->memcg_aware.  You can add
>> another flag for your new use case, or refactor it to use a function
>> pointer.
>> 
>> Best Regards,
>> Huang, Ying
>
> I’ll take a look at this tomorrow and get back to you. Thanks for the suggestion! 
>> 
>>>> 

[snip]

Best Regards,
Huang, Ying

Alex Zhu (Kernel) Oct. 20, 2022, 5:04 p.m. UTC | #10

> On Oct 20, 2022, at 1:06 AM, Huang, Ying <ying.huang@intel.com> wrote:
> 
> "Alex Zhu (Kernel)" <alexlzhu@meta.com> writes:
> 
>>> On Oct 19, 2022, at 6:24 PM, Huang, Ying <ying.huang@intel.com> wrote:
>>> 
>>> "Alex Zhu (Kernel)" <alexlzhu@meta.com> writes:
>>> 
>>>>> On Oct 19, 2022, at 12:04 AM, Huang, Ying <ying.huang@intel.com> wrote:
>>>>> 
>>>>>> 
>>>>> <alexlzhu@fb.com> writes:
>>>>> 
> 
> [snip]
> 
>>>>>> diff --git a/mm/list_lru.c b/mm/list_lru.c
>>>>>> index a05e5bef3b40..8cc56a84b554 100644
>>>>>> --- a/mm/list_lru.c
>>>>>> +++ b/mm/list_lru.c
>>>>>> @@ -140,6 +140,32 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(list_lru_add);
>>>>>> 
>>>>>> +bool list_lru_add_page(struct list_lru *lru, struct page *page, struct list_head *item)
>>>>>> +{
>>>>>> +	int nid = page_to_nid(page);
>>>>>> +	struct list_lru_node *nlru = &lru->node[nid];
>>>>>> +	struct list_lru_one *l;
>>>>>> +	struct mem_cgroup *memcg;
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	spin_lock_irqsave(&nlru->lock, flags);
>>>>>> +	if (list_empty(item)) {
>>>>>> +		memcg = page_memcg(page);
>>>>>> +		l = list_lru_from_memcg_idx(lru, nid, memcg_kmem_id(memcg));
>>>>>> +		list_add_tail(item, &l->list);
>>>>>> +		/* Set shrinker bit if the first element was added */
>>>>>> +		if (!l->nr_items++)
>>>>>> +			set_shrinker_bit(memcg, nid,
>>>>>> +					 lru_shrinker_id(lru));
>>>>>> +		nlru->nr_items++;
>>>>>> +		spin_unlock_irqrestore(&nlru->lock, flags);
>>>>>> +		return true;
>>>>>> +	}
>>>>>> +	spin_unlock_irqrestore(&nlru->lock, flags);
>>>>>> +	return false;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(list_lru_add_page);
>>>>>> +
>>>>> 
>>>>> It appears that only 2 lines are different from list_lru_add().  Is it
>>>>> possible for us to share code?  For example, add another flag for
>>>>> page_memcg() case?
>>>> 
>>>> I believe there are 4 lines. The page_to_nid(page) and the spin_lock_irqsave/restore. 
>>>> It was implemented this way as we found we needed to take a page as a parameter and obtain the node id from
>>>> the page. This is because the THP is not necessarily a slab object, as the list_lru_add/delete code assumes. 
>>>> 
>>>> Also, there is a potential deadlock when split_huge_page is called from reclaim and when split_huge_page is called 
>>>> by the THP Shrinker, which is why we need irqsave/restore. 
>>> 
>>> Can you provide more details on this?  If it's necessary, I think it should be OK to use irqsave/restore.
>> 
>> There is an ABBA deadlock between reclaim and the THP shrinker where
>> one waits for the page lock and the other waits for the list_lru lock.
> 
> Per my understanding, when we hold list_lru lock, we only use
> trylock_page() and backoff if we fail to acquire the page lock.  So it
> appears that we are safe here?  If not, can you provide the detailed
> race condition information, for example actions on different CPUs?

The ABBA deadlock was the reason we added this try lock_page(). It is safe now, was not safe in v3. 
> 
>> There is another deadlock where free_transhuge_page interrupts the THP
>> shrinker while it is splitting THPs (has the list_lru lock) and then
>> acquires the same list_lru lock (self deadlock). These happened in our
>> prod experiments.  I do believe irqsave/restore is necessary.
> 
> Got it.  We may use list_lru lock inside IRQ handler when free pages.  I
> think this information is good for code comments or patch description.

Sounds good! I’ll add some comments to make that more clear. 
> 
>>> 
>>>> I though this would be cleaner than attempting to shared code with list_lru_add/delete. Only the shrinker makes use of this. 
>>>> All other use cases assume slab objects.
>>> 
>>> Not only for slab objects now.  The original code works for slab and
>>> normal page.  Which is controlled by list_lru->memcg_aware.  You can add
>>> another flag for your new use case, or refactor it to use a function
>>> pointer.
>>> 
>>> Best Regards,
>>> Huang, Ying
>> 
>> I’ll take a look at this tomorrow and get back to you. Thanks for the suggestion! 
>>> 
>>>>> 
> 
> [snip]
> 
> Best Regards,
> Huang, Ying

[v3,3/3] mm: THP low utilization shrinker

Commit Message

Comments

Patch