diff mbox series

[v4,08/13] mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse

Message ID 20220502181714.3483177-9-zokeefe@google.com (mailing list archive)
State New
Headers show
Series mm: userspace hugepage collapse | expand

Commit Message

Zach O'Keefe May 2, 2022, 6:17 p.m. UTC
This idea was introduced by David Rientjes[1].

Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request a
synchronous collapse of memory at their own expense.

The benefits of this approach are:

* CPU is charged to the process that wants to spend the cycles for the
  THP
* Avoid unpredictable timing of khugepaged collapse

An immediate user of this new functionality are malloc() implementations
that manage memory in hugepage-sized chunks, but sometimes subrelease
memory back to the system in native-sized chunks via MADV_DONTNEED;
zapping the pmd.  Later, when the memory is hot, the implementation
could madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain
hugepage coverage and dTLB performance.  TCMalloc is such an
implementation that could benefit from this[2].

Only privately-mapped anon memory is supported for now, but it is
expected that file and shmem support will be added later to support the
use-case of backing executable text by THPs.  Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system
which might impair services from serving at their full rated load after
(re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
immediately realize iTLB performance prevents page sharing and demand
paging, both of which increase steady state memory footprint.  With
MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
and lower RAM footprints.

This call respects THP eligibility as determined by the system-wide
/sys/kernel/mm/transparent_hugepage/enabled sysfs settings and the VMA
flags for the memory range being collapsed.

THP allocation may enter direct reclaim and/or compaction.

[1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
[2] https://github.com/google/tcmalloc/tree/master/tcmalloc

Suggested-by: David Rientjes <rientjes@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
---
 arch/alpha/include/uapi/asm/mman.h     |   2 +
 arch/mips/include/uapi/asm/mman.h      |   2 +
 arch/parisc/include/uapi/asm/mman.h    |   2 +
 arch/xtensa/include/uapi/asm/mman.h    |   2 +
 include/linux/huge_mm.h                |  12 ++
 include/uapi/asm-generic/mman-common.h |   2 +
 mm/khugepaged.c                        | 166 +++++++++++++++++++++++--
 mm/madvise.c                           |   5 +
 8 files changed, 181 insertions(+), 12 deletions(-)

Comments

kernel test robot May 3, 2022, 7:21 a.m. UTC | #1
Hi Zach,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on next-20220502]
[cannot apply to hnaz-mm/master rostedt-trace/for-next deller-parisc/for-next arnd-asm-generic/master linus/master v5.18-rc5 v5.18-rc4 v5.18-rc3 v5.18-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-khugepaged-record-SCAN_PMD_MAPPED-when-scan_pmd-finds-THP/20220503-031727
base:    9f9b9a2972eb8dcaad09d826c5c6d7488eaca3e6
config: x86_64-randconfig-a011-20220502 (https://download.01.org/0day-ci/archive/20220503/202205031501.6qBJrsPn-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 09325d36061e42b495d1f4c7e933e260eac260ed)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/9f69946c58d8d53c271a4d75ac477b4a5164a511
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Zach-O-Keefe/mm-khugepaged-record-SCAN_PMD_MAPPED-when-scan_pmd-finds-THP/20220503-031727
        git checkout 9f69946c58d8d53c271a4d75ac477b4a5164a511
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   mm/khugepaged.c:1105:29: warning: incompatible integer to pointer conversion passing 'gfp_t' (aka 'unsigned int') to parameter of type 'struct page **' [-Wint-conversion]
           if (!khugepaged_alloc_page(gfp, node, cc))
                                      ^~~
   mm/khugepaged.c:963:49: note: passing argument to parameter 'hpage' here
   static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
                                                   ^
   mm/khugepaged.c:1105:40: warning: incompatible pointer to integer conversion passing 'struct collapse_control *' to parameter of type 'int' [-Wint-conversion]
           if (!khugepaged_alloc_page(gfp, node, cc))
                                                 ^~
   mm/khugepaged.c:963:71: note: passing argument to parameter 'node' here
   static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
                                                                         ^
>> mm/khugepaged.c:2565:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
                   case SCAN_PMD_NULL:
                   ^
   mm/khugepaged.c:2565:3: note: insert 'break;' to avoid fall-through
                   case SCAN_PMD_NULL:
                   ^
                   break; 
   3 warnings generated.


vim +2565 mm/khugepaged.c

  2511	
  2512	int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
  2513			     unsigned long start, unsigned long end)
  2514	{
  2515		struct collapse_control cc = {
  2516			.enforce_pte_scan_limits = false,
  2517			.enforce_young = false,
  2518			.last_target_node = NUMA_NO_NODE,
  2519			.hpage = NULL,
  2520			.alloc_charge_hpage = &madvise_alloc_charge_hpage,
  2521		};
  2522		struct mm_struct *mm = vma->vm_mm;
  2523		unsigned long hstart, hend, addr;
  2524		int thps = 0, nr_hpages = 0, result = SCAN_FAIL;
  2525		bool mmap_locked = true;
  2526	
  2527		BUG_ON(vma->vm_start > start);
  2528		BUG_ON(vma->vm_end < end);
  2529	
  2530		*prev = vma;
  2531	
  2532		if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
  2533			return -EINVAL;
  2534	
  2535		hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
  2536		hend = end & HPAGE_PMD_MASK;
  2537		nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
  2538	
  2539		if (hstart >= hend || !transparent_hugepage_active(vma))
  2540			return -EINVAL;
  2541	
  2542		mmgrab(mm);
  2543		lru_add_drain();
  2544	
  2545		for (addr = hstart; ; ) {
  2546			mmap_assert_locked(mm);
  2547			cond_resched();
  2548			result = SCAN_FAIL;
  2549	
  2550			if (unlikely(khugepaged_test_exit(mm))) {
  2551				result = SCAN_ANY_PROCESS;
  2552				break;
  2553			}
  2554	
  2555			memset(cc.node_load, 0, sizeof(cc.node_load));
  2556			result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
  2557			if (!mmap_locked)
  2558				*prev = NULL;  /* tell madvise we dropped mmap_lock */
  2559	
  2560			switch (result) {
  2561			/* Whitelisted set of results where continuing OK */
  2562			case SCAN_SUCCEED:
  2563			case SCAN_PMD_MAPPED:
  2564				++thps;
> 2565			case SCAN_PMD_NULL:
Zach O'Keefe May 4, 2022, 9:46 p.m. UTC | #2
Sorry again - fixed in v5

On Tue, May 3, 2022 at 12:22 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Zach,
>
> Thank you for the patch! Perhaps something to improve:
>
> [auto build test WARNING on next-20220502]
> [cannot apply to hnaz-mm/master rostedt-trace/for-next deller-parisc/for-next arnd-asm-generic/master linus/master v5.18-rc5 v5.18-rc4 v5.18-rc3 v5.18-rc5]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-khugepaged-record-SCAN_PMD_MAPPED-when-scan_pmd-finds-THP/20220503-031727
> base:    9f9b9a2972eb8dcaad09d826c5c6d7488eaca3e6
> config: x86_64-randconfig-a011-20220502 (https://download.01.org/0day-ci/archive/20220503/202205031501.6qBJrsPn-lkp@intel.com/config)
> compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 09325d36061e42b495d1f4c7e933e260eac260ed)
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/9f69946c58d8d53c271a4d75ac477b4a5164a511
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Zach-O-Keefe/mm-khugepaged-record-SCAN_PMD_MAPPED-when-scan_pmd-finds-THP/20220503-031727
>         git checkout 9f69946c58d8d53c271a4d75ac477b4a5164a511
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
>
> All warnings (new ones prefixed by >>):
>
>    mm/khugepaged.c:1105:29: warning: incompatible integer to pointer conversion passing 'gfp_t' (aka 'unsigned int') to parameter of type 'struct page **' [-Wint-conversion]
>            if (!khugepaged_alloc_page(gfp, node, cc))
>                                       ^~~
>    mm/khugepaged.c:963:49: note: passing argument to parameter 'hpage' here
>    static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>                                                    ^
>    mm/khugepaged.c:1105:40: warning: incompatible pointer to integer conversion passing 'struct collapse_control *' to parameter of type 'int' [-Wint-conversion]
>            if (!khugepaged_alloc_page(gfp, node, cc))
>                                                  ^~
>    mm/khugepaged.c:963:71: note: passing argument to parameter 'node' here
>    static bool khugepaged_alloc_page(struct page **hpage, gfp_t gfp, int node)
>                                                                          ^
> >> mm/khugepaged.c:2565:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
>                    case SCAN_PMD_NULL:
>                    ^
>    mm/khugepaged.c:2565:3: note: insert 'break;' to avoid fall-through
>                    case SCAN_PMD_NULL:
>                    ^
>                    break;
>    3 warnings generated.
>
>
> vim +2565 mm/khugepaged.c
>
>   2511
>   2512  int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   2513                       unsigned long start, unsigned long end)
>   2514  {
>   2515          struct collapse_control cc = {
>   2516                  .enforce_pte_scan_limits = false,
>   2517                  .enforce_young = false,
>   2518                  .last_target_node = NUMA_NO_NODE,
>   2519                  .hpage = NULL,
>   2520                  .alloc_charge_hpage = &madvise_alloc_charge_hpage,
>   2521          };
>   2522          struct mm_struct *mm = vma->vm_mm;
>   2523          unsigned long hstart, hend, addr;
>   2524          int thps = 0, nr_hpages = 0, result = SCAN_FAIL;
>   2525          bool mmap_locked = true;
>   2526
>   2527          BUG_ON(vma->vm_start > start);
>   2528          BUG_ON(vma->vm_end < end);
>   2529
>   2530          *prev = vma;
>   2531
>   2532          if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
>   2533                  return -EINVAL;
>   2534
>   2535          hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
>   2536          hend = end & HPAGE_PMD_MASK;
>   2537          nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
>   2538
>   2539          if (hstart >= hend || !transparent_hugepage_active(vma))
>   2540                  return -EINVAL;
>   2541
>   2542          mmgrab(mm);
>   2543          lru_add_drain();
>   2544
>   2545          for (addr = hstart; ; ) {
>   2546                  mmap_assert_locked(mm);
>   2547                  cond_resched();
>   2548                  result = SCAN_FAIL;
>   2549
>   2550                  if (unlikely(khugepaged_test_exit(mm))) {
>   2551                          result = SCAN_ANY_PROCESS;
>   2552                          break;
>   2553                  }
>   2554
>   2555                  memset(cc.node_load, 0, sizeof(cc.node_load));
>   2556                  result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
>   2557                  if (!mmap_locked)
>   2558                          *prev = NULL;  /* tell madvise we dropped mmap_lock */
>   2559
>   2560                  switch (result) {
>   2561                  /* Whitelisted set of results where continuing OK */
>   2562                  case SCAN_SUCCEED:
>   2563                  case SCAN_PMD_MAPPED:
>   2564                          ++thps;
> > 2565                  case SCAN_PMD_NULL:
>
> --
> 0-DAY CI Kernel Test Service
> https://01.org/lkp
diff mbox series

Patch

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 4aa996423b0d..763929e814e9 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -76,6 +76,8 @@ 
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 1be428663c10..c6e1fc77c996 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -103,6 +103,8 @@ 
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index a7ea3204a5fa..22133a6a506e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -70,6 +70,8 @@ 
 #define MADV_WIPEONFORK 71		/* Zero memory on fork, child only */
 #define MADV_KEEPONFORK 72		/* Undo MADV_WIPEONFORK */
 
+#define MADV_COLLAPSE	73		/* Synchronous hugepage collapse */
+
 #define MADV_HWPOISON     100		/* poison a page for testing */
 #define MADV_SOFT_OFFLINE 101		/* soft offline page for testing */
 
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 7966a58af472..1ff0c858544f 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -111,6 +111,8 @@ 
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9a26bd10e083..4a2ea1b5437c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -222,6 +222,9 @@  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
+int madvise_collapse(struct vm_area_struct *vma,
+		     struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
 spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
@@ -378,6 +381,15 @@  static inline int hugepage_madvise(struct vm_area_struct *vma,
 	BUG();
 	return 0;
 }
+
+static inline int madvise_collapse(struct vm_area_struct *vma,
+				   struct vm_area_struct **prev,
+				   unsigned long start, unsigned long end)
+{
+	BUG();
+	return 0;
+}
+
 static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 unsigned long start,
 					 unsigned long end,
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 6c1aa92a92e4..6ce1f1ceb432 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -77,6 +77,8 @@ 
 
 #define MADV_DONTNEED_LOCKED	24	/* like DONTNEED, but drop locked pages too */
 
+#define MADV_COLLAPSE	25		/* Synchronous hugepage collapse */
+
 /* compatibility flags */
 #define MAP_FILE	0
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b57a4a643053..3ba2c570da5e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -837,6 +837,22 @@  static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 	return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
 }
 
+static bool alloc_hpage(gfp_t gfp, int node, struct collapse_control *cc)
+{
+	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
+
+	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
+	if (unlikely(!cc->hpage)) {
+		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
+		cc->hpage = ERR_PTR(-ENOMEM);
+		return false;
+	}
+
+	prep_transhuge_page(cc->hpage);
+	count_vm_event(THP_COLLAPSE_ALLOC);
+	return true;
+}
+
 #ifdef CONFIG_NUMA
 static int khugepaged_find_target_node(struct collapse_control *cc)
 {
@@ -882,18 +898,7 @@  static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
 static bool khugepaged_alloc_page(gfp_t gfp, int node,
 				  struct collapse_control *cc)
 {
-	VM_BUG_ON_PAGE(cc->hpage, cc->hpage);
-
-	cc->hpage = __alloc_pages_node(node, gfp, HPAGE_PMD_ORDER);
-	if (unlikely(!cc->hpage)) {
-		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-		cc->hpage = ERR_PTR(-ENOMEM);
-		return false;
-	}
-
-	prep_transhuge_page(cc->hpage);
-	count_vm_event(THP_COLLAPSE_ALLOC);
-	return true;
+	return alloc_hpage(gfp, node, cc);
 }
 #else
 static int khugepaged_find_target_node(struct collapse_control *cc)
@@ -2462,3 +2467,140 @@  void khugepaged_min_free_kbytes_update(void)
 		set_recommended_min_free_kbytes();
 	mutex_unlock(&khugepaged_mutex);
 }
+
+static void madvise_collapse_cleanup_page(struct page **hpage)
+{
+	if (!IS_ERR(*hpage) && *hpage)
+		put_page(*hpage);
+	*hpage = NULL;
+}
+
+static int madvise_collapse_errno(enum scan_result r)
+{
+	switch (r) {
+	case SCAN_PMD_NULL:
+	case SCAN_ADDRESS_RANGE:
+	case SCAN_VMA_NULL:
+	case SCAN_PTE_NON_PRESENT:
+	case SCAN_PAGE_NULL:
+		/*
+		 * Addresses in the specified range are not currently mapped,
+		 * or are outside the AS of the process.
+		 */
+		return -ENOMEM;
+	case SCAN_ALLOC_HUGE_PAGE_FAIL:
+	case SCAN_CGROUP_CHARGE_FAIL:
+		/* A kernel resource was temporarily unavailable. */
+		return -EAGAIN;
+	default:
+		return -EINVAL;
+	}
+}
+
+static int madvise_alloc_charge_hpage(struct mm_struct *mm,
+				      struct collapse_control *cc)
+{
+	if (!alloc_hpage(GFP_TRANSHUGE, khugepaged_find_target_node(cc), cc))
+		return SCAN_ALLOC_HUGE_PAGE_FAIL;
+	if (unlikely(mem_cgroup_charge(page_folio(cc->hpage), mm,
+				       GFP_TRANSHUGE)))
+		return SCAN_CGROUP_CHARGE_FAIL;
+	count_memcg_page_event(cc->hpage, THP_COLLAPSE_ALLOC);
+	return SCAN_SUCCEED;
+}
+
+int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
+		     unsigned long start, unsigned long end)
+{
+	struct collapse_control cc = {
+		.enforce_pte_scan_limits = false,
+		.enforce_young = false,
+		.last_target_node = NUMA_NO_NODE,
+		.hpage = NULL,
+		.alloc_charge_hpage = &madvise_alloc_charge_hpage,
+	};
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long hstart, hend, addr;
+	int thps = 0, nr_hpages = 0, result = SCAN_FAIL;
+	bool mmap_locked = true;
+
+	BUG_ON(vma->vm_start > start);
+	BUG_ON(vma->vm_end < end);
+
+	*prev = vma;
+
+	if (IS_ENABLED(CONFIG_SHMEM) && vma->vm_file)
+		return -EINVAL;
+
+	hstart = (start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+	hend = end & HPAGE_PMD_MASK;
+	nr_hpages = (hend - hstart) >> HPAGE_PMD_SHIFT;
+
+	if (hstart >= hend || !transparent_hugepage_active(vma))
+		return -EINVAL;
+
+	mmgrab(mm);
+	lru_add_drain();
+
+	for (addr = hstart; ; ) {
+		mmap_assert_locked(mm);
+		cond_resched();
+		result = SCAN_FAIL;
+
+		if (unlikely(khugepaged_test_exit(mm))) {
+			result = SCAN_ANY_PROCESS;
+			break;
+		}
+
+		memset(cc.node_load, 0, sizeof(cc.node_load));
+		result = khugepaged_scan_pmd(mm, vma, addr, &mmap_locked, &cc);
+		if (!mmap_locked)
+			*prev = NULL;  /* tell madvise we dropped mmap_lock */
+
+		switch (result) {
+		/* Whitelisted set of results where continuing OK */
+		case SCAN_SUCCEED:
+		case SCAN_PMD_MAPPED:
+			++thps;
+		case SCAN_PMD_NULL:
+		case SCAN_PTE_NON_PRESENT:
+		case SCAN_PTE_UFFD_WP:
+		case SCAN_PAGE_RO:
+		case SCAN_LACK_REFERENCED_PAGE:
+		case SCAN_PAGE_NULL:
+		case SCAN_PAGE_COUNT:
+		case SCAN_PAGE_LOCK:
+		case SCAN_PAGE_COMPOUND:
+			break;
+		case SCAN_PAGE_LRU:
+			lru_add_drain_all();
+			goto retry;
+		default:
+			/* Other error, exit */
+			goto break_loop;
+		}
+		addr += HPAGE_PMD_SIZE;
+		if (addr >= hend)
+			break;
+retry:
+		if (!mmap_locked) {
+			mmap_read_lock(mm);
+			mmap_locked = true;
+			result = hugepage_vma_revalidate(mm, addr, &vma);
+			if (result)
+				goto out;
+		}
+		madvise_collapse_cleanup_page(&cc.hpage);
+	}
+
+break_loop:
+	/* madvise_walk_vmas() expects us to hold mmap_lock on return */
+	if (!mmap_locked)
+		mmap_read_lock(mm);
+out:
+	mmap_assert_locked(mm);
+	madvise_collapse_cleanup_page(&cc.hpage);
+	mmdrop(mm);
+
+	return thps == nr_hpages ? 0 : madvise_collapse_errno(result);
+}
diff --git a/mm/madvise.c b/mm/madvise.c
index 5f4537511532..638517952bd2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -59,6 +59,7 @@  static int madvise_need_mmap_write(int behavior)
 	case MADV_FREE:
 	case MADV_POPULATE_READ:
 	case MADV_POPULATE_WRITE:
+	case MADV_COLLAPSE:
 		return 0;
 	default:
 		/* be safe, default to 1. list exceptions explicitly */
@@ -1054,6 +1055,8 @@  static int madvise_vma_behavior(struct vm_area_struct *vma,
 		if (error)
 			goto out;
 		break;
+	case MADV_COLLAPSE:
+		return madvise_collapse(vma, prev, start, end);
 	}
 
 	anon_name = anon_vma_name(vma);
@@ -1147,6 +1150,7 @@  madvise_behavior_valid(int behavior)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	case MADV_HUGEPAGE:
 	case MADV_NOHUGEPAGE:
+	case MADV_COLLAPSE:
 #endif
 	case MADV_DONTDUMP:
 	case MADV_DODUMP:
@@ -1336,6 +1340,7 @@  int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
  *  MADV_NOHUGEPAGE - mark the given range as not worth being backed by
  *		transparent huge pages so the existing pages will not be
  *		coalesced into THP and new pages will not be allocated as THP.
+ *  MADV_COLLAPSE - synchronously coalesce pages into new THP.
  *  MADV_DONTDUMP - the application wants to prevent pages in the given range
  *		from being included in its core dump.
  *  MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.