diff mbox series

[2/2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups

Message ID 20230125015738.912924-2-zokeefe@google.com (mailing list archive)
State New
Headers show
Series [1/2] mm/MADV_COLLAPSE: set EAGAIN on unexpected page refcount | expand

Commit Message

Zach O'Keefe Jan. 25, 2023, 1:57 a.m. UTC
In commit 34488399fa08 ("mm/madvise: add file and shmem support to
MADV_COLLAPSE") we make the following change to find_pmd_or_thp_or_none():

	-       if (!pmd_present(pmde))
	-               return SCAN_PMD_NULL;
	+       if (pmd_none(pmde))
	+               return SCAN_PMD_NONE;

This was for-use by MADV_COLLAPSE file/shmem codepaths, where MADV_COLLAPSE
might identify a pte-mapped hugepage, only to have khugepaged race-in, free
the pte table, and clear the pmd.  Such codepaths include:

A) If we find a suitably-aligned compound page of order HPAGE_PMD_ORDER
   already in the pagecache.
B) In retract_page_tables(), if we fail to grab mmap_lock for the target
   mm/address.

In these cases, collapse_pte_mapped_thp() really does expect a none (not
just !present) pmd, and we want to suitably identify that case separate
from the case where no pmd is found, or it's a bad-pmd (of course, many
things could happen once we drop mmap_lock, and the pmd could plausibly
undergo multiple transitions due to intervening fault, split, etc).
Regardless, the code is prepared install a huge-pmd only when the existing
pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.

However, the commit introduces a logical hole; namely, that we've allowed
!none- && !huge- && !bad-pmds to be classified as genuine
pte-table-mapping-pmds.  One such example that could leak through are swap
entries.  The pmd values aren't checked again before use in
pte_offset_map_lock(), which is expecting nothing less than a genuine
pte-table-mapping-pmd.

We want to put back the !pmd_present() check (below the pmd_none() check),
but need to be careful to deal with subtleties in pmd transitions and
treatments by various arch.

The issue is that __split_huge_pmd_locked() temporarily clears the present
bit (or otherwise marks the entry as invalid), but pmd_present()
and pmd_trans_huge() still need to return true while the pmd is in this
transitory state.  For example, x86's pmd_present() also checks the
_PAGE_PSE , riscv's version also checks the _PAGE_LEAF bit, and arm64 also
checks a PMD_PRESENT_INVALID bit.

Covering all 4 cases for x86 (all checks done on the same pmd value):

1) pmd_present() && pmd_trans_huge()
   All we actually know here is that the PSE bit is set. Either:
   a) We aren't racing with __split_huge_page(), and PRESENT or PROTNONE
      is set.
      => huge-pmd
   b) We are currently racing with __split_huge_page().  The danger here
      is that we proceed as-if we have a huge-pmd, but really we are
      looking at a pte-mapping-pmd.  So, what is the risk of this
      danger?

      The only relevant path is:

	madvise_collapse() -> collapse_pte_mapped_thp()

      Where we might just incorrectly report back "success", when really
      the memory isn't pmd-backed.  This is fine, since split could
      happen immediately after (actually) successful madvise_collapse().
      So, it should be safe to just assume huge-pmd here.

2) pmd_present() && !pmd_trans_huge()
   Either:
   a) PSE not set and either PRESENT or PROTNONE is.
      => pte-table-mapping pmd (or PROT_NONE)
   b) devmap.  This routine can be called immediately after
      unlocking/locking mmap_lock -- or called with no locks held (see
      khugepaged_scan_mm_slot()), so previous VMA checks have since been
      invalidated.

3) !pmd_present() && pmd_trans_huge()
  Not possible.

4) !pmd_present() && !pmd_trans_huge()
  Neither PRESENT nor PROTNONE set
  => not present

I've checked all archs that implement pmd_trans_huge() (arm64, riscv,
powerpc, longarch, x86, mips, s390) and this logic roughly translates
(though devmap treatment is unique to x86 and powerpc, and (3) doesn't
necessarily hold in general -- but that doesn't matter since !pmd_present()
always takes failure path).

Also, add a comment above find_pmd_or_thp_or_none() to help future
travelers reason about the validity of the code; namely, the possible
mutations that might happen out from under us, depending on how
mmap_lock is held (if at all).

Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Cc: stable@vger.kernel.org

---
Request that this be pulled into stable since it's theoretically
possible (though I have no reproducer) that while mmap_lock is dropped,
racing thp migration installs a pmd migration entry which then has a path to
be consumed, unchecked, by pte_offset_map().
---
 mm/khugepaged.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

kernel test robot Jan. 25, 2023, 12:54 p.m. UTC | #1
Hi Zach,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230125015738.912924-2-zokeefe%40google.com
patch subject: [PATCH 2/2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
config: x86_64-randconfig-r025-20230123 (https://download.01.org/0day-ci/archive/20230125/202301252033.HoFIRXm4-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/6001eb9a8f1687a1d0b72831d991886106cac37b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954
        git checkout 6001eb9a8f1687a1d0b72831d991886106cac37b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> mm/khugepaged.c:972:17: error: passing 'pmd_t **' to parameter of incompatible type 'pmd_t'
           if (pmd_devmap(pmd))
                          ^~~
   arch/x86/include/asm/pgtable.h:254:36: note: passing argument to parameter 'pmd' here
   static inline int pmd_devmap(pmd_t pmd)
                                      ^
   1 error generated.


vim +972 mm/khugepaged.c

   945	
   946	/*
   947	 * See pmd_trans_unstable() for how the result may change out from
   948	 * underneath us, even if we hold mmap_lock in read.
   949	 */
   950	static int find_pmd_or_thp_or_none(struct mm_struct *mm,
   951					   unsigned long address,
   952					   pmd_t **pmd)
   953	{
   954		pmd_t pmde;
   955	
   956		*pmd = mm_find_pmd(mm, address);
   957		if (!*pmd)
   958			return SCAN_PMD_NULL;
   959	
   960		pmde = pmdp_get_lockless(*pmd);
   961	
   962	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
   963		/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
   964		barrier();
   965	#endif
   966		if (pmd_none(pmde))
   967			return SCAN_PMD_NONE;
   968		if (!pmd_present(pmde))
   969			return SCAN_PMD_NULL;
   970		if (pmd_trans_huge(pmde))
   971			return SCAN_PMD_MAPPED;
 > 972		if (pmd_devmap(pmd))
   973			return SCAN_PMD_NULL;
   974		if (pmd_bad(pmde))
   975			return SCAN_PMD_NULL;
   976		return SCAN_SUCCEED;
   977	}
   978
kernel test robot Jan. 25, 2023, 1:38 p.m. UTC | #2
Hi Zach,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230125015738.912924-2-zokeefe%40google.com
patch subject: [PATCH 2/2] mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
config: x86_64-rhel-8.3-kselftests (https://download.01.org/0day-ci/archive/20230125/202301252110.hFYRsbrm-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/6001eb9a8f1687a1d0b72831d991886106cac37b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Zach-O-Keefe/mm-MADV_COLLAPSE-catch-none-huge-bad-pmd-lookups/20230125-095954
        git checkout 6001eb9a8f1687a1d0b72831d991886106cac37b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=x86_64 olddefconfig
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   mm/khugepaged.c: In function 'find_pmd_or_thp_or_none':
>> mm/khugepaged.c:972:24: error: incompatible type for argument 1 of 'pmd_devmap'
     972 |         if (pmd_devmap(pmd))
         |                        ^~~
         |                        |
         |                        pmd_t **
   In file included from include/linux/pgtable.h:6,
                    from include/linux/mm.h:29,
                    from mm/khugepaged.c:4:
   arch/x86/include/asm/pgtable.h:254:36: note: expected 'pmd_t' but argument is of type 'pmd_t **'
     254 | static inline int pmd_devmap(pmd_t pmd)
         |                              ~~~~~~^~~


vim +/pmd_devmap +972 mm/khugepaged.c

   945	
   946	/*
   947	 * See pmd_trans_unstable() for how the result may change out from
   948	 * underneath us, even if we hold mmap_lock in read.
   949	 */
   950	static int find_pmd_or_thp_or_none(struct mm_struct *mm,
   951					   unsigned long address,
   952					   pmd_t **pmd)
   953	{
   954		pmd_t pmde;
   955	
   956		*pmd = mm_find_pmd(mm, address);
   957		if (!*pmd)
   958			return SCAN_PMD_NULL;
   959	
   960		pmde = pmdp_get_lockless(*pmd);
   961	
   962	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
   963		/* See comments in pmd_none_or_trans_huge_or_clear_bad() */
   964		barrier();
   965	#endif
   966		if (pmd_none(pmde))
   967			return SCAN_PMD_NONE;
   968		if (!pmd_present(pmde))
   969			return SCAN_PMD_NULL;
   970		if (pmd_trans_huge(pmde))
   971			return SCAN_PMD_MAPPED;
 > 972		if (pmd_devmap(pmd))
   973			return SCAN_PMD_NULL;
   974		if (pmd_bad(pmde))
   975			return SCAN_PMD_NULL;
   976		return SCAN_SUCCEED;
   977	}
   978
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fa38cae240b9..7ea668bbea70 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -941,6 +941,10 @@  static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
 	return SCAN_SUCCEED;
 }
 
+/*
+ * See pmd_trans_unstable() for how the result may change out from
+ * underneath us, even if we hold mmap_lock in read.
+ */
 static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 				   unsigned long address,
 				   pmd_t **pmd)
@@ -959,8 +963,12 @@  static int find_pmd_or_thp_or_none(struct mm_struct *mm,
 #endif
 	if (pmd_none(pmde))
 		return SCAN_PMD_NONE;
+	if (!pmd_present(pmde))
+		return SCAN_PMD_NULL;
 	if (pmd_trans_huge(pmde))
 		return SCAN_PMD_MAPPED;
+	if (pmd_devmap(pmd))
+		return SCAN_PMD_NULL;
 	if (pmd_bad(pmde))
 		return SCAN_PMD_NULL;
 	return SCAN_SUCCEED;