diff mbox series

[v13,2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

Message ID 20230417125630.1146906-3-usama.anjum@collabora.com (mailing list archive)
State New
Headers show
Series Implement IOCTL to get and optionally clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum April 17, 2023, 12:56 p.m. UTC
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
  (PAGE_IS_SWAPPED).
- Find pages which have been written-to and write protect the pages
  (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

This IOCTL can be extended to get information about more PTE bits.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v13:
- Review updates
- mmap_read_lock_killable() instead of mmap_read_lock()
- Replace uffd_wp_range() with helpers which increases performance
  drastically for OP_WP operations by reducing the number of tlb
  flushing etc
- Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range

Changes in v12:
- Add hugetlb support to cover all memory types
- Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
- Review updates to the code

Changes in v11:
- Find written pages in a better way
- Fix a corner case (thanks Paul)
- Improve the code/comments
- remove ENGAGE_WP + ! GET operation
- shorten the commit message in favour of moving documentation to
  pagemap.rst

Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl

Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl

task_mmu

task_mmu

task_mmu.c
---
 fs/proc/task_mmu.c      | 478 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  53 +++++
 2 files changed, 531 insertions(+)

Comments

kernel test robot April 17, 2023, 3:28 p.m. UTC | #1
Hi Muhammad,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[cannot apply to linus/master v6.3-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230417-210005
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230417125630.1146906-3-usama.anjum%40collabora.com
patch subject: [PATCH v13 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
config: i386-randconfig-a003-20230417 (https://download.01.org/0day-ci/archive/20230417/202304172319.CjlksY4s-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/2e7e6f5bd3b9c28493f9871282fe8f14e2263962
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230417-210005
        git checkout 2e7e6f5bd3b9c28493f9871282fe8f14e2263962
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 olddefconfig
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash fs/proc/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304172319.CjlksY4s-lkp@intel.com/

All warnings (new ones prefixed by >>):

   fs/proc/task_mmu.c: In function 'pagemap_scan_pte_hole':
   fs/proc/task_mmu.c:2059:14: error: invalid use of undefined type 'struct pagemap_scan_private'
    2059 |         if (p->max_pages && p->found_pages + n_pages > p->max_pages)
         |              ^~
   fs/proc/task_mmu.c:2059:30: error: invalid use of undefined type 'struct pagemap_scan_private'
    2059 |         if (p->max_pages && p->found_pages + n_pages > p->max_pages)
         |                              ^~
   fs/proc/task_mmu.c:2059:57: error: invalid use of undefined type 'struct pagemap_scan_private'
    2059 |         if (p->max_pages && p->found_pages + n_pages > p->max_pages)
         |                                                         ^~
   fs/proc/task_mmu.c:2060:28: error: invalid use of undefined type 'struct pagemap_scan_private'
    2060 |                 n_pages = p->max_pages - p->found_pages;
         |                            ^~
   fs/proc/task_mmu.c:2060:43: error: invalid use of undefined type 'struct pagemap_scan_private'
    2060 |                 n_pages = p->max_pages - p->found_pages;
         |                                           ^~
   fs/proc/task_mmu.c:2062:15: error: implicit declaration of function 'pagemap_scan_output' [-Werror=implicit-function-declaration]
    2062 |         ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
         |               ^~~~~~~~~~~~~~~~~~~
   fs/proc/task_mmu.c: At top level:
   fs/proc/task_mmu.c:2068:22: error: 'pagemap_scan_test_walk' undeclared here (not in a function); did you mean 'pagemap_scan_pte_hole'?
    2068 |         .test_walk = pagemap_scan_test_walk,
         |                      ^~~~~~~~~~~~~~~~~~~~~~
         |                      pagemap_scan_pte_hole
   fs/proc/task_mmu.c:2069:22: error: 'pagemap_scan_pmd_entry' undeclared here (not in a function); did you mean 'pagemap_scan_pte_hole'?
    2069 |         .pmd_entry = pagemap_scan_pmd_entry,
         |                      ^~~~~~~~~~~~~~~~~~~~~~
         |                      pagemap_scan_pte_hole
   fs/proc/task_mmu.c: In function 'pagemap_scan_args_valid':
   fs/proc/task_mmu.c:2080:27: error: 'PM_SCAN_OPS' undeclared (first use in this function); did you mean 'PM_SCAN_OP_WP'?
    2080 |         if (arg->flags & ~PM_SCAN_OPS)
         |                           ^~~~~~~~~~~
         |                           PM_SCAN_OP_WP
   fs/proc/task_mmu.c:2080:27: note: each undeclared identifier is reported only once for each function it appears in
   fs/proc/task_mmu.c:2085:35: error: 'PM_SCAN_BITS_ALL' undeclared (first use in this function); did you mean 'CPU_BITS_ALL'?
    2085 |              arg->return_mask) & ~PM_SCAN_BITS_ALL)
         |                                   ^~~~~~~~~~~~~~~~
         |                                   CPU_BITS_ALL
   fs/proc/task_mmu.c:2106:14: error: implicit declaration of function 'PM_SCAN_DO_UFFD_WP'; did you mean 'PM_SCAN_OP_WP'? [-Werror=implicit-function-declaration]
    2106 |         if (!PM_SCAN_DO_UFFD_WP(arg))
         |              ^~~~~~~~~~~~~~~~~~
         |              PM_SCAN_OP_WP
   fs/proc/task_mmu.c: In function 'do_pagemap_scan':
   fs/proc/task_mmu.c:2123:37: error: storage size of 'p' isn't known
    2123 |         struct pagemap_scan_private p;
         |                                     ^
   fs/proc/task_mmu.c:2148:21: error: 'PAGEMAP_WALK_SIZE' undeclared (first use in this function); did you mean 'PAGE_TABLE_SIZE'?
    2148 |         p.vec_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
         |                     ^~~~~~~~~~~~~~~~~
         |                     PAGE_TABLE_SIZE
   In file included from include/linux/kernel.h:27,
                    from arch/x86/include/asm/percpu.h:27,
                    from arch/x86/include/asm/preempt.h:6,
                    from include/linux/preempt.h:78,
                    from include/linux/spinlock.h:56,
                    from include/linux/mmzone.h:8,
                    from include/linux/gfp.h:7,
                    from include/linux/mm.h:7,
                    from include/linux/pagewalk.h:5,
                    from fs/proc/task_mmu.c:2:
   include/linux/minmax.h:36:9: error: first argument to '__builtin_choose_expr' not a constant
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |         ^~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:25: note: in expansion of macro '__careful_cmp'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~
   fs/proc/task_mmu.c:2173:29: note: in expansion of macro 'min'
    2173 |                 p.vec_len = min(p.vec_len, empty_slots);
         |                             ^~~
   fs/proc/task_mmu.c:2175:63: error: 'PAGEMAP_WALK_MASK' undeclared (first use in this function)
    2175 |                 walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
         |                                                               ^~~~~~~~~~~~~~~~~
   fs/proc/task_mmu.c:2186:53: error: 'PM_SCAN_FOUND_MAX_PAGES' undeclared (first use in this function)
    2186 |                 if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
         |                                                     ^~~~~~~~~~~~~~~~~~~~~~~
>> fs/proc/task_mmu.c:2123:37: warning: unused variable 'p' [-Wunused-variable]
    2123 |         struct pagemap_scan_private p;
         |                                     ^
   fs/proc/task_mmu.c: At top level:
   fs/proc/task_mmu.c:2240:27: error: 'pagemap_read' undeclared here (not in a function); did you mean 'filemap_read'?
    2240 |         .read           = pagemap_read,
         |                           ^~~~~~~~~~~~
         |                           filemap_read
   fs/proc/task_mmu.c:2241:27: error: 'pagemap_open' undeclared here (not in a function); did you mean 'pid_maps_open'?
    2241 |         .open           = pagemap_open,
         |                           ^~~~~~~~~~~~
         |                           pid_maps_open
   fs/proc/task_mmu.c:2242:27: error: 'pagemap_release' undeclared here (not in a function); did you mean 'proc_map_release'?
    2242 |         .release        = pagemap_release,
         |                           ^~~~~~~~~~~~~~~
         |                           proc_map_release
   fs/proc/task_mmu.c:2246:2: error: #endif without #if
    2246 | #endif /* CONFIG_PROC_PAGE_MONITOR */
         |  ^~~~~
   cc1: some warnings being treated as errors


vim +/p +2123 fs/proc/task_mmu.c

  2115	
  2116	static long do_pagemap_scan(struct mm_struct *mm,
  2117				   struct pm_scan_arg __user *uarg)
  2118	{
  2119		unsigned long start, end, walk_start, walk_end;
  2120		unsigned long empty_slots, vec_index = 0;
  2121		struct mmu_notifier_range range;
  2122		struct page_region __user *vec;
> 2123		struct pagemap_scan_private p;
  2124		struct pm_scan_arg arg;
  2125		int ret = 0;
  2126	
  2127		if (copy_from_user(&arg, uarg, sizeof(arg)))
  2128			return -EFAULT;
  2129	
  2130		start = untagged_addr((unsigned long)arg.start);
  2131		vec = (struct page_region *)untagged_addr((unsigned long)arg.vec);
  2132	
  2133		ret = pagemap_scan_args_valid(&arg, start, vec);
  2134		if (ret)
  2135			return ret;
  2136	
  2137		end = start + arg.len;
  2138		p.max_pages = arg.max_pages;
  2139		p.found_pages = 0;
  2140		p.flags = arg.flags;
  2141		p.required_mask = arg.required_mask;
  2142		p.anyof_mask = arg.anyof_mask;
  2143		p.excluded_mask = arg.excluded_mask;
  2144		p.return_mask = arg.return_mask;
  2145		p.cur.len = 0;
  2146		p.cur.start = 0;
  2147		p.vec = NULL;
  2148		p.vec_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
  2149	
  2150		/*
  2151		 * Allocate smaller buffer to get output from inside the page walk
  2152		 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
  2153		 * we want to return output to user in compact form where no two
  2154		 * consecutive regions should be continuous and have the same flags.
  2155		 * So store the latest element in p.cur between different walks and
  2156		 * store the p.cur at the end of the walk to the user buffer.
  2157		 */
  2158		p.vec = kmalloc_array(p.vec_len, sizeof(*p.vec), GFP_KERNEL);
  2159		if (!p.vec)
  2160			return -ENOMEM;
  2161	
  2162		if (p.flags & PM_SCAN_OP_WP) {
  2163			mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
  2164						vma->vm_mm, start, end);
  2165			mmu_notifier_invalidate_range_start(&range);
  2166		}
  2167	
  2168		walk_start = walk_end = start;
  2169		while (walk_end < end && !ret) {
  2170			p.vec_index = 0;
  2171	
  2172			empty_slots = arg.vec_len - vec_index;
  2173			p.vec_len = min(p.vec_len, empty_slots);
  2174	
  2175			walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
  2176			if (walk_end > end)
  2177				walk_end = end;
  2178	
  2179			ret = mmap_read_lock_killable(mm);
  2180			if (ret)
  2181				goto free_data;
  2182			ret = walk_page_range(mm, walk_start, walk_end,
  2183					      &pagemap_scan_ops, &p);
  2184			mmap_read_unlock(mm);
  2185	
  2186			if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
  2187				goto free_data;
  2188	
  2189			walk_start = walk_end;
  2190			if (p.vec_index) {
  2191				if (copy_to_user(&vec[vec_index], p.vec,
  2192						 p.vec_index * sizeof(*p.vec))) {
  2193					/*
  2194					 * Return error even though the OP succeeded
  2195					 */
  2196					ret = -EFAULT;
  2197					goto free_data;
  2198				}
  2199				vec_index += p.vec_index;
  2200			}
  2201		}
  2202	
  2203		if (p.flags & PM_SCAN_OP_WP) {
  2204			mmu_notifier_invalidate_range_end(&range);
  2205			flush_tlb_mm_range(mm, start, end, PAGE_SHIFT, false);
  2206		}
  2207	
  2208		if (p.cur.len) {
  2209			if (copy_to_user(&vec[vec_index], &p.cur, sizeof(*p.vec))) {
  2210				ret = -EFAULT;
  2211				goto free_data;
  2212			}
  2213			vec_index++;
  2214		}
  2215	
  2216		ret = vec_index;
  2217	
  2218	free_data:
  2219		kfree(p.vec);
  2220		return ret;
  2221	}
  2222
kernel test robot April 17, 2023, 4:40 p.m. UTC | #2
Hi Muhammad,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[cannot apply to linus/master v6.3-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230417-210005
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230417125630.1146906-3-usama.anjum%40collabora.com
patch subject: [PATCH v13 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
config: i386-randconfig-a003-20230417 (https://download.01.org/0day-ci/archive/20230418/202304180050.anuTGBfy-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
        # https://github.com/intel-lab-lkp/linux/commit/2e7e6f5bd3b9c28493f9871282fe8f14e2263962
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230417-210005
        git checkout 2e7e6f5bd3b9c28493f9871282fe8f14e2263962
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 O=build_dir ARCH=i386 olddefconfig
        make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash fs/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304180050.anuTGBfy-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/proc/task_mmu.c: In function 'pagemap_scan_pte_hole':
>> fs/proc/task_mmu.c:2059:14: error: invalid use of undefined type 'struct pagemap_scan_private'
    2059 |         if (p->max_pages && p->found_pages + n_pages > p->max_pages)
         |              ^~
   fs/proc/task_mmu.c:2059:30: error: invalid use of undefined type 'struct pagemap_scan_private'
    2059 |         if (p->max_pages && p->found_pages + n_pages > p->max_pages)
         |                              ^~
   fs/proc/task_mmu.c:2059:57: error: invalid use of undefined type 'struct pagemap_scan_private'
    2059 |         if (p->max_pages && p->found_pages + n_pages > p->max_pages)
         |                                                         ^~
   fs/proc/task_mmu.c:2060:28: error: invalid use of undefined type 'struct pagemap_scan_private'
    2060 |                 n_pages = p->max_pages - p->found_pages;
         |                            ^~
   fs/proc/task_mmu.c:2060:43: error: invalid use of undefined type 'struct pagemap_scan_private'
    2060 |                 n_pages = p->max_pages - p->found_pages;
         |                                           ^~
>> fs/proc/task_mmu.c:2062:15: error: implicit declaration of function 'pagemap_scan_output' [-Werror=implicit-function-declaration]
    2062 |         ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
         |               ^~~~~~~~~~~~~~~~~~~
   fs/proc/task_mmu.c: At top level:
>> fs/proc/task_mmu.c:2068:22: error: 'pagemap_scan_test_walk' undeclared here (not in a function); did you mean 'pagemap_scan_pte_hole'?
    2068 |         .test_walk = pagemap_scan_test_walk,
         |                      ^~~~~~~~~~~~~~~~~~~~~~
         |                      pagemap_scan_pte_hole
>> fs/proc/task_mmu.c:2069:22: error: 'pagemap_scan_pmd_entry' undeclared here (not in a function); did you mean 'pagemap_scan_pte_hole'?
    2069 |         .pmd_entry = pagemap_scan_pmd_entry,
         |                      ^~~~~~~~~~~~~~~~~~~~~~
         |                      pagemap_scan_pte_hole
   fs/proc/task_mmu.c: In function 'pagemap_scan_args_valid':
>> fs/proc/task_mmu.c:2080:27: error: 'PM_SCAN_OPS' undeclared (first use in this function); did you mean 'PM_SCAN_OP_WP'?
    2080 |         if (arg->flags & ~PM_SCAN_OPS)
         |                           ^~~~~~~~~~~
         |                           PM_SCAN_OP_WP
   fs/proc/task_mmu.c:2080:27: note: each undeclared identifier is reported only once for each function it appears in
>> fs/proc/task_mmu.c:2085:35: error: 'PM_SCAN_BITS_ALL' undeclared (first use in this function); did you mean 'CPU_BITS_ALL'?
    2085 |              arg->return_mask) & ~PM_SCAN_BITS_ALL)
         |                                   ^~~~~~~~~~~~~~~~
         |                                   CPU_BITS_ALL
>> fs/proc/task_mmu.c:2106:14: error: implicit declaration of function 'PM_SCAN_DO_UFFD_WP'; did you mean 'PM_SCAN_OP_WP'? [-Werror=implicit-function-declaration]
    2106 |         if (!PM_SCAN_DO_UFFD_WP(arg))
         |              ^~~~~~~~~~~~~~~~~~
         |              PM_SCAN_OP_WP
   fs/proc/task_mmu.c: In function 'do_pagemap_scan':
>> fs/proc/task_mmu.c:2123:37: error: storage size of 'p' isn't known
    2123 |         struct pagemap_scan_private p;
         |                                     ^
>> fs/proc/task_mmu.c:2148:21: error: 'PAGEMAP_WALK_SIZE' undeclared (first use in this function); did you mean 'PAGE_TABLE_SIZE'?
    2148 |         p.vec_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
         |                     ^~~~~~~~~~~~~~~~~
         |                     PAGE_TABLE_SIZE
   In file included from include/linux/kernel.h:27,
                    from arch/x86/include/asm/percpu.h:27,
                    from arch/x86/include/asm/preempt.h:6,
                    from include/linux/preempt.h:78,
                    from include/linux/spinlock.h:56,
                    from include/linux/mmzone.h:8,
                    from include/linux/gfp.h:7,
                    from include/linux/mm.h:7,
                    from include/linux/pagewalk.h:5,
                    from fs/proc/task_mmu.c:2:
>> include/linux/minmax.h:36:9: error: first argument to '__builtin_choose_expr' not a constant
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |         ^~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:25: note: in expansion of macro '__careful_cmp'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~
   fs/proc/task_mmu.c:2173:29: note: in expansion of macro 'min'
    2173 |                 p.vec_len = min(p.vec_len, empty_slots);
         |                             ^~~
>> fs/proc/task_mmu.c:2175:63: error: 'PAGEMAP_WALK_MASK' undeclared (first use in this function)
    2175 |                 walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
         |                                                               ^~~~~~~~~~~~~~~~~
>> fs/proc/task_mmu.c:2186:53: error: 'PM_SCAN_FOUND_MAX_PAGES' undeclared (first use in this function)
    2186 |                 if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
         |                                                     ^~~~~~~~~~~~~~~~~~~~~~~
   fs/proc/task_mmu.c:2123:37: warning: unused variable 'p' [-Wunused-variable]
    2123 |         struct pagemap_scan_private p;
         |                                     ^
   fs/proc/task_mmu.c: At top level:
>> fs/proc/task_mmu.c:2240:27: error: 'pagemap_read' undeclared here (not in a function); did you mean 'filemap_read'?
    2240 |         .read           = pagemap_read,
         |                           ^~~~~~~~~~~~
         |                           filemap_read
>> fs/proc/task_mmu.c:2241:27: error: 'pagemap_open' undeclared here (not in a function); did you mean 'pid_maps_open'?
    2241 |         .open           = pagemap_open,
         |                           ^~~~~~~~~~~~
         |                           pid_maps_open
>> fs/proc/task_mmu.c:2242:27: error: 'pagemap_release' undeclared here (not in a function); did you mean 'proc_map_release'?
    2242 |         .release        = pagemap_release,
         |                           ^~~~~~~~~~~~~~~
         |                           proc_map_release
   fs/proc/task_mmu.c:2246:2: error: #endif without #if
    2246 | #endif /* CONFIG_PROC_PAGE_MONITOR */
         |  ^~~~~
   cc1: some warnings being treated as errors


vim +2059 fs/proc/task_mmu.c

  2047	
  2048	static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
  2049					 int depth, struct mm_walk *walk)
  2050	{
  2051		unsigned long n_pages = (end - addr)/PAGE_SIZE;
  2052		struct pagemap_scan_private *p = walk->private;
  2053		struct vm_area_struct *vma = walk->vma;
  2054		int ret = 0;
  2055	
  2056		if (!vma)
  2057			return 0;
  2058	
> 2059		if (p->max_pages && p->found_pages + n_pages > p->max_pages)
> 2060			n_pages = p->max_pages - p->found_pages;
  2061	
> 2062		ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
  2063					  n_pages);
  2064		return ret;
  2065	}
  2066	
  2067	static const struct mm_walk_ops pagemap_scan_ops = {
> 2068		.test_walk = pagemap_scan_test_walk,
> 2069		.pmd_entry = pagemap_scan_pmd_entry,
  2070		.pte_hole = pagemap_scan_pte_hole,
  2071		.hugetlb_entry = pagemap_scan_hugetlb_entry,
  2072	};
  2073	
  2074	static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
  2075					   struct page_region __user *vec)
  2076	{
  2077		/* Detect illegal size, flags, len and masks */
  2078		if (arg->size != sizeof(struct pm_scan_arg))
  2079			return -EINVAL;
> 2080		if (arg->flags & ~PM_SCAN_OPS)
  2081			return -EINVAL;
  2082		if (!arg->len)
  2083			return -EINVAL;
  2084		if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
> 2085		     arg->return_mask) & ~PM_SCAN_BITS_ALL)
  2086			return -EINVAL;
  2087		if (!arg->required_mask && !arg->anyof_mask &&
  2088		    !arg->excluded_mask)
  2089			return -EINVAL;
  2090		if (!arg->return_mask)
  2091			return -EINVAL;
  2092	
  2093		/* Validate memory ranges */
  2094		if (!(arg->flags & PM_SCAN_OP_GET))
  2095			return -EINVAL;
  2096		if (!arg->vec)
  2097			return -EINVAL;
  2098		if (arg->vec_len == 0)
  2099			return -EINVAL;
  2100	
  2101		if (!IS_ALIGNED(start, PAGE_SIZE))
  2102			return -EINVAL;
  2103		if (!access_ok((void __user *)start, arg->len))
  2104			return -EFAULT;
  2105	
> 2106		if (!PM_SCAN_DO_UFFD_WP(arg))
  2107			return 0;
  2108	
  2109		if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask) &
  2110		    ~PAGE_IS_WRITTEN)
  2111			return -EINVAL;
  2112	
  2113		return 0;
  2114	}
  2115	
  2116	static long do_pagemap_scan(struct mm_struct *mm,
  2117				   struct pm_scan_arg __user *uarg)
  2118	{
  2119		unsigned long start, end, walk_start, walk_end;
  2120		unsigned long empty_slots, vec_index = 0;
  2121		struct mmu_notifier_range range;
  2122		struct page_region __user *vec;
> 2123		struct pagemap_scan_private p;
  2124		struct pm_scan_arg arg;
  2125		int ret = 0;
  2126	
  2127		if (copy_from_user(&arg, uarg, sizeof(arg)))
  2128			return -EFAULT;
  2129	
  2130		start = untagged_addr((unsigned long)arg.start);
  2131		vec = (struct page_region *)untagged_addr((unsigned long)arg.vec);
  2132	
  2133		ret = pagemap_scan_args_valid(&arg, start, vec);
  2134		if (ret)
  2135			return ret;
  2136	
  2137		end = start + arg.len;
  2138		p.max_pages = arg.max_pages;
  2139		p.found_pages = 0;
  2140		p.flags = arg.flags;
  2141		p.required_mask = arg.required_mask;
  2142		p.anyof_mask = arg.anyof_mask;
  2143		p.excluded_mask = arg.excluded_mask;
  2144		p.return_mask = arg.return_mask;
  2145		p.cur.len = 0;
  2146		p.cur.start = 0;
  2147		p.vec = NULL;
> 2148		p.vec_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
  2149	
  2150		/*
  2151		 * Allocate smaller buffer to get output from inside the page walk
  2152		 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
  2153		 * we want to return output to user in compact form where no two
  2154		 * consecutive regions should be continuous and have the same flags.
  2155		 * So store the latest element in p.cur between different walks and
  2156		 * store the p.cur at the end of the walk to the user buffer.
  2157		 */
  2158		p.vec = kmalloc_array(p.vec_len, sizeof(*p.vec), GFP_KERNEL);
  2159		if (!p.vec)
  2160			return -ENOMEM;
  2161	
  2162		if (p.flags & PM_SCAN_OP_WP) {
  2163			mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
  2164						vma->vm_mm, start, end);
  2165			mmu_notifier_invalidate_range_start(&range);
  2166		}
  2167	
  2168		walk_start = walk_end = start;
  2169		while (walk_end < end && !ret) {
  2170			p.vec_index = 0;
  2171	
  2172			empty_slots = arg.vec_len - vec_index;
> 2173			p.vec_len = min(p.vec_len, empty_slots);
  2174	
> 2175			walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
  2176			if (walk_end > end)
  2177				walk_end = end;
  2178	
  2179			ret = mmap_read_lock_killable(mm);
  2180			if (ret)
  2181				goto free_data;
  2182			ret = walk_page_range(mm, walk_start, walk_end,
  2183					      &pagemap_scan_ops, &p);
  2184			mmap_read_unlock(mm);
  2185	
> 2186			if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
  2187				goto free_data;
  2188	
  2189			walk_start = walk_end;
  2190			if (p.vec_index) {
  2191				if (copy_to_user(&vec[vec_index], p.vec,
  2192						 p.vec_index * sizeof(*p.vec))) {
  2193					/*
  2194					 * Return error even though the OP succeeded
  2195					 */
  2196					ret = -EFAULT;
  2197					goto free_data;
  2198				}
  2199				vec_index += p.vec_index;
  2200			}
  2201		}
  2202	
  2203		if (p.flags & PM_SCAN_OP_WP) {
  2204			mmu_notifier_invalidate_range_end(&range);
  2205			flush_tlb_mm_range(mm, start, end, PAGE_SHIFT, false);
  2206		}
  2207	
  2208		if (p.cur.len) {
  2209			if (copy_to_user(&vec[vec_index], &p.cur, sizeof(*p.vec))) {
  2210				ret = -EFAULT;
  2211				goto free_data;
  2212			}
  2213			vec_index++;
  2214		}
  2215	
  2216		ret = vec_index;
  2217	
  2218	free_data:
  2219		kfree(p.vec);
  2220		return ret;
  2221	}
  2222	
  2223	static long do_pagemap_cmd(struct file *file, unsigned int cmd,
  2224				       unsigned long arg)
  2225	{
  2226		struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
  2227		struct mm_struct *mm = file->private_data;
  2228	
  2229		switch (cmd) {
  2230		case PAGEMAP_SCAN:
  2231			return do_pagemap_scan(mm, uarg);
  2232	
  2233		default:
  2234			return -EINVAL;
  2235		}
  2236	}
  2237	
  2238	const struct file_operations proc_pagemap_operations = {
  2239		.llseek		= mem_lseek, /* borrow this */
> 2240		.read		= pagemap_read,
> 2241		.open		= pagemap_open,
> 2242		.release	= pagemap_release,
  2243		.unlocked_ioctl = do_pagemap_cmd,
  2244		.compat_ioctl	= do_pagemap_cmd,
  2245	};
  2246	#endif /* CONFIG_PROC_PAGE_MONITOR */
  2247
Peter Xu April 17, 2023, 7:42 p.m. UTC | #3
Muhammad,

On Mon, Apr 17, 2023 at 05:56:27PM +0500, Muhammad Usama Anjum wrote:
> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +				  unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long addr = end;
> +	pte_t *pte, *orig_pte;
> +	spinlock_t *ptl;
> +	bool is_written;
> +	int ret = 0;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		unsigned long n_pages = (end - start)/PAGE_SIZE;
> +
> +		if (p->max_pages && n_pages > p->max_pages - p->found_pages)
> +			n_pages = p->max_pages - p->found_pages;
> +
> +		is_written = !is_pmd_uffd_wp(*pmd);
> +
> +		/*
> +		 * Break huge page into small pages if the WP operation need to
> +		 * be performed is on a portion of the huge page.
> +		 */
> +		if (is_written && PM_SCAN_DO_UFFD_WP(p) &&
> +		    n_pages < HPAGE_SIZE/PAGE_SIZE) {
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, start);
> +			goto process_smaller_pages;
> +		}
> +
> +		ret = pagemap_scan_output(is_written, vma->vm_file,
> +					  pmd_present(*pmd), is_swap_pmd(*pmd),
> +					  p, start, n_pages);
> +
> +		if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p))
> +			make_uffd_wp_pmd(vma, addr, pmd);
> +
> +		spin_unlock(ptl);
> +		return ret;
> +	}
> +process_smaller_pages:
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +#endif
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> +	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
> +		is_written = !is_pte_uffd_wp(*pte);
> +
> +		ret = pagemap_scan_output(is_written, vma->vm_file,
> +					  pte_present(*pte), is_swap_pte(*pte),
> +					  p, addr, 1);
> +
> +		if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p))
> +			make_uffd_wp_pte(vma, addr, pte);
> +	}
> +	pte_unmap_unlock(orig_pte, ptl);

IIUC tlb flushes, mmu notifications are still missing here, am I right?

Thanks,

> +
> +	cond_resched();
> +	return ret;
> +}
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 38b19a757281..3d091095d404 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@ 
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1767,11 +1768,488 @@  static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PM_SCAN_FOUND_MAX_PAGES	(1)
+#define PM_SCAN_BITS_ALL	(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PM_SCAN_OPS		(PM_SCAN_OP_GET | PM_SCAN_OP_WP)
+#define PM_SCAN_DO_UFFD_WP(a)	(a->flags & PM_SCAN_OP_WP)
+#define PM_SCAN_BITMAP(wt, file, present, swap)	\
+	((wt) | ((file) << 1) | ((present) << 2) | ((swap) << 3))
+
+struct pagemap_scan_private {
+	struct page_region *vec;
+	struct page_region cur;
+	unsigned long vec_len, vec_index;
+	unsigned int max_pages, found_pages, flags;
+	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+	return (pte_present(pte) && pte_uffd_wp(pte)) ||
+	       pte_swp_uffd_wp_any(pte);
+}
+
+static inline void make_uffd_wp_pte(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *pte)
+{
+	pte_t ptent = *pte;
+
+	if (pte_present(ptent)) {
+		pte_t old_pte;
+
+		old_pte = ptep_modify_prot_start(vma, addr, pte);
+		ptent = pte_mkuffd_wp(ptent);
+		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+	} else if (is_swap_pte(ptent)) {
+		ptent = pte_swp_mkuffd_wp(ptent);
+		set_pte_at(vma->vm_mm, addr, pte, ptent);
+	}
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+	return (pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+	       (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd));
+}
+
+static inline void make_uffd_wp_pmd(struct vm_area_struct *vma,
+		unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t old, pmd = *pmdp;
+
+	if (pmd_present(pmd)) {
+		old = pmdp_invalidate_ad(vma, addr, pmdp);
+		pmd = pmd_mkuffd_wp(old);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_mkuffd_wp(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
+}
+#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+static inline bool is_huge_pte_uffd_wp(pte_t pte)
+{
+	return ((pte_present(pte) && huge_pte_uffd_wp(pte)) ||
+		pte_swp_uffd_wp_any(pte));
+}
+
+static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 pte_t ptent)
+{
+	pte_t old_pte;
+
+	if (!huge_pte_none(ptent)) {
+		old_pte = huge_ptep_modify_prot_start(vma, addr, ptep);
+		ptent = huge_pte_mkuffd_wp(old_pte);
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, ptent);
+	} else {
+		set_huge_pte_at(vma->vm_mm, addr, ptep,
+				make_pte_marker(PTE_MARKER_UFFD_WP));
+	}
+}
+#endif
+
+static inline bool pagemap_scan_check_page_written(struct pagemap_scan_private *p)
+{
+	return (p->required_mask | p->anyof_mask | p->excluded_mask) &
+	       PAGE_IS_WRITTEN;
+}
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+				  struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (pagemap_scan_check_page_written(p) && (!userfaultfd_wp(vma) ||
+	    !userfaultfd_wp_async(vma)))
+		return -EPERM;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
+			       struct pagemap_scan_private *p,
+			       unsigned long addr, unsigned int n_pages)
+{
+	unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
+	struct page_region *cur = &p->cur;
+
+	if (!n_pages)
+		return -EINVAL;
+
+	if ((p->required_mask & bitmap) != p->required_mask)
+		return 0;
+	if (p->anyof_mask && !(p->anyof_mask & bitmap))
+		return 0;
+	if (p->excluded_mask & bitmap)
+		return 0;
+
+	bitmap &= p->return_mask;
+	if (!bitmap)
+		return 0;
+
+	if (cur->bitmap == bitmap &&
+	    cur->start + cur->len * PAGE_SIZE == addr) {
+		cur->len += n_pages;
+		p->found_pages += n_pages;
+
+		if (p->max_pages && (p->found_pages == p->max_pages))
+			return PM_SCAN_FOUND_MAX_PAGES;
+
+		return 0;
+	}
+
+	/*
+	 * All data is copied to cur first. When more data is found, we push
+	 * cur to vec and copy new data to cur. The vec_index represents the
+	 * current index of vec array. We add 1 to the vec_index while
+	 * performing checks to account for data in cur.
+	 */
+	if (p->vec_index && (p->vec_index + 1) >= p->vec_len)
+		return -ENOSPC;
+
+	if (cur->len) {
+		memcpy(&p->vec[p->vec_index], cur, sizeof(*p->vec));
+		p->vec_index++;
+	}
+
+	cur->start = addr;
+	cur->len = n_pages;
+	cur->bitmap = bitmap;
+	p->found_pages += n_pages;
+
+	if (p->max_pages && (p->found_pages == p->max_pages))
+		return PM_SCAN_FOUND_MAX_PAGES;
+
+	return 0;
+}
+
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long addr = end;
+	pte_t *pte, *orig_pte;
+	spinlock_t *ptl;
+	bool is_written;
+	int ret = 0;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		unsigned long n_pages = (end - start)/PAGE_SIZE;
+
+		if (p->max_pages && n_pages > p->max_pages - p->found_pages)
+			n_pages = p->max_pages - p->found_pages;
+
+		is_written = !is_pmd_uffd_wp(*pmd);
+
+		/*
+		 * Break huge page into small pages if the WP operation need to
+		 * be performed is on a portion of the huge page.
+		 */
+		if (is_written && PM_SCAN_DO_UFFD_WP(p) &&
+		    n_pages < HPAGE_SIZE/PAGE_SIZE) {
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, start);
+			goto process_smaller_pages;
+		}
+
+		ret = pagemap_scan_output(is_written, vma->vm_file,
+					  pmd_present(*pmd), is_swap_pmd(*pmd),
+					  p, start, n_pages);
+
+		if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p))
+			make_uffd_wp_pmd(vma, addr, pmd);
+
+		spin_unlock(ptl);
+		return ret;
+	}
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+#endif
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
+		is_written = !is_pte_uffd_wp(*pte);
+
+		ret = pagemap_scan_output(is_written, vma->vm_file,
+					  pte_present(*pte), is_swap_pte(*pte),
+					  p, addr, 1);
+
+		if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p))
+			make_uffd_wp_pte(vma, addr, pte);
+	}
+	pte_unmap_unlock(orig_pte, ptl);
+
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
+				      unsigned long start, unsigned long end,
+				      struct mm_walk *walk)
+{
+	unsigned long n_pages = (end - start)/PAGE_SIZE;
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct hstate *h = hstate_vma(vma);
+	int ret = -EPERM;
+	spinlock_t *ptl;
+	bool is_written;
+	pte_t pte;
+
+	if (p->max_pages && n_pages > p->max_pages - p->found_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	if (PM_SCAN_DO_UFFD_WP(p)) {
+		i_mmap_lock_write(vma->vm_file->f_mapping);
+		ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+	}
+
+	pte = huge_ptep_get(ptep);
+	is_written = !is_huge_pte_uffd_wp(pte);
+
+	/*
+	 * Partial hugetlb page clear isn't supported
+	 */
+	if (is_written && PM_SCAN_DO_UFFD_WP(p) &&
+	    n_pages < HPAGE_SIZE/PAGE_SIZE)
+		goto unlock_and_return;
+
+	ret = pagemap_scan_output(is_written, vma->vm_file, pte_present(pte),
+				  is_swap_pte(pte), p, start, n_pages);
+	if (ret < 0)
+		goto unlock_and_return;
+
+	if (is_written && PM_SCAN_DO_UFFD_WP(p)) {
+		make_uffd_wp_huge_pte(vma, start, ptep, pte);
+#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
+		flush_hugetlb_tlb_range(vma, start, end);
+#endif
+	}
+
+unlock_and_return:
+	if (PM_SCAN_DO_UFFD_WP(p)) {
+		spin_unlock(ptl);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	}
+
+	return ret;
+}
+#else
+#define pagemap_scan_hugetlb_entry NULL
+#endif
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+				 int depth, struct mm_walk *walk)
+{
+	unsigned long n_pages = (end - addr)/PAGE_SIZE;
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int ret = 0;
+
+	if (!vma)
+		return 0;
+
+	if (p->max_pages && p->found_pages + n_pages > p->max_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
+				  n_pages);
+	return ret;
+}
+
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+	.hugetlb_entry = pagemap_scan_hugetlb_entry,
+};
+
+static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
+				   struct page_region __user *vec)
+{
+	/* Detect illegal size, flags, len and masks */
+	if (arg->size != sizeof(struct pm_scan_arg))
+		return -EINVAL;
+	if (arg->flags & ~PM_SCAN_OPS)
+		return -EINVAL;
+	if (!arg->len)
+		return -EINVAL;
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
+	     arg->return_mask) & ~PM_SCAN_BITS_ALL)
+		return -EINVAL;
+	if (!arg->required_mask && !arg->anyof_mask &&
+	    !arg->excluded_mask)
+		return -EINVAL;
+	if (!arg->return_mask)
+		return -EINVAL;
+
+	/* Validate memory ranges */
+	if (!(arg->flags & PM_SCAN_OP_GET))
+		return -EINVAL;
+	if (!arg->vec)
+		return -EINVAL;
+	if (arg->vec_len == 0)
+		return -EINVAL;
+
+	if (!IS_ALIGNED(start, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok((void __user *)start, arg->len))
+		return -EFAULT;
+
+	if (!PM_SCAN_DO_UFFD_WP(arg))
+		return 0;
+
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask) &
+	    ~PAGE_IS_WRITTEN)
+		return -EINVAL;
+
+	return 0;
+}
+
+static long do_pagemap_scan(struct mm_struct *mm,
+			   struct pm_scan_arg __user *uarg)
+{
+	unsigned long start, end, walk_start, walk_end;
+	unsigned long empty_slots, vec_index = 0;
+	struct mmu_notifier_range range;
+	struct page_region __user *vec;
+	struct pagemap_scan_private p;
+	struct pm_scan_arg arg;
+	int ret = 0;
+
+	if (copy_from_user(&arg, uarg, sizeof(arg)))
+		return -EFAULT;
+
+	start = untagged_addr((unsigned long)arg.start);
+	vec = (struct page_region *)untagged_addr((unsigned long)arg.vec);
+
+	ret = pagemap_scan_args_valid(&arg, start, vec);
+	if (ret)
+		return ret;
+
+	end = start + arg.len;
+	p.max_pages = arg.max_pages;
+	p.found_pages = 0;
+	p.flags = arg.flags;
+	p.required_mask = arg.required_mask;
+	p.anyof_mask = arg.anyof_mask;
+	p.excluded_mask = arg.excluded_mask;
+	p.return_mask = arg.return_mask;
+	p.cur.len = 0;
+	p.cur.start = 0;
+	p.vec = NULL;
+	p.vec_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
+
+	/*
+	 * Allocate smaller buffer to get output from inside the page walk
+	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
+	 * we want to return output to user in compact form where no two
+	 * consecutive regions should be continuous and have the same flags.
+	 * So store the latest element in p.cur between different walks and
+	 * store the p.cur at the end of the walk to the user buffer.
+	 */
+	p.vec = kmalloc_array(p.vec_len, sizeof(*p.vec), GFP_KERNEL);
+	if (!p.vec)
+		return -ENOMEM;
+
+	if (p.flags & PM_SCAN_OP_WP) {
+		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+					vma->vm_mm, start, end);
+		mmu_notifier_invalidate_range_start(&range);
+	}
+
+	walk_start = walk_end = start;
+	while (walk_end < end && !ret) {
+		p.vec_index = 0;
+
+		empty_slots = arg.vec_len - vec_index;
+		p.vec_len = min(p.vec_len, empty_slots);
+
+		walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
+		if (walk_end > end)
+			walk_end = end;
+
+		ret = mmap_read_lock_killable(mm);
+		if (ret)
+			goto free_data;
+		ret = walk_page_range(mm, walk_start, walk_end,
+				      &pagemap_scan_ops, &p);
+		mmap_read_unlock(mm);
+
+		if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
+			goto free_data;
+
+		walk_start = walk_end;
+		if (p.vec_index) {
+			if (copy_to_user(&vec[vec_index], p.vec,
+					 p.vec_index * sizeof(*p.vec))) {
+				/*
+				 * Return error even though the OP succeeded
+				 */
+				ret = -EFAULT;
+				goto free_data;
+			}
+			vec_index += p.vec_index;
+		}
+	}
+
+	if (p.flags & PM_SCAN_OP_WP) {
+		mmu_notifier_invalidate_range_end(&range);
+		flush_tlb_mm_range(mm, start, end, PAGE_SHIFT, false);
+	}
+
+	if (p.cur.len) {
+		if (copy_to_user(&vec[vec_index], &p.cur, sizeof(*p.vec))) {
+			ret = -EFAULT;
+			goto free_data;
+		}
+		vec_index++;
+	}
+
+	ret = vec_index;
+
+free_data:
+	kfree(p.vec);
+	return ret;
+}
+
+static long do_pagemap_cmd(struct file *file, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
+	struct mm_struct *mm = file->private_data;
+
+	switch (cmd) {
+	case PAGEMAP_SCAN:
+		return do_pagemap_scan(mm, uarg);
+
+	default:
+		return -EINVAL;
+	}
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = do_pagemap_cmd,
+	.compat_ioctl	= do_pagemap_cmd,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..47879c38ce2f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,57 @@  typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pm_scan_args */
+#define PAGE_IS_WRITTEN		(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region in pages
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Supported flags */
+#define PM_SCAN_OP_GET	(1 << 0)
+#define PM_SCAN_OP_WP	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */