diff mbox series

[v12,2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

Message ID 20230406074005.1784728-3-usama.anjum@collabora.com (mailing list archive)
State New
Headers show
Series Implement IOCTL to get and optionally clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum April 6, 2023, 7:40 a.m. UTC
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
  (PAGE_IS_SWAPPED).
- Find pages which have been written-to and write protect the pages
  (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

This IOCTL can be extended to get information about more PTE bits.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v12:
- Add hugetlb support to cover all memory types
- Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
- Review updates to the code

Changes in v11:
- Find written pages in a better way
- Fix a corner case (thanks Paul)
- Improve the code/comments
- remove ENGAGE_WP + ! GET operation
- shorten the commit message in favour of moving documentation to
  pagemap.rst

Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl

Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
---
 fs/proc/task_mmu.c            | 426 ++++++++++++++++++++++++++++++++++
 include/linux/userfaultfd_k.h |   8 +
 include/uapi/linux/fs.h       |  53 +++++
 3 files changed, 487 insertions(+)

Comments

kernel test robot April 6, 2023, 11:40 a.m. UTC | #1
Hi Muhammad,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20230406]
[cannot apply to linus/master v6.3-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230406074005.1784728-3-usama.anjum%40collabora.com
patch subject: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
config: arc-randconfig-r023-20230405 (https://download.01.org/0day-ci/archive/20230406/202304061914.N1Hmx12N-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/f13abb36f64c77913509da8ca157512d2fb9f031
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
        git checkout f13abb36f64c77913509da8ca157512d2fb9f031
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash fs/proc/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304061914.N1Hmx12N-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   fs/proc/task_mmu.c: In function 'pagemap_scan_pmd_entry':
>> fs/proc/task_mmu.c:1921:37: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
    1921 |                     ((end - start < HPAGE_SIZE) ||
         |                                     ^~~~~~~~~~
         |                                     PAGE_SIZE
   fs/proc/task_mmu.c:1921:37: note: each undeclared identifier is reported only once for each function it appears in
>> fs/proc/task_mmu.c:1939:35: error: passing argument 1 of 'uffd_wp_range' from incompatible pointer type [-Werror=incompatible-pointer-types]
    1939 |                     uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
         |                                   ^~~
         |                                   |
         |                                   struct vm_area_struct *
   In file included from include/linux/mm_inline.h:9,
                    from fs/proc/task_mmu.c:3:
   include/linux/userfaultfd_k.h:215:52: note: expected 'struct mm_struct *' but argument is of type 'struct vm_area_struct *'
     215 | static inline long uffd_wp_range(struct mm_struct *dst_mm,
         |                                  ~~~~~~~~~~~~~~~~~~^~~~~~
>> fs/proc/task_mmu.c:1939:40: warning: passing argument 2 of 'uffd_wp_range' makes pointer from integer without a cast [-Wint-conversion]
    1939 |                     uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
         |                                        ^~~~~
         |                                        |
         |                                        long unsigned int
   include/linux/userfaultfd_k.h:216:57: note: expected 'struct vm_area_struct *' but argument is of type 'long unsigned int'
     216 |                                  struct vm_area_struct *vma,
         |                                  ~~~~~~~~~~~~~~~~~~~~~~~^~~
>> fs/proc/task_mmu.c:1939:21: error: too few arguments to function 'uffd_wp_range'
    1939 |                     uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
         |                     ^~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:215:20: note: declared here
     215 | static inline long uffd_wp_range(struct mm_struct *dst_mm,
         |                    ^~~~~~~~~~~~~
   fs/proc/task_mmu.c:1965:35: error: passing argument 1 of 'uffd_wp_range' from incompatible pointer type [-Werror=incompatible-pointer-types]
    1965 |                     uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
         |                                   ^~~
         |                                   |
         |                                   struct vm_area_struct *
   include/linux/userfaultfd_k.h:215:52: note: expected 'struct mm_struct *' but argument is of type 'struct vm_area_struct *'
     215 | static inline long uffd_wp_range(struct mm_struct *dst_mm,
         |                                  ~~~~~~~~~~~~~~~~~~^~~~~~
   fs/proc/task_mmu.c:1965:40: warning: passing argument 2 of 'uffd_wp_range' makes pointer from integer without a cast [-Wint-conversion]
    1965 |                     uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
         |                                        ^~~~
         |                                        |
         |                                        long unsigned int
   include/linux/userfaultfd_k.h:216:57: note: expected 'struct vm_area_struct *' but argument is of type 'long unsigned int'
     216 |                                  struct vm_area_struct *vma,
         |                                  ~~~~~~~~~~~~~~~~~~~~~~~^~~
   fs/proc/task_mmu.c:1965:21: error: too few arguments to function 'uffd_wp_range'
    1965 |                     uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
         |                     ^~~~~~~~~~~~~
   include/linux/userfaultfd_k.h:215:20: note: declared here
     215 | static inline long uffd_wp_range(struct mm_struct *dst_mm,
         |                    ^~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +1921 fs/proc/task_mmu.c

  1891	
  1892	static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
  1893					  unsigned long end, struct mm_walk *walk)
  1894	{
  1895		struct pagemap_scan_private *p = walk->private;
  1896		bool is_written, is_file, is_present, is_swap;
  1897		struct vm_area_struct *vma = walk->vma;
  1898		unsigned long addr = end;
  1899		spinlock_t *ptl;
  1900		int ret = 0;
  1901		pte_t *pte;
  1902	
  1903	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  1904		ptl = pmd_trans_huge_lock(pmd, vma);
  1905		if (ptl) {
  1906			unsigned long n_pages = (end - start)/PAGE_SIZE;
  1907	
  1908			is_written = !is_pmd_uffd_wp(*pmd);
  1909			is_file = vma->vm_file;
  1910			is_present = pmd_present(*pmd);
  1911			is_swap = is_swap_pmd(*pmd);
  1912	
  1913			spin_unlock(ptl);
  1914	
  1915			/*
  1916			 * Break huge page into small pages if the WP operation need to
  1917			 * be performed is on a portion of the huge page or if max_pages
  1918			 * pages limit would exceed.
  1919			 */
  1920			if (is_written && PM_SCAN_OP_IS_WP(p) &&
> 1921			    ((end - start < HPAGE_SIZE) ||
  1922			     (p->max_pages &&
  1923			      (p->max_pages - p->found_pages) < n_pages))) {
  1924	
  1925				split_huge_pmd(vma, pmd, start);
  1926				goto process_smaller_pages;
  1927			}
  1928	
  1929			if (p->max_pages &&
  1930			    p->found_pages + n_pages > p->max_pages)
  1931				n_pages = p->max_pages - p->found_pages;
  1932	
  1933			ret = pagemap_scan_output(is_written, is_file, is_present,
  1934						  is_swap, p, start, n_pages);
  1935			if (ret < 0)
  1936				return ret;
  1937	
  1938			if (is_written && PM_SCAN_OP_IS_WP(p) &&
> 1939			    uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
  1940				ret = -EINVAL;
  1941	
  1942			return ret;
  1943		}
  1944	process_smaller_pages:
  1945		if (pmd_trans_unstable(pmd))
  1946			return 0;
  1947	#endif
  1948	
  1949		for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
  1950			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
  1951	
  1952			is_written = !is_pte_uffd_wp(*pte);
  1953			is_file = vma->vm_file;
  1954			is_present = pte_present(*pte);
  1955			is_swap = is_swap_pte(*pte);
  1956	
  1957			pte_unmap_unlock(pte, ptl);
  1958	
  1959			ret = pagemap_scan_output(is_written, is_file, is_present,
  1960						  is_swap, p, addr, 1);
  1961			if (ret < 0)
  1962				return ret;
  1963	
  1964			if (is_written && PM_SCAN_OP_IS_WP(p) &&
  1965			    uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
  1966				return -EINVAL;
  1967		}
  1968	
  1969		cond_resched();
  1970		return ret;
  1971	}
  1972
kernel test robot April 6, 2023, noon UTC | #2
Hi Muhammad,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20230406]
[cannot apply to linus/master v6.3-rc5]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230406074005.1784728-3-usama.anjum%40collabora.com
patch subject: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
config: arm-randconfig-r021-20230405 (https://download.01.org/0day-ci/archive/20230406/202304061924.m0OL9fIj-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project 67409911353323ca5edf2049ef0df54132fa1ca7)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/intel-lab-lkp/linux/commit/f13abb36f64c77913509da8ca157512d2fb9f031
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
        git checkout f13abb36f64c77913509da8ca157512d2fb9f031
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash fs/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202304061924.m0OL9fIj-lkp@intel.com/

All errors (new ones prefixed by >>):

>> fs/proc/task_mmu.c:1965:47: error: too few arguments to function call, expected 5, have 4
                       uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
                       ~~~~~~~~~~~~~                           ^
   include/linux/userfaultfd_k.h:215:20: note: 'uffd_wp_range' declared here
   static inline long uffd_wp_range(struct mm_struct *dst_mm,
                      ^
   1 error generated.


vim +1965 fs/proc/task_mmu.c

  1948	
  1949		for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
  1950			pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
  1951	
  1952			is_written = !is_pte_uffd_wp(*pte);
  1953			is_file = vma->vm_file;
  1954			is_present = pte_present(*pte);
  1955			is_swap = is_swap_pte(*pte);
  1956	
  1957			pte_unmap_unlock(pte, ptl);
  1958	
  1959			ret = pagemap_scan_output(is_written, is_file, is_present,
  1960						  is_swap, p, addr, 1);
  1961			if (ret < 0)
  1962				return ret;
  1963	
  1964			if (is_written && PM_SCAN_OP_IS_WP(p) &&
> 1965			    uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
  1966				return -EINVAL;
  1967		}
  1968	
  1969		cond_resched();
  1970		return ret;
  1971	}
  1972
Muhammad Usama Anjum April 6, 2023, 12:56 p.m. UTC | #3
On 4/6/23 4:40 PM, kernel test robot wrote:
> Hi Muhammad,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on akpm-mm/mm-everything]
> [also build test ERROR on next-20230406]
> [cannot apply to linus/master v6.3-rc5]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20230406074005.1784728-3-usama.anjum%40collabora.com
> patch subject: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
> config: arc-randconfig-r023-20230405 (https://download.01.org/0day-ci/archive/20230406/202304061914.N1Hmx12N-lkp@intel.com/config)
> compiler: arceb-elf-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/f13abb36f64c77913509da8ca157512d2fb9f031
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>         git checkout f13abb36f64c77913509da8ca157512d2fb9f031
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash fs/proc/
> 
> If you fix the issue, kindly add following tag where applicable
> | Reported-by: kernel test robot <lkp@intel.com>
> | Link: https://lore.kernel.org/oe-kbuild-all/202304061914.N1Hmx12N-lkp@intel.com/
> 
> All error/warnings (new ones prefixed by >>):
> 
>    fs/proc/task_mmu.c: In function 'pagemap_scan_pmd_entry':
>>> fs/proc/task_mmu.c:1921:37: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
>     1921 |                     ((end - start < HPAGE_SIZE) ||
>          |                                     ^~~~~~~~~~
>          |                                     PAGE_SIZE
It seems arc architecture supports CONFIG_TRANSPARENT_HUGEPAGE, but it
doesn't define HPAGE_SIZE. It only defines PAGE_SIZE in
include/uapi/asm/page.h? AFAIK HPAGE_SIZE must be defined when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. What can be the solution here for arc?

The remaining build failures are because the wrong tree. I base my patches
on latest next, while the bot has based patches on mm-everything. I guess
today's next would have latest mm stuff, a rebase would make things correct
or I'll shift to mm-everything.
Michał Mirosław April 6, 2023, 3:52 p.m. UTC | #4
On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:>
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>   (PAGE_IS_SWAPPED).
> - Find pages which have been written-to and write protect the pages
>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>
> This IOCTL can be extended to get information about more PTE bits.
[...]
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> +       return ((pte_present(pte) && pte_uffd_wp(pte)) ||
> +               pte_swp_uffd_wp_any(pte));

Nit: outer parentheses are not needed for `return`ed value -- please
remove. (Same in other places.)

> @@ -1768,11 +1789,416 @@ static int pagemap_release(struct inode *inode, struct file *file)
>         return 0;
>  }
>
> +#define PM_SCAN_FOUND_MAX_PAGES        (1)
> +#define PM_SCAN_BITS_ALL       (PAGE_IS_WRITTEN | PAGE_IS_FILE |       \
> +                                PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PM_SCAN_OPS            (PM_SCAN_OP_GET | PM_SCAN_OP_WP)
> +#define PM_SCAN_OP_IS_WP(a)    (a->flags & PM_SCAN_OP_WP)

Nit: PM_SCAN_DO_UFFD_WP()? It would shift the hint in the name from
what op is executed to what behaviour is requested.

> +#define PM_SCAN_BITMAP(wt, file, present, swap)        \
> +       (wt | file << 1 | present << 2 | swap << 3)

Please parenthesize macro arguments ("(wt)", "(file)", etc.) to not
have to worry about operator precedence when passed a complex
expression.

I

[...]
> +static inline bool pagemap_scan_is_written_set(struct pagemap_scan_private *p)

pagemap_scan_checks_page_written? or similar 'scan is written' doesn't
seem to convey the expected intention.
The function is used only once in ..._test_walk(), so maybe just
inline, possibly using a temporary `bool` to make the condition easier
to read?

[...]

> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,

Could you try out how the code would look when passing the bitmap
instead of separate booleans?

> +                              struct pagemap_scan_private *p,
> +                              unsigned long addr, unsigned int n_pages)
> +{
[...]
> +       if ((cur->start + cur->len * PAGE_SIZE == addr) &&
> +           (cur->bitmap == bitmap)) {

Nit: bitmap check is cheaper, so I'd put it first. BTW, inner
parentheses are not needed here.

> +               cur->len += n_pages;
> +               p->found_pages += n_pages;
> +
> +               if (p->max_pages && (p->found_pages == p->max_pages))
> +                       return PM_SCAN_FOUND_MAX_PAGES;
> +
> +               return 0;
> +       }
> +
> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {

It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
have vec_len == 0 here, then we'd not fit the entry in the userspace
buffer anyway. Am I missing something?

> +
> +               if (cur->len) {
> +                       memcpy(&p->vec[p->vec_index], cur,
> +                              sizeof(struct page_region));
> +                       p->vec_index++;
> +               }
> +               cur->start = addr;
> +               cur->len = n_pages;
> +               cur->bitmap = bitmap;
> +               p->found_pages += n_pages;
> +
> +               if (p->max_pages && (p->found_pages == p->max_pages))
> +                       return PM_SCAN_FOUND_MAX_PAGES;
> +
> +               return 0;
> +       }
> +
> +       return -ENOSPC;
> +}
> +
> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
> +                                      struct page_region __user *vec,
> +                                      unsigned long *vec_index)

..._deposit() is used only in single place - please inline.

[...]
> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       bool is_written, is_file, is_present, is_swap;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long addr = end;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +       pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       ptl = pmd_trans_huge_lock(pmd, vma);
> +       if (ptl) {

Nit: `page_lock` or `pt_lock` to make it easier to guess the purpose?

> +               unsigned long n_pages = (end - start)/PAGE_SIZE;
> +
> +               is_written = !is_pmd_uffd_wp(*pmd);
> +               is_file = vma->vm_file;
> +               is_present = pmd_present(*pmd);
> +               is_swap = is_swap_pmd(*pmd);
> +
> +               spin_unlock(ptl);
> +
> +               /*
> +                * Break huge page into small pages if the WP operation need to
> +                * be performed is on a portion of the huge page or if max_pages
> +                * pages limit would exceed.

BTW, could the `max_pages` limit be relaxed a bit (in that it would be
possible to return more pages if they all merge into the last vector
entry) so that it would not need to split otherwise-matching huge
page? It would remove the need for this special handling in the kernel
and splitting the page by this read-only-appearing ioctl?

> +                */
> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
> +                   ((end - start < HPAGE_SIZE) ||
> +                    (p->max_pages &&
> +                     (p->max_pages - p->found_pages) < n_pages))) {
> +
> +                       split_huge_pmd(vma, pmd, start);
> +                       goto process_smaller_pages;
> +               }
> +
> +               if (p->max_pages &&
> +                   p->found_pages + n_pages > p->max_pages)
> +                       n_pages = p->max_pages - p->found_pages;
> +
> +               ret = pagemap_scan_output(is_written, is_file, is_present,
> +                                         is_swap, p, start, n_pages);
> +               if (ret < 0)
> +                       return ret;
> +
> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
> +                   uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
> +                       ret = -EINVAL;

Why not propagate the error from uffd_wp_range()?

[...]
> +static long do_pagemap_cmd(struct mm_struct *mm,
> +                          struct pm_scan_arg __user *uarg)

Please rename the function to `do_pagemap_scan` as it implements just
this single ioctl now.

> +{
[...]
> +       start = (unsigned long)untagged_addr(arg.start);
> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);

Is the inner cast needed?

> +       ret = pagemap_scan_args_valid(&arg, start, vec);
> +       if (ret)
> +               return ret;
> +
> +       end = start + arg.len;
> +       p.max_pages = arg.max_pages;
> +       p.found_pages = 0;
> +       p.flags = arg.flags;
> +       p.required_mask = arg.required_mask;
> +       p.anyof_mask = arg.anyof_mask;
> +       p.excluded_mask = arg.excluded_mask;
> +       p.return_mask = arg.return_mask;
> +       p.cur.len = 0;
> +       p.cur.start = 0;
> +       p.vec = NULL;
> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);

Nit: parentheses are not needed here, please remove.

> +
> +       /*
> +        * Allocate smaller buffer to get output from inside the page walk
> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> +        * we want to return output to user in compact form where no two
> +        * consecutive regions should be continuous and have the same flags.
> +        * So store the latest element in p.cur between different walks and
> +        * store the p.cur at the end of the walk to the user buffer.
> +        */
> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> +                             GFP_KERNEL);
> +       if (!p.vec)
> +               return -ENOMEM;
> +
> +       walk_start = walk_end = start;
> +       while (walk_end < end && !ret) {

The loop will stop if a previous iteration returned ENOSPC (and the
error will be lost) - is it intended?

> +               p.vec_index = 0;
> +
> +               empty_slots = arg.vec_len - vec_index;
> +               p.vec_len = min(p.vec_len, empty_slots);
> +
> +               walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> +               if (walk_end > end)
> +                       walk_end = end;
> +
> +               mmap_read_lock(mm);
> +               ret = walk_page_range(mm, walk_start, walk_end,
> +                                     &pagemap_scan_ops, &p);
> +               mmap_read_unlock(mm);
> +
> +               if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
> +                       goto free_data;
> +
> +               walk_start = walk_end;
> +               if (p.vec_index) {
> +                       if (copy_to_user(&vec[vec_index], p.vec,
> +                                        p.vec_index *
> +                                        sizeof(struct page_region))) {

sizeof(*p.vec) ?

> +                               ret = -EFAULT;
> +                               goto free_data;
> +                       }
> +                       vec_index += p.vec_index;
> +               }
> +       }
> +       ret = pagemap_scan_deposit(&p, vec, &vec_index);
> +       if (!ret)
> +               ret = vec_index;
> +free_data:
> +       kfree(p.vec);
> +
> +       return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
> +                              unsigned long arg)
> +{
> +       struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;

This is specific to PAGEMAP_SCAN ioctl, so should go into do_pagemap_cmd().

> +       struct mm_struct *mm = file->private_data;
> +
> +       switch (cmd) {
> +       case PAGEMAP_SCAN:
> +               return do_pagemap_cmd(mm, uarg);
[...]
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -210,6 +210,14 @@ extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
>
>  #else /* CONFIG_USERFAULTFD */
>
> +static inline long uffd_wp_range(struct mm_struct *dst_mm,
> +                                struct vm_area_struct *vma,
> +                                unsigned long start, unsigned long len,
> +                                bool enable_wp)
> +{
> +       return 0;
> +}
> +
>  /* mm helpers */
>  static inline vm_fault_t handle_userfault(struct vm_fault *vmf,
>                                 unsigned long reason)

Shouldn't this part be in the patch introducing uffd_wp_range()?

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 6, 2023, 5:58 p.m. UTC | #5
Hello,

Thank you so much for the review. Do you have any thoughts on the build
error on arc architecture?
https://lore.kernel.org/all/e3c82373-256a-6297-bcb4-5e1179a2cbe2@collabora.com

On 4/6/23 8:52 PM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:>
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>   (PAGE_IS_SWAPPED).
>> - Find pages which have been written-to and write protect the pages
>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> This IOCTL can be extended to get information about more PTE bits.
> [...]
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
> [...]
>> +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> +       return ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> +               pte_swp_uffd_wp_any(pte));
> 
> Nit: outer parentheses are not needed for `return`ed value -- please
> remove. (Same in other places.)
Will remove.

> 
>> @@ -1768,11 +1789,416 @@ static int pagemap_release(struct inode *inode, struct file *file)
>>         return 0;
>>  }
>>
>> +#define PM_SCAN_FOUND_MAX_PAGES        (1)
>> +#define PM_SCAN_BITS_ALL       (PAGE_IS_WRITTEN | PAGE_IS_FILE |       \
>> +                                PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define PM_SCAN_OPS            (PM_SCAN_OP_GET | PM_SCAN_OP_WP)
>> +#define PM_SCAN_OP_IS_WP(a)    (a->flags & PM_SCAN_OP_WP)
> 
> Nit: PM_SCAN_DO_UFFD_WP()? It would shift the hint in the name from
> what op is executed to what behaviour is requested.
Will do.

> 
>> +#define PM_SCAN_BITMAP(wt, file, present, swap)        \
>> +       (wt | file << 1 | present << 2 | swap << 3)
> 
> Please parenthesize macro arguments ("(wt)", "(file)", etc.) to not
> have to worry about operator precedence when passed a complex
> expression.
Like this?
#define PM_SCAN_BITMAP(wt, file, present, swap)	\
	((wt) | (file << 1) | (present << 2) | (swap << 3))

> 
> I
> 
> [...]
>> +static inline bool pagemap_scan_is_written_set(struct pagemap_scan_private *p)
> 
> pagemap_scan_checks_page_written? or similar 'scan is written' doesn't
> seem to convey the expected intention.
> The function is used only once in ..._test_walk(), so maybe just
> inline, possibly using a temporary `bool` to make the condition easier
> to read?
I'll update the name of function. Using bool doesn't help much. Lets keep
the same implementation.

> 
> [...]
> 
>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> 
> Could you try out how the code would look when passing the bitmap
> instead of separate booleans?
It doesn't look better. Right now we have less duplicate code.

> 
>> +                              struct pagemap_scan_private *p,
>> +                              unsigned long addr, unsigned int n_pages)
>> +{
> [...]
>> +       if ((cur->start + cur->len * PAGE_SIZE == addr) &&
>> +           (cur->bitmap == bitmap)) {
> 
> Nit: bitmap check is cheaper, so I'd put it first. BTW, inner
> parentheses are not needed here.
Will do.

> 
>> +               cur->len += n_pages;
>> +               p->found_pages += n_pages;
>> +
>> +               if (p->max_pages && (p->found_pages == p->max_pages))
>> +                       return PM_SCAN_FOUND_MAX_PAGES;
>> +
>> +               return 0;
>> +       }
>> +
>> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
> 
> It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
> have vec_len == 0 here, then we'd not fit the entry in the userspace
> buffer anyway. Am I missing something?
No. I'd explained it with diagram last time:
https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com

I'll add a concise comment here.

> 
>> +
>> +               if (cur->len) {
>> +                       memcpy(&p->vec[p->vec_index], cur,
>> +                              sizeof(struct page_region));
>> +                       p->vec_index++;
>> +               }
>> +               cur->start = addr;
>> +               cur->len = n_pages;
>> +               cur->bitmap = bitmap;
>> +               p->found_pages += n_pages;
>> +
>> +               if (p->max_pages && (p->found_pages == p->max_pages))
>> +                       return PM_SCAN_FOUND_MAX_PAGES;
>> +
>> +               return 0;
>> +       }
>> +
>> +       return -ENOSPC;
>> +}
>> +
>> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
>> +                                      struct page_region __user *vec,
>> +                                      unsigned long *vec_index)
> 
> ..._deposit() is used only in single place - please inline.
It is already inline.

> 
> [...]
>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                                 unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       bool is_written, is_file, is_present, is_swap;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       unsigned long addr = end;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +       pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>> +       if (ptl) {
> 
> Nit: `page_lock` or `pt_lock` to make it easier to guess the purpose?
No, in this file ptl is used religiously for this lock. So I'll keep it ptl
to keep consistency.

> 
>> +               unsigned long n_pages = (end - start)/PAGE_SIZE;
>> +
>> +               is_written = !is_pmd_uffd_wp(*pmd);
>> +               is_file = vma->vm_file;
>> +               is_present = pmd_present(*pmd);
>> +               is_swap = is_swap_pmd(*pmd);
>> +
>> +               spin_unlock(ptl);
>> +
>> +               /*
>> +                * Break huge page into small pages if the WP operation need to
>> +                * be performed is on a portion of the huge page or if max_pages
>> +                * pages limit would exceed.
> 
> BTW, could the `max_pages` limit be relaxed a bit (in that it would be
> possible to return more pages if they all merge into the last vector
> entry) so that it would not need to split otherwise-matching huge
> page? It would remove the need for this special handling in the kernel
> and splitting the page by this read-only-appearing ioctl?
No, this cannot be done. Otherwise we'll not be able to emulate Windows
syscall GetWriteWatch() which specifies the exact number of pages. Usually
in most of cases, either user will not use THP or not perform the operation
on partial huge page. So this part is only there to keep things correct for
those users who do use THP and partial pagemap_scan operations.

> 
>> +                */
>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
>> +                   ((end - start < HPAGE_SIZE) ||
>> +                    (p->max_pages &&
>> +                     (p->max_pages - p->found_pages) < n_pages))) {
>> +
>> +                       split_huge_pmd(vma, pmd, start);
>> +                       goto process_smaller_pages;
>> +               }
>> +
>> +               if (p->max_pages &&
>> +                   p->found_pages + n_pages > p->max_pages)
>> +                       n_pages = p->max_pages - p->found_pages;
>> +
>> +               ret = pagemap_scan_output(is_written, is_file, is_present,
>> +                                         is_swap, p, start, n_pages);
>> +               if (ret < 0)
>> +                       return ret;
>> +
>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
>> +                   uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
>> +                       ret = -EINVAL;
> 
> Why not propagate the error from uffd_wp_range()?
uffd_wp_range() returns status in long variable. We cannot return long in
this function. So intead of type casting long to int and then return I've
used -EINVAL. Would following be more suitable?

long ret2 = uffd_wp_range(vma, start, HPAGE_SIZE, true);
if (ret2 < 0)
	return (int)ret2;

> 
> [...]
>> +static long do_pagemap_cmd(struct mm_struct *mm,
>> +                          struct pm_scan_arg __user *uarg)
> 
> Please rename the function to `do_pagemap_scan` as it implements just
> this single ioctl now.
Will do.

> 
>> +{
> [...]
>> +       start = (unsigned long)untagged_addr(arg.start);
>> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
> 
> Is the inner cast needed?
arg.vec remains 64-bit on 32-bit systems. So casting 64bit value directly
to struct page_region pointer errors out. So I've added specific casting to
unsigned long first before casting to pointers.

> 
>> +       ret = pagemap_scan_args_valid(&arg, start, vec);
>> +       if (ret)
>> +               return ret;
>> +
>> +       end = start + arg.len;
>> +       p.max_pages = arg.max_pages;
>> +       p.found_pages = 0;
>> +       p.flags = arg.flags;
>> +       p.required_mask = arg.required_mask;
>> +       p.anyof_mask = arg.anyof_mask;
>> +       p.excluded_mask = arg.excluded_mask;
>> +       p.return_mask = arg.return_mask;
>> +       p.cur.len = 0;
>> +       p.cur.start = 0;
>> +       p.vec = NULL;
>> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> 
> Nit: parentheses are not needed here, please remove.
Will do.

> 
>> +
>> +       /*
>> +        * Allocate smaller buffer to get output from inside the page walk
>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>> +        * we want to return output to user in compact form where no two
>> +        * consecutive regions should be continuous and have the same flags.
>> +        * So store the latest element in p.cur between different walks and
>> +        * store the p.cur at the end of the walk to the user buffer.
>> +        */
>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>> +                             GFP_KERNEL);
>> +       if (!p.vec)
>> +               return -ENOMEM;
>> +
>> +       walk_start = walk_end = start;
>> +       while (walk_end < end && !ret) {
> 
> The loop will stop if a previous iteration returned ENOSPC (and the
> error will be lost) - is it intended?
It is intentional. -ENOSPC means that the user buffer is full even though
there was more memory to walk over. We don't treat this error. So when
buffer gets full, we stop walking over further as user buffer has gotten
full and return as success.

> 
>> +               p.vec_index = 0;
>> +
>> +               empty_slots = arg.vec_len - vec_index;
>> +               p.vec_len = min(p.vec_len, empty_slots);
>> +
>> +               walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
>> +               if (walk_end > end)
>> +                       walk_end = end;
>> +
>> +               mmap_read_lock(mm);
>> +               ret = walk_page_range(mm, walk_start, walk_end,
>> +                                     &pagemap_scan_ops, &p);
>> +               mmap_read_unlock(mm);
>> +
>> +               if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
>> +                       goto free_data;
>> +
>> +               walk_start = walk_end;
>> +               if (p.vec_index) {
>> +                       if (copy_to_user(&vec[vec_index], p.vec,
>> +                                        p.vec_index *
>> +                                        sizeof(struct page_region))) {
> 
> sizeof(*p.vec) ?
Sure.

> 
>> +                               ret = -EFAULT;
>> +                               goto free_data;
>> +                       }
>> +                       vec_index += p.vec_index;
>> +               }
>> +       }
>> +       ret = pagemap_scan_deposit(&p, vec, &vec_index);
>> +       if (!ret)
>> +               ret = vec_index;
>> +free_data:
>> +       kfree(p.vec);
>> +
>> +       return ret;
>> +}
>> +
>> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
>> +                              unsigned long arg)
>> +{
>> +       struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
> 
> This is specific to PAGEMAP_SCAN ioctl, so should go into do_pagemap_cmd().
Sure.

> 
>> +       struct mm_struct *mm = file->private_data;
>> +
>> +       switch (cmd) {
>> +       case PAGEMAP_SCAN:
>> +               return do_pagemap_cmd(mm, uarg);
> [...]
>> --- a/include/linux/userfaultfd_k.h
>> +++ b/include/linux/userfaultfd_k.h
>> @@ -210,6 +210,14 @@ extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
>>
>>  #else /* CONFIG_USERFAULTFD */
>>
>> +static inline long uffd_wp_range(struct mm_struct *dst_mm,
>> +                                struct vm_area_struct *vma,
>> +                                unsigned long start, unsigned long len,
>> +                                bool enable_wp)
>> +{
>> +       return 0;
>> +}
>> +
>>  /* mm helpers */
>>  static inline vm_fault_t handle_userfault(struct vm_fault *vmf,
>>                                 unsigned long reason)
> 
> Shouldn't this part be in the patch introducing uffd_wp_range()?
We have not added uffd_wp_range() in previous patches. We just need this
stub for this patch for the case when CONFIG_USERFAULTFD isn't enabled.

I'd this as separate patch before this patch. Mike asked me to merge it
with this patch in order not to break bisectability.
https://lore.kernel.org/all/ZBK+86eMMazwfhdx@kernel.org


> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 6, 2023, 8 p.m. UTC | #6
On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> Hello,
>
> Thank you so much for the review. Do you have any thoughts on the build
> error on arc architecture?
> https://lore.kernel.org/all/e3c82373-256a-6297-bcb4-5e1179a2cbe2@collabora.com

Maybe copy HPAGE_* defines from x86 and key on CONFIG_PGTABLE_LEVELS >
2? I don't know much about arc arch, though.

> On 4/6/23 8:52 PM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:>
[...]
> >> +#define PM_SCAN_BITMAP(wt, file, present, swap)        \
> >> +       (wt | file << 1 | present << 2 | swap << 3)
> > Please parenthesize macro arguments ("(wt)", "(file)", etc.) to not
> > have to worry about operator precedence when passed a complex
> > expression.
> Like this?
> #define PM_SCAN_BITMAP(wt, file, present, swap) \
>         ((wt) | (file << 1) | (present << 2) | (swap << 3))

The value would be:
 ( (wt) | ((file) << 1) | ... )
IOW, each parameter should have parentheses around its name.

[...]
> >> +               cur->len += n_pages;
> >> +               p->found_pages += n_pages;
> >> +
> >> +               if (p->max_pages && (p->found_pages == p->max_pages))
> >> +                       return PM_SCAN_FOUND_MAX_PAGES;
> >> +
> >> +               return 0;
> >> +       }
> >> +
> >> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
> >
> > It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
> > have vec_len == 0 here, then we'd not fit the entry in the userspace
> > buffer anyway. Am I missing something?
> No. I'd explained it with diagram last time:
> https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com
>
> I'll add a concise comment here.

So it seems, but I think the code changed a bit and maybe could be
simplified now? Since p->vec_len == 0 is currently not valid, the
field could count only the entries available in p->vec[] -- IOW: not
include p->cur in the count.

BTW, `if (no space) return -ENOSPC` will avoid additional indentation
for the non-merging case.

[...]
> >> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
> >> +                                      struct page_region __user *vec,
> >> +                                      unsigned long *vec_index)
> >
> > ..._deposit() is used only in single place - please inline.
> It is already inline.

Sorry. I mean: please paste the code in place of the single call.

[...]
> >> +               /*
> >> +                * Break huge page into small pages if the WP operation need to
> >> +                * be performed is on a portion of the huge page or if max_pages
> >> +                * pages limit would exceed.
> >
> > BTW, could the `max_pages` limit be relaxed a bit (in that it would be
> > possible to return more pages if they all merge into the last vector
> > entry) so that it would not need to split otherwise-matching huge
> > page? It would remove the need for this special handling in the kernel
> > and splitting the page by this read-only-appearing ioctl?
> No, this cannot be done. Otherwise we'll not be able to emulate Windows
> syscall GetWriteWatch() which specifies the exact number of pages. Usually
> in most of cases, either user will not use THP or not perform the operation
> on partial huge page. So this part is only there to keep things correct for
> those users who do use THP and partial pagemap_scan operations.

I see that `GetWriteWatch` returns a list of pages not ranges of
pages. That makes sense (more or less). (BTW, It could be emulated in
userspace by caching the last not-fully-consumed range.)

> >> +                */
> >> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
> >> +                   ((end - start < HPAGE_SIZE) ||
> >> +                    (p->max_pages &&
> >> +                     (p->max_pages - p->found_pages) < n_pages))) {
> >> +
> >> +                       split_huge_pmd(vma, pmd, start);
> >> +                       goto process_smaller_pages;
> >> +               }
> >> +
> >> +               if (p->max_pages &&
> >> +                   p->found_pages + n_pages > p->max_pages)
> >> +                       n_pages = p->max_pages - p->found_pages;
> >> +
> >> +               ret = pagemap_scan_output(is_written, is_file, is_present,
> >> +                                         is_swap, p, start, n_pages);
> >> +               if (ret < 0)
> >> +                       return ret;

So let's simplify this:

if (p->max_pages && n_pages > max_pages - found_pages)
  n_pages = max_pages - found_pages;

if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
  split_thp();
  goto process_smaller_pages;
}

BTW, THP handling could be extracted to a function that would return
-EAGAIN if it has split the page or it wasn't a THP -- and that would
mean `goto process_smaller_pages`.

> > Why not propagate the error from uffd_wp_range()?
> uffd_wp_range() returns status in long variable. We cannot return long in
> this function. So intead of type casting long to int and then return I've
> used -EINVAL. Would following be more suitable?
>
> long ret2 = uffd_wp_range(vma, start, HPAGE_SIZE, true);
> if (ret2 < 0)
>         return (int)ret2;

I think it's ok, since negative values are expected to be error codes.
And since you can't overflow int with HPAGE_SIZE pages, then I
wouldn't use `ret2` but cast the return and add a comment why it's
safe.

[...]
> >> +       start = (unsigned long)untagged_addr(arg.start);
> >> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
> >
> > Is the inner cast needed?
> arg.vec remains 64-bit on 32-bit systems. So casting 64bit value directly
> to struct page_region pointer errors out. So I've added specific casting to
> unsigned long first before casting to pointers.

I see. So to convey the intention, the `arg.start` and `arg.vec`
should be casted to unsigned long, not the `untagged_addr()` return
values.

> >> +       ret = pagemap_scan_args_valid(&arg, start, vec);
> >> +       if (ret)
> >> +               return ret;
> >> +
> >> +       end = start + arg.len;
> >> +       p.max_pages = arg.max_pages;
> >> +       p.found_pages = 0;
> >> +       p.flags = arg.flags;
> >> +       p.required_mask = arg.required_mask;
> >> +       p.anyof_mask = arg.anyof_mask;
> >> +       p.excluded_mask = arg.excluded_mask;
> >> +       p.return_mask = arg.return_mask;
> >> +       p.cur.len = 0;
> >> +       p.cur.start = 0;
> >> +       p.vec = NULL;
> >> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> >
> > Nit: parentheses are not needed here, please remove.
> Will do.
>
> >
> >> +
> >> +       /*
> >> +        * Allocate smaller buffer to get output from inside the page walk
> >> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >> +        * we want to return output to user in compact form where no two
> >> +        * consecutive regions should be continuous and have the same flags.
> >> +        * So store the latest element in p.cur between different walks and
> >> +        * store the p.cur at the end of the walk to the user buffer.
> >> +        */
> >> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >> +                             GFP_KERNEL);
> >> +       if (!p.vec)
> >> +               return -ENOMEM;
> >> +
> >> +       walk_start = walk_end = start;
> >> +       while (walk_end < end && !ret) {
> >
> > The loop will stop if a previous iteration returned ENOSPC (and the
> > error will be lost) - is it intended?
> It is intentional. -ENOSPC means that the user buffer is full even though
> there was more memory to walk over. We don't treat this error. So when
> buffer gets full, we stop walking over further as user buffer has gotten
> full and return as success.

Thanks. What's the difference between -ENOSPC and
PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
flow).

[...]
> >> --- a/include/linux/userfaultfd_k.h
> >> +++ b/include/linux/userfaultfd_k.h
> >> @@ -210,6 +210,14 @@ extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
> >>
> >>  #else /* CONFIG_USERFAULTFD */
> >>
> >> +static inline long uffd_wp_range(struct mm_struct *dst_mm,
> >> +                                struct vm_area_struct *vma,
> >> +                                unsigned long start, unsigned long len,
> >> +                                bool enable_wp)
> >> +{
> >> +       return 0;
> >> +}
[...]
> > Shouldn't this part be in the patch introducing uffd_wp_range()?
> We have not added uffd_wp_range() in previous patches. We just need this
> stub for this patch for the case when CONFIG_USERFAULTFD isn't enabled.
>
> I'd this as separate patch before this patch. Mike asked me to merge it
> with this patch in order not to break bisectability.
>[1] https://lore.kernel.org/all/ZBK+86eMMazwfhdx@kernel.org

I would understand the reply [1] to mean that the uffd_wp_range() stub
should go in the same patch where uffd_wp_range() is implemented. But
uffd_wp_range() is already in (since f369b07c86140) so I don't see how
having the stub in a separate commit sequenced before this one could
break bisect?

Best Regards
Michał Mirosław
Michał Mirosław April 6, 2023, 8:12 p.m. UTC | #7
On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
[...]
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       bool is_written, is_file, is_present, is_swap;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long addr = end;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +       pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       ptl = pmd_trans_huge_lock(pmd, vma);
> +       if (ptl) {
[...]
> +               return ret;
> +       }
> +process_smaller_pages:
> +       if (pmd_trans_unstable(pmd))
> +               return 0;

Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?

Best regards
Michał Mirosław
Muhammad Usama Anjum April 6, 2023, 9:03 p.m. UTC | #8
On 4/7/23 1:00 AM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> Hello,
>>
>> Thank you so much for the review. Do you have any thoughts on the build
>> error on arc architecture?
>> https://lore.kernel.org/all/e3c82373-256a-6297-bcb4-5e1179a2cbe2@collabora.com
> 
> Maybe copy HPAGE_* defines from x86 and key on CONFIG_PGTABLE_LEVELS >
> 2? I don't know much about arc arch, though.
> 
>> On 4/6/23 8:52 PM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:>
> [...]
>>>> +#define PM_SCAN_BITMAP(wt, file, present, swap)        \
>>>> +       (wt | file << 1 | present << 2 | swap << 3)
>>> Please parenthesize macro arguments ("(wt)", "(file)", etc.) to not
>>> have to worry about operator precedence when passed a complex
>>> expression.
>> Like this?
>> #define PM_SCAN_BITMAP(wt, file, present, swap) \
>>         ((wt) | (file << 1) | (present << 2) | (swap << 3))
> 
> The value would be:
>  ( (wt) | ((file) << 1) | ... )
> IOW, each parameter should have parentheses around its name.
Will do.

> 
> [...]
>>>> +               cur->len += n_pages;
>>>> +               p->found_pages += n_pages;
>>>> +
>>>> +               if (p->max_pages && (p->found_pages == p->max_pages))
>>>> +                       return PM_SCAN_FOUND_MAX_PAGES;
>>>> +
>>>> +               return 0;
>>>> +       }
>>>> +
>>>> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
>>>
>>> It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
>>> have vec_len == 0 here, then we'd not fit the entry in the userspace
>>> buffer anyway. Am I missing something?
>> No. I'd explained it with diagram last time:
>> https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com
>>
>> I'll add a concise comment here.
> 
> So it seems, but I think the code changed a bit and maybe could be
> simplified now? Since p->vec_len == 0 is currently not valid, the
> field could count only the entries available in p->vec[] -- IOW: not
> include p->cur in the count.
I see. But this'll not work as we need to count p->cur to don't go above
the maximum count, p->vec_size.

> 
> BTW, `if (no space) return -ENOSPC` will avoid additional indentation
> for the non-merging case.
I'll update.

> 
> [...]
>>>> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
>>>> +                                      struct page_region __user *vec,
>>>> +                                      unsigned long *vec_index)
>>>
>>> ..._deposit() is used only in single place - please inline.
>> It is already inline.
> 
> Sorry. I mean: please paste the code in place of the single call.
I've made it a separate function to make the code look better in the caller
function and logically easier to understand. This function is ugly.
do_pagemap_scan() is also already very long function with lots of things
happening. If you still insist, I'll remove this function.

> 
> [...]
>>>> +               /*
>>>> +                * Break huge page into small pages if the WP operation need to
>>>> +                * be performed is on a portion of the huge page or if max_pages
>>>> +                * pages limit would exceed.
>>>
>>> BTW, could the `max_pages` limit be relaxed a bit (in that it would be
>>> possible to return more pages if they all merge into the last vector
>>> entry) so that it would not need to split otherwise-matching huge
>>> page? It would remove the need for this special handling in the kernel
>>> and splitting the page by this read-only-appearing ioctl?
>> No, this cannot be done. Otherwise we'll not be able to emulate Windows
>> syscall GetWriteWatch() which specifies the exact number of pages. Usually
>> in most of cases, either user will not use THP or not perform the operation
>> on partial huge page. So this part is only there to keep things correct for
>> those users who do use THP and partial pagemap_scan operations.
> 
> I see that `GetWriteWatch` returns a list of pages not ranges of
> pages. That makes sense (more or less). (BTW, It could be emulated in
> userspace by caching the last not-fully-consumed range.)
First of all, caching is avoided as then state maintained is needed. This
is probably not accepted in Wine upstream later. Secondly, even if we have
cache, Get + WP operation would not be accurate when we ask only N pages,
but it gets N + X pages where X pages will not be consumed by the
application at this time.

> 
>>>> +                */
>>>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
>>>> +                   ((end - start < HPAGE_SIZE) ||
>>>> +                    (p->max_pages &&
>>>> +                     (p->max_pages - p->found_pages) < n_pages))) {
>>>> +
>>>> +                       split_huge_pmd(vma, pmd, start);
>>>> +                       goto process_smaller_pages;
>>>> +               }
>>>> +
>>>> +               if (p->max_pages &&
>>>> +                   p->found_pages + n_pages > p->max_pages)
>>>> +                       n_pages = p->max_pages - p->found_pages;
>>>> +
>>>> +               ret = pagemap_scan_output(is_written, is_file, is_present,
>>>> +                                         is_swap, p, start, n_pages);
>>>> +               if (ret < 0)
>>>> +                       return ret;
> 
> So let's simplify this:
> 
> if (p->max_pages && n_pages > max_pages - found_pages)
>   n_pages = max_pages - found_pages;
> 
> if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
>   split_thp();
>   goto process_smaller_pages;
> }
Clever!! This looks very sleek.

> 
> BTW, THP handling could be extracted to a function that would return
> -EAGAIN if it has split the page or it wasn't a THP -- and that would
> mean `goto process_smaller_pages`.
Other functions in this file handle the THP in this same way. So it feels
like more intuitive that we follow to same pattern in this file.

> 
>>> Why not propagate the error from uffd_wp_range()?
>> uffd_wp_range() returns status in long variable. We cannot return long in
>> this function. So intead of type casting long to int and then return I've
>> used -EINVAL. Would following be more suitable?
>>
>> long ret2 = uffd_wp_range(vma, start, HPAGE_SIZE, true);
>> if (ret2 < 0)
>>         return (int)ret2;
> 
> I think it's ok, since negative values are expected to be error codes.
> And since you can't overflow int with HPAGE_SIZE pages, then I
> wouldn't use `ret2` but cast the return and add a comment why it's
> safe.
I'll update.

> 
> [...]
>>>> +       start = (unsigned long)untagged_addr(arg.start);
>>>> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
>>>
>>> Is the inner cast needed?
>> arg.vec remains 64-bit on 32-bit systems. So casting 64bit value directly
>> to struct page_region pointer errors out. So I've added specific casting to
>> unsigned long first before casting to pointers.
> 
> I see. So to convey the intention, the `arg.start` and `arg.vec`
> should be casted to unsigned long, not the `untagged_addr()` return
> values.
I'll update.

> 
>>>> +       ret = pagemap_scan_args_valid(&arg, start, vec);
>>>> +       if (ret)
>>>> +               return ret;
>>>> +
>>>> +       end = start + arg.len;
>>>> +       p.max_pages = arg.max_pages;
>>>> +       p.found_pages = 0;
>>>> +       p.flags = arg.flags;
>>>> +       p.required_mask = arg.required_mask;
>>>> +       p.anyof_mask = arg.anyof_mask;
>>>> +       p.excluded_mask = arg.excluded_mask;
>>>> +       p.return_mask = arg.return_mask;
>>>> +       p.cur.len = 0;
>>>> +       p.cur.start = 0;
>>>> +       p.vec = NULL;
>>>> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>>>
>>> Nit: parentheses are not needed here, please remove.
>> Will do.
>>
>>>
>>>> +
>>>> +       /*
>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>> +        * we want to return output to user in compact form where no two
>>>> +        * consecutive regions should be continuous and have the same flags.
>>>> +        * So store the latest element in p.cur between different walks and
>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>> +        */
>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>> +                             GFP_KERNEL);
>>>> +       if (!p.vec)
>>>> +               return -ENOMEM;
>>>> +
>>>> +       walk_start = walk_end = start;
>>>> +       while (walk_end < end && !ret) {
>>>
>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>> error will be lost) - is it intended?
>> It is intentional. -ENOSPC means that the user buffer is full even though
>> there was more memory to walk over. We don't treat this error. So when
>> buffer gets full, we stop walking over further as user buffer has gotten
>> full and return as success.
> 
> Thanks. What's the difference between -ENOSPC and
> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> flow).
-ENOSPC --> user buffer has been filled completely
PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
			    still have more space

> 
> [...]
>>>> --- a/include/linux/userfaultfd_k.h
>>>> +++ b/include/linux/userfaultfd_k.h
>>>> @@ -210,6 +210,14 @@ extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
>>>>
>>>>  #else /* CONFIG_USERFAULTFD */
>>>>
>>>> +static inline long uffd_wp_range(struct mm_struct *dst_mm,
>>>> +                                struct vm_area_struct *vma,
>>>> +                                unsigned long start, unsigned long len,
>>>> +                                bool enable_wp)
>>>> +{
>>>> +       return 0;
>>>> +}
> [...]
>>> Shouldn't this part be in the patch introducing uffd_wp_range()?
>> We have not added uffd_wp_range() in previous patches. We just need this
>> stub for this patch for the case when CONFIG_USERFAULTFD isn't enabled.
>>
>> I'd this as separate patch before this patch. Mike asked me to merge it
>> with this patch in order not to break bisectability.
>> [1] https://lore.kernel.org/all/ZBK+86eMMazwfhdx@kernel.org
> 
> I would understand the reply [1] to mean that the uffd_wp_range() stub
> should go in the same patch where uffd_wp_range() is implemented. But
> uffd_wp_range() is already in (since f369b07c86140) so I don't see how
> having the stub in a separate commit sequenced before this one could
> break bisect?
Sorry, I mis-interpreted it. I'll make it a separate patch.

> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum April 6, 2023, 9:12 p.m. UTC | #9
On 4/7/23 1:12 AM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
> [...]
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
> [...]
>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                                 unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       bool is_written, is_file, is_present, is_swap;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       unsigned long addr = end;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +       pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>> +       if (ptl) {
> [...]
>> +               return ret;
>> +       }
>> +process_smaller_pages:
>> +       if (pmd_trans_unstable(pmd))
>> +               return 0;
> 
> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
I'm not entirely sure. But the idea is if THP is unstable, we should
return. As it doesn't seem like after splitting THP can be unstable, we
should not check it. Do you agree with the following?


--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1957,11 +1957,11 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd,
unsigned long start,

 		return ret;
 	}
-process_smaller_pages:
+
 	if (pmd_trans_unstable(pmd))
 		return 0;
 #endif
-
+process_smaller_pages:
 	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE)
Michał Mirosław April 7, 2023, 7:23 a.m. UTC | #10
On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 1:12 AM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> > [...]
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
> > [...]
> >> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >> +                                 unsigned long end, struct mm_walk *walk)
> >> +{
[...]
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> +       ptl = pmd_trans_huge_lock(pmd, vma);
> >> +       if (ptl) {
> > [...]
> >> +               return ret;
> >> +       }
> >> +process_smaller_pages:
> >> +       if (pmd_trans_unstable(pmd))
> >> +               return 0;
> >
> > Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
> I'm not entirely sure. But the idea is if THP is unstable, we should
> return. As it doesn't seem like after splitting THP can be unstable, we
> should not check it. Do you agree with the following?

The description of pmd_trans_unstable() [1] seems to indicate that it
is needed only after split_huge_pmd().

[1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394

Best Regards
Michał Mirosław
Michał Mirosław April 7, 2023, 7:34 a.m. UTC | #11
On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
[...]
> >>>> +               cur->len += n_pages;
> >>>> +               p->found_pages += n_pages;
> >>>> +
> >>>> +               if (p->max_pages && (p->found_pages == p->max_pages))
> >>>> +                       return PM_SCAN_FOUND_MAX_PAGES;
> >>>> +
> >>>> +               return 0;
> >>>> +       }
> >>>> +
> >>>> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
> >>>
> >>> It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
> >>> have vec_len == 0 here, then we'd not fit the entry in the userspace
> >>> buffer anyway. Am I missing something?
> >> No. I'd explained it with diagram last time:
> >> https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com
> >>
> >> I'll add a concise comment here.
> >
> > So it seems, but I think the code changed a bit and maybe could be
> > simplified now? Since p->vec_len == 0 is currently not valid, the
> > field could count only the entries available in p->vec[] -- IOW: not
> > include p->cur in the count.
> I see. But this'll not work as we need to count p->cur to don't go above
> the maximum count, p->vec_size.

You can subtract 1 from p->vec_size before the page walk to account
for the buffer in `cur`.

[...]
> >>>> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
> >>>> +                                      struct page_region __user *vec,
> >>>> +                                      unsigned long *vec_index)
> >>>
> >>> ..._deposit() is used only in single place - please inline.
> >> It is already inline.
> >
> > Sorry. I mean: please paste the code in place of the single call.
> I've made it a separate function to make the code look better in the caller
> function and logically easier to understand. This function is ugly.
> do_pagemap_scan() is also already very long function with lots of things
> happening. If you still insist, I'll remove this function.

Please do remove - it will make the copy to userspace code all neatly together.

[...]
> >>>> +                */
> >>>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
> >>>> +                   ((end - start < HPAGE_SIZE) ||
> >>>> +                    (p->max_pages &&
> >>>> +                     (p->max_pages - p->found_pages) < n_pages))) {
> >>>> +
> >>>> +                       split_huge_pmd(vma, pmd, start);
> >>>> +                       goto process_smaller_pages;
> >>>> +               }
> >>>> +
> >>>> +               if (p->max_pages &&
> >>>> +                   p->found_pages + n_pages > p->max_pages)
> >>>> +                       n_pages = p->max_pages - p->found_pages;
> >>>> +
> >>>> +               ret = pagemap_scan_output(is_written, is_file, is_present,
> >>>> +                                         is_swap, p, start, n_pages);
> >>>> +               if (ret < 0)
> >>>> +                       return ret;
> >
> > So let's simplify this:
> >
> > if (p->max_pages && n_pages > max_pages - found_pages)
> >   n_pages = max_pages - found_pages;
> >
> > if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
> >   split_thp();
> >   goto process_smaller_pages;
> > }
> Clever!! This looks very sleek.
>
> >
> > BTW, THP handling could be extracted to a function that would return
> > -EAGAIN if it has split the page or it wasn't a THP -- and that would
> > mean `goto process_smaller_pages`.
> Other functions in this file handle the THP in this same way. So it feels
> like more intuitive that we follow to same pattern in this file.

I'll leave it to you. Extracting THP support would avoid a goto and
#ifdef inside a function, though (and make the function smaller).

> >>>> +       /*
> >>>> +        * Allocate smaller buffer to get output from inside the page walk
> >>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>> +        * we want to return output to user in compact form where no two
> >>>> +        * consecutive regions should be continuous and have the same flags.
> >>>> +        * So store the latest element in p.cur between different walks and
> >>>> +        * store the p.cur at the end of the walk to the user buffer.
> >>>> +        */
> >>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>> +                             GFP_KERNEL);
> >>>> +       if (!p.vec)
> >>>> +               return -ENOMEM;
> >>>> +
> >>>> +       walk_start = walk_end = start;
> >>>> +       while (walk_end < end && !ret) {
> >>>
> >>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>> error will be lost) - is it intended?
> >> It is intentional. -ENOSPC means that the user buffer is full even though
> >> there was more memory to walk over. We don't treat this error. So when
> >> buffer gets full, we stop walking over further as user buffer has gotten
> >> full and return as success.
> >
> > Thanks. What's the difference between -ENOSPC and
> > PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> > flow).
> -ENOSPC --> user buffer has been filled completely
> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>                             still have more space

What is the difference in code behaviour when those two cases are
compared? (I'd expect none.)

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 9:35 a.m. UTC | #12
On 4/7/23 12:23 PM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>> [...]
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
>>> [...]
>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>> +{
> [...]
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>>>> +       if (ptl) {
>>> [...]
>>>> +               return ret;
>>>> +       }
>>>> +process_smaller_pages:
>>>> +       if (pmd_trans_unstable(pmd))
>>>> +               return 0;
>>>
>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
>> I'm not entirely sure. But the idea is if THP is unstable, we should
>> return. As it doesn't seem like after splitting THP can be unstable, we
>> should not check it. Do you agree with the following?
> 
> The description of pmd_trans_unstable() [1] seems to indicate that it
> is needed only after split_huge_pmd().
> 
> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
in normal case when ptl is NULL to rule out the case if pmd is unstable
before performing operation on normal pages:

ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
...
}
if (pmd_trans_unstable(pmd))
	return 0;

This file has usage examples of pmd_trans_unstable():

https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887

So we are good with what we have in this patch.

> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 7, 2023, 10:04 a.m. UTC | #13
On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 12:23 PM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 4/7/23 1:12 AM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> >>> [...]
> >>>> --- a/fs/proc/task_mmu.c
> >>>> +++ b/fs/proc/task_mmu.c
> >>> [...]
> >>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >>>> +                                 unsigned long end, struct mm_walk *walk)
> >>>> +{
> > [...]
> >>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
> >>>> +       if (ptl) {
> >>> [...]
> >>>> +               return ret;
> >>>> +       }
> >>>> +process_smaller_pages:
> >>>> +       if (pmd_trans_unstable(pmd))
> >>>> +               return 0;
> >>>
> >>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
> >> I'm not entirely sure. But the idea is if THP is unstable, we should
> >> return. As it doesn't seem like after splitting THP can be unstable, we
> >> should not check it. Do you agree with the following?
> >
> > The description of pmd_trans_unstable() [1] seems to indicate that it
> > is needed only after split_huge_pmd().
> >
> > [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
> in normal case when ptl is NULL to rule out the case if pmd is unstable
> before performing operation on normal pages:
>
> ptl = pmd_trans_huge_lock(pmd, vma);
> if (ptl) {
> ...
> }
> if (pmd_trans_unstable(pmd))
>         return 0;
>
> This file has usage examples of pmd_trans_unstable():
>
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
>
> So we are good with what we have in this patch.

Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 10:04 a.m. UTC | #14
On 4/7/23 12:34 PM, Michał Mirosław wrote:
> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>>>> +               cur->len += n_pages;
>>>>>> +               p->found_pages += n_pages;
>>>>>> +
>>>>>> +               if (p->max_pages && (p->found_pages == p->max_pages))
>>>>>> +                       return PM_SCAN_FOUND_MAX_PAGES;
>>>>>> +
>>>>>> +               return 0;
>>>>>> +       }
>>>>>> +
>>>>>> +       if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
>>>>>
>>>>> It looks that `if (p->vec_index < p->vec_len)` is enough here - if we
>>>>> have vec_len == 0 here, then we'd not fit the entry in the userspace
>>>>> buffer anyway. Am I missing something?
>>>> No. I'd explained it with diagram last time:
>>>> https://lore.kernel.org/all/3c8d9ea0-1382-be0c-8dd2-d490eedd3b55@collabora.com
>>>>
>>>> I'll add a concise comment here.
>>>
>>> So it seems, but I think the code changed a bit and maybe could be
>>> simplified now? Since p->vec_len == 0 is currently not valid, the
>>> field could count only the entries available in p->vec[] -- IOW: not
>>> include p->cur in the count.
>> I see. But this'll not work as we need to count p->cur to don't go above
>> the maximum count, p->vec_size.
> 
> You can subtract 1 from p->vec_size before the page walk to account
> for the buffer in `cur`.
Yeah, it can be done. But I've thought about it. It would look more uglier.
I'll have to subtract 1 from vec_len in the start and then add 1 in the
end. The current algorithm seems simpler.

> 
> [...]
>>>>>> +static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
>>>>>> +                                      struct page_region __user *vec,
>>>>>> +                                      unsigned long *vec_index)
>>>>>
>>>>> ..._deposit() is used only in single place - please inline.
>>>> It is already inline.
>>>
>>> Sorry. I mean: please paste the code in place of the single call.
>> I've made it a separate function to make the code look better in the caller
>> function and logically easier to understand. This function is ugly.
>> do_pagemap_scan() is also already very long function with lots of things
>> happening. If you still insist, I'll remove this function.
> 
> Please do remove - it will make the copy to userspace code all neatly together.
Will remove it.

> 
> [...]
>>>>>> +                */
>>>>>> +               if (is_written && PM_SCAN_OP_IS_WP(p) &&
>>>>>> +                   ((end - start < HPAGE_SIZE) ||
>>>>>> +                    (p->max_pages &&
>>>>>> +                     (p->max_pages - p->found_pages) < n_pages))) {
>>>>>> +
>>>>>> +                       split_huge_pmd(vma, pmd, start);
>>>>>> +                       goto process_smaller_pages;
>>>>>> +               }
>>>>>> +
>>>>>> +               if (p->max_pages &&
>>>>>> +                   p->found_pages + n_pages > p->max_pages)
>>>>>> +                       n_pages = p->max_pages - p->found_pages;
>>>>>> +
>>>>>> +               ret = pagemap_scan_output(is_written, is_file, is_present,
>>>>>> +                                         is_swap, p, start, n_pages);
>>>>>> +               if (ret < 0)
>>>>>> +                       return ret;
>>>
>>> So let's simplify this:
>>>
>>> if (p->max_pages && n_pages > max_pages - found_pages)
>>>   n_pages = max_pages - found_pages;
>>>
>>> if (is_written && DO_WP && n_pages != HPAGE_SIZE / PAGE_SIZE) {
>>>   split_thp();
>>>   goto process_smaller_pages;
>>> }
>> Clever!! This looks very sleek.
>>
>>>
>>> BTW, THP handling could be extracted to a function that would return
>>> -EAGAIN if it has split the page or it wasn't a THP -- and that would
>>> mean `goto process_smaller_pages`.
>> Other functions in this file handle the THP in this same way. So it feels
>> like more intuitive that we follow to same pattern in this file.
> 
> I'll leave it to you. Extracting THP support would avoid a goto and
> #ifdef inside a function, though (and make the function smaller).
> 
>>>>>> +       /*
>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>>>> +        * we want to return output to user in compact form where no two
>>>>>> +        * consecutive regions should be continuous and have the same flags.
>>>>>> +        * So store the latest element in p.cur between different walks and
>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>>>> +        */
>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>>>> +                             GFP_KERNEL);
>>>>>> +       if (!p.vec)
>>>>>> +               return -ENOMEM;
>>>>>> +
>>>>>> +       walk_start = walk_end = start;
>>>>>> +       while (walk_end < end && !ret) {
>>>>>
>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>>>> error will be lost) - is it intended?
>>>> It is intentional. -ENOSPC means that the user buffer is full even though
>>>> there was more memory to walk over. We don't treat this error. So when
>>>> buffer gets full, we stop walking over further as user buffer has gotten
>>>> full and return as success.
>>>
>>> Thanks. What's the difference between -ENOSPC and
>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
>>> flow).
>> -ENOSPC --> user buffer has been filled completely
>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>>                             still have more space
> 
> What is the difference in code behaviour when those two cases are
> compared? (I'd expect none.)
There is difference:
We add data to user buffer. If it succeeds with return code 0, we engage
the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
to the user buffer.

> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 7, 2023, 10:14 a.m. UTC | #15
On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 12:34 PM, Michał Mirosław wrote:
> > On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
[...]
> >>>>>> +       /*
> >>>>>> +        * Allocate smaller buffer to get output from inside the page walk
> >>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>>>> +        * we want to return output to user in compact form where no two
> >>>>>> +        * consecutive regions should be continuous and have the same flags.
> >>>>>> +        * So store the latest element in p.cur between different walks and
> >>>>>> +        * store the p.cur at the end of the walk to the user buffer.
> >>>>>> +        */
> >>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>>>> +                             GFP_KERNEL);
> >>>>>> +       if (!p.vec)
> >>>>>> +               return -ENOMEM;
> >>>>>> +
> >>>>>> +       walk_start = walk_end = start;
> >>>>>> +       while (walk_end < end && !ret) {
> >>>>>
> >>>>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>>>> error will be lost) - is it intended?
> >>>> It is intentional. -ENOSPC means that the user buffer is full even though
> >>>> there was more memory to walk over. We don't treat this error. So when
> >>>> buffer gets full, we stop walking over further as user buffer has gotten
> >>>> full and return as success.
> >>>
> >>> Thanks. What's the difference between -ENOSPC and
> >>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> >>> flow).
> >> -ENOSPC --> user buffer has been filled completely
> >> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
> >>                             still have more space
> >
> > What is the difference in code behaviour when those two cases are
> > compared? (I'd expect none.)
> There is difference:
> We add data to user buffer. If it succeeds with return code 0, we engage
> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
> to the user buffer.

Thanks! I see it now. I see a few more corner cases here:
1. If we did engage WP but fail to copy the vector we return -EFAULT
but the WP is already engaged. I'm not sure this is something worth
guarding against, but documenting that would be helpful I think.
2. If uffd_wp_range() fails, but we have already processed pages
earlier, we should treat the error like ENOSPC and back out the failed
range (the earier changes would be lost otherwise).

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 10:14 a.m. UTC | #16
On 4/7/23 3:04 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 12:23 PM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>>>> [...]
>>>>>> --- a/fs/proc/task_mmu.c
>>>>>> +++ b/fs/proc/task_mmu.c
>>>>> [...]
>>>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>>>> +{
>>> [...]
>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>>>>>> +       if (ptl) {
>>>>> [...]
>>>>>> +               return ret;
>>>>>> +       }
>>>>>> +process_smaller_pages:
>>>>>> +       if (pmd_trans_unstable(pmd))
>>>>>> +               return 0;
>>>>>
>>>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
>>>> I'm not entirely sure. But the idea is if THP is unstable, we should
>>>> return. As it doesn't seem like after splitting THP can be unstable, we
>>>> should not check it. Do you agree with the following?
>>>
>>> The description of pmd_trans_unstable() [1] seems to indicate that it
>>> is needed only after split_huge_pmd().
>>>
>>> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
>> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
>> in normal case when ptl is NULL to rule out the case if pmd is unstable
>> before performing operation on normal pages:
>>
>> ptl = pmd_trans_huge_lock(pmd, vma);
>> if (ptl) {
>> ...
>> }
>> if (pmd_trans_unstable(pmd))
>>         return 0;
>>
>> This file has usage examples of pmd_trans_unstable():
>>
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
>>
>> So we are good with what we have in this patch.
> 
> Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?
I'm not sure. I've not done research on it if we need to signal
ACTION_AGAIN as this function pagemap_scan_pmd_entry() mimics how
pagemap_pmd_range() handles reads to the pagemap file. pagemap_pmd_range()
isn't doing anything if pmd is unstable. Hence we also not doing anything.

> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 7, 2023, 10:21 a.m. UTC | #17
On Fri, 7 Apr 2023 at 12:15, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 3:04 PM, Michał Mirosław wrote:
> > On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 4/7/23 12:23 PM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> >>>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
> >>>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
> >>>>> <usama.anjum@collabora.com> wrote:
> >>>>> [...]
> >>>>>> --- a/fs/proc/task_mmu.c
> >>>>>> +++ b/fs/proc/task_mmu.c
> >>>>> [...]
> >>>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >>>>>> +                                 unsigned long end, struct mm_walk *walk)
> >>>>>> +{
> >>> [...]
> >>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
> >>>>>> +       if (ptl) {
> >>>>> [...]
> >>>>>> +               return ret;
> >>>>>> +       }
> >>>>>> +process_smaller_pages:
> >>>>>> +       if (pmd_trans_unstable(pmd))
> >>>>>> +               return 0;
> >>>>>
> >>>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
> >>>> I'm not entirely sure. But the idea is if THP is unstable, we should
> >>>> return. As it doesn't seem like after splitting THP can be unstable, we
> >>>> should not check it. Do you agree with the following?
> >>>
> >>> The description of pmd_trans_unstable() [1] seems to indicate that it
> >>> is needed only after split_huge_pmd().
> >>>
> >>> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
> >> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
> >> in normal case when ptl is NULL to rule out the case if pmd is unstable
> >> before performing operation on normal pages:
> >>
> >> ptl = pmd_trans_huge_lock(pmd, vma);
> >> if (ptl) {
> >> ...
> >> }
> >> if (pmd_trans_unstable(pmd))
> >>         return 0;
> >>
> >> This file has usage examples of pmd_trans_unstable():
> >>
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
> >> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
> >>
> >> So we are good with what we have in this patch.
> >
> > Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?
> I'm not sure. I've not done research on it if we need to signal
> ACTION_AGAIN as this function pagemap_scan_pmd_entry() mimics how
> pagemap_pmd_range() handles reads to the pagemap file. pagemap_pmd_range()
> isn't doing anything if pmd is unstable. Hence we also not doing anything.

Doesn't this mean that if we scan a file-backed vma we would miss
non-present parts of the mapping in the output?

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 10:50 a.m. UTC | #18
On 4/7/23 3:21 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 12:15, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 3:04 PM, Michał Mirosław wrote:
>>> On Fri, 7 Apr 2023 at 11:35, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 12:23 PM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 23:12, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>>>>> On 4/7/23 1:12 AM, Michał Mirosław wrote:
>>>>>>> On Thu, 6 Apr 2023 at 09:40, Muhammad Usama Anjum
>>>>>>> <usama.anjum@collabora.com> wrote:
>>>>>>> [...]
>>>>>>>> --- a/fs/proc/task_mmu.c
>>>>>>>> +++ b/fs/proc/task_mmu.c
>>>>>>> [...]
>>>>>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>>>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>>>>>> +{
>>>>> [...]
>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>>>>>>>> +       if (ptl) {
>>>>>>> [...]
>>>>>>>> +               return ret;
>>>>>>>> +       }
>>>>>>>> +process_smaller_pages:
>>>>>>>> +       if (pmd_trans_unstable(pmd))
>>>>>>>> +               return 0;
>>>>>>>
>>>>>>> Why pmd_trans_unstable() is needed here and not only after split_huge_pmd()?
>>>>>> I'm not entirely sure. But the idea is if THP is unstable, we should
>>>>>> return. As it doesn't seem like after splitting THP can be unstable, we
>>>>>> should not check it. Do you agree with the following?
>>>>>
>>>>> The description of pmd_trans_unstable() [1] seems to indicate that it
>>>>> is needed only after split_huge_pmd().
>>>>>
>>>>> [1] https://elixir.bootlin.com/linux/v6.3-rc5/source/include/linux/pgtable.h#L1394
>>>> Sorry, yeah pmd_trans_unstable() is need after split. But it is also needed
>>>> in normal case when ptl is NULL to rule out the case if pmd is unstable
>>>> before performing operation on normal pages:
>>>>
>>>> ptl = pmd_trans_huge_lock(pmd, vma);
>>>> if (ptl) {
>>>> ...
>>>> }
>>>> if (pmd_trans_unstable(pmd))
>>>>         return 0;
>>>>
>>>> This file has usage examples of pmd_trans_unstable():
>>>>
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L634
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1195
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1543
>>>> https://elixir.bootlin.com/linux/v6.3-rc5/source/fs/proc/task_mmu.c#L1887
>>>>
>>>> So we are good with what we have in this patch.
>>>
>>> Shouldn't we signal ACTION_AGAIN then in order to call .pte_hole?
>> I'm not sure. I've not done research on it if we need to signal
>> ACTION_AGAIN as this function pagemap_scan_pmd_entry() mimics how
>> pagemap_pmd_range() handles reads to the pagemap file. pagemap_pmd_range()
>> isn't doing anything if pmd is unstable. Hence we also not doing anything.
> 
> Doesn't this mean that if we scan a file-backed vma we would miss
> non-present parts of the mapping in the output?
I'm trying to mimic the same information through ioctl which is attained by
reading the file. We'll only miss the unstable VMA here. I'm don't know
about how often the PMD is unstable and its effects.


> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum April 7, 2023, 11:10 a.m. UTC | #19
On 4/7/23 3:14 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 12:34 PM, Michał Mirosław wrote:
>>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>>>>>> +       /*
>>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>>>>>> +        * we want to return output to user in compact form where no two
>>>>>>>> +        * consecutive regions should be continuous and have the same flags.
>>>>>>>> +        * So store the latest element in p.cur between different walks and
>>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>>>>>> +        */
>>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>>>>>> +                             GFP_KERNEL);
>>>>>>>> +       if (!p.vec)
>>>>>>>> +               return -ENOMEM;
>>>>>>>> +
>>>>>>>> +       walk_start = walk_end = start;
>>>>>>>> +       while (walk_end < end && !ret) {
>>>>>>>
>>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>>>>>> error will be lost) - is it intended?
>>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
>>>>>> there was more memory to walk over. We don't treat this error. So when
>>>>>> buffer gets full, we stop walking over further as user buffer has gotten
>>>>>> full and return as success.
>>>>>
>>>>> Thanks. What's the difference between -ENOSPC and
>>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
>>>>> flow).
>>>> -ENOSPC --> user buffer has been filled completely
>>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>>>>                             still have more space
>>>
>>> What is the difference in code behaviour when those two cases are
>>> compared? (I'd expect none.)
>> There is difference:
>> We add data to user buffer. If it succeeds with return code 0, we engage
>> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
>> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
>> to the user buffer.
> 
> Thanks! I see it now. I see a few more corner cases here:
> 1. If we did engage WP but fail to copy the vector we return -EFAULT
> but the WP is already engaged. I'm not sure this is something worth
> guarding against, but documenting that would be helpful I think.
Sure.

> 2. If uffd_wp_range() fails, but we have already processed pages
> earlier, we should treat the error like ENOSPC and back out the failed
> range (the earier changes would be lost otherwise).
Backing out is easier to do for hugepages. But for normal pages, we'll have
to write some code to find where the current data was added (in cur or in
vec) and back out from that. I'll have to write some more code to avoid the
side-effects as well.

But aren't we going over-engineering here? Error occurred and we are trying
to keep the previously generated correct data and returning successfully
still to the user? I don't think we should do this. An error is error. We
should return the error simply even if the memory flags would get lost. We
don't know what caused the error in uffd_wp_range(). Under normal
situation, we there shouldn't have had error.


> 
> Best Regards
> Michał Mirosław
Michał Mirosław April 11, 2023, 9:29 a.m. UTC | #20
On Fri, 7 Apr 2023 at 13:11, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 4/7/23 3:14 PM, Michał Mirosław wrote:
> > On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 4/7/23 12:34 PM, Michał Mirosław wrote:
> >>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> >>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
> >>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
> >>>>> <usama.anjum@collabora.com> wrote:
> > [...]
> >>>>>>>> +       /*
> >>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
> >>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >>>>>>>> +        * we want to return output to user in compact form where no two
> >>>>>>>> +        * consecutive regions should be continuous and have the same flags.
> >>>>>>>> +        * So store the latest element in p.cur between different walks and
> >>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
> >>>>>>>> +        */
> >>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> >>>>>>>> +                             GFP_KERNEL);
> >>>>>>>> +       if (!p.vec)
> >>>>>>>> +               return -ENOMEM;
> >>>>>>>> +
> >>>>>>>> +       walk_start = walk_end = start;
> >>>>>>>> +       while (walk_end < end && !ret) {
> >>>>>>>
> >>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
> >>>>>>> error will be lost) - is it intended?
> >>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
> >>>>>> there was more memory to walk over. We don't treat this error. So when
> >>>>>> buffer gets full, we stop walking over further as user buffer has gotten
> >>>>>> full and return as success.
> >>>>>
> >>>>> Thanks. What's the difference between -ENOSPC and
> >>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
> >>>>> flow).
> >>>> -ENOSPC --> user buffer has been filled completely
> >>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
> >>>>                             still have more space
> >>>
> >>> What is the difference in code behaviour when those two cases are
> >>> compared? (I'd expect none.)
> >> There is difference:
> >> We add data to user buffer. If it succeeds with return code 0, we engage
> >> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
> >> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
> >> to the user buffer.
> >
> > Thanks! I see it now. I see a few more corner cases here:
> > 1. If we did engage WP but fail to copy the vector we return -EFAULT
> > but the WP is already engaged. I'm not sure this is something worth
> > guarding against, but documenting that would be helpful I think.
> Sure.
>
> > 2. If uffd_wp_range() fails, but we have already processed pages
> > earlier, we should treat the error like ENOSPC and back out the failed
> > range (the earier changes would be lost otherwise).
> Backing out is easier to do for hugepages. But for normal pages, we'll have
> to write some code to find where the current data was added (in cur or in
> vec) and back out from that. I'll have to write some more code to avoid the
> side-effects as well.

If I read the code correctly, the last page should always be in `cur`
and on failure only a single page is needed to be backed out. Did I
miss something?

> But aren't we going over-engineering here? Error occurred and we are trying
> to keep the previously generated correct data and returning successfully
> still to the user? I don't think we should do this. An error is error. We
> should return the error simply even if the memory flags would get lost. We
> don't know what caused the error in uffd_wp_range(). Under normal
> situation, we there shouldn't have had error.

In this case it means that on (intermittent) allocation error we get
inconsistent or non-deterministic results. I wouldn't want to be the
one debugging this later - I'd prefer either the syscall be
"exception-safe" (give consistent and predictable output) or kill the
process.

Best Regards
Michał Mirosław
Muhammad Usama Anjum April 17, 2023, 11:11 a.m. UTC | #21
On 4/11/23 2:29 PM, Michał Mirosław wrote:
> On Fri, 7 Apr 2023 at 13:11, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 4/7/23 3:14 PM, Michał Mirosław wrote:
>>> On Fri, 7 Apr 2023 at 12:04, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 4/7/23 12:34 PM, Michał Mirosław wrote:
>>>>> On Thu, 6 Apr 2023 at 23:04, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>>>>> On 4/7/23 1:00 AM, Michał Mirosław wrote:
>>>>>>> On Thu, 6 Apr 2023 at 19:58, Muhammad Usama Anjum
>>>>>>> <usama.anjum@collabora.com> wrote:
>>> [...]
>>>>>>>>>> +       /*
>>>>>>>>>> +        * Allocate smaller buffer to get output from inside the page walk
>>>>>>>>>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>>>>>>>> +        * we want to return output to user in compact form where no two
>>>>>>>>>> +        * consecutive regions should be continuous and have the same flags.
>>>>>>>>>> +        * So store the latest element in p.cur between different walks and
>>>>>>>>>> +        * store the p.cur at the end of the walk to the user buffer.
>>>>>>>>>> +        */
>>>>>>>>>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>>>>>>>>>> +                             GFP_KERNEL);
>>>>>>>>>> +       if (!p.vec)
>>>>>>>>>> +               return -ENOMEM;
>>>>>>>>>> +
>>>>>>>>>> +       walk_start = walk_end = start;
>>>>>>>>>> +       while (walk_end < end && !ret) {
>>>>>>>>>
>>>>>>>>> The loop will stop if a previous iteration returned ENOSPC (and the
>>>>>>>>> error will be lost) - is it intended?
>>>>>>>> It is intentional. -ENOSPC means that the user buffer is full even though
>>>>>>>> there was more memory to walk over. We don't treat this error. So when
>>>>>>>> buffer gets full, we stop walking over further as user buffer has gotten
>>>>>>>> full and return as success.
>>>>>>>
>>>>>>> Thanks. What's the difference between -ENOSPC and
>>>>>>> PM_SCAN_FOUND_MAX_PAGES? They seem to result in the same effect (code
>>>>>>> flow).
>>>>>> -ENOSPC --> user buffer has been filled completely
>>>>>> PM_SCAN_FOUND_MAX_PAGES --> max_pages have been found, user buffer may
>>>>>>                             still have more space
>>>>>
>>>>> What is the difference in code behaviour when those two cases are
>>>>> compared? (I'd expect none.)
>>>> There is difference:
>>>> We add data to user buffer. If it succeeds with return code 0, we engage
>>>> the WP. If it succeeds with PM_SCAN_FOUND_MAX_PAGES, we still engage the
>>>> WP. But if we get -ENOSPC, we don't perform engage as the data wasn't added
>>>> to the user buffer.
>>>
>>> Thanks! I see it now. I see a few more corner cases here:
>>> 1. If we did engage WP but fail to copy the vector we return -EFAULT
>>> but the WP is already engaged. I'm not sure this is something worth
>>> guarding against, but documenting that would be helpful I think.
>> Sure.
>>
>>> 2. If uffd_wp_range() fails, but we have already processed pages
>>> earlier, we should treat the error like ENOSPC and back out the failed
>>> range (the earier changes would be lost otherwise).
>> Backing out is easier to do for hugepages. But for normal pages, we'll have
>> to write some code to find where the current data was added (in cur or in
>> vec) and back out from that. I'll have to write some more code to avoid the
>> side-effects as well.
> 
> If I read the code correctly, the last page should always be in `cur`
> and on failure only a single page is needed to be backed out. Did I
> miss something?
I'm leaving using uffd_wp_range() in next revision as it is costing
performance. This will not be needed in next revision.

> 
>> But aren't we going over-engineering here? Error occurred and we are trying
>> to keep the previously generated correct data and returning successfully
>> still to the user? I don't think we should do this. An error is error. We
>> should return the error simply even if the memory flags would get lost. We
>> don't know what caused the error in uffd_wp_range(). Under normal
>> situation, we there shouldn't have had error.
> 
> In this case it means that on (intermittent) allocation error we get
> inconsistent or non-deterministic results. I wouldn't want to be the
> one debugging this later - I'd prefer either the syscall be
> "exception-safe" (give consistent and predictable output) or kill the
> process.
> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum June 7, 2023, 5:45 a.m. UTC | #22
On 4/6/23 5:56 PM, Muhammad Usama Anjum wrote:
> On 4/6/23 4:40 PM, kernel test robot wrote:
>> Hi Muhammad,
>>
>> kernel test robot noticed the following build errors:
>>
>> [auto build test ERROR on akpm-mm/mm-everything]
>> [also build test ERROR on next-20230406]
>> [cannot apply to linus/master v6.3-rc5]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use '--base' as documented in
>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>
>> url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>> patch link:    https://lore.kernel.org/r/20230406074005.1784728-3-usama.anjum%40collabora.com
>> patch subject: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
>> config: arc-randconfig-r023-20230405 (https://download.01.org/0day-ci/archive/20230406/202304061914.N1Hmx12N-lkp@intel.com/config)
>> compiler: arceb-elf-gcc (GCC) 12.1.0
>> reproduce (this is a W=1 build):
>>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>>         chmod +x ~/bin/make.cross
>>         # https://github.com/intel-lab-lkp/linux/commit/f13abb36f64c77913509da8ca157512d2fb9f031
>>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>>         git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>>         git checkout f13abb36f64c77913509da8ca157512d2fb9f031
>>         # save the config file
>>         mkdir build_dir && cp config build_dir/.config
>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash fs/proc/
>>
>> If you fix the issue, kindly add following tag where applicable
>> | Reported-by: kernel test robot <lkp@intel.com>
>> | Link: https://lore.kernel.org/oe-kbuild-all/202304061914.N1Hmx12N-lkp@intel.com/
>>
>> All error/warnings (new ones prefixed by >>):
>>
>>    fs/proc/task_mmu.c: In function 'pagemap_scan_pmd_entry':
>>>> fs/proc/task_mmu.c:1921:37: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
>>     1921 |                     ((end - start < HPAGE_SIZE) ||
>>          |                                     ^~~~~~~~~~
>>          |                                     PAGE_SIZE
> It seems arc architecture supports CONFIG_TRANSPARENT_HUGEPAGE, but it
> doesn't define HPAGE_SIZE. It only defines PAGE_SIZE in
> include/uapi/asm/page.h? AFAIK HPAGE_SIZE must be defined when
> CONFIG_TRANSPARENT_HUGEPAGE is enabled. What can be the solution here for arc?
I'm still looking for solution. Vineet do you have some thoughts?

> 
> The remaining build failures are because the wrong tree. I base my patches
> on latest next, while the bot has based patches on mm-everything. I guess
> today's next would have latest mm stuff, a rebase would make things correct
> or I'll shift to mm-everything.
> 
>
Muhammad Usama Anjum June 13, 2023, 10:35 a.m. UTC | #23
Hi Vineet,

It seems arc architecture supports CONFIG_TRANSPARENT_HUGEPAGE, but it
doesn't define HPAGE_SIZE. It only defines PAGE_SIZE in
include/uapi/asm/page.h? AFAIK HPAGE_SIZE must be defined when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. What can be the solution here for arc?

Should I just compile out this code for arc architecture specifically?

Thanks,
Usama


On 6/7/23 10:45 AM, Muhammad Usama Anjum wrote:
> On 4/6/23 5:56 PM, Muhammad Usama Anjum wrote:
>> On 4/6/23 4:40 PM, kernel test robot wrote:
>>> Hi Muhammad,
>>>
>>> kernel test robot noticed the following build errors:
>>>
>>> [auto build test ERROR on akpm-mm/mm-everything]
>>> [also build test ERROR on next-20230406]
>>> [cannot apply to linus/master v6.3-rc5]
>>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>>> And when submitting patch, we suggest to use '--base' as documented in
>>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>>
>>> url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
>>> patch link:    https://lore.kernel.org/r/20230406074005.1784728-3-usama.anjum%40collabora.com
>>> patch subject: [PATCH v12 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
>>> config: arc-randconfig-r023-20230405 (https://download.01.org/0day-ci/archive/20230406/202304061914.N1Hmx12N-lkp@intel.com/config)
>>> compiler: arceb-elf-gcc (GCC) 12.1.0
>>> reproduce (this is a W=1 build):
>>>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>>>         chmod +x ~/bin/make.cross
>>>         # https://github.com/intel-lab-lkp/linux/commit/f13abb36f64c77913509da8ca157512d2fb9f031
>>>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>>>         git fetch --no-tags linux-review Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230406-154314
>>>         git checkout f13abb36f64c77913509da8ca157512d2fb9f031
>>>         # save the config file
>>>         mkdir build_dir && cp config build_dir/.config
>>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc olddefconfig
>>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=arc SHELL=/bin/bash fs/proc/
>>>
>>> If you fix the issue, kindly add following tag where applicable
>>> | Reported-by: kernel test robot <lkp@intel.com>
>>> | Link: https://lore.kernel.org/oe-kbuild-all/202304061914.N1Hmx12N-lkp@intel.com/
>>>
>>> All error/warnings (new ones prefixed by >>):
>>>
>>>    fs/proc/task_mmu.c: In function 'pagemap_scan_pmd_entry':
>>>>> fs/proc/task_mmu.c:1921:37: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
>>>     1921 |                     ((end - start < HPAGE_SIZE) ||
>>>          |                                     ^~~~~~~~~~
>>>          |                                     PAGE_SIZE
>> It seems arc architecture supports CONFIG_TRANSPARENT_HUGEPAGE, but it
>> doesn't define HPAGE_SIZE. It only defines PAGE_SIZE in
>> include/uapi/asm/page.h? AFAIK HPAGE_SIZE must be defined when
>> CONFIG_TRANSPARENT_HUGEPAGE is enabled. What can be the solution here for arc?
> I'm still looking for solution. Vineet do you have some thoughts?
> 
>>
>> The remaining build failures are because the wrong tree. I base my patches
>> on latest next, while the bot has based patches on mm-everything. I guess
>> today's next would have latest mm stuff, a rebase would make things correct
>> or I'll shift to mm-everything.
>>
>>
>
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6134f0df4164..36531b04bcd6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@ 
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1135,6 +1136,26 @@  static inline void clear_soft_dirty(struct vm_area_struct *vma,
 }
 #endif
 
+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+	return ((pte_present(pte) && pte_uffd_wp(pte)) ||
+		pte_swp_uffd_wp_any(pte));
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static inline bool is_huge_pte_uffd_wp(pte_t pte)
+{
+	return ((pte_present(pte) && huge_pte_uffd_wp(pte)) ||
+		pte_swp_uffd_wp_any(pte));
+}
+#endif
+
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+	return ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+		(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)));
+}
+
 #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
@@ -1768,11 +1789,416 @@  static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PM_SCAN_FOUND_MAX_PAGES	(1)
+#define PM_SCAN_BITS_ALL	(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PM_SCAN_OPS		(PM_SCAN_OP_GET | PM_SCAN_OP_WP)
+#define PM_SCAN_OP_IS_WP(a)	(a->flags & PM_SCAN_OP_WP)
+#define PM_SCAN_BITMAP(wt, file, present, swap)	\
+	(wt | file << 1 | present << 2 | swap << 3)
+
+struct pagemap_scan_private {
+	struct page_region *vec;
+	struct page_region cur;
+	unsigned long vec_len, vec_index;
+	unsigned int max_pages, found_pages, flags;
+	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static inline bool pagemap_scan_is_written_set(struct pagemap_scan_private *p)
+{
+	return	((p->required_mask | p->anyof_mask | p->excluded_mask) &
+		 PAGE_IS_WRITTEN);
+}
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+				  struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (pagemap_scan_is_written_set(p) && (!userfaultfd_wp(vma) ||
+	    !userfaultfd_wp_async(vma)))
+		return -EPERM;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
+			       struct pagemap_scan_private *p,
+			       unsigned long addr, unsigned int n_pages)
+{
+	unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
+	struct page_region *cur = &p->cur;
+
+	if (!n_pages)
+		return -EINVAL;
+
+	if ((p->required_mask & bitmap) != p->required_mask)
+		return 0;
+	if (p->anyof_mask && !(p->anyof_mask & bitmap))
+		return 0;
+	if (p->excluded_mask & bitmap)
+		return 0;
+
+	bitmap &= p->return_mask;
+	if (!bitmap)
+		return 0;
+
+	if ((cur->start + cur->len * PAGE_SIZE == addr) &&
+	    (cur->bitmap == bitmap)) {
+		cur->len += n_pages;
+		p->found_pages += n_pages;
+
+		if (p->max_pages && (p->found_pages == p->max_pages))
+			return PM_SCAN_FOUND_MAX_PAGES;
+
+		return 0;
+	}
+
+	if (!p->vec_index || ((p->vec_index + 1) < p->vec_len)) {
+
+		if (cur->len) {
+			memcpy(&p->vec[p->vec_index], cur,
+			       sizeof(struct page_region));
+			p->vec_index++;
+		}
+		cur->start = addr;
+		cur->len = n_pages;
+		cur->bitmap = bitmap;
+		p->found_pages += n_pages;
+
+		if (p->max_pages && (p->found_pages == p->max_pages))
+			return PM_SCAN_FOUND_MAX_PAGES;
+
+		return 0;
+	}
+
+	return -ENOSPC;
+}
+
+static inline int pagemap_scan_deposit(struct pagemap_scan_private *p,
+				       struct page_region __user *vec,
+				       unsigned long *vec_index)
+{
+	struct page_region *cur = &p->cur;
+
+	if (!cur->len)
+		return 0;
+
+	if (copy_to_user(&vec[*vec_index], cur, sizeof(struct page_region)))
+		return -EFAULT;
+
+	p->vec_index++;
+	(*vec_index)++;
+
+	return 0;
+}
+
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	bool is_written, is_file, is_present, is_swap;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long addr = end;
+	spinlock_t *ptl;
+	int ret = 0;
+	pte_t *pte;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		unsigned long n_pages = (end - start)/PAGE_SIZE;
+
+		is_written = !is_pmd_uffd_wp(*pmd);
+		is_file = vma->vm_file;
+		is_present = pmd_present(*pmd);
+		is_swap = is_swap_pmd(*pmd);
+
+		spin_unlock(ptl);
+
+		/*
+		 * Break huge page into small pages if the WP operation need to
+		 * be performed is on a portion of the huge page or if max_pages
+		 * pages limit would exceed.
+		 */
+		if (is_written && PM_SCAN_OP_IS_WP(p) &&
+		    ((end - start < HPAGE_SIZE) ||
+		     (p->max_pages &&
+		      (p->max_pages - p->found_pages) < n_pages))) {
+
+			split_huge_pmd(vma, pmd, start);
+			goto process_smaller_pages;
+		}
+
+		if (p->max_pages &&
+		    p->found_pages + n_pages > p->max_pages)
+			n_pages = p->max_pages - p->found_pages;
+
+		ret = pagemap_scan_output(is_written, is_file, is_present,
+					  is_swap, p, start, n_pages);
+		if (ret < 0)
+			return ret;
+
+		if (is_written && PM_SCAN_OP_IS_WP(p) &&
+		    uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
+			ret = -EINVAL;
+
+		return ret;
+	}
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+#endif
+
+	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
+		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+		is_written = !is_pte_uffd_wp(*pte);
+		is_file = vma->vm_file;
+		is_present = pte_present(*pte);
+		is_swap = is_swap_pte(*pte);
+
+		pte_unmap_unlock(pte, ptl);
+
+		ret = pagemap_scan_output(is_written, is_file, is_present,
+					  is_swap, p, addr, 1);
+		if (ret < 0)
+			return ret;
+
+		if (is_written && PM_SCAN_OP_IS_WP(p) &&
+		    uffd_wp_range(vma, addr, PAGE_SIZE, true) < 0)
+			return -EINVAL;
+	}
+
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
+				      unsigned long start, unsigned long end,
+				      struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	bool is_written;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long n_pages = (end - start)/PAGE_SIZE;
+	pte_t pte;
+	int ret;
+
+	pte = huge_ptep_get(ptep);
+	is_written = !is_huge_pte_uffd_wp(pte);
+
+	/*
+	 * Partial hugetlb page clear isn't supported
+	 */
+	if (is_written && PM_SCAN_OP_IS_WP(p) &&
+	    ((end - start < HPAGE_SIZE) ||
+	     (p->max_pages &&
+	      (p->max_pages - p->found_pages) < n_pages)))
+		return -EPERM;
+
+	if (p->max_pages &&
+	    p->found_pages + n_pages > p->max_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	ret = pagemap_scan_output(is_written, vma->vm_file, pte_present(pte),
+				  is_swap_pte(pte), p, start, n_pages);
+	if (ret < 0)
+		return ret;
+
+	if (is_written && PM_SCAN_OP_IS_WP(p) &&
+	    uffd_wp_range(vma, start, HPAGE_SIZE, true) < 0)
+		ret = -EINVAL;
+
+	return ret;
+}
+#else
+#define pagemap_scan_hugetlb_entry NULL
+#endif
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+				 int depth, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long n_pages;
+	int ret = 0;
+
+	if (!vma)
+		return 0;
+
+	n_pages = (end - addr)/PAGE_SIZE;
+	if (p->max_pages && p->found_pages + n_pages > p->max_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
+				  n_pages);
+
+	return ret;
+}
+
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+	.hugetlb_entry = pagemap_scan_hugetlb_entry,
+};
+
+static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
+				   struct page_region __user *vec)
+{
+	/* Detect illegal size, flags, len and masks */
+	if (arg->size != sizeof(struct pm_scan_arg))
+		return -EINVAL;
+	if (arg->flags & ~PM_SCAN_OPS)
+		return -EINVAL;
+	if (!arg->len)
+		return -EINVAL;
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
+	     arg->return_mask) & ~PM_SCAN_BITS_ALL)
+		return -EINVAL;
+	if (!arg->required_mask && !arg->anyof_mask &&
+	    !arg->excluded_mask)
+		return -EINVAL;
+	if (!arg->return_mask)
+		return -EINVAL;
+
+	/* Validate memory ranges */
+	if (!(arg->flags & PM_SCAN_OP_GET))
+		return -EINVAL;
+	if (!arg->vec)
+		return -EINVAL;
+	if (arg->vec_len == 0)
+		return -EINVAL;
+
+	if (!IS_ALIGNED(start, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok((void __user *)start, arg->len))
+		return -EFAULT;
+
+	if (!PM_SCAN_OP_IS_WP(arg))
+		return 0;
+
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask) &
+	    ~PAGE_IS_WRITTEN)
+		return -EINVAL;
+
+	return 0;
+}
+
+static long do_pagemap_cmd(struct mm_struct *mm,
+			   struct pm_scan_arg __user *uarg)
+{
+	unsigned long start, end, walk_start, walk_end;
+	unsigned long empty_slots, vec_index = 0;
+	struct page_region __user *vec;
+	struct pagemap_scan_private p;
+	struct pm_scan_arg arg;
+	int ret = 0;
+
+	if (copy_from_user(&arg, uarg, sizeof(arg)))
+		return -EFAULT;
+
+	start = (unsigned long)untagged_addr(arg.start);
+	vec = (struct page_region *)(unsigned long)untagged_addr(arg.vec);
+
+	ret = pagemap_scan_args_valid(&arg, start, vec);
+	if (ret)
+		return ret;
+
+	end = start + arg.len;
+	p.max_pages = arg.max_pages;
+	p.found_pages = 0;
+	p.flags = arg.flags;
+	p.required_mask = arg.required_mask;
+	p.anyof_mask = arg.anyof_mask;
+	p.excluded_mask = arg.excluded_mask;
+	p.return_mask = arg.return_mask;
+	p.cur.len = 0;
+	p.cur.start = 0;
+	p.vec = NULL;
+	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+
+	/*
+	 * Allocate smaller buffer to get output from inside the page walk
+	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
+	 * we want to return output to user in compact form where no two
+	 * consecutive regions should be continuous and have the same flags.
+	 * So store the latest element in p.cur between different walks and
+	 * store the p.cur at the end of the walk to the user buffer.
+	 */
+	p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
+			      GFP_KERNEL);
+	if (!p.vec)
+		return -ENOMEM;
+
+	walk_start = walk_end = start;
+	while (walk_end < end && !ret) {
+		p.vec_index = 0;
+
+		empty_slots = arg.vec_len - vec_index;
+		p.vec_len = min(p.vec_len, empty_slots);
+
+		walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
+		if (walk_end > end)
+			walk_end = end;
+
+		mmap_read_lock(mm);
+		ret = walk_page_range(mm, walk_start, walk_end,
+				      &pagemap_scan_ops, &p);
+		mmap_read_unlock(mm);
+
+		if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
+			goto free_data;
+
+		walk_start = walk_end;
+		if (p.vec_index) {
+			if (copy_to_user(&vec[vec_index], p.vec,
+					 p.vec_index *
+					 sizeof(struct page_region))) {
+				ret = -EFAULT;
+				goto free_data;
+			}
+			vec_index += p.vec_index;
+		}
+	}
+	ret = pagemap_scan_deposit(&p, vec, &vec_index);
+	if (!ret)
+		ret = vec_index;
+free_data:
+	kfree(p.vec);
+
+	return ret;
+}
+
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
+	struct mm_struct *mm = file->private_data;
+
+	switch (cmd) {
+	case PAGEMAP_SCAN:
+		return do_pagemap_cmd(mm, uarg);
+
+	default:
+		return -EINVAL;
+	}
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = pagemap_scan_ioctl,
+	.compat_ioctl	= pagemap_scan_ioctl,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index ce4f9d18de3e..325672cd6d6f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -210,6 +210,14 @@  extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
 
 #else /* CONFIG_USERFAULTFD */
 
+static inline long uffd_wp_range(struct mm_struct *dst_mm,
+				 struct vm_area_struct *vma,
+				 unsigned long start, unsigned long len,
+				 bool enable_wp)
+{
+	return 0;
+}
+
 /* mm helpers */
 static inline vm_fault_t handle_userfault(struct vm_fault *vmf,
 				unsigned long reason)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..47879c38ce2f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,57 @@  typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pm_scan_args */
+#define PAGE_IS_WRITTEN		(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region in pages
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Supported flags */
+#define PM_SCAN_OP_GET	(1 << 0)
+#define PM_SCAN_OP_WP	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */