diff mbox series

[v11,4/7] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

Message ID 20230309135718.1490461-5-usama.anjum@collabora.com (mailing list archive)
State New
Headers show
Series Implement IOCTL to get and optionally clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum March 9, 2023, 1:57 p.m. UTC
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
  (PAGE_IS_SWAPPED).
- Find pages which have been written-to and write protect the pages
  (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

This IOCTL can be extended to get information about more PTE bits. This
IOCTL doesn't support hugetlbs at the moment. No information about
hugetlb can be obtained. This patch has evolved from a basic patch from
Gabriel Krisman Bertazi.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v11:
- Find written pages in a better way
- Fix a corner case (thanks Paul)
- Improve the code/comments
- remove ENGAGE_WP + ! GET operation
- shorten the commit message in favour of moving documentation to
  pagemap.rst

Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl

Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
---
 fs/proc/task_mmu.c      | 366 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  53 ++++++
 2 files changed, 419 insertions(+)

Comments

Michał Mirosław March 13, 2023, 4:02 p.m. UTC | #1
On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>   (PAGE_IS_SWAPPED).
> - Find pages which have been written-to and write protect the pages
>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
[...]
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/uaccess.h>
>  #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -1132,6 +1133,18 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  }
>  #endif
>
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> +       return ((pte_present(pte) && pte_uffd_wp(pte)) ||
> +               (pte_swp_uffd_wp_any(pte)));

Parentheses around pte_swp_uffd_wp_any() are redundant. Please remove
here and in all following if()s. (Nit: those extra parentheses are
used inconsistently in the patch anyway.)

[...]
> +static inline bool pagemap_scan_is_wt_required(struct pagemap_scan_private *p)

This seems to check if the PAGE_IS_WRITTEN flag is tested, so
"pagemap_scan_needs_wp_checks()"? Or maybe document/expand the "wt"
acronym as it seems used also on following code.

> +{
> +       return  ((p->required_mask & PAGE_IS_WRITTEN) ||
> +                (p->anyof_mask & PAGE_IS_WRITTEN) ||
> +                (p->excluded_mask & PAGE_IS_WRITTEN));

Nit: It looks like it should answer "do any of the masks contain
PAGE_IS_WRITTEN?" so maybe:

return (p->required_mask | p->anyof_mask | p->excluded_mask) & PAGE_IS_WRITTEN;

[...]

> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> +                              struct pagemap_scan_private *p,
> +                              unsigned long addr, unsigned int n_pages)
> +{
> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
> +       struct page_region *cur = &p->cur;
> +       bool cpy = true;
> +
> +       if (p->max_pages && (p->found_pages == p->max_pages))
> +               return -ENOSPC;
> +
> +       if (!n_pages)
> +               return -EINVAL;
> +
> +       if (p->required_mask)
> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
> +       if (cpy && p->anyof_mask)
> +               cpy = (p->anyof_mask & bitmap);
> +       if (cpy && p->excluded_mask)
> +               cpy = !(p->excluded_mask & bitmap);

Since the rest of the code is executed only when `cpy` is true, this
could just return early for easier understanding.

BTW, some of the tests are redundant. Eg: if required_mask == 0, then
`required_mask & x == required_mask` will always hold. Same for
`excluded_mask & x == 0`.

> +
> +       bitmap = bitmap & p->return_mask;

Nit: bitmap &= p->return_mask;

> +       if (cpy && bitmap) {

Assuming early returns on `!cpy` are done earlier:

if (!bitmap)
  return 0;

> +               if ((cur->len) && (cur->bitmap == bitmap) &&
> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {

I'd recommend removing the extra parentheses as they make the code
less readable for me (too many parentheses to match visually).
The `cur->len` test seems redundant: is it possible to have
`cur->start == addr` in that case (I guess it would have to get
`n_pages == 0` in an earlier invocation)?

> +
> +                       cur->len += n_pages;
> +                       p->found_pages += n_pages;

Please add an early return so that 'else' chaining won't be necessary.

> +               } else if ((!p->vec_index) ||
> +                          ((p->vec_index + 1) < p->vec_len)) {

Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:

if (vec_index >= p->vec_len)
    return -ENOSPC;

> +                       if (cur->len) {
> +                               memcpy(&p->vec[p->vec_index], cur,
> +                                      sizeof(struct page_region));
> +                               p->vec_index++;
> +                       }
> +
> +                       cur->start = addr;
> +                       cur->len = n_pages;
> +                       cur->bitmap = bitmap;
> +                       p->found_pages += n_pages;
> +               } else {
> +                       return -ENOSPC;
> +               }
> +       }
> +
> +       return 0;
> +}
[...]

> +static int pagemap_scan_deposit(struct pagemap_scan_private *p,
> +                               struct page_region __user *vec,
> +                               unsigned long *vec_index)
> +{
> +       struct page_region *cur = &p->cur;
> +
> +       if (cur->len) {

if (!cur->len)
  return 0;

> +               if (copy_to_user(&vec[*vec_index], cur,
> +                                sizeof(struct page_region)))
> +                       return -EFAULT;
> +
> +               p->vec_index++;
> +               (*vec_index)++;
> +       }
> +
> +       return 0;
> +}

> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       bool is_writ, is_file, is_pres, is_swap;
> +       unsigned long addr = end;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +       pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE

Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
NULL if transparent hugepages are not compiled in. OTOH I see
BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
case), so that would need to be worked around first.

> +       ptl = pmd_trans_huge_lock(pmd, vma);
> +       if (ptl) {
> +               unsigned long n_pages;
> +
> +               is_writ = !is_pmd_uffd_wp(*pmd);

`is_written`?

> +               /*
> +                * Break huge page into small pages if operation needs to be
> +                * performed is on a portion of the huge page.
> +                */
> +               if (is_writ && PM_SCAN_OP_IS_WP(p) &&
> +                   (end - start < HPAGE_SIZE)) {
> +                       spin_unlock(ptl);
> +
> +                       split_huge_pmd(vma, pmd, start);
> +                       goto process_smaller_pages;
> +               }
> +
> +               n_pages = (end - start)/PAGE_SIZE;
> +               if (p->max_pages &&
> +                   p->found_pages + n_pages >= p->max_pages)

Nit: greater-than is also correct and avoids no-op assignment.

> +                       n_pages = p->max_pages - p->found_pages;
> +
> +               ret = pagemap_scan_output(is_writ, vma->vm_file,
> +                                         pmd_present(*pmd), is_swap_pmd(*pmd),
> +                                         p, start, n_pages);
> +               spin_unlock(ptl);

if (ret || !is_written)
  return ret;

This will avoid those tests in the following if().

> +
> +               if (!ret && is_writ && PM_SCAN_OP_IS_WP(p) &&
> +                   uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true) < 0)
> +                       ret = -EINVAL;
> +
> +               return ret;

After above early returns, this will be always `return 0;`.

> +       }
> +process_smaller_pages:
> +       if (pmd_trans_unstable(pmd))
> +               return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +       for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {

The `!ret` can be removed if the EINVAL case was to `break` by itself.

> +               pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> +               is_writ = !is_pte_uffd_wp(*pte);
> +               is_file = vma->vm_file;
> +               is_pres = pte_present(*pte);
> +               is_swap = is_swap_pte(*pte);
> +
> +               pte_unmap_unlock(pte, ptl);
> +
> +               ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
> +                                         p, addr, 1);
> +               if (ret)
> +                       break;
> +
> +               if (PM_SCAN_OP_IS_WP(p) && is_writ &&
> +                   uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
> +                       ret = -EINVAL;
> +       }
> +
> +       cond_resched();
> +       return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
> +                                int depth, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long n_pages;
> +       int ret = 0;
> +
> +       if (vma) {

if (!vma) return 0;

> +               n_pages = (end - addr)/PAGE_SIZE;
> +               if (p->max_pages &&
> +                   p->found_pages + n_pages >= p->max_pages)
> +                       n_pages = p->max_pages - p->found_pages;
> +
> +               ret = pagemap_scan_output(false, vma->vm_file, false, false, p,
> +                                         addr, n_pages);
> +       }
> +
> +       return ret;
> +}


> +/* No hugetlb support is present. */

"FIXME: hugetlb support is not implemented."? (There seems to be no
#ifdef CONFIG_HUGETLB or similar, so I guess the comment is about the
current implementation.)

> +static const struct mm_walk_ops pagemap_scan_ops = {
> +       .test_walk = pagemap_scan_test_walk,
> +       .pmd_entry = pagemap_scan_pmd_entry,
> +       .pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static bool pagemap_scan_args_valid(struct pm_scan_arg *arg,
> +                                   struct page_region __user *vec,
> +                                   unsigned long start)
> +{
> +       /* Detect illegal size, flags and masks */
> +       if (arg->size != sizeof(struct pm_scan_arg))
> +               return false;
> +       if (arg->flags & ~PM_SCAN_OPS)
> +               return false;
> +       if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
> +            arg->return_mask) & ~PM_SCAN_BITS_ALL)
> +               return false;

> +       if (!arg->required_mask && !arg->anyof_mask &&
> +           !arg->excluded_mask)
> +               return false;

Is there an assumption in the code that those checks are needed? I'd
expect that no selection criteria makes a valid page set?

> +       if (!arg->return_mask)
> +               return false;
> +
> +       /* Validate memory ranges */
> +       if (!(arg->flags & PM_SCAN_OP_GET))
> +               return false;
> +       if (!arg->vec)
> +               return false;
> +       if (arg->vec_len == 0)
> +               return false;

> +       if (!access_ok((void __user *)vec,
> +                      arg->vec_len * sizeof(struct page_region)))
> +               return false;

Is there a provision that userspace threads are all blocked from
manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
and the writes should be checked each time as another userspace thread
could remap the memory while the ioctl() is working. Anyway, the
return should be EFAULT for this case.

> +       if (!IS_ALIGNED(start, PAGE_SIZE))
> +               return false;
> +       if (!access_ok((void __user *)start, arg->len))
> +               return false;

This I guess want's to check if the range to be scanned is mapped -
but isn't this what the ioctl() should do during the scan? (But, also
see above.)

> +       if (PM_SCAN_OP_IS_WP(arg)) {

if (!...IS_WP) return true;

> +               if (arg->required_mask & PM_SCAN_NON_WT_BITS)
> +                       return false;
> +               if (arg->anyof_mask & PM_SCAN_NON_WT_BITS)
> +                       return false;
> +               if (arg->excluded_mask & PM_SCAN_NON_WT_BITS)
> +                       return false;

Please see: pagemap_scan_is_wt_required comment. Also, it seems this
constant is used only here, so ~PAGE_IS_WRITTEN might be enough?

[...]
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pm_scan_arg *arg)
> +{
> +       unsigned long start, end, walk_start, walk_end;
> +       unsigned long empty_slots, vec_index = 0;
> +       struct page_region __user *vec;
> +       struct pagemap_scan_private p;
> +       int ret = 0;
> +
> +       start = (unsigned long)untagged_addr(arg->start);
> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> +       if (!pagemap_scan_args_valid(arg, vec, start))
> +               return -EINVAL;
> +
> +       end = start + arg->len;
> +       p.max_pages = arg->max_pages;
> +       p.found_pages = 0;
> +       p.flags = arg->flags;
> +       p.required_mask = arg->required_mask;
> +       p.anyof_mask = arg->anyof_mask;
> +       p.excluded_mask = arg->excluded_mask;
> +       p.return_mask = arg->return_mask;
> +       p.cur.len = 0;
> +       p.vec = NULL;
> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> +       /*
> +        * Allocate smaller buffer to get output from inside the page walk
> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> +        * we want to return output to user in compact form where no two
> +        * consecutive regions should be continuous and have the same flags.
> +        * So store the latest element in p.cur between different walks and
> +        * store the p.cur at the end of the walk to the user buffer.
> +        */
> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> +                             GFP_KERNEL);
> +       if (!p.vec)
> +               return -ENOMEM;
> +
> +       walk_start = walk_end = start;
> +       while (walk_end < end) {
> +               p.vec_index = 0;
> +
> +               empty_slots = arg->vec_len - vec_index;
> +               p.vec_len = min(p.vec_len, empty_slots);
> +
> +               walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> +               if (walk_end > end)
> +                       walk_end = end;
> +
> +               mmap_read_lock(mm);
> +               ret = walk_page_range(mm, walk_start, walk_end,
> +                                     &pagemap_scan_ops, &p);
> +               mmap_read_unlock(mm);
> +
> +               if (!(!ret || ret == -ENOSPC))

if (ret && ret != -ENOSPC)

> +                       goto free_data;
> +
> +               walk_start = walk_end;
> +               if (p.vec_index) {
> +                       if (copy_to_user(&vec[vec_index], p.vec,
> +                                        p.vec_index *
> +                                        sizeof(struct page_region))) {
> +                               ret = -EFAULT;
> +                               goto free_data;
> +                       }
> +                       vec_index += p.vec_index;
> +               }
> +       }
> +       ret = pagemap_scan_deposit(&p, vec, &vec_index);
> +       if (!ret)
> +               ret = vec_index;
> +free_data:
> +       kfree(p.vec);
> +
> +       return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
> +                              unsigned long arg)
> +{
> +       struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
> +       struct mm_struct *mm = file->private_data;
> +       struct pm_scan_arg argument;
> +
> +       if (cmd == PAGEMAP_SCAN) {

switch() for easier expansion later?

> +               if (copy_from_user(&argument, uarg,
> +                                  sizeof(struct pm_scan_arg)))

sizeof(*argument);

Could you push this to do_pagemap_cmd()? In case this file gets more
ioctl() commands there won't be need to add more command-specific
structures in this function.

> +                       return -EFAULT;
> +               return do_pagemap_cmd(mm, &argument);
> +       }
> +
> +       return -EINVAL;
> +}
> +
>  const struct file_operations proc_pagemap_operations = {
>         .llseek         = mem_lseek, /* borrow this */
>         .read           = pagemap_read,
>         .open           = pagemap_open,
>         .release        = pagemap_release,
> +       .unlocked_ioctl = pagemap_scan_ioctl,
> +       .compat_ioctl   = pagemap_scan_ioctl,

Is this correct? Would the code need a different userspace pointer
handling for 32-bit userspace on 64-bit kernel?

>  };
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
Peter Xu March 15, 2023, 3:55 p.m. UTC | #2
On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> +		is_writ = !is_pte_uffd_wp(*pte);
> +		is_file = vma->vm_file;
> +		is_pres = pte_present(*pte);
> +		is_swap = is_swap_pte(*pte);
> +
> +		pte_unmap_unlock(pte, ptl);
> +
> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
> +					  p, addr, 1);
> +		if (ret)
> +			break;
> +
> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
> +			ret = -EINVAL;
> +	}

This is not real atomic..

Taking the spinlock for eacy pte is not only overkill but wrong in
atomicity because the pte can change right after spinlock unlocked.

Unfortunately you also cannot reuse uffd_wp_range() because that's not
atomic either, my fault here.  Probably I was thinking mostly from
soft-dirty pov on batching the collect+reset.

You need to take the spin lock, collect whatever bits, set/clear whatever
bits, only until then release the spin lock.

"Not atomic" means you can have some page got dirtied but you could miss
it.  Depending on how strict you want, I think it'll break apps like CRIU
if strict atomicity needed for migrating a process.  If we want to have a
new interface anyway, IMHO we'd better do that in the strict way.

Same comment applies to the THP handling (where I cut from the context).
Muhammad Usama Anjum March 15, 2023, 4:54 p.m. UTC | #3
On 3/15/23 8:55 PM, Peter Xu wrote:
> On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
>> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
>> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> +
>> +		is_writ = !is_pte_uffd_wp(*pte);
>> +		is_file = vma->vm_file;
>> +		is_pres = pte_present(*pte);
>> +		is_swap = is_swap_pte(*pte);
>> +
>> +		pte_unmap_unlock(pte, ptl);
>> +
>> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
>> +					  p, addr, 1);
>> +		if (ret)
>> +			break;
>> +
>> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
>> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
>> +			ret = -EINVAL;
>> +	}
> 
> This is not real atomic..
> 
> Taking the spinlock for eacy pte is not only overkill but wrong in
> atomicity because the pte can change right after spinlock unlocked.
Let me explain. It seems like wrong, but it isn't. In my rigorous testing,
it didn't show any side-effect.  Here we are finding out if a page is
written. If page is written, only then we clear it. Lets look at the
different possibilities here:
- If a page isn't written, we'll not clear it.
- If a page is written and there isn't any race, we'll clear written-to
flag by write protecting it.
- If a page is written but before clearing it, data is written again to the
page. The page would remain written and we'll clear it.
- If a page is written but before clearing it, it gets write protected,
we'll still write protected it. There is double right protection here, but
no side-effect.

Lets turn this into a truth table for easier understanding. Here first
coulmn and thrid column represents this above code. 2nd column represents
any other thread interacting with the page.

If page is written/dirty	some other task interacts	wp_page
no				does nothing			no
no				writes to page			no
no				wp the page			no
yes				does nothing			yes
yes				write to page			yes
yes				wp the page			yes

As you can see there isn't any side-effect happening. We aren't over doing
the wp or under-doing the write-protect.

Even if we were doing something wrong here and I bring the lock over all of
this, the pages get become written or wp just after unlocking. It is
expected. This current implementation doesn't seem to be breaking this.

Is my understanding wrong somewhere here? Can you point out?

Previous to this current locking design were either buggy or slower when
multiple threads were working on same pages. Current implementation removes
the limitations:
- The memcpy inside pagemap_scan_output is happening with pte unlocked.
- We are only wp a page if we have noted this page to be dirty
- No mm write lock is required. Only read lock works fine just like
userfaultfd_writeprotect() takes only read lock.

There is only one con here that we are locking and unlocking the pte lock
again and again.

Please have a look at my explanation and let me know what do you think.

> 
> Unfortunately you also cannot reuse uffd_wp_range() because that's not
> atomic either, my fault here.  Probably I was thinking mostly from
> soft-dirty pov on batching the collect+reset.
> 
> You need to take the spin lock, collect whatever bits, set/clear whatever
> bits, only until then release the spin lock.
> 
> "Not atomic" means you can have some page got dirtied but you could miss
> it.  Depending on how strict you want, I think it'll break apps like CRIU
> if strict atomicity needed for migrating a process.  If we want to have a
> new interface anyway, IMHO we'd better do that in the strict way.
In my rigorous multi-threaded testing where a lots of threads are working
on same set of pages, we aren't losing even a single update. I can share
the test if you want.

> 
> Same comment applies to the THP handling (where I cut from the context).
>
Peter Xu March 15, 2023, 7:53 p.m. UTC | #4
On Wed, Mar 15, 2023 at 09:54:40PM +0500, Muhammad Usama Anjum wrote:
> On 3/15/23 8:55 PM, Peter Xu wrote:
> > On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
> >> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
> >> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >> +
> >> +		is_writ = !is_pte_uffd_wp(*pte);
> >> +		is_file = vma->vm_file;
> >> +		is_pres = pte_present(*pte);
> >> +		is_swap = is_swap_pte(*pte);
> >> +
> >> +		pte_unmap_unlock(pte, ptl);
> >> +
> >> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
> >> +					  p, addr, 1);
> >> +		if (ret)
> >> +			break;
> >> +
> >> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
> >> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
> >> +			ret = -EINVAL;
> >> +	}
> > 
> > This is not real atomic..
> > 
> > Taking the spinlock for eacy pte is not only overkill but wrong in
> > atomicity because the pte can change right after spinlock unlocked.
> Let me explain. It seems like wrong, but it isn't. In my rigorous testing,
> it didn't show any side-effect.  Here we are finding out if a page is
> written. If page is written, only then we clear it. Lets look at the
> different possibilities here:
> - If a page isn't written, we'll not clear it.
> - If a page is written and there isn't any race, we'll clear written-to
> flag by write protecting it.
> - If a page is written but before clearing it, data is written again to the
> page. The page would remain written and we'll clear it.
> - If a page is written but before clearing it, it gets write protected,
> we'll still write protected it. There is double right protection here, but
> no side-effect.
> 
> Lets turn this into a truth table for easier understanding. Here first
> coulmn and thrid column represents this above code. 2nd column represents
> any other thread interacting with the page.
> 
> If page is written/dirty	some other task interacts	wp_page
> no				does nothing			no
> no				writes to page			no
> no				wp the page			no
> yes				does nothing			yes
> yes				write to page			yes
> yes				wp the page			yes
> 
> As you can see there isn't any side-effect happening. We aren't over doing
> the wp or under-doing the write-protect.
> 
> Even if we were doing something wrong here and I bring the lock over all of
> this, the pages get become written or wp just after unlocking. It is
> expected. This current implementation doesn't seem to be breaking this.
> 
> Is my understanding wrong somewhere here? Can you point out?

Yes you're right.  With is_writ check it looks all fine.

> 
> Previous to this current locking design were either buggy or slower when
> multiple threads were working on same pages. Current implementation removes
> the limitations:
> - The memcpy inside pagemap_scan_output is happening with pte unlocked.

Why this has anything to worry?  Isn't that memcpy only applies to a
page_region struct?

> - We are only wp a page if we have noted this page to be dirty
> - No mm write lock is required. Only read lock works fine just like
> userfaultfd_writeprotect() takes only read lock.

I didn't even notice you used to use write lock.  Yes I think read lock is
suffice here.

> 
> There is only one con here that we are locking and unlocking the pte lock
> again and again.
> 
> Please have a look at my explanation and let me know what do you think.

I think this is fine as long as the semantics is correct, which I believe
is the case.  The spinlock can be optimized, but it can be done on top if
needs more involved changes.

> 
> > 
> > Unfortunately you also cannot reuse uffd_wp_range() because that's not
> > atomic either, my fault here.  Probably I was thinking mostly from
> > soft-dirty pov on batching the collect+reset.
> > 
> > You need to take the spin lock, collect whatever bits, set/clear whatever
> > bits, only until then release the spin lock.
> > 
> > "Not atomic" means you can have some page got dirtied but you could miss
> > it.  Depending on how strict you want, I think it'll break apps like CRIU
> > if strict atomicity needed for migrating a process.  If we want to have a
> > new interface anyway, IMHO we'd better do that in the strict way.
> In my rigorous multi-threaded testing where a lots of threads are working
> on same set of pages, we aren't losing even a single update. I can share
> the test if you want.

Good to have tests covering that.  I'd say you can add the test into
selftests along with the series when you repost if it's convenient.  It can
be part of an existing test or it can be a new one under mm/.

Thanks,
Muhammad Usama Anjum March 16, 2023, 5:17 a.m. UTC | #5
On 3/16/23 12:53 AM, Peter Xu wrote:
> On Wed, Mar 15, 2023 at 09:54:40PM +0500, Muhammad Usama Anjum wrote:
>> On 3/15/23 8:55 PM, Peter Xu wrote:
>>> On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
>>>> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
>>>> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>>> +
>>>> +		is_writ = !is_pte_uffd_wp(*pte);
>>>> +		is_file = vma->vm_file;
>>>> +		is_pres = pte_present(*pte);
>>>> +		is_swap = is_swap_pte(*pte);
>>>> +
>>>> +		pte_unmap_unlock(pte, ptl);
>>>> +
>>>> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
>>>> +					  p, addr, 1);
>>>> +		if (ret)
>>>> +			break;
>>>> +
>>>> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
>>>> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
>>>> +			ret = -EINVAL;
>>>> +	}
>>>
>>> This is not real atomic..
>>>
>>> Taking the spinlock for eacy pte is not only overkill but wrong in
>>> atomicity because the pte can change right after spinlock unlocked.
>> Let me explain. It seems like wrong, but it isn't. In my rigorous testing,
>> it didn't show any side-effect.  Here we are finding out if a page is
>> written. If page is written, only then we clear it. Lets look at the
>> different possibilities here:
>> - If a page isn't written, we'll not clear it.
>> - If a page is written and there isn't any race, we'll clear written-to
>> flag by write protecting it.
>> - If a page is written but before clearing it, data is written again to the
>> page. The page would remain written and we'll clear it.
>> - If a page is written but before clearing it, it gets write protected,
>> we'll still write protected it. There is double right protection here, but
>> no side-effect.
>>
>> Lets turn this into a truth table for easier understanding. Here first
>> coulmn and thrid column represents this above code. 2nd column represents
>> any other thread interacting with the page.
>>
>> If page is written/dirty	some other task interacts	wp_page
>> no				does nothing			no
>> no				writes to page			no
>> no				wp the page			no
>> yes				does nothing			yes
>> yes				write to page			yes
>> yes				wp the page			yes
>>
>> As you can see there isn't any side-effect happening. We aren't over doing
>> the wp or under-doing the write-protect.
>>
>> Even if we were doing something wrong here and I bring the lock over all of
>> this, the pages get become written or wp just after unlocking. It is
>> expected. This current implementation doesn't seem to be breaking this.
>>
>> Is my understanding wrong somewhere here? Can you point out?
> 
> Yes you're right.  With is_writ check it looks all fine.
> 
>>
>> Previous to this current locking design were either buggy or slower when
>> multiple threads were working on same pages. Current implementation removes
>> the limitations:
>> - The memcpy inside pagemap_scan_output is happening with pte unlocked.
> 
> Why this has anything to worry?  Isn't that memcpy only applies to a
> page_region struct?
Yeah, correct. I'm just saying that memcpy without pte lock is better than
memcpy with pte locked. :)

> 
>> - We are only wp a page if we have noted this page to be dirty
>> - No mm write lock is required. Only read lock works fine just like
>> userfaultfd_writeprotect() takes only read lock.
> 
> I didn't even notice you used to use write lock.  Yes I think read lock is
> suffice here.
> 
>>
>> There is only one con here that we are locking and unlocking the pte lock
>> again and again.
>>
>> Please have a look at my explanation and let me know what do you think.
> 
> I think this is fine as long as the semantics is correct, which I believe
> is the case.  The spinlock can be optimized, but it can be done on top if
> needs more involved changes.
> 
>>
>>>
>>> Unfortunately you also cannot reuse uffd_wp_range() because that's not
>>> atomic either, my fault here.  Probably I was thinking mostly from
>>> soft-dirty pov on batching the collect+reset.
>>>
>>> You need to take the spin lock, collect whatever bits, set/clear whatever
>>> bits, only until then release the spin lock.
>>>
>>> "Not atomic" means you can have some page got dirtied but you could miss
>>> it.  Depending on how strict you want, I think it'll break apps like CRIU
>>> if strict atomicity needed for migrating a process.  If we want to have a
>>> new interface anyway, IMHO we'd better do that in the strict way.
>> In my rigorous multi-threaded testing where a lots of threads are working
>> on same set of pages, we aren't losing even a single update. I can share
>> the test if you want.
> 
> Good to have tests covering that.  I'd say you can add the test into
> selftests along with the series when you repost if it's convenient.  It can
> be part of an existing test or it can be a new one under mm/.
Sure, I'll add it to the selftests.

Thank you for reviewing and asking the questions.

> 
> Thanks,
>
Muhammad Usama Anjum March 16, 2023, 5:53 p.m. UTC | #6
Hi,

Thank you so much for reviewing.

On 3/13/23 9:02 PM, Michał Mirosław wrote:
> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>   (PAGE_IS_SWAPPED).
>> - Find pages which have been written-to and write protect the pages
>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> [...]
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/shmem_fs.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>
>>  #include <asm/elf.h>
>>  #include <asm/tlb.h>
>> @@ -1132,6 +1133,18 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>>  }
>>  #endif
>>
>> +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> +       return ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> +               (pte_swp_uffd_wp_any(pte)));
> 
> Parentheses around pte_swp_uffd_wp_any() are redundant. Please remove
> here and in all following if()s. (Nit: those extra parentheses are
> used inconsistently in the patch anyway.)
I'll remove these in next version.

> 
> [...]
>> +static inline bool pagemap_scan_is_wt_required(struct pagemap_scan_private *p)
> 
> This seems to check if the PAGE_IS_WRITTEN flag is tested, so
> "pagemap_scan_needs_wp_checks()"? Or maybe document/expand the "wt"
> acronym as it seems used also on following code.
I'll expand wt.

> 
>> +{
>> +       return  ((p->required_mask & PAGE_IS_WRITTEN) ||
>> +                (p->anyof_mask & PAGE_IS_WRITTEN) ||
>> +                (p->excluded_mask & PAGE_IS_WRITTEN));
> 
> Nit: It looks like it should answer "do any of the masks contain
> PAGE_IS_WRITTEN?" so maybe:
> 
> return (p->required_mask | p->anyof_mask | p->excluded_mask) & PAGE_IS_WRITTEN;
I'll update.

> 
> [...]
> 
>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> +                              struct pagemap_scan_private *p,
>> +                              unsigned long addr, unsigned int n_pages)
>> +{
>> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
>> +       struct page_region *cur = &p->cur;
>> +       bool cpy = true;
>> +
>> +       if (p->max_pages && (p->found_pages == p->max_pages))
>> +               return -ENOSPC;
>> +
>> +       if (!n_pages)
>> +               return -EINVAL;
>> +
>> +       if (p->required_mask)
>> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
>> +       if (cpy && p->anyof_mask)
>> +               cpy = (p->anyof_mask & bitmap);
>> +       if (cpy && p->excluded_mask)
>> +               cpy = !(p->excluded_mask & bitmap);
> 
> Since the rest of the code is executed only when `cpy` is true, this
> could just return early for easier understanding.
Hmm... I'll do the following:
	if (!cpy || !bitmap)
		return 0;

> 
> BTW, some of the tests are redundant. Eg: if required_mask == 0, then
> `required_mask & x == required_mask` will always hold. Same for
> `excluded_mask & x == 0`.
Correct. This is why I'm checking if required_mask is set and then
comparing bitmap with it. required_mask may be 0 if not set. This if will
ignore the subsequent check.

	if (p->required_mask)
		cpy = ((p->required_mask & bitmap) == p->required_mask);

I don't see any redundancy here. Please let me know otherwise?

> 
>> +
>> +       bitmap = bitmap & p->return_mask;
> 
> Nit: bitmap &= p->return_mask;
Sure. Will do.

Just for my knowledge, what does "Nit" signify if a comment is marked with it?

> 
>> +       if (cpy && bitmap) {
> 
> Assuming early returns on `!cpy` are done earlier:
> 
> if (!bitmap)
>   return 0;
I've posted condition above which would better suit here.

> 
>> +               if ((cur->len) && (cur->bitmap == bitmap) &&
>> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {
> 
> I'd recommend removing the extra parentheses as they make the code
> less readable for me (too many parentheses to match visually).
I'll remove parenthesis.

> The `cur->len` test seems redundant: is it possible to have
> `cur->start == addr` in that case (I guess it would have to get
> `n_pages == 0` in an earlier invocation)?
No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
essential to check the validity from cur->len before performing other
checks. Also cur->start can never be equal to addr as we are walking over
page addressing in serial manner. We want to see here if the current
address matches the previous data by finding the ending address of last
stored data (cur->start + cur->len * PAGE_SIZE).

> 
>> +
>> +                       cur->len += n_pages;
>> +                       p->found_pages += n_pages;
> 
> Please add an early return so that 'else' chaining won't be necessary.
I'll do it.

> 
>> +               } else if ((!p->vec_index) ||
>> +                          ((p->vec_index + 1) < p->vec_len)) {
> 
> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
> 
> if (vec_index >= p->vec_len)
>     return -ENOSPC;

No, it'll not work. Lets leave it as it is. :)

It has gotten somewhat complex, but I don't have any other way to make it
simpler which works. First note the following points:
1) We walk over 512 page or 1 thp at a time to not over allocate memory in
kernel (p->vec).
2) We also want to merge the consective pages with same flags into one
struct page_region. p->vec of current walk may merge with next walk. So we
cannot write to user memory until we find the results of the next walk.

So most recent data is put into p->cur. When non-intersecting or mergeable
data is found, we move p->cur to p->vec[p->index] inside the page walk.
After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
the walks are over. We move the p->cur to arg->vec. It completes the data
transfer to user buffer.

	--------------
	|   p->cur   |
	--------------
	      |
	      |
	      V
	--------------
	|	     |
	|	     |
	|   p->vec   |
	|	     |
	|	     |
	--------------
	      |
	      |
	      V
	--------------
	|	     |
	|	     |
	|	     |
	|  arg->vec  |
	|	     |
	|	     |
	|	     |
	--------------


I'm so sorry that it has gotten this much complex. It was way simpler when
we were walking over all the memory in one go. But then we needed an
unbounded memory from the kernel which we don't want.

> 
>> +                       if (cur->len) {
>> +                               memcpy(&p->vec[p->vec_index], cur,
>> +                                      sizeof(struct page_region));
>> +                               p->vec_index++;
>> +                       }
>> +
>> +                       cur->start = addr;
>> +                       cur->len = n_pages;
>> +                       cur->bitmap = bitmap;
>> +                       p->found_pages += n_pages;
>> +               } else {
>> +                       return -ENOSPC;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
> [...]
> 
>> +static int pagemap_scan_deposit(struct pagemap_scan_private *p,
>> +                               struct page_region __user *vec,
>> +                               unsigned long *vec_index)
>> +{
>> +       struct page_region *cur = &p->cur;
>> +
>> +       if (cur->len) {
> 
> if (!cur->len)
>   return 0;
Sure.

> 
>> +               if (copy_to_user(&vec[*vec_index], cur,
>> +                                sizeof(struct page_region)))
>> +                       return -EFAULT;
>> +
>> +               p->vec_index++;
>> +               (*vec_index)++;
>> +       }
>> +
>> +       return 0;
>> +}
> 
>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                                 unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       bool is_writ, is_file, is_pres, is_swap;
>> +       unsigned long addr = end;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +       pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 
> Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
> NULL if transparent hugepages are not compiled in. OTOH I see
> BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
> case), so that would need to be worked around first.
I'd got the build error [1]. So I'd added these. I've tested it again with
the same config. We don't need these #ifdef now. I'll remove these.

[1] https://lore.kernel.org/all/202211120107.cYLiq2cH-lkp@intel.com

> 
>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>> +       if (ptl) {
>> +               unsigned long n_pages;
>> +
>> +               is_writ = !is_pmd_uffd_wp(*pmd);
> 
> `is_written`?
I'd kept it is_writ to match the pattern of is_file, is_pres and is_swap.
I'll update it to is_written and is_pres to is_present.

> 
>> +               /*
>> +                * Break huge page into small pages if operation needs to be
>> +                * performed is on a portion of the huge page.
>> +                */
>> +               if (is_writ && PM_SCAN_OP_IS_WP(p) &&
>> +                   (end - start < HPAGE_SIZE)) {
>> +                       spin_unlock(ptl);
>> +
>> +                       split_huge_pmd(vma, pmd, start);
>> +                       goto process_smaller_pages;
>> +               }
>> +
>> +               n_pages = (end - start)/PAGE_SIZE;
>> +               if (p->max_pages &&
>> +                   p->found_pages + n_pages >= p->max_pages)
> 
> Nit: greater-than is also correct and avoids no-op assignment.
Ohh... I'll update.

> 
>> +                       n_pages = p->max_pages - p->found_pages;
>> +
>> +               ret = pagemap_scan_output(is_writ, vma->vm_file,
>> +                                         pmd_present(*pmd), is_swap_pmd(*pmd),
>> +                                         p, start, n_pages);
>> +               spin_unlock(ptl);
> 
> if (ret || !is_written)
>   return ret;
> 
> This will avoid those tests in the following if().
Done.

> 
>> +
>> +               if (!ret && is_writ && PM_SCAN_OP_IS_WP(p) &&
>> +                   uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true) < 0)
>> +                       ret = -EINVAL;
>> +
>> +               return ret;
> 
> After above early returns, this will be always `return 0;`.
Sure.

> 
>> +       }
>> +process_smaller_pages:
>> +       if (pmd_trans_unstable(pmd))
>> +               return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +       for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
> 
> The `!ret` can be removed if the EINVAL case was to `break` by itself.
Sure. Will do.

> 
>> +               pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> +
>> +               is_writ = !is_pte_uffd_wp(*pte);
>> +               is_file = vma->vm_file;
>> +               is_pres = pte_present(*pte);
>> +               is_swap = is_swap_pte(*pte);
>> +
>> +               pte_unmap_unlock(pte, ptl);
>> +
>> +               ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
>> +                                         p, addr, 1);
>> +               if (ret)
>> +                       break;
>> +
>> +               if (PM_SCAN_OP_IS_WP(p) && is_writ &&
>> +                   uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
>> +                       ret = -EINVAL;
>> +       }
>> +
>> +       cond_resched();
>> +       return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
>> +                                int depth, struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       unsigned long n_pages;
>> +       int ret = 0;
>> +
>> +       if (vma) {
> 
> if (!vma) return 0;
Will do.

> 
>> +               n_pages = (end - addr)/PAGE_SIZE;
>> +               if (p->max_pages &&
>> +                   p->found_pages + n_pages >= p->max_pages)
>> +                       n_pages = p->max_pages - p->found_pages;
>> +
>> +               ret = pagemap_scan_output(false, vma->vm_file, false, false, p,
>> +                                         addr, n_pages);
>> +       }
>> +
>> +       return ret;
>> +}
> 
> 
>> +/* No hugetlb support is present. */
> 
> "FIXME: hugetlb support is not implemented."? (There seems to be no
> #ifdef CONFIG_HUGETLB or similar, so I guess the comment is about the
> current implementation.)
I'm working on adding hugetlb support. I'll remove this comment.

> 
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> +       .test_walk = pagemap_scan_test_walk,
>> +       .pmd_entry = pagemap_scan_pmd_entry,
>> +       .pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static bool pagemap_scan_args_valid(struct pm_scan_arg *arg,
>> +                                   struct page_region __user *vec,
>> +                                   unsigned long start)
>> +{
>> +       /* Detect illegal size, flags and masks */
>> +       if (arg->size != sizeof(struct pm_scan_arg))
>> +               return false;
>> +       if (arg->flags & ~PM_SCAN_OPS)
>> +               return false;
>> +       if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
>> +            arg->return_mask) & ~PM_SCAN_BITS_ALL)
>> +               return false;
> 
>> +       if (!arg->required_mask && !arg->anyof_mask &&
>> +           !arg->excluded_mask)
>> +               return false;
> 
> Is there an assumption in the code that those checks are needed? I'd
> expect that no selection criteria makes a valid page set?
In my view, selection criterion must be specified for the ioctl to work. If
there is no criterio, user should go and read pagemap file directly. So the
assumption is that at least one selection criterion must be specified.

> 
>> +       if (!arg->return_mask)
>> +               return false;
>> +
>> +       /* Validate memory ranges */
>> +       if (!(arg->flags & PM_SCAN_OP_GET))
>> +               return false;
>> +       if (!arg->vec)
>> +               return false;
>> +       if (arg->vec_len == 0)
>> +               return false;
> 
>> +       if (!access_ok((void __user *)vec,
>> +                      arg->vec_len * sizeof(struct page_region)))
>> +               return false;
> 
> Is there a provision that userspace threads are all blocked from
> manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
> and the writes should be checked each time as another userspace thread
> could remap the memory while the ioctl() is working.
mincore() syscall is doing in the same way. It checks the validity in the
start only. What provision should I add? Isn't it obvious that the user
should not remap such memory?

> Anyway, the
> return should be EFAULT for this case.
I'll update.

> 
>> +       if (!IS_ALIGNED(start, PAGE_SIZE))
>> +               return false;
>> +       if (!access_ok((void __user *)start, arg->len))
>> +               return false;
> 
> This I guess want's to check if the range to be scanned is mapped -
> but isn't this what the ioctl() should do during the scan? (But, also
> see above.)
No, start represents the memory which the user wants to watch. User must
allocate this memory first and then pass the address to this ioctl to find
out the flags per page.

> 
>> +       if (PM_SCAN_OP_IS_WP(arg)) {
> 
> if (!...IS_WP) return true;
I was liking this way. Anyways I'll update.

> 
>> +               if (arg->required_mask & PM_SCAN_NON_WT_BITS)
>> +                       return false;
>> +               if (arg->anyof_mask & PM_SCAN_NON_WT_BITS)
>> +                       return false;
>> +               if (arg->excluded_mask & PM_SCAN_NON_WT_BITS)
>> +                       return false;
> 
> Please see: pagemap_scan_is_wt_required comment. Also, it seems this
> constant is used only here, so ~PAGE_IS_WRITTEN might be enough?
Yup, I'll update.

> 
> [...]
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pm_scan_arg *arg)
>> +{
>> +       unsigned long start, end, walk_start, walk_end;
>> +       unsigned long empty_slots, vec_index = 0;
>> +       struct page_region __user *vec;
>> +       struct pagemap_scan_private p;
>> +       int ret = 0;
>> +
>> +       start = (unsigned long)untagged_addr(arg->start);
>> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> +       if (!pagemap_scan_args_valid(arg, vec, start))
>> +               return -EINVAL;
>> +
>> +       end = start + arg->len;
>> +       p.max_pages = arg->max_pages;
>> +       p.found_pages = 0;
>> +       p.flags = arg->flags;
>> +       p.required_mask = arg->required_mask;
>> +       p.anyof_mask = arg->anyof_mask;
>> +       p.excluded_mask = arg->excluded_mask;
>> +       p.return_mask = arg->return_mask;
>> +       p.cur.len = 0;
>> +       p.vec = NULL;
>> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>> +
>> +       /*
>> +        * Allocate smaller buffer to get output from inside the page walk
>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>> +        * we want to return output to user in compact form where no two
>> +        * consecutive regions should be continuous and have the same flags.
>> +        * So store the latest element in p.cur between different walks and
>> +        * store the p.cur at the end of the walk to the user buffer.
>> +        */
>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>> +                             GFP_KERNEL);
>> +       if (!p.vec)
>> +               return -ENOMEM;
>> +
>> +       walk_start = walk_end = start;
>> +       while (walk_end < end) {
>> +               p.vec_index = 0;
>> +
>> +               empty_slots = arg->vec_len - vec_index;
>> +               p.vec_len = min(p.vec_len, empty_slots);
>> +
>> +               walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
>> +               if (walk_end > end)
>> +                       walk_end = end;
>> +
>> +               mmap_read_lock(mm);
>> +               ret = walk_page_range(mm, walk_start, walk_end,
>> +                                     &pagemap_scan_ops, &p);
>> +               mmap_read_unlock(mm);
>> +
>> +               if (!(!ret || ret == -ENOSPC))
> 
> if (ret && ret != -ENOSPC)
Sorry, I should have thought of this one. Thanks.

> 
>> +                       goto free_data;
>> +
>> +               walk_start = walk_end;
>> +               if (p.vec_index) {
>> +                       if (copy_to_user(&vec[vec_index], p.vec,
>> +                                        p.vec_index *
>> +                                        sizeof(struct page_region))) {
>> +                               ret = -EFAULT;
>> +                               goto free_data;
>> +                       }
>> +                       vec_index += p.vec_index;
>> +               }
>> +       }
>> +       ret = pagemap_scan_deposit(&p, vec, &vec_index);
>> +       if (!ret)
>> +               ret = vec_index;
>> +free_data:
>> +       kfree(p.vec);
>> +
>> +       return ret;
>> +}
>> +
>> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
>> +                              unsigned long arg)
>> +{
>> +       struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
>> +       struct mm_struct *mm = file->private_data;
>> +       struct pm_scan_arg argument;
>> +
>> +       if (cmd == PAGEMAP_SCAN) {
> 
> switch() for easier expansion later?
I'd switch here once. I'll add it again.

> 
>> +               if (copy_from_user(&argument, uarg,
>> +                                  sizeof(struct pm_scan_arg)))
> 
> sizeof(*argument);
> 
> Could you push this to do_pagemap_cmd()? In case this file gets more
> ioctl() commands there won't be need to add more command-specific
> structures in this function.
Sure, I'll update.

> 
>> +                       return -EFAULT;
>> +               return do_pagemap_cmd(mm, &argument);
>> +       }
>> +
>> +       return -EINVAL;
>> +}
>> +
>>  const struct file_operations proc_pagemap_operations = {
>>         .llseek         = mem_lseek, /* borrow this */
>>         .read           = pagemap_read,
>>         .open           = pagemap_open,
>>         .release        = pagemap_release,
>> +       .unlocked_ioctl = pagemap_scan_ioctl,
>> +       .compat_ioctl   = pagemap_scan_ioctl,
> 
> Is this correct? Would the code need a different userspace pointer
> handling for 32-bit userspace on 64-bit kernel?
Yeah, it is needed for 32-bit application to run on 64-bit kernel.

> 
>>  };
>>  #endif /* CONFIG_PROC_PAGE_MONITOR */
Michał Mirosław March 16, 2023, 9:28 p.m. UTC | #7
On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> Hi,
>
> Thank you so much for reviewing.
>
> On 3/13/23 9:02 PM, Michał Mirosław wrote:
> > On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
[...]
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
[...]
> >> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> >> +                              struct pagemap_scan_private *p,
> >> +                              unsigned long addr, unsigned int n_pages)
> >> +{
> >> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
> >> +       struct page_region *cur = &p->cur;
> >> +       bool cpy = true;
> >> +
> >> +       if (p->max_pages && (p->found_pages == p->max_pages))
> >> +               return -ENOSPC;
> >> +
> >> +       if (!n_pages)
> >> +               return -EINVAL;
> >> +
> >> +       if (p->required_mask)
> >> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
> >> +       if (cpy && p->anyof_mask)
> >> +               cpy = (p->anyof_mask & bitmap);
> >> +       if (cpy && p->excluded_mask)
> >> +               cpy = !(p->excluded_mask & bitmap);
> >
> > Since the rest of the code is executed only when `cpy` is true, this
> > could just return early for easier understanding.
> Hmm... I'll do the following:
>         if (!cpy || !bitmap)
>                 return 0;
> > BTW, some of the tests are redundant. Eg: if required_mask == 0, then
> > `required_mask & x == required_mask` will always hold. Same for
> > `excluded_mask & x == 0`.
> Correct. This is why I'm checking if required_mask is set and then
> comparing bitmap with it. required_mask may be 0 if not set. This if will
> ignore the subsequent check.
>
>         if (p->required_mask)
>                 cpy = ((p->required_mask & bitmap) == p->required_mask);
>
> I don't see any redundancy here. Please let me know otherwise?
[...]
> >> +       if (cpy && bitmap) {
> >
> > Assuming early returns on `!cpy` are done earlier:
> >
> > if (!bitmap)
> >   return 0;
> I've posted condition above which would better suit here.
[...]

Since the `cpy` condition is updated and passed to each new branch
(IOW: after setting cpy = 0 for whatever reason all the further code
is skipped) you can drop the variable and do early returns everywhere.
E.g.:

if ((bitmap & p->required_mask) != p->required_mask)
  return 0;
if (p->anyof_mask && !(bitmap & p->anyof_mask))
  return 0;
if (bitmap & p->excluded_mask)
  return 0;
if (!bitmap)
  return 0;

Also you can take the "special" effect of masking with zero to be
always zero (and in C - false) to avoid testing for an empty mask
separately in most cases.

> Just for my knowledge, what does "Nit" signify if a comment is marked with it?

A low priority / cosmetic item that you might consider ignoring if a
fix is too expensive or controversial.

>> +               if ((cur->len) && (cur->bitmap == bitmap) &&
>> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {
>
> I'd recommend removing the extra parentheses as they make the code
> less readable for me (too many parentheses to match visually).
I'll remove parenthesis.

[...]
>> The `cur->len` test seems redundant: is it possible to have
>> `cur->start == addr` in that case (I guess it would have to get
>> `n_pages == 0` in an earlier invocation)?
> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
> essential to check the validity from cur->len before performing other
> checks. Also cur->start can never be equal to addr as we are walking over
> page addressing in serial manner. We want to see here if the current
> address matches the previous data by finding the ending address of last
> stored data (cur->start + cur->len * PAGE_SIZE).

If cur->len == 0, then it doesn't matter if it gets merged or not - it
can be filtered out during the flush (see below).

[...]
> >> +               } else if ((!p->vec_index) ||
> >> +                          ((p->vec_index + 1) < p->vec_len)) {
> >
> > Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
> >
> > if (vec_index >= p->vec_len)
> >     return -ENOSPC;
>
> No, it'll not work. Lets leave it as it is. :)
>
> It has gotten somewhat complex, but I don't have any other way to make it
> simpler which works. First note the following points:
> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
> kernel (p->vec).
> 2) We also want to merge the consecutive pages with the same flags into one
> struct page_region. p->vec of current walk may merge with next walk. So we
> cannot write to user memory until we find the results of the next walk.
>
> So most recent data is put into p->cur. When non-intersecting or mergeable
> data is found, we move p->cur to p->vec[p->index] inside the page walk.
> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
> the walks are over. We move the p->cur to arg->vec. It completes the data
> transfer to user buffer.
[...]
> I'm so sorry that it has gotten this much complex. It was way simpler when
> we were walking over all the memory in one go. But then we needed an
> unbounded memory from the kernel which we don't want.
[...]

I've gone through and hopefully understood the code. I'm not sure this
needs to be so complicated: when traversing a single PMD you can
always copy p->cur to p->vec[p->vec_index++] because you can have at
most pages_per_PMD non-merges (in the worst case the last page always
is left in p->cur and whole p->vec is used). After each PMD p->vec
needs a flush if p->vec_index > 0, skipping the dummy entry at front
(len == 0; if present). (This is mostly how it is implemented now, but
I propose to remove the "overflow" check and do the starting guard
removal only every PMD.)

BTW, the pagemap_scan_deposit() got me a bit confused: it seems that
it is just a copy of the p->vec flush to userspace. Please either use
it for both p->vec and p->cur flushing or inline.

BTW#2, I think the ENOSPC return in pagemap_scan_output() should
happen later - only if the pages would match and that caused the count
to exceed the limit. For THP n_pages should be truncated to the limit
(and ENOSPC returned right away) only after the pages were verified to
match.

[...]
> >> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >> +                                 unsigned long end, struct mm_walk *walk)
> >> +{
> >> +       struct pagemap_scan_private *p = walk->private;
> >> +       struct vm_area_struct *vma = walk->vma;
> >> +       bool is_writ, is_file, is_pres, is_swap;
> >> +       unsigned long addr = end;
> >> +       spinlock_t *ptl;
> >> +       int ret = 0;
> >> +       pte_t *pte;
> >> +
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >
> > Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
> > NULL if transparent hugepages are not compiled in. OTOH I see
> > BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
> > case), so that would need to be worked around first.
> I'd got the build error [1]. So I'd added these. I've tested it again with
> the same config. We don't need these #ifdef now. I'll remove these.

I mean that there are cases like [1] that actually need the #ifdef at
least to wrap HPAGE_SIZE usage. But maybe just this constant can be
wrapped so that we keep the code always compile-tested?

[1] https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/mips/include/asm/page.h#L66

[...]
> >> +       if (!arg->required_mask && !arg->anyof_mask &&
> >> +           !arg->excluded_mask)
> >> +               return false;
> >
> > Is there an assumption in the code that those checks are needed? I'd
> > expect that no selection criteria makes a valid page set?
> In my view, selection criterion must be specified for the ioctl to work. If
> there is no criterio, user should go and read pagemap file directly. So the
> assumption is that at least one selection criterion must be specified.

Yes. I'm not sure we need to prevent multiple ways of doing the same
thing. But doesn't pagemap reading lack the range aggregation feature?

[...]
> >> +       if (!access_ok((void __user *)vec,
> >> +                      arg->vec_len * sizeof(struct page_region)))
> >> +               return false;
> >
> > Is there a provision that userspace threads are all blocked from
> > manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
> > and the writes should be checked each time as another userspace thread
> > could remap the memory while the ioctl() is working.
> mincore() syscall is doing in the same way. It checks the validity in the
> start only. What provision should I add? Isn't it obvious that the user
> should not remap such memory?

On the second look, I think the code already checks that while doing
copy_to_user(), so this check is redundant and can be removed.

> >
> >> +       if (!IS_ALIGNED(start, PAGE_SIZE))
> >> +               return false;
> >> +       if (!access_ok((void __user *)start, arg->len))
> >> +               return false;
> >
> > This I guess wants to check if the range to be scanned is mapped -
> > but isn't this what the ioctl() should do during the scan? (But, also
> > see above.)
> No, start represents the memory which the user wants to watch. User must
> allocate this memory first and then pass the address to this ioctl to find
> out the flags per page.

From:
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:              Size of the structure
+ * @flags:             Flags for the IOCTL
+ * @start:             Starting address of the region
+ * @len:               Length of the region (All the pages in this
length are included)
...

I'd expect the `start` field to just be a virtual address to start
scanning from. Does it need to be mapped? For CRIU usecase I'd start
with "start = 0" to find out all mappings, but 0 is (always) not
mapped. Is this supposed to only work on already discovered page
ranges? Anyway, I'd expect the code should be tolerant of another
thread changing the mappings while this ioctl() is walking the page
tables - is it so? If yes, then this check serves at most as an
optimization used only for an invalid call.

> >>  const struct file_operations proc_pagemap_operations = {
> >>         .llseek         = mem_lseek, /* borrow this */
> >>         .read           = pagemap_read,
> >>         .open           = pagemap_open,
> >>         .release        = pagemap_release,
> >> +       .unlocked_ioctl = pagemap_scan_ioctl,
> >> +       .compat_ioctl   = pagemap_scan_ioctl,
> >
> > Is this correct? Would the code need a different userspace pointer
> > handling for 32-bit userspace on 64-bit kernel?
> Yeah, it is needed for 32-bit application to run on 64-bit kernel.

I mean is using the same function for both entry points correct? Don't
the pointers to userspace memory (e.g. arg->vec) need to be mapped for
32-bit process?

Best Regards

Michał Mirosław
Muhammad Usama Anjum March 17, 2023, 12:43 p.m. UTC | #8
On 3/17/23 2:28 AM, Michał Mirosław wrote:
> On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> Hi,
>>
>> Thank you so much for reviewing.
>>
>> On 3/13/23 9:02 PM, Michał Mirosław wrote:
>>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
> [...]
>>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>>>> +                              struct pagemap_scan_private *p,
>>>> +                              unsigned long addr, unsigned int n_pages)
>>>> +{
>>>> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
>>>> +       struct page_region *cur = &p->cur;
>>>> +       bool cpy = true;
>>>> +
>>>> +       if (p->max_pages && (p->found_pages == p->max_pages))
>>>> +               return -ENOSPC;
>>>> +
>>>> +       if (!n_pages)
>>>> +               return -EINVAL;
>>>> +
>>>> +       if (p->required_mask)
>>>> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
>>>> +       if (cpy && p->anyof_mask)
>>>> +               cpy = (p->anyof_mask & bitmap);
>>>> +       if (cpy && p->excluded_mask)
>>>> +               cpy = !(p->excluded_mask & bitmap);
>>>
>>> Since the rest of the code is executed only when `cpy` is true, this
>>> could just return early for easier understanding.
>> Hmm... I'll do the following:
>>         if (!cpy || !bitmap)
>>                 return 0;
>>> BTW, some of the tests are redundant. Eg: if required_mask == 0, then
>>> `required_mask & x == required_mask` will always hold. Same for
>>> `excluded_mask & x == 0`.
>> Correct. This is why I'm checking if required_mask is set and then
>> comparing bitmap with it. required_mask may be 0 if not set. This if will
>> ignore the subsequent check.
>>
>>         if (p->required_mask)
>>                 cpy = ((p->required_mask & bitmap) == p->required_mask);
>>
>> I don't see any redundancy here. Please let me know otherwise?
> [...]
>>>> +       if (cpy && bitmap) {
>>>
>>> Assuming early returns on `!cpy` are done earlier:
>>>
>>> if (!bitmap)
>>>   return 0;
>> I've posted condition above which would better suit here.
> [...]
> 
> Since the `cpy` condition is updated and passed to each new branch
> (IOW: after setting cpy = 0 for whatever reason all the further code
> is skipped) you can drop the variable and do early returns everywhere.
> E.g.:
> 
> if ((bitmap & p->required_mask) != p->required_mask)
>   return 0;
> if (p->anyof_mask && !(bitmap & p->anyof_mask))
>   return 0;
> if (bitmap & p->excluded_mask)
>   return 0;
> if (!bitmap)
>   return 0;
Clever. Will do.

> 
> Also you can take the "special" effect of masking with zero to be
> always zero (and in C - false) to avoid testing for an empty mask
> separately in most cases.
Done.

> 
>> Just for my knowledge, what does "Nit" signify if a comment is marked with it?
> 
> A low priority / cosmetic item that you might consider ignoring if a
> fix is too expensive or controversial.
> 
>>> +               if ((cur->len) && (cur->bitmap == bitmap) &&
>>> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {
>>
>> I'd recommend removing the extra parentheses as they make the code
>> less readable for me (too many parentheses to match visually).
> I'll remove parenthesis.
> 
> [...]
>>> The `cur->len` test seems redundant: is it possible to have
>>> `cur->start == addr` in that case (I guess it would have to get
>>> `n_pages == 0` in an earlier invocation)?
>> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
>> essential to check the validity from cur->len before performing other
>> checks. Also cur->start can never be equal to addr as we are walking over
>> page addressing in serial manner. We want to see here if the current
>> address matches the previous data by finding the ending address of last
>> stored data (cur->start + cur->len * PAGE_SIZE).
> 
> If cur->len == 0, then it doesn't matter if it gets merged or not - it
> can be filtered out during the flush (see below).

> 
> [...]
>>>> +               } else if ((!p->vec_index) ||
>>>> +                          ((p->vec_index + 1) < p->vec_len)) {
>>>
>>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
>>>
>>> if (vec_index >= p->vec_len)
>>>     return -ENOSPC;
>>
>> No, it'll not work. Lets leave it as it is. :)
>>
>> It has gotten somewhat complex, but I don't have any other way to make it
>> simpler which works. First note the following points:
>> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
>> kernel (p->vec).
>> 2) We also want to merge the consecutive pages with the same flags into one
>> struct page_region. p->vec of current walk may merge with next walk. So we
>> cannot write to user memory until we find the results of the next walk.
>>
>> So most recent data is put into p->cur. When non-intersecting or mergeable
>> data is found, we move p->cur to p->vec[p->index] inside the page walk.
>> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
>> the walks are over. We move the p->cur to arg->vec. It completes the data
>> transfer to user buffer.
> [...]
>> I'm so sorry that it has gotten this much complex. It was way simpler when
>> we were walking over all the memory in one go. But then we needed an
>> unbounded memory from the kernel which we don't want.
> [...]
> 
> I've gone through and hopefully understood the code. I'm not sure this
> needs to be so complicated: when traversing a single PMD you can
> always copy p->cur to p->vec[p->vec_index++] because you can have at
> most pages_per_PMD non-merges (in the worst case the last page always
> is left in p->cur and whole p->vec is used). After each PMD p->vec
> needs a flush if p->vec_index > 0, skipping the dummy entry at front
> (len == 0; if present). (This is mostly how it is implemented now, but
> I propose to remove the "overflow" check and do the starting guard
> removal only every PMD.)
Sorry, unable to understand where to remove the guard?

> 
> BTW, the pagemap_scan_deposit() got me a bit confused: it seems that
> it is just a copy of the p->vec flush to userspace. Please either use
> it for both p->vec and p->cur flushing or inline.
I can inline this function if you say so, now that you understand all the
logic. I don't see what else can be done here.

> 
> BTW#2, I think the ENOSPC return in pagemap_scan_output() should
> happen later - only if the pages would match and that caused the count
> to exceed the limit. For THP n_pages should be truncated to the limit
> (and ENOSPC returned right away) only after the pages were verified to
> match.
We have 2 counters here:
* the p->max_pages optionally can be set to find out only N pages of
interest. So p->found_pages is counting this. We need to return early if
the page limit is complete.
* the p->vec_index keeps track of output buffer array size

> 
> [...]
>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>> +{
>>>> +       struct pagemap_scan_private *p = walk->private;
>>>> +       struct vm_area_struct *vma = walk->vma;
>>>> +       bool is_writ, is_file, is_pres, is_swap;
>>>> +       unsigned long addr = end;
>>>> +       spinlock_t *ptl;
>>>> +       int ret = 0;
>>>> +       pte_t *pte;
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>
>>> Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
>>> NULL if transparent hugepages are not compiled in. OTOH I see
>>> BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
>>> case), so that would need to be worked around first.
>> I'd got the build error [1]. So I'd added these. I've tested it again with
>> the same config. We don't need these #ifdef now. I'll remove these.
> 
> I mean that there are cases like [1] that actually need the #ifdef at
> least to wrap HPAGE_SIZE usage. But maybe just this constant can be
> wrapped so that we keep the code always compile-tested?
Some arch define HPAGE_SIZE even if huge page config isn't enabled and some
don't. Lets just add #ifdef CONFIG_TRANSPARENT_HUGEPAGE as it is just like
the similar code in this same file is using this same #ifdef.

> 
> [1] https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/mips/include/asm/page.h#L66
> 
> [...]
>>>> +       if (!arg->required_mask && !arg->anyof_mask &&
>>>> +           !arg->excluded_mask)
>>>> +               return false;
>>>
>>> Is there an assumption in the code that those checks are needed? I'd
>>> expect that no selection criteria makes a valid page set?
>> In my view, selection criterion must be specified for the ioctl to work. If
>> there is no criterio, user should go and read pagemap file directly. So the
>> assumption is that at least one selection criterion must be specified.
> 
> Yes. I'm not sure we need to prevent multiple ways of doing the same
> thing. But doesn't pagemap reading lack the range aggregation feature?
Yeah, correct. But note that we are supporting only selective 4 flags in
this ioctl, not all pagemap flags. So it is useful for only those users who
depend only on these 4 flags. Out pagemap_ioctl interface is not so much
generic that we can cater anyone. Its interface is specific and we are
adding only those cases which are of our interest. So if someone wants
range aggregation from pagemap_ioctl, he'll need to add that flag in the
IOCTL first. When IOCTL support is added, he can specify the selection
criterion etc.

> 
> [...]
>>>> +       if (!access_ok((void __user *)vec,
>>>> +                      arg->vec_len * sizeof(struct page_region)))
>>>> +               return false;
>>>
>>> Is there a provision that userspace threads are all blocked from
>>> manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
>>> and the writes should be checked each time as another userspace thread
>>> could remap the memory while the ioctl() is working.
>> mincore() syscall is doing in the same way. It checks the validity in the
>> start only. What provision should I add? Isn't it obvious that the user
>> should not remap such memory?
> 
> On the second look, I think the code already checks that while doing
> copy_to_user(), so this check is redundant and can be removed.
I'll remove.

> 
>>>
>>>> +       if (!IS_ALIGNED(start, PAGE_SIZE))
>>>> +               return false;
>>>> +       if (!access_ok((void __user *)start, arg->len))
>>>> +               return false;
>>>
>>> This I guess wants to check if the range to be scanned is mapped -
>>> but isn't this what the ioctl() should do during the scan? (But, also
>>> see above.)
>> No, start represents the memory which the user wants to watch. User must
>> allocate this memory first and then pass the address to this ioctl to find
>> out the flags per page.
> 
> From:
> + * struct pm_scan_arg - Pagemap ioctl argument
> + * @size:              Size of the structure
> + * @flags:             Flags for the IOCTL
> + * @start:             Starting address of the region
> + * @len:               Length of the region (All the pages in this
> length are included)
> ...
> 
> I'd expect the `start` field to just be a virtual address to start
> scanning from. Does it need to be mapped? For CRIU usecase I'd start
> with "start = 0" to find out all mappings, but 0 is (always) not
> mapped. Is this supposed to only work on already discovered page
> ranges? Anyway, I'd expect the code should be tolerant of another
> thread changing the mappings while this ioctl() is walking the page
> tables - is it so? If yes, then this check serves at most as an
> optimization used only for an invalid call.
Ohh... Ignore my previous comment. Yeah, any valid memory range can be
passed to view the page flags. This check just verifies if the memory range
is valid.

> 
>>>>  const struct file_operations proc_pagemap_operations = {
>>>>         .llseek         = mem_lseek, /* borrow this */
>>>>         .read           = pagemap_read,
>>>>         .open           = pagemap_open,
>>>>         .release        = pagemap_release,
>>>> +       .unlocked_ioctl = pagemap_scan_ioctl,
>>>> +       .compat_ioctl   = pagemap_scan_ioctl,
>>>
>>> Is this correct? Would the code need a different userspace pointer
>>> handling for 32-bit userspace on 64-bit kernel?
>> Yeah, it is needed for 32-bit application to run on 64-bit kernel.
> 
> I mean is using the same function for both entry points correct? Don't
> the pointers to userspace memory (e.g. arg->vec) need to be mapped for
> 32-bit process?
No, every member is our argument structure is of 64 bit in our structure
which keeps memory layout same. So we don't need any specific conversion
here. (Even if we had any 32-bit variable, we just needed to make sure that
the layout remains the same in the memory.)

Thanks,
Usama

> 
> Best Regards
> 
> Michał Mirosław
Michał Mirosław March 17, 2023, 2:15 p.m. UTC | #9
On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 3/17/23 2:28 AM, Michał Mirosław wrote:
> > On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 3/13/23 9:02 PM, Michał Mirosław wrote:
> >>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> > [...]
> >>>> --- a/fs/proc/task_mmu.c
> >>>> +++ b/fs/proc/task_mmu.c
> > [...]
> >>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> > [...]
> >>> The `cur->len` test seems redundant: is it possible to have
> >>> `cur->start == addr` in that case (I guess it would have to get
> >>> `n_pages == 0` in an earlier invocation)?
> >> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
> >> essential to check the validity from cur->len before performing other
> >> checks. Also cur->start can never be equal to addr as we are walking over
> >> page addressing in serial manner. We want to see here if the current
> >> address matches the previous data by finding the ending address of last
> >> stored data (cur->start + cur->len * PAGE_SIZE).
> >
> > If cur->len == 0, then it doesn't matter if it gets merged or not - it
> > can be filtered out during the flush (see below).
> > [...]
> >>>> +               } else if ((!p->vec_index) ||
> >>>> +                          ((p->vec_index + 1) < p->vec_len)) {
> >>>
> >>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
> >>>
> >>> if (vec_index >= p->vec_len)
> >>>     return -ENOSPC;
> >>
> >> No, it'll not work. Lets leave it as it is. :)
> >>
> >> It has gotten somewhat complex, but I don't have any other way to make it
> >> simpler which works. First note the following points:
> >> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
> >> kernel (p->vec).
> >> 2) We also want to merge the consecutive pages with the same flags into one
> >> struct page_region. p->vec of current walk may merge with next walk. So we
> >> cannot write to user memory until we find the results of the next walk.
> >>
> >> So most recent data is put into p->cur. When non-intersecting or mergeable
> >> data is found, we move p->cur to p->vec[p->index] inside the page walk.
> >> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
> >> the walks are over. We move the p->cur to arg->vec. It completes the data
> >> transfer to user buffer.
> > [...]
> >> I'm so sorry that it has gotten this much complex. It was way simpler when
> >> we were walking over all the memory in one go. But then we needed an
> >> unbounded memory from the kernel which we don't want.
> > [...]
> >
> > I've gone through and hopefully understood the code. I'm not sure this
> > needs to be so complicated: when traversing a single PMD you can
> > always copy p->cur to p->vec[p->vec_index++] because you can have at
> > most pages_per_PMD non-merges (in the worst case the last page always
> > is left in p->cur and whole p->vec is used). After each PMD p->vec
> > needs a flush if p->vec_index > 0, skipping the dummy entry at front
> > (len == 0; if present). (This is mostly how it is implemented now, but
> > I propose to remove the "overflow" check and do the starting guard
> > removal only every PMD.)
> Sorry, unable to understand where to remove the guard?

Instead of checking for it in pagemap_scan_output() for each page you
can skip it in do_pagemap_cmd() when doing the flush.

> > BTW#2, I think the ENOSPC return in pagemap_scan_output() should
> > happen later - only if the pages would match and that caused the count
> > to exceed the limit. For THP n_pages should be truncated to the limit
> > (and ENOSPC returned right away) only after the pages were verified to
> > match.
> We have 2 counters here:
> * the p->max_pages optionally can be set to find out only N pages of
> interest. So p->found_pages is counting this. We need to return early if
> the page limit is complete.
> * the p->vec_index keeps track of output buffer array size

I think I get how the limits are supposed to work, but I also think
the implementation is not optimal. An example (assuming max_pages = 1
and vec_len = 1):
 - a matching page has been found
 - a second - non-matching - is tried but results in immediate -ENOSPC.
-> In this case I'd expect the early return to happen just after the
first page is found so that non
A similar problem occurs for hugepage: when the limit is hit (we found
>= max_pages, n_pages is possibly truncated), but the scan continues
until next page / PMD.

[...]
> >>>> +       if (!arg->required_mask && !arg->anyof_mask &&
> >>>> +           !arg->excluded_mask)
> >>>> +               return false;
> >>>
> >>> Is there an assumption in the code that those checks are needed? I'd
> >>> expect that no selection criteria makes a valid page set?
> >> In my view, selection criterion must be specified for the ioctl to work. If
> >> there is no criterio, user should go and read pagemap file directly. So the
> >> assumption is that at least one selection criterion must be specified.
> >
> > Yes. I'm not sure we need to prevent multiple ways of doing the same
> > thing. But doesn't pagemap reading lack the range aggregation feature?
> Yeah, correct. But note that we are supporting only selective 4 flags in
> this ioctl, not all pagemap flags. So it is useful for only those users who
> depend only on these 4 flags. Out pagemap_ioctl interface is not so much
> generic that we can cater anyone. Its interface is specific and we are
> adding only those cases which are of our interest. So if someone wants
> range aggregation from pagemap_ioctl, he'll need to add that flag in the
> IOCTL first. When IOCTL support is added, he can specify the selection
> criterion etc.

The available flag set is not a problem. An example usecase: dumping
the memory state for debugging: ioctl(return_mask=ALL) returns a
conveniently compact vector of ranges of pages that are actually used
by the process (not only having reserved the virtual space). This is
actually something that helps dumping processes with using tools like
AddressSanitizer that create huge sparse mappings.

Best Regards
Michał Mirosław
Muhammad Usama Anjum March 20, 2023, 6:08 a.m. UTC | #10
On 3/17/23 7:15 PM, Michał Mirosław wrote:
> On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 3/17/23 2:28 AM, Michał Mirosław wrote:
>>> On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 3/13/23 9:02 PM, Michał Mirosław wrote:
>>>>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>> [...]
>>>>>> --- a/fs/proc/task_mmu.c
>>>>>> +++ b/fs/proc/task_mmu.c
>>> [...]
>>>>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>>> [...]
>>>>> The `cur->len` test seems redundant: is it possible to have
>>>>> `cur->start == addr` in that case (I guess it would have to get
>>>>> `n_pages == 0` in an earlier invocation)?
>>>> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
>>>> essential to check the validity from cur->len before performing other
>>>> checks. Also cur->start can never be equal to addr as we are walking over
>>>> page addressing in serial manner. We want to see here if the current
>>>> address matches the previous data by finding the ending address of last
>>>> stored data (cur->start + cur->len * PAGE_SIZE).
>>>
>>> If cur->len == 0, then it doesn't matter if it gets merged or not - it
>>> can be filtered out during the flush (see below).
>>> [...]
>>>>>> +               } else if ((!p->vec_index) ||
>>>>>> +                          ((p->vec_index + 1) < p->vec_len)) {
>>>>>
>>>>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
>>>>>
>>>>> if (vec_index >= p->vec_len)
>>>>>     return -ENOSPC;
>>>>
>>>> No, it'll not work. Lets leave it as it is. :)
>>>>
>>>> It has gotten somewhat complex, but I don't have any other way to make it
>>>> simpler which works. First note the following points:
>>>> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
>>>> kernel (p->vec).
>>>> 2) We also want to merge the consecutive pages with the same flags into one
>>>> struct page_region. p->vec of current walk may merge with next walk. So we
>>>> cannot write to user memory until we find the results of the next walk.
>>>>
>>>> So most recent data is put into p->cur. When non-intersecting or mergeable
>>>> data is found, we move p->cur to p->vec[p->index] inside the page walk.
>>>> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
>>>> the walks are over. We move the p->cur to arg->vec. It completes the data
>>>> transfer to user buffer.
>>> [...]
>>>> I'm so sorry that it has gotten this much complex. It was way simpler when
>>>> we were walking over all the memory in one go. But then we needed an
>>>> unbounded memory from the kernel which we don't want.
>>> [...]
>>>
>>> I've gone through and hopefully understood the code. I'm not sure this
>>> needs to be so complicated: when traversing a single PMD you can
>>> always copy p->cur to p->vec[p->vec_index++] because you can have at
>>> most pages_per_PMD non-merges (in the worst case the last page always
>>> is left in p->cur and whole p->vec is used). After each PMD p->vec
>>> needs a flush if p->vec_index > 0, skipping the dummy entry at front
>>> (len == 0; if present). (This is mostly how it is implemented now, but
>>> I propose to remove the "overflow" check and do the starting guard
>>> removal only every PMD.)
>> Sorry, unable to understand where to remove the guard?
> 
> Instead of checking for it in pagemap_scan_output() for each page you
> can skip it in do_pagemap_cmd() when doing the flush.
No, this cannot be done because in do_pagemap_cmd() we don't know that we
have space for new pages in the output buffer or not because the next page
may be aggregated to already present data.

> 
>>> BTW#2, I think the ENOSPC return in pagemap_scan_output() should
>>> happen later - only if the pages would match and that caused the count
>>> to exceed the limit. For THP n_pages should be truncated to the limit
>>> (and ENOSPC returned right away) only after the pages were verified to
>>> match.
>> We have 2 counters here:
>> * the p->max_pages optionally can be set to find out only N pages of
>> interest. So p->found_pages is counting this. We need to return early if
>> the page limit is complete.
>> * the p->vec_index keeps track of output buffer array size
> 
> I think I get how the limits are supposed to work, but I also think
> the implementation is not optimal. An example (assuming max_pages = 1
> and vec_len = 1):
>  - a matching page has been found
>  - a second - non-matching - is tried but results in immediate -ENOSPC.
> -> In this case I'd expect the early return to happen just after the
> first page is found so that non
> A similar problem occurs for hugepage: when the limit is hit (we found
>> = max_pages, n_pages is possibly truncated), but the scan continues
> until next page / PMD.
I'll try to check if I can optimize it. It seems like I should be able to
update this pretty easily by returning a negative status/error which
signifies that we have found the max_pages. Now just abort in sane way.

> 
> [...]
>>>>>> +       if (!arg->required_mask && !arg->anyof_mask &&
>>>>>> +           !arg->excluded_mask)
>>>>>> +               return false;
>>>>>
>>>>> Is there an assumption in the code that those checks are needed? I'd
>>>>> expect that no selection criteria makes a valid page set?
>>>> In my view, selection criterion must be specified for the ioctl to work. If
>>>> there is no criterio, user should go and read pagemap file directly. So the
>>>> assumption is that at least one selection criterion must be specified.
>>>
>>> Yes. I'm not sure we need to prevent multiple ways of doing the same
>>> thing. But doesn't pagemap reading lack the range aggregation feature?
>> Yeah, correct. But note that we are supporting only selective 4 flags in
>> this ioctl, not all pagemap flags. So it is useful for only those users who
>> depend only on these 4 flags. Out pagemap_ioctl interface is not so much
>> generic that we can cater anyone. Its interface is specific and we are
>> adding only those cases which are of our interest. So if someone wants
>> range aggregation from pagemap_ioctl, he'll need to add that flag in the
>> IOCTL first. When IOCTL support is added, he can specify the selection
>> criterion etc.
> 
> The available flag set is not a problem. An example usecase: dumping
> the memory state for debugging: ioctl(return_mask=ALL) returns a
> conveniently compact vector of ranges of pages that are actually used
> by the process (not only having reserved the virtual space). This is
> actually something that helps dumping processes with using tools like
> AddressSanitizer that create huge sparse mappings.
I don't know, we are adding more and more use cases as people are noticing
it. I've not thought about this use case. So I need more understanding
about it:
How should I identify "which pages are used"? Does use mean present and
swapped both? We we want to find present or swapped pages in other words
!pte_none pages and return in compact form, it can already be done by
ioctl(anyod_mask=PRESET | SWAPPED, return_mask=ALL).

> 
> Best Regards
> Michał Mirosław
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6a96e1713fd5..f8f796cf3439 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@ 
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1132,6 +1133,18 @@  static inline void clear_soft_dirty(struct vm_area_struct *vma,
 }
 #endif
 
+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+	return ((pte_present(pte) && pte_uffd_wp(pte)) ||
+		(pte_swp_uffd_wp_any(pte)));
+}
+
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+	return ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+		(is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)));
+}
+
 #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
@@ -1760,11 +1773,364 @@  static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PM_SCAN_BITS_ALL	(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PM_SCAN_NON_WT_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT |	\
+				 PAGE_IS_SWAPPED)
+#define PM_SCAN_OPS		(PM_SCAN_OP_GET | PM_SCAN_OP_WP)
+#define PM_SCAN_OP_IS_WP(a)	(a->flags & PM_SCAN_OP_WP)
+#define PM_SCAN_BITMAP(wt, file, present, swap)	\
+	(wt | file << 1 | present << 2 | swap << 3)
+
+struct pagemap_scan_private {
+	struct page_region *vec;
+	struct page_region cur;
+	unsigned long vec_len, vec_index;
+	unsigned int max_pages, found_pages, flags;
+	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static inline bool pagemap_scan_is_wt_required(struct pagemap_scan_private *p)
+{
+	return	((p->required_mask & PAGE_IS_WRITTEN) ||
+		 (p->anyof_mask & PAGE_IS_WRITTEN) ||
+		 (p->excluded_mask & PAGE_IS_WRITTEN));
+}
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+				  struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (pagemap_scan_is_wt_required(p) && (!userfaultfd_wp(vma) ||
+	    !userfaultfd_wp_async(vma)))
+		return -EPERM;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
+			       struct pagemap_scan_private *p,
+			       unsigned long addr, unsigned int n_pages)
+{
+	unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
+	struct page_region *cur = &p->cur;
+	bool cpy = true;
+
+	if (p->max_pages && (p->found_pages == p->max_pages))
+		return -ENOSPC;
+
+	if (!n_pages)
+		return -EINVAL;
+
+	if (p->required_mask)
+		cpy = ((p->required_mask & bitmap) == p->required_mask);
+	if (cpy && p->anyof_mask)
+		cpy = (p->anyof_mask & bitmap);
+	if (cpy && p->excluded_mask)
+		cpy = !(p->excluded_mask & bitmap);
+
+	bitmap = bitmap & p->return_mask;
+
+	if (cpy && bitmap) {
+		if ((cur->len) && (cur->bitmap == bitmap) &&
+		    (cur->start + cur->len * PAGE_SIZE == addr)) {
+
+			cur->len += n_pages;
+			p->found_pages += n_pages;
+		} else if ((!p->vec_index) ||
+			   ((p->vec_index + 1) < p->vec_len)) {
+
+			if (cur->len) {
+				memcpy(&p->vec[p->vec_index], cur,
+				       sizeof(struct page_region));
+				p->vec_index++;
+			}
+
+			cur->start = addr;
+			cur->len = n_pages;
+			cur->bitmap = bitmap;
+			p->found_pages += n_pages;
+		} else {
+			return -ENOSPC;
+		}
+	}
+
+	return 0;
+}
+
+static int pagemap_scan_deposit(struct pagemap_scan_private *p,
+				struct page_region __user *vec,
+				unsigned long *vec_index)
+{
+	struct page_region *cur = &p->cur;
+
+	if (cur->len) {
+		if (copy_to_user(&vec[*vec_index], cur,
+				 sizeof(struct page_region)))
+			return -EFAULT;
+
+		p->vec_index++;
+		(*vec_index)++;
+	}
+
+	return 0;
+}
+
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	bool is_writ, is_file, is_pres, is_swap;
+	unsigned long addr = end;
+	spinlock_t *ptl;
+	int ret = 0;
+	pte_t *pte;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		unsigned long n_pages;
+
+		is_writ = !is_pmd_uffd_wp(*pmd);
+		/*
+		 * Break huge page into small pages if operation needs to be
+		 * performed is on a portion of the huge page.
+		 */
+		if (is_writ && PM_SCAN_OP_IS_WP(p) &&
+		    (end - start < HPAGE_SIZE)) {
+			spin_unlock(ptl);
+
+			split_huge_pmd(vma, pmd, start);
+			goto process_smaller_pages;
+		}
+
+		n_pages = (end - start)/PAGE_SIZE;
+		if (p->max_pages &&
+		    p->found_pages + n_pages >= p->max_pages)
+			n_pages = p->max_pages - p->found_pages;
+
+		ret = pagemap_scan_output(is_writ, vma->vm_file,
+					  pmd_present(*pmd), is_swap_pmd(*pmd),
+					  p, start, n_pages);
+		spin_unlock(ptl);
+
+		if (!ret && is_writ && PM_SCAN_OP_IS_WP(p) &&
+		    uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true) < 0)
+			ret = -EINVAL;
+
+		return ret;
+	}
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
+		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+		is_writ = !is_pte_uffd_wp(*pte);
+		is_file = vma->vm_file;
+		is_pres = pte_present(*pte);
+		is_swap = is_swap_pte(*pte);
+
+		pte_unmap_unlock(pte, ptl);
+
+		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
+					  p, addr, 1);
+		if (ret)
+			break;
+
+		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
+		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
+			ret = -EINVAL;
+	}
+
+	cond_resched();
+	return ret;
+}
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+				 int depth, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long n_pages;
+	int ret = 0;
+
+	if (vma) {
+		n_pages = (end - addr)/PAGE_SIZE;
+		if (p->max_pages &&
+		    p->found_pages + n_pages >= p->max_pages)
+			n_pages = p->max_pages - p->found_pages;
+
+		ret = pagemap_scan_output(false, vma->vm_file, false, false, p,
+					  addr, n_pages);
+	}
+
+	return ret;
+}
+
+/* No hugetlb support is present. */
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+};
+
+static bool pagemap_scan_args_valid(struct pm_scan_arg *arg,
+				    struct page_region __user *vec,
+				    unsigned long start)
+{
+	/* Detect illegal size, flags and masks */
+	if (arg->size != sizeof(struct pm_scan_arg))
+		return false;
+	if (arg->flags & ~PM_SCAN_OPS)
+		return false;
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
+	     arg->return_mask) & ~PM_SCAN_BITS_ALL)
+		return false;
+	if (!arg->required_mask && !arg->anyof_mask &&
+	    !arg->excluded_mask)
+		return false;
+	if (!arg->return_mask)
+		return false;
+
+	/* Validate memory ranges */
+	if (!(arg->flags & PM_SCAN_OP_GET))
+		return false;
+	if (!arg->vec)
+		return false;
+	if (arg->vec_len == 0)
+		return false;
+	if (!access_ok((void __user *)vec,
+		       arg->vec_len * sizeof(struct page_region)))
+		return false;
+
+	if (!IS_ALIGNED(start, PAGE_SIZE))
+		return false;
+	if (!access_ok((void __user *)start, arg->len))
+		return false;
+
+	if (PM_SCAN_OP_IS_WP(arg)) {
+		if (arg->required_mask & PM_SCAN_NON_WT_BITS)
+			return false;
+		if (arg->anyof_mask & PM_SCAN_NON_WT_BITS)
+			return false;
+		if (arg->excluded_mask & PM_SCAN_NON_WT_BITS)
+			return false;
+	}
+
+	return true;
+}
+
+static long do_pagemap_cmd(struct mm_struct *mm, struct pm_scan_arg *arg)
+{
+	unsigned long start, end, walk_start, walk_end;
+	unsigned long empty_slots, vec_index = 0;
+	struct page_region __user *vec;
+	struct pagemap_scan_private p;
+	int ret = 0;
+
+	start = (unsigned long)untagged_addr(arg->start);
+	vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
+
+	if (!pagemap_scan_args_valid(arg, vec, start))
+		return -EINVAL;
+
+	end = start + arg->len;
+	p.max_pages = arg->max_pages;
+	p.found_pages = 0;
+	p.flags = arg->flags;
+	p.required_mask = arg->required_mask;
+	p.anyof_mask = arg->anyof_mask;
+	p.excluded_mask = arg->excluded_mask;
+	p.return_mask = arg->return_mask;
+	p.cur.len = 0;
+	p.vec = NULL;
+	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+
+	/*
+	 * Allocate smaller buffer to get output from inside the page walk
+	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
+	 * we want to return output to user in compact form where no two
+	 * consecutive regions should be continuous and have the same flags.
+	 * So store the latest element in p.cur between different walks and
+	 * store the p.cur at the end of the walk to the user buffer.
+	 */
+	p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
+			      GFP_KERNEL);
+	if (!p.vec)
+		return -ENOMEM;
+
+	walk_start = walk_end = start;
+	while (walk_end < end) {
+		p.vec_index = 0;
+
+		empty_slots = arg->vec_len - vec_index;
+		p.vec_len = min(p.vec_len, empty_slots);
+
+		walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
+		if (walk_end > end)
+			walk_end = end;
+
+		mmap_read_lock(mm);
+		ret = walk_page_range(mm, walk_start, walk_end,
+				      &pagemap_scan_ops, &p);
+		mmap_read_unlock(mm);
+
+		if (!(!ret || ret == -ENOSPC))
+			goto free_data;
+
+		walk_start = walk_end;
+		if (p.vec_index) {
+			if (copy_to_user(&vec[vec_index], p.vec,
+					 p.vec_index *
+					 sizeof(struct page_region))) {
+				ret = -EFAULT;
+				goto free_data;
+			}
+			vec_index += p.vec_index;
+		}
+	}
+	ret = pagemap_scan_deposit(&p, vec, &vec_index);
+	if (!ret)
+		ret = vec_index;
+free_data:
+	kfree(p.vec);
+
+	return ret;
+}
+
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
+			       unsigned long arg)
+{
+	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
+	struct mm_struct *mm = file->private_data;
+	struct pm_scan_arg argument;
+
+	if (cmd == PAGEMAP_SCAN) {
+		if (copy_from_user(&argument, uarg,
+				   sizeof(struct pm_scan_arg)))
+			return -EFAULT;
+		return do_pagemap_cmd(mm, &argument);
+	}
+
+	return -EINVAL;
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = pagemap_scan_ioctl,
+	.compat_ioctl	= pagemap_scan_ioctl,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..47879c38ce2f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,57 @@  typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pm_scan_args */
+#define PAGE_IS_WRITTEN		(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region in pages
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Supported flags */
+#define PM_SCAN_OP_GET	(1 << 0)
+#define PM_SCAN_OP_WP	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */