diff mbox series

[v22,2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

Message ID 20230628095426.1886064-3-usama.anjum@collabora.com (mailing list archive)
State New
Headers show
Series Implement IOCTL to get and optionally clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum June 28, 2023, 9:54 a.m. UTC
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT), swapped
  (PAGE_IS_SWAPPED) or page has pfn zero (PAGE_IS_PFNZERO).
- Find pages which have been written-to and/or write protect the pages
  (atomic PM_SCAN_OP_GET + PM_SCAN_OP_WP)

This IOCTL can be extended to get information about more PTE bits. The
entire address range passed by user [start, end) is scanned until either
the user provided buffer is full or max_pages have been found.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v22:
- Interface change to return walk ending address to user:
  - Replace [start start + len) with [start, end)
  - Return the ending address of the address walk in start

Changes in v21:
- Abort walk instead of returning error if WP is to be performed on
  partial hugetlb
- Changed the data types of some variables in pagemap_scan_private to
  long

Changes in v20:
- Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO

Changes in v19:
- Interface changes such as renaming, return mask and WP can be used
  with any flags specified in masks
- Internal code changes

Changes in v18:
- Rebased on top of next-20230613
  - ptep_get() updates
  - remove pmd_trans_unstable() and add ACTION_AGAIN
- Review updates (Micheal)

Changes in v17:
- Rebased on next-20230606
- Made make_uffd_wp_*_pte() better and minor changes

Changes in v16:
- Fixed a corner case where kernel writes beyond user buffer by one
  element
- Bring back exclusive PM_SCAN_OP_WP
- Cosmetic changes

Changes in v15:
- Build fix:
  - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of
    using x86 specific flush function in do_pagemap_scan()
  - Remove #ifdef from pagemap_scan_hugetlb_entry()
  - Use mm instead of undefined vma->vm_mm

Changes in v14:
- Fix build error caused by #ifdef added at last minute in some configs

Changes in v13:
- Review updates
- mmap_read_lock_killable() instead of mmap_read_lock()
- Replace uffd_wp_range() with helpers which increases performance
  drastically for OP_WP operations by reducing the number of tlb
  flushing etc
- Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range

Changes in v12:
- Add hugetlb support to cover all memory types
- Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
- Review updates to the code

Changes in v11:
- Find written pages in a better way
- Fix a corner case (thanks Paul)
- Improve the code/comments
- remove ENGAGE_WP + ! GET operation
- shorten the commit message in favour of moving documentation to
  pagemap.rst

Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl
---
 fs/proc/task_mmu.c      | 565 ++++++++++++++++++++++++++++++++++++++++
 include/linux/hugetlb.h |   1 +
 include/uapi/linux/fs.h |  55 ++++
 mm/hugetlb.c            |   2 +-
 4 files changed, 622 insertions(+), 1 deletion(-)

Comments

Michał Mirosław June 28, 2023, 12:11 p.m. UTC | #1
Hi Muhammad,

I'd really like to reduce the number of conditionals in this code by a
third. I believe it is possible if we agree to rework the API a bit
(making it always a GET operation with optional WP) and relax some of
the restrictions on the input values.

Also please include *all* changes you make in a patch in the changelog
- I just noticed e.g. a `mmu_notifier` code that I haven't seen before
in the review.

A general request from me would be to move all the page classification
code to one place. That would mean replacing:

> +               bitmap = PM_SCAN_FLAGS(is_written, false,
> +                                      pmd_present(*pmd), is_swap_pmd(*pmd),
> +                                      pmd_present(*pmd) && is_zero_pfn(pmd_pfn(*pmd)));

With:

       category_flags = pmd_page_category_flags(*pmd);

... and adding a pmd_page_category_flags() and pte/hugepage versions
together at the top. The idea here is to separate concerns: split
classification of pages from the code acting on the classification
results and be able to easily understand (and review for correctness)
how corresponding classes are derived from PTE and PMD values.

I'd prefer the naming of the functions and variables to follow that it
is a classification result, not a generic "bitmap" or "flags", so that
it's harder to confuse them with page flags as used elsewhere in the
kernel.

(inline review follows)

BTW, thanks for doing this!

On Wed, 28 Jun 2023 at 12:00, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
[...]
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 507cd4e59d07..be747fb89827 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
[...]
> +#define PM_SCAN_REQUIRE_UFFD   (1ULL << 63)
> +
> +#define PM_SCAN_FOUND_MAX_PAGES        (1)
> +#define PM_SCAN_END_WALK       (-256)
[...]
> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
> +                                 struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +
> +       if ((p->flags & PM_SCAN_REQUIRE_UFFD) && (!userfaultfd_wp_async(vma) ||
> +           !userfaultfd_wp_use_markers(vma)))
> +               return -EPERM;

p->vma_can_wp = userfaultfd_wp_async(vma) && userfaultfd_wp_use_markers(vma);

And then use that in classification functions and for skipping WP for
pages not prepared for that. The PM_SCAN_REQUIRE_UFFD won't be needed
then.

> +
> +       if (vma->vm_flags & VM_PFNMAP)
> +               return 1;

Why do we skip VM_PFNMAP vmas? This will skip over at least VDSO and
VVAR pages. Those two are not that big a problem, but it should be at
least documented what ranges are skipped and why.

> +
> +       return 0;
> +}
> +
> +static int pagemap_scan_output(unsigned long bitmap,
> +                              struct pagemap_scan_private *p,
> +                              unsigned long addr, unsigned int n_pages)
> +{
> +       struct page_region *cur_buf = &p->cur_buf;
> +
> +       if (!n_pages)
> +               return -EINVAL;

How can this happen?

> +       bitmap &= p->return_mask;
> +
> +       if (cur_buf->flags == bitmap &&
> +           cur_buf->start + cur_buf->len * PAGE_SIZE == addr) {

BTW, maybe the ranges returned to the user could also use start .. end
form, and then this would be simplified to `cur->flags ==
categories_to_report && cur->end == addr`.

> +               cur_buf->len += n_pages;
> +               p->found_pages += n_pages;
> +       } else {
> +               if (cur_buf->len) {
> +                       if (p->vec_buf_index >= p->vec_buf_len)
> +                               return PM_SCAN_END_WALK;
> +
> +                       memcpy(&p->vec_buf[p->vec_buf_index], cur_buf,
> +                              sizeof(*p->vec_buf));
> +                       p->vec_buf_index++;
> +               }
> +
> +               cur_buf->start = addr;
> +               cur_buf->len = n_pages;
> +               cur_buf->flags = bitmap;
> +               p->found_pages += n_pages;
> +       }
> +
> +       if (p->found_pages == p->max_pages)
> +               return PM_SCAN_FOUND_MAX_PAGES;

Since we now return the address the walk ended at, what is the
difference for PM_SCAN_END_WALK and PM_SCAN_FOUND_MAX_PAGES, and do we
still need any of those instead of -ENOSPC or `n_pages !=
scan_output(...)` check?

> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +       bool is_written, flush = false, interesting = true;
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long bitmap, addr = end;
> +       pte_t *pte, *orig_pte, ptent;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +
> +       arch_enter_lazy_mmu_mode();
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +       ptl = pmd_trans_huge_lock(pmd, vma);
> +       if (ptl) {
> +               unsigned long n_pages = (end - start)/PAGE_SIZE;
> +
> +               if (n_pages > p->max_pages - p->found_pages)
> +                       n_pages = p->max_pages - p->found_pages;
> +
> +               is_written = !is_pmd_uffd_wp(*pmd);
> +
> +               bitmap = PM_SCAN_FLAGS(is_written, false,
> +                                      pmd_present(*pmd), is_swap_pmd(*pmd),
> +                                      pmd_present(*pmd) && is_zero_pfn(pmd_pfn(*pmd)));
> +
> +               if (IS_PM_SCAN_GET(p->flags))
> +                       interesting = pagemap_scan_is_interesting_page(bitmap, p);
> +
> +               if (interesting) {
> +                       /*
> +                        * Break huge page into small pages if the WP operation
> +                        * need to be performed is on a portion of the huge page.
> +                        */
> +                       if (is_written && IS_PM_SCAN_WP(p->flags) &&
> +                           n_pages < HPAGE_SIZE/PAGE_SIZE) {
> +                               spin_unlock(ptl);
> +
> +                               split_huge_pmd(vma, pmd, start);
> +                               goto process_smaller_pages;
> +                       }
> +
> +                       if (IS_PM_SCAN_GET(p->flags))
> +                               ret = pagemap_scan_output(bitmap, p, start, n_pages);
> +
> +                       if (IS_PM_SCAN_WP(p->flags) && is_written && ret >= 0) {
> +                               make_uffd_wp_pmd(vma, start, pmd);
> +                               flush_tlb_range(vma, start, end);
> +                       }
> +               }
> +
> +               spin_unlock(ptl);
> +               arch_leave_lazy_mmu_mode();
> +
> +               return ret;
> +       }

Could you split the THP code and use it here like:

enter_lazy_mmu();

ret = scan_thp(...);
if (ret != -ENOTTY) {
  leave_lazy_mmu();
  return ret;
}

To avoid having #ifdef here, it can be moved to the scan_thp() only
(either in the body or having a dummy `return -ENOTTY;` version in an
#else block).

BTW, there's no `cond_resched()` in the THP case - is that intentional?

> +
> +process_smaller_pages:
> +#endif
> +
> +       orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> +       if (!pte) {
> +               walk->action = ACTION_AGAIN;
> +               return 0;
> +       }
> +
> +       for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
> +               ptent = ptep_get(pte);
> +               is_written = !is_pte_uffd_wp(ptent);
> +
> +               bitmap = PM_SCAN_FLAGS(is_written, pagemap_scan_is_file(vma, ptent, addr),
> +                                      pte_present(ptent), is_swap_pte(ptent),
> +                                      pte_present(ptent) && is_zero_pfn(pte_pfn(ptent)));
> +
> +               if (IS_PM_SCAN_GET(p->flags)) {
> +                       interesting = pagemap_scan_is_interesting_page(bitmap, p);

If we consider GET as always done, this can be:

if (!is_interesting) continue;

And that would simplify the code greatly.

> +                       if (interesting)
> +                               ret = pagemap_scan_output(bitmap, p, addr, 1);
> +               }
> +
> +               if (IS_PM_SCAN_WP(p->flags) && is_written && interesting &&
> +                   ret >= 0) {
> +                       make_uffd_wp_pte(vma, addr, pte);
> +                       flush = true;
> +               }
> +       }
> +
> +       if (flush)
> +               flush_tlb_range(vma, start, addr);

Would optimizing the TLB flush range be beneficial here? If yes, the
loop above could do:

flush_start = end;
flush_end = end;
...
for (...) {
  if (mark_wp) {
    make_wp();
    if (flush_start > addr)
      flush_start = addr;
    flush_end = addr;
  }
}

to reduce the flushed range.

> +
> +       pte_unmap_unlock(orig_pte, ptl);
> +       arch_leave_lazy_mmu_mode();
> +
> +       cond_resched();
> +       return ret;
> +}
> +
> +#ifdef CONFIG_HUGETLB_PAGE
> +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
> +                                     unsigned long start, unsigned long end,
> +                                     struct mm_walk *walk)
> +{
> +       unsigned long n_pages = (end - start)/PAGE_SIZE;
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       bool is_written, interesting = true;
> +       struct hstate *h = hstate_vma(vma);
> +       unsigned long bitmap;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +       pte_t ptent;
> +
> +       if (IS_PM_SCAN_WP(p->flags) && n_pages < HPAGE_SIZE/PAGE_SIZE)
> +               return -EINVAL;

Shouldn't this be checked after the `n_pages` is reduced? BTW, maybe
check it only if the page `is_written` to avoid the conditional for
all walks where WP is not used?

> +
> +       if (n_pages > p->max_pages - p->found_pages)
> +               n_pages = p->max_pages - p->found_pages;
[...]

Proposing the relaxed API (and, while at it, less generic constant and
field names) below.

> +static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
> +                                  unsigned long end, struct page_region __user *vec)
> +{
> +       /* Detect illegal size, flags, len and masks */
> +       if (arg->size != sizeof(struct pm_scan_arg))
> +               return -EINVAL;

> +       if (!arg->flags)
> +               return -EINVAL;
> +       if (arg->flags & ~PM_SCAN_OPS)
> +               return -EINVAL;

if (arg->flags & ~PM_SCAN_FLAGS)
  return -EINVAL;

> +       if (!(end - start))
> +               return -EINVAL;

Remove. Nothing bad will happen when trying to scan an empty range.

> +       if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
> +            arg->return_mask) & ~PM_SCAN_BITS_ALL)
> +               return -EINVAL;

if ((...) & ~PM_SCAN_PAGE_CATEGORIES)
  return -EINVAL;

> +       if (!arg->required_mask && !arg->anyof_mask &&
> +           !arg->excluded_mask)
> +               return -EINVAL;

Remove. Will inspect all pages then.

> +       if (!arg->return_mask)
> +               return -EINVAL;

Remove. Will not return any ranges then. (But will WP, if requested.)

> +       /* Validate memory range */
> +       if (!IS_ALIGNED(start, PAGE_SIZE))
> +               return -EINVAL;
> +       if (!access_ok((void __user *)start, end - start))
> +               return -EFAULT;

> +       if (IS_PM_SCAN_GET(arg->flags)) {

Remove. Do GET always.

> +               if (arg->vec_len == 0)
> +                       return -EINVAL;

Remove. Will end the walk at the first matching page (and return its
address in `start`).

> +               if (!vec)
> +                       return -EINVAL;
> +               if (!access_ok((void __user *)vec,
> +                              arg->vec_len * sizeof(struct page_region)))
> +                       return -EFAULT;

Check only if vec_len != 0. BTW, return EFAULT for `!vec`.

> +       }
> +
> +       if (IS_PM_SCAN_WP(arg->flags) && !IS_PM_SCAN_GET(arg->flags) &&
> +           arg->max_pages)
> +               return -EINVAL;

With return_mask == 0, arg->max_pages will be ignored anyway. We can
just document that this limits the pages reported in the output
vector. (And from that follows that if not returning anything, the
value doesn't make a difference.)

[...]
> +       p.max_pages = (arg.max_pages) ? arg.max_pages : ULONG_MAX;
> +       p.found_pages = 0;
> +       p.required_mask = arg.required_mask;
> +       p.anyof_mask = arg.anyof_mask;
> +       p.excluded_mask = arg.excluded_mask;
> +       p.return_mask = arg.return_mask;
> +       p.flags = arg.flags;
> +       p.flags |= ((p.required_mask | p.anyof_mask | p.excluded_mask) &
> +                   PAGE_IS_WRITTEN) ? PM_SCAN_REQUIRE_UFFD : 0;
> +       p.cur_buf.start = p.cur_buf.len = p.cur_buf.flags = 0;
> +       p.vec_buf = NULL;
> +       p.vec_buf_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;

This needs to be less by 1 to account for the entry in p.cur_buf.

> +       /*
> +        * Allocate smaller buffer to get output from inside the page walk
> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> +        * we want to return output to user in compact form where no two
> +        * consecutive regions should be continuous and have the same flags.
> +        * So store the latest element in p.cur_buf between different walks and
> +        * store the p.cur_buf at the end of the walk to the user buffer.
> +        */
> +       if (IS_PM_SCAN_GET(p.flags)) {

if (p.vec_len != 0)

> +               p.vec_buf = kmalloc_array(p.vec_buf_len, sizeof(*p.vec_buf),
> +                                         GFP_KERNEL);
> +               if (!p.vec_buf)
> +                       return -ENOMEM;
> +       }
> +
> +       if (IS_PM_SCAN_WP(p.flags)) {
> +               mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
> +                                       mm, start, end);
> +               mmu_notifier_invalidate_range_start(&range);
> +       }

Please add a comment why this is needed.

> +       walk_start = walk_end = start;
> +       while (walk_end < end && !ret) {
> +               if (IS_PM_SCAN_GET(p.flags)) {

if (p.vec_len)

> +                       p.vec_buf_index = 0;
> +
> +                       /*
> +                        * All data is copied to cur_buf first. When more data
> +                        * is found, we push cur_buf to vec_buf and copy new
> +                        * data to cur_buf. Subtract 1 from length as the
> +                        * index of cur_buf isn't counted in length.
> +                        */
> +                       empty_slots = arg.vec_len - vec_index;
> +                       p.vec_buf_len = min(p.vec_buf_len, empty_slots - 1);
> +               }
> +
> +               ret = mmap_read_lock_killable(mm);
> +               if (ret)
> +                       goto return_status;

This could be _interruptible() now as we can return early overwriting
`start` so that userspace can detect that the walk was interrupted by
a signal (start != end).

> +               walk_end = min((walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK, end);
> +
> +               ret = walk_page_range(mm, walk_start, walk_end,
> +                                     &pagemap_scan_ops, &p);
> +               mmap_read_unlock(mm);
> +
> +               if (ret && ret != PM_SCAN_FOUND_MAX_PAGES &&
> +                   ret != PM_SCAN_END_WALK)
> +                       goto return_status;

So now there is no difference in the two special error values.

> +               walk_start = walk_end;
> +               if (IS_PM_SCAN_GET(p.flags) && p.vec_buf_index) {

`if (p.vec_buf_index)` should be enough, as if not reporting the
ranges, the index will never increase.

> +                       if (copy_to_user(&vec[vec_index], p.vec_buf,
> +                                        p.vec_buf_index * sizeof(*p.vec_buf))) {
> +                               /*
> +                                * Return error even though the OP succeeded
> +                                */
> +                               ret = -EFAULT;
> +                               goto return_status;
> +                       }
> +                       vec_index += p.vec_buf_index;

p.vec_buf_index = 0;

... so that there is no need to do that at every start of the loop iteration.

> +               }
> +       }
> +
> +       if (p.cur_buf.len) {
> +               if (copy_to_user(&vec[vec_index], &p.cur_buf, sizeof(p.cur_buf))) {
> +                       ret = -EFAULT;
> +                       goto return_status;
> +               }
> +               vec_index++;
> +       }
> +
> +       ret = vec_index;
> +
> +return_status:

The label name looks too narrowing, considering what is being done
here now. Maybe just 'out', as there is no other non-trivial exit path
from the function?

> +       arg.start = (unsigned long)walk_end;
> +       if (copy_to_user(&uarg->start, &arg.start, sizeof(arg.start)))
> +               ret = -EFAULT;
> +
> +       if (IS_PM_SCAN_WP(p.flags))
> +               mmu_notifier_invalidate_range_end(&range);

With removal of the OP_GET, there is only a single flag left. I don't
think it is useful then to hide it behind a macro. It should be
readable as `p.flags & PM_SCAN_DO_WP`.

[...]
> +/* Bits are set in flags of the page_region and masks in pm_scan_args */
> +#define PAGE_IS_WRITTEN                (1 << 0)
> +#define PAGE_IS_FILE           (1 << 1)
> +#define PAGE_IS_PRESENT                (1 << 2)
> +#define PAGE_IS_SWAPPED                (1 << 3)
> +#define PAGE_IS_PFNZERO                (1 << 4)

Please add PAGE_IS_UFFD_WP_ENABLED (or a shorter name) - this would be
populated from `p->vma_can_wp` in the `test_walk` implementation
above.

Best Regards
Michał Mirosław
Andrei Vagin June 30, 2023, 3:01 p.m. UTC | #2
On Wed, Jun 28, 2023 at 02:54:23PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT), swapped
>   (PAGE_IS_SWAPPED) or page has pfn zero (PAGE_IS_PFNZERO).
> - Find pages which have been written-to and/or write protect the pages
>   (atomic PM_SCAN_OP_GET + PM_SCAN_OP_WP)
> 
> This IOCTL can be extended to get information about more PTE bits. The
> entire address range passed by user [start, end) is scanned until either
> the user provided buffer is full or max_pages have been found.
> 
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>

<snip>

> +
> +static long do_pagemap_scan(struct mm_struct *mm, unsigned long __arg)
> +{
> +	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)__arg;
> +	unsigned long long start, end, walk_start, walk_end;
> +	unsigned long empty_slots, vec_index = 0;
> +	struct mmu_notifier_range range;
> +	struct page_region __user *vec;
> +	struct pagemap_scan_private p;
> +	struct pm_scan_arg arg;
> +	int ret = 0;
> +
> +	if (copy_from_user(&arg, uarg, sizeof(arg)))
> +		return -EFAULT;
> +
> +	start = untagged_addr((unsigned long)arg.start);
> +	end = untagged_addr((unsigned long)arg.end);
> +	vec = (struct page_region __user *)untagged_addr((unsigned long)arg.vec);
> +
> +	ret = pagemap_scan_args_valid(&arg, start, end, vec);
> +	if (ret)
> +		return ret;
> +
> +	p.max_pages = (arg.max_pages) ? arg.max_pages : ULONG_MAX;
> +	p.found_pages = 0;
> +	p.required_mask = arg.required_mask;
> +	p.anyof_mask = arg.anyof_mask;
> +	p.excluded_mask = arg.excluded_mask;
> +	p.return_mask = arg.return_mask;
> +	p.flags = arg.flags;
> +	p.flags |= ((p.required_mask | p.anyof_mask | p.excluded_mask) &
> +		    PAGE_IS_WRITTEN) ? PM_SCAN_REQUIRE_UFFD : 0;
> +	p.cur_buf.start = p.cur_buf.len = p.cur_buf.flags = 0;
> +	p.vec_buf = NULL;
> +	p.vec_buf_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
> +
> +	/*
> +	 * Allocate smaller buffer to get output from inside the page walk
> +	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> +	 * we want to return output to user in compact form where no two
> +	 * consecutive regions should be continuous and have the same flags.
> +	 * So store the latest element in p.cur_buf between different walks and
> +	 * store the p.cur_buf at the end of the walk to the user buffer.
> +	 */
> +	if (IS_PM_SCAN_GET(p.flags)) {
> +		p.vec_buf = kmalloc_array(p.vec_buf_len, sizeof(*p.vec_buf),
> +					  GFP_KERNEL);
> +		if (!p.vec_buf)
> +			return -ENOMEM;
> +	}
> +
> +	if (IS_PM_SCAN_WP(p.flags)) {
> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
> +					mm, start, end);
> +		mmu_notifier_invalidate_range_start(&range);
> +	}
> +
> +	walk_start = walk_end = start;
> +	while (walk_end < end && !ret) {
> +		if (IS_PM_SCAN_GET(p.flags)) {
> +			p.vec_buf_index = 0;
> +
> +			/*
> +			 * All data is copied to cur_buf first. When more data
> +			 * is found, we push cur_buf to vec_buf and copy new
> +			 * data to cur_buf. Subtract 1 from length as the
> +			 * index of cur_buf isn't counted in length.
> +			 */
> +			empty_slots = arg.vec_len - vec_index;
> +			p.vec_buf_len = min(p.vec_buf_len, empty_slots - 1);
> +		}
> +
> +		ret = mmap_read_lock_killable(mm);
> +		if (ret)
> +			goto return_status;
> +
> +		walk_end = min((walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK, end);
> +
> +		ret = walk_page_range(mm, walk_start, walk_end,
> +				      &pagemap_scan_ops, &p);
> +		mmap_read_unlock(mm);
> +
> +		if (ret && ret != PM_SCAN_FOUND_MAX_PAGES &&
> +		    ret != PM_SCAN_END_WALK)
> +			goto return_status;
> +
> +		walk_start = walk_end;
> +		if (IS_PM_SCAN_GET(p.flags) && p.vec_buf_index) {
> +			if (copy_to_user(&vec[vec_index], p.vec_buf,
> +					 p.vec_buf_index * sizeof(*p.vec_buf))) {
> +				/*
> +				 * Return error even though the OP succeeded
> +				 */
> +				ret = -EFAULT;
> +				goto return_status;
> +			}
> +			vec_index += p.vec_buf_index;
> +		}
> +	}
> +
> +	if (p.cur_buf.len) {
> +		if (copy_to_user(&vec[vec_index], &p.cur_buf, sizeof(p.cur_buf))) {
> +			ret = -EFAULT;
> +			goto return_status;
> +		}
> +		vec_index++;
> +	}
> +
> +	ret = vec_index;
> +
> +return_status:
> +	arg.start = (unsigned long)walk_end;

This doesn't look right. pagemap_scan_pmd_entry can stop early. For
example, it can happen when it hits the max_pages limit. Do I miss
something?

> +	if (copy_to_user(&uarg->start, &arg.start, sizeof(arg.start)))
> +		ret = -EFAULT;
> +
> +	if (IS_PM_SCAN_WP(p.flags))
> +		mmu_notifier_invalidate_range_end(&range);
> +
> +	kfree(p.vec_buf);
> +	return ret;
> +}
> +
Muhammad Usama Anjum July 3, 2023, 6:47 a.m. UTC | #3
On 6/30/23 8:01 PM, Andrei Vagin wrote:
> On Wed, Jun 28, 2023 at 02:54:23PM +0500, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT), swapped
>>   (PAGE_IS_SWAPPED) or page has pfn zero (PAGE_IS_PFNZERO).
>> - Find pages which have been written-to and/or write protect the pages
>>   (atomic PM_SCAN_OP_GET + PM_SCAN_OP_WP)
>>
>> This IOCTL can be extended to get information about more PTE bits. The
>> entire address range passed by user [start, end) is scanned until either
>> the user provided buffer is full or max_pages have been found.
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> 
> <snip>
> 
>> +
>> +static long do_pagemap_scan(struct mm_struct *mm, unsigned long __arg)
>> +{
>> +	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)__arg;
>> +	unsigned long long start, end, walk_start, walk_end;
>> +	unsigned long empty_slots, vec_index = 0;
>> +	struct mmu_notifier_range range;
>> +	struct page_region __user *vec;
>> +	struct pagemap_scan_private p;
>> +	struct pm_scan_arg arg;
>> +	int ret = 0;
>> +
>> +	if (copy_from_user(&arg, uarg, sizeof(arg)))
>> +		return -EFAULT;
>> +
>> +	start = untagged_addr((unsigned long)arg.start);
>> +	end = untagged_addr((unsigned long)arg.end);
>> +	vec = (struct page_region __user *)untagged_addr((unsigned long)arg.vec);
>> +
>> +	ret = pagemap_scan_args_valid(&arg, start, end, vec);
>> +	if (ret)
>> +		return ret;
>> +
>> +	p.max_pages = (arg.max_pages) ? arg.max_pages : ULONG_MAX;
>> +	p.found_pages = 0;
>> +	p.required_mask = arg.required_mask;
>> +	p.anyof_mask = arg.anyof_mask;
>> +	p.excluded_mask = arg.excluded_mask;
>> +	p.return_mask = arg.return_mask;
>> +	p.flags = arg.flags;
>> +	p.flags |= ((p.required_mask | p.anyof_mask | p.excluded_mask) &
>> +		    PAGE_IS_WRITTEN) ? PM_SCAN_REQUIRE_UFFD : 0;
>> +	p.cur_buf.start = p.cur_buf.len = p.cur_buf.flags = 0;
>> +	p.vec_buf = NULL;
>> +	p.vec_buf_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
>> +
>> +	/*
>> +	 * Allocate smaller buffer to get output from inside the page walk
>> +	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>> +	 * we want to return output to user in compact form where no two
>> +	 * consecutive regions should be continuous and have the same flags.
>> +	 * So store the latest element in p.cur_buf between different walks and
>> +	 * store the p.cur_buf at the end of the walk to the user buffer.
>> +	 */
>> +	if (IS_PM_SCAN_GET(p.flags)) {
>> +		p.vec_buf = kmalloc_array(p.vec_buf_len, sizeof(*p.vec_buf),
>> +					  GFP_KERNEL);
>> +		if (!p.vec_buf)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	if (IS_PM_SCAN_WP(p.flags)) {
>> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
>> +					mm, start, end);
>> +		mmu_notifier_invalidate_range_start(&range);
>> +	}
>> +
>> +	walk_start = walk_end = start;
>> +	while (walk_end < end && !ret) {
>> +		if (IS_PM_SCAN_GET(p.flags)) {
>> +			p.vec_buf_index = 0;
>> +
>> +			/*
>> +			 * All data is copied to cur_buf first. When more data
>> +			 * is found, we push cur_buf to vec_buf and copy new
>> +			 * data to cur_buf. Subtract 1 from length as the
>> +			 * index of cur_buf isn't counted in length.
>> +			 */
>> +			empty_slots = arg.vec_len - vec_index;
>> +			p.vec_buf_len = min(p.vec_buf_len, empty_slots - 1);
>> +		}
>> +
>> +		ret = mmap_read_lock_killable(mm);
>> +		if (ret)
>> +			goto return_status;
>> +
>> +		walk_end = min((walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK, end);
>> +
>> +		ret = walk_page_range(mm, walk_start, walk_end,
>> +				      &pagemap_scan_ops, &p);
>> +		mmap_read_unlock(mm);
>> +
>> +		if (ret && ret != PM_SCAN_FOUND_MAX_PAGES &&
>> +		    ret != PM_SCAN_END_WALK)
>> +			goto return_status;
>> +
>> +		walk_start = walk_end;
>> +		if (IS_PM_SCAN_GET(p.flags) && p.vec_buf_index) {
>> +			if (copy_to_user(&vec[vec_index], p.vec_buf,
>> +					 p.vec_buf_index * sizeof(*p.vec_buf))) {
>> +				/*
>> +				 * Return error even though the OP succeeded
>> +				 */
>> +				ret = -EFAULT;
>> +				goto return_status;
>> +			}
>> +			vec_index += p.vec_buf_index;
>> +		}
>> +	}
>> +
>> +	if (p.cur_buf.len) {
>> +		if (copy_to_user(&vec[vec_index], &p.cur_buf, sizeof(p.cur_buf))) {
>> +			ret = -EFAULT;
>> +			goto return_status;
>> +		}
>> +		vec_index++;
>> +	}
>> +
>> +	ret = vec_index;
>> +
>> +return_status:
>> +	arg.start = (unsigned long)walk_end;
> 
> This doesn't look right. pagemap_scan_pmd_entry can stop early. For
> example, it can happen when it hits the max_pages limit. Do I miss
> something?
The walk_page_range() calls pagemap_scan_pmd_entry(). So whatever status is
returned from pagemap_scan_pmd_entry(), walk_page_range() returns to this
function where we are handling the status code. After while loop starts,
there is only 1 return path. Hence there isn't any path missing where we'll
miss setting arg.start.

> 
>> +	if (copy_to_user(&uarg->start, &arg.start, sizeof(arg.start)))
>> +		ret = -EFAULT;
>> +
>> +	if (IS_PM_SCAN_WP(p.flags))
>> +		mmu_notifier_invalidate_range_end(&range);
>> +
>> +	kfree(p.vec_buf);
>> +	return ret;
>> +}
>> +
Muhammad Usama Anjum July 3, 2023, 9:30 a.m. UTC | #4
On 6/28/23 5:11 PM, Michał Mirosław wrote:
> Hi Muhammad,
> 
> I'd really like to reduce the number of conditionals in this code by a
> third. I believe it is possible if we agree to rework the API a bit
> (making it always a GET operation with optional WP) and relax some of
> the restrictions on the input values.
I have discussed this multiple times in revisions that why we need the GET,
GET+WP and WP operations. We cannot remove only WP operation. Please
consult previous revisions on why we need exclusive WP operation without GET.

> 
> Also please include *all* changes you make in a patch in the changelog
> - I just noticed e.g. a `mmu_notifier` code that I haven't seen before
> in the review.
mmu_notifier code was added v16. I'll try to write more longer specific
changelog.

> 
> A general request from me would be to move all the page classification
> code to one place. That would mean replacing:
> 
>> +               bitmap = PM_SCAN_FLAGS(is_written, false,
>> +                                      pmd_present(*pmd), is_swap_pmd(*pmd),
>> +                                      pmd_present(*pmd) && is_zero_pfn(pmd_pfn(*pmd)));
> 
> With:
> 
>        category_flags = pmd_page_category_flags(*pmd);
> 
> ... and adding a pmd_page_category_flags() and pte/hugepage versions
> together at the top. The idea here is to separate concerns: split
> classification of pages from the code acting on the classification
> results and be able to easily understand (and review for correctness)
> how corresponding classes are derived from PTE and PMD values.
This is just a cosmetic change. I don't think we should keep on doing
cosmetic change when code is already readable and obvious. At this rate I
would keep doing cosmetic changes on the argument that we are making is
more easier to read. I've been working on this patch and particularly
task_mmu.c file from last 1 year. The current code is in same flavor as the
rest of the `task_mmu.c` file. All the functions above are using #ifdef THP.

> 
> I'd prefer the naming of the functions and variables to follow that it
> is a classification result, not a generic "bitmap" or "flags", so that
> it's harder to confuse them with page flags as used elsewhere in the
> kernel.
> 
> (inline review follows)
> 
> BTW, thanks for doing this!
> 
> On Wed, 28 Jun 2023 at 12:00, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
> [...]
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index 507cd4e59d07..be747fb89827 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
> [...]
>> +#define PM_SCAN_REQUIRE_UFFD   (1ULL << 63)
>> +
>> +#define PM_SCAN_FOUND_MAX_PAGES        (1)
>> +#define PM_SCAN_END_WALK       (-256)
> [...]
>> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
>> +                                 struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +
>> +       if ((p->flags & PM_SCAN_REQUIRE_UFFD) && (!userfaultfd_wp_async(vma) ||
>> +           !userfaultfd_wp_use_markers(vma)))
>> +               return -EPERM;
> 
> p->vma_can_wp = userfaultfd_wp_async(vma) && userfaultfd_wp_use_markers(vma);
> 
> And then use that in classification functions and for skipping WP for
> pages not prepared for that. The PM_SCAN_REQUIRE_UFFD won't be needed
> then.
No we don't want to skip, but return error if a VMA isn't setup to use WP
Async. Even after adding this new `vma_can_wp` variable. We'll have to
check in every test_walk() if UFFD is initialized on the VMA for every VMA.
At max, `vma_can_wp` can be setup to hold the true or false for
PM_SCAN_REQUIRE_UFFD which makes it cosmetic change again.

> 
>> +
>> +       if (vma->vm_flags & VM_PFNMAP)
>> +               return 1;
> 
> Why do we skip VM_PFNMAP vmas? This will skip over at least VDSO and
> VVAR pages. Those two are not that big a problem, but it should be at
> least documented what ranges are skipped and why.
We are following pagemap_read(). We don't want to expose any additional
information which pagemap_read() doesn't show.

> 
>> +
>> +       return 0;
>> +}
>> +
>> +static int pagemap_scan_output(unsigned long bitmap,
>> +                              struct pagemap_scan_private *p,
>> +                              unsigned long addr, unsigned int n_pages)
>> +{
>> +       struct page_region *cur_buf = &p->cur_buf;
>> +
>> +       if (!n_pages)
>> +               return -EINVAL;
> 
> How can this happen?
This was there to check validity of n_pages before proceeding. By doing
static analysis, I can see that it isn't needed anymore and can be removed.

> 
>> +       bitmap &= p->return_mask;
>> +
>> +       if (cur_buf->flags == bitmap &&
>> +           cur_buf->start + cur_buf->len * PAGE_SIZE == addr) {
> 
> BTW, maybe the ranges returned to the user could also use start .. end
> form, and then this would be simplified to `cur->flags ==
> categories_to_report && cur->end == addr`.
> 
>> +               cur_buf->len += n_pages;
>> +               p->found_pages += n_pages;
>> +       } else {
>> +               if (cur_buf->len) {
>> +                       if (p->vec_buf_index >= p->vec_buf_len)
>> +                               return PM_SCAN_END_WALK;
>> +
>> +                       memcpy(&p->vec_buf[p->vec_buf_index], cur_buf,
>> +                              sizeof(*p->vec_buf));
>> +                       p->vec_buf_index++;
>> +               }
>> +
>> +               cur_buf->start = addr;
>> +               cur_buf->len = n_pages;
>> +               cur_buf->flags = bitmap;
>> +               p->found_pages += n_pages;
>> +       }
>> +
>> +       if (p->found_pages == p->max_pages)
>> +               return PM_SCAN_FOUND_MAX_PAGES;
> 
> Since we now return the address the walk ended at, what is the
> difference for PM_SCAN_END_WALK and PM_SCAN_FOUND_MAX_PAGES, and do we
> still need any of those instead of -ENOSPC or `n_pages !=
> scan_output(...)` check?
Yes, we need two different return codes from here to judge if we need to wp
the current range or not. When PM_SCAN_FOUND_MAX_PAGES is returned we need
to wp the current range. But when END_WALK is returned we don't need to
perform wp and return.

> 
>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                                 unsigned long end, struct mm_walk *walk)
>> +{
>> +       bool is_written, flush = false, interesting = true;
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       unsigned long bitmap, addr = end;
>> +       pte_t *pte, *orig_pte, ptent;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +
>> +       arch_enter_lazy_mmu_mode();
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>> +       if (ptl) {
>> +               unsigned long n_pages = (end - start)/PAGE_SIZE;
>> +
>> +               if (n_pages > p->max_pages - p->found_pages)
>> +                       n_pages = p->max_pages - p->found_pages;
>> +
>> +               is_written = !is_pmd_uffd_wp(*pmd);
>> +
>> +               bitmap = PM_SCAN_FLAGS(is_written, false,
>> +                                      pmd_present(*pmd), is_swap_pmd(*pmd),
>> +                                      pmd_present(*pmd) && is_zero_pfn(pmd_pfn(*pmd)));
>> +
>> +               if (IS_PM_SCAN_GET(p->flags))
>> +                       interesting = pagemap_scan_is_interesting_page(bitmap, p);
>> +
>> +               if (interesting) {
>> +                       /*
>> +                        * Break huge page into small pages if the WP operation
>> +                        * need to be performed is on a portion of the huge page.
>> +                        */
>> +                       if (is_written && IS_PM_SCAN_WP(p->flags) &&
>> +                           n_pages < HPAGE_SIZE/PAGE_SIZE) {
>> +                               spin_unlock(ptl);
>> +
>> +                               split_huge_pmd(vma, pmd, start);
>> +                               goto process_smaller_pages;
>> +                       }
>> +
>> +                       if (IS_PM_SCAN_GET(p->flags))
>> +                               ret = pagemap_scan_output(bitmap, p, start, n_pages);
>> +
>> +                       if (IS_PM_SCAN_WP(p->flags) && is_written && ret >= 0) {
>> +                               make_uffd_wp_pmd(vma, start, pmd);
>> +                               flush_tlb_range(vma, start, end);
>> +                       }
>> +               }
>> +
>> +               spin_unlock(ptl);
>> +               arch_leave_lazy_mmu_mode();
>> +
>> +               return ret;
>> +       }
> 
> Could you split the THP code and use it here like:
> 
> enter_lazy_mmu();
> 
> ret = scan_thp(...);
> if (ret != -ENOTTY) {
>   leave_lazy_mmu();
>   return ret;
> }
> 
> To avoid having #ifdef here, it can be moved to the scan_thp() only
> (either in the body or having a dummy `return -ENOTTY;` version in an
> #else block).
I'm following the flavour of code in this file as explained above. This
isn't needed.

> 
> BTW, there's no `cond_resched()` in the THP case - is that intentional?
AFAIU, processing of THP here is fast as compared to 512 iterations over
smaller pages. So cond_resched() isn't present here. But overall, I've not
seen cond_resched() being used after THPs.

> 
>> +
>> +process_smaller_pages:
>> +#endif
>> +
>> +       orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
>> +       if (!pte) {
>> +               walk->action = ACTION_AGAIN;
>> +               return 0;
>> +       }
>> +
>> +       for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
>> +               ptent = ptep_get(pte);
>> +               is_written = !is_pte_uffd_wp(ptent);
>> +
>> +               bitmap = PM_SCAN_FLAGS(is_written, pagemap_scan_is_file(vma, ptent, addr),
>> +                                      pte_present(ptent), is_swap_pte(ptent),
>> +                                      pte_present(ptent) && is_zero_pfn(pte_pfn(ptent)));
>> +
>> +               if (IS_PM_SCAN_GET(p->flags)) {
>> +                       interesting = pagemap_scan_is_interesting_page(bitmap, p);
> 
> If we consider GET as always done, this can be:
> 
> if (!is_interesting) continue;
> 
> And that would simplify the code greatly.
> 
>> +                       if (interesting)
>> +                               ret = pagemap_scan_output(bitmap, p, addr, 1);
>> +               }
>> +
>> +               if (IS_PM_SCAN_WP(p->flags) && is_written && interesting &&
>> +                   ret >= 0) {
>> +                       make_uffd_wp_pte(vma, addr, pte);
>> +                       flush = true;
>> +               }
>> +       }
>> +
>> +       if (flush)
>> +               flush_tlb_range(vma, start, addr);
> 
> Would optimizing the TLB flush range be beneficial here? If yes, the
> loop above could do:
> 
> flush_start = end;
> flush_end = end;
> ...
> for (...) {
>   if (mark_wp) {
>     make_wp();
>     if (flush_start > addr)
>       flush_start = addr;
>     flush_end = addr;
>   }
> }
> 
> to reduce the flushed range.
Just tried this. There is some bug or hidden restriction. It isn't working.
Maybe we can look at it latter. Reducing the flush size may actually be
beneficial.

> 
>> +
>> +       pte_unmap_unlock(orig_pte, ptl);
>> +       arch_leave_lazy_mmu_mode();
>> +
>> +       cond_resched();
>> +       return ret;
>> +}
>> +
>> +#ifdef CONFIG_HUGETLB_PAGE
>> +static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
>> +                                     unsigned long start, unsigned long end,
>> +                                     struct mm_walk *walk)
>> +{
>> +       unsigned long n_pages = (end - start)/PAGE_SIZE;
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       bool is_written, interesting = true;
>> +       struct hstate *h = hstate_vma(vma);
>> +       unsigned long bitmap;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +       pte_t ptent;
>> +
>> +       if (IS_PM_SCAN_WP(p->flags) && n_pages < HPAGE_SIZE/PAGE_SIZE)
>> +               return -EINVAL;
> 
> Shouldn't this be checked after the `n_pages` is reduced? BTW, maybe
> check it only if the page `is_written` to avoid the conditional for
> all walks where WP is not used?
No, we (you and me) had reached to this current state after discussion. We
are returning -EINVAL here for the user to understand that he had not
passed the entire hugetlb page address range when specified WP op.

> 
>> +
>> +       if (n_pages > p->max_pages - p->found_pages)
>> +               n_pages = p->max_pages - p->found_pages;
> [...]
> 
> Proposing the relaxed API (and, while at it, less generic constant and
> field names) below.
> 
>> +static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
>> +                                  unsigned long end, struct page_region __user *vec)
>> +{
>> +       /* Detect illegal size, flags, len and masks */
>> +       if (arg->size != sizeof(struct pm_scan_arg))
>> +               return -EINVAL;
> 
>> +       if (!arg->flags)
>> +               return -EINVAL;
>> +       if (arg->flags & ~PM_SCAN_OPS)
>> +               return -EINVAL;
> 
> if (arg->flags & ~PM_SCAN_FLAGS)
>   return -EINVAL;
We need if (!flags) condition for the case when flag is zero.

> 
>> +       if (!(end - start))
>> +               return -EINVAL;
> 
> Remove. Nothing bad will happen when trying to scan an empty range.
We should return error. It is user's fault for not providing the correct range.

> 
>> +       if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
>> +            arg->return_mask) & ~PM_SCAN_BITS_ALL)
>> +               return -EINVAL;
> 
> if ((...) & ~PM_SCAN_PAGE_CATEGORIES)
>   return -EINVAL;
> 
>> +       if (!arg->required_mask && !arg->anyof_mask &&
>> +           !arg->excluded_mask)
>> +               return -EINVAL;
> 
> Remove. Will inspect all pages then.
> 
>> +       if (!arg->return_mask)
>> +               return -EINVAL;
> 
> Remove. Will not return any ranges then. (But will WP, if requested.)
> 
>> +       /* Validate memory range */
>> +       if (!IS_ALIGNED(start, PAGE_SIZE))
>> +               return -EINVAL;
>> +       if (!access_ok((void __user *)start, end - start))
>> +               return -EFAULT;
> 
>> +       if (IS_PM_SCAN_GET(arg->flags)) {
> 
> Remove. Do GET always.
> 
>> +               if (arg->vec_len == 0)
>> +                       return -EINVAL;
> 
> Remove. Will end the walk at the first matching page (and return its
> address in `start`).
> 
>> +               if (!vec)
>> +                       return -EINVAL;
>> +               if (!access_ok((void __user *)vec,
>> +                              arg->vec_len * sizeof(struct page_region)))
>> +                       return -EFAULT;
> 
> Check only if vec_len != 0. BTW, return EFAULT for `!vec`.
> 
>> +       }
>> +
>> +       if (IS_PM_SCAN_WP(arg->flags) && !IS_PM_SCAN_GET(arg->flags) &&
>> +           arg->max_pages)
>> +               return -EINVAL;
> 
> With return_mask == 0, arg->max_pages will be ignored anyway. We can
> just document that this limits the pages reported in the output
> vector. (And from that follows that if not returning anything, the
> value doesn't make a difference.)
I'm sorry. I don't see any value in channgig the interface again. The
interface is based on requirement, not the simplification of the code.

> 
> [...]
>> +       p.max_pages = (arg.max_pages) ? arg.max_pages : ULONG_MAX;
>> +       p.found_pages = 0;
>> +       p.required_mask = arg.required_mask;
>> +       p.anyof_mask = arg.anyof_mask;
>> +       p.excluded_mask = arg.excluded_mask;
>> +       p.return_mask = arg.return_mask;
>> +       p.flags = arg.flags;
>> +       p.flags |= ((p.required_mask | p.anyof_mask | p.excluded_mask) &
>> +                   PAGE_IS_WRITTEN) ? PM_SCAN_REQUIRE_UFFD : 0;
>> +       p.cur_buf.start = p.cur_buf.len = p.cur_buf.flags = 0;
>> +       p.vec_buf = NULL;
>> +       p.vec_buf_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
> 
> This needs to be less by 1 to account for the entry in p.cur_buf.
No, that would be bug.

> 
>> +       /*
>> +        * Allocate smaller buffer to get output from inside the page walk
>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>> +        * we want to return output to user in compact form where no two
>> +        * consecutive regions should be continuous and have the same flags.
>> +        * So store the latest element in p.cur_buf between different walks and
>> +        * store the p.cur_buf at the end of the walk to the user buffer.
>> +        */
>> +       if (IS_PM_SCAN_GET(p.flags)) {
> 
> if (p.vec_len != 0)
No, this is wrong.

> 
>> +               p.vec_buf = kmalloc_array(p.vec_buf_len, sizeof(*p.vec_buf),
>> +                                         GFP_KERNEL);
>> +               if (!p.vec_buf)
>> +                       return -ENOMEM;
>> +       }
>> +
>> +       if (IS_PM_SCAN_WP(p.flags)) {
>> +               mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
>> +                                       mm, start, end);
>> +               mmu_notifier_invalidate_range_start(&range);
>> +       }
> 
> Please add a comment why this is needed.
This was added when we had left using change_pmd_range(). This is a
notification that protection for the range has been changed.

> 
>> +       walk_start = walk_end = start;
>> +       while (walk_end < end && !ret) {
>> +               if (IS_PM_SCAN_GET(p.flags)) {
> 
> if (p.vec_len)
> 
>> +                       p.vec_buf_index = 0;
>> +
>> +                       /*
>> +                        * All data is copied to cur_buf first. When more data
>> +                        * is found, we push cur_buf to vec_buf and copy new
>> +                        * data to cur_buf. Subtract 1 from length as the
>> +                        * index of cur_buf isn't counted in length.
>> +                        */
>> +                       empty_slots = arg.vec_len - vec_index;
>> +                       p.vec_buf_len = min(p.vec_buf_len, empty_slots - 1);
>> +               }
>> +
>> +               ret = mmap_read_lock_killable(mm);
>> +               if (ret)
>> +                       goto return_status;
> 
> This could be _interruptible() now as we can return early overwriting
> `start` so that userspace can detect that the walk was interrupted by
> a signal (start != end).
> 
>> +               walk_end = min((walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK, end);
>> +
>> +               ret = walk_page_range(mm, walk_start, walk_end,
>> +                                     &pagemap_scan_ops, &p);
>> +               mmap_read_unlock(mm);
>> +
>> +               if (ret && ret != PM_SCAN_FOUND_MAX_PAGES &&
>> +                   ret != PM_SCAN_END_WALK)
>> +                       goto return_status;
> 
> So now there is no difference in the two special error values.
Inside pagemap_scan_pmd_entry() the purpose is different.

> 
>> +               walk_start = walk_end;
>> +               if (IS_PM_SCAN_GET(p.flags) && p.vec_buf_index) {
> 
> `if (p.vec_buf_index)` should be enough, as if not reporting the
> ranges, the index will never increase.
Agreed. Will update.

> 
>> +                       if (copy_to_user(&vec[vec_index], p.vec_buf,
>> +                                        p.vec_buf_index * sizeof(*p.vec_buf))) {
>> +                               /*
>> +                                * Return error even though the OP succeeded
>> +                                */
>> +                               ret = -EFAULT;
>> +                               goto return_status;
>> +                       }
>> +                       vec_index += p.vec_buf_index;
> 
> p.vec_buf_index = 0;
> 
> ... so that there is no need to do that at every start of the loop iteration.
Okay.

> 
>> +               }
>> +       }
>> +
>> +       if (p.cur_buf.len) {
>> +               if (copy_to_user(&vec[vec_index], &p.cur_buf, sizeof(p.cur_buf))) {
>> +                       ret = -EFAULT;
>> +                       goto return_status;
>> +               }
>> +               vec_index++;
>> +       }
>> +
>> +       ret = vec_index;
>> +
>> +return_status:
> 
> The label name looks too narrowing, considering what is being done
> here now. Maybe just 'out', as there is no other non-trivial exit path
> from the function?
Okay.

> 
>> +       arg.start = (unsigned long)walk_end;
>> +       if (copy_to_user(&uarg->start, &arg.start, sizeof(arg.start)))
>> +               ret = -EFAULT;
>> +
>> +       if (IS_PM_SCAN_WP(p.flags))
>> +               mmu_notifier_invalidate_range_end(&range);
> 
> With removal of the OP_GET, there is only a single flag left. I don't
> think it is useful then to hide it behind a macro. It should be
> readable as `p.flags & PM_SCAN_DO_WP`.
> 
> [...]
>> +/* Bits are set in flags of the page_region and masks in pm_scan_args */
>> +#define PAGE_IS_WRITTEN                (1 << 0)
>> +#define PAGE_IS_FILE           (1 << 1)
>> +#define PAGE_IS_PRESENT                (1 << 2)
>> +#define PAGE_IS_SWAPPED                (1 << 3)
>> +#define PAGE_IS_PFNZERO                (1 << 4)
> 
> Please add PAGE_IS_UFFD_WP_ENABLED (or a shorter name) - this would be
> populated from `p->vma_can_wp` in the `test_walk` implementation
> above.
> 
> Best Regards
> Michał Mirosław
Andrei Vagin July 3, 2023, 3:07 p.m. UTC | #5
On Mon, Jul 03, 2023 at 11:47:37AM +0500, Muhammad Usama Anjum wrote:
> On 6/30/23 8:01 PM, Andrei Vagin wrote:
> > On Wed, Jun 28, 2023 at 02:54:23PM +0500, Muhammad Usama Anjum wrote:
> >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> >> the info about page table entries. The following operations are supported
> >> in this ioctl:
> >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> >>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT), swapped
> >>   (PAGE_IS_SWAPPED) or page has pfn zero (PAGE_IS_PFNZERO).
> >> - Find pages which have been written-to and/or write protect the pages
> >>   (atomic PM_SCAN_OP_GET + PM_SCAN_OP_WP)
> >>
> >> This IOCTL can be extended to get information about more PTE bits. The
> >> entire address range passed by user [start, end) is scanned until either
> >> the user provided buffer is full or max_pages have been found.
> >>
> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> > 
> > <snip>
> > 
> >> +
> >> +static long do_pagemap_scan(struct mm_struct *mm, unsigned long __arg)
> >> +{
> >> +	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)__arg;
> >> +	unsigned long long start, end, walk_start, walk_end;
> >> +	unsigned long empty_slots, vec_index = 0;
> >> +	struct mmu_notifier_range range;
> >> +	struct page_region __user *vec;
> >> +	struct pagemap_scan_private p;
> >> +	struct pm_scan_arg arg;
> >> +	int ret = 0;
> >> +
> >> +	if (copy_from_user(&arg, uarg, sizeof(arg)))
> >> +		return -EFAULT;
> >> +
> >> +	start = untagged_addr((unsigned long)arg.start);
> >> +	end = untagged_addr((unsigned long)arg.end);
> >> +	vec = (struct page_region __user *)untagged_addr((unsigned long)arg.vec);
> >> +
> >> +	ret = pagemap_scan_args_valid(&arg, start, end, vec);
> >> +	if (ret)
> >> +		return ret;
> >> +
> >> +	p.max_pages = (arg.max_pages) ? arg.max_pages : ULONG_MAX;
> >> +	p.found_pages = 0;
> >> +	p.required_mask = arg.required_mask;
> >> +	p.anyof_mask = arg.anyof_mask;
> >> +	p.excluded_mask = arg.excluded_mask;
> >> +	p.return_mask = arg.return_mask;
> >> +	p.flags = arg.flags;
> >> +	p.flags |= ((p.required_mask | p.anyof_mask | p.excluded_mask) &
> >> +		    PAGE_IS_WRITTEN) ? PM_SCAN_REQUIRE_UFFD : 0;
> >> +	p.cur_buf.start = p.cur_buf.len = p.cur_buf.flags = 0;
> >> +	p.vec_buf = NULL;
> >> +	p.vec_buf_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
> >> +
> >> +	/*
> >> +	 * Allocate smaller buffer to get output from inside the page walk
> >> +	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> >> +	 * we want to return output to user in compact form where no two
> >> +	 * consecutive regions should be continuous and have the same flags.
> >> +	 * So store the latest element in p.cur_buf between different walks and
> >> +	 * store the p.cur_buf at the end of the walk to the user buffer.
> >> +	 */
> >> +	if (IS_PM_SCAN_GET(p.flags)) {
> >> +		p.vec_buf = kmalloc_array(p.vec_buf_len, sizeof(*p.vec_buf),
> >> +					  GFP_KERNEL);
> >> +		if (!p.vec_buf)
> >> +			return -ENOMEM;
> >> +	}
> >> +
> >> +	if (IS_PM_SCAN_WP(p.flags)) {
> >> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
> >> +					mm, start, end);
> >> +		mmu_notifier_invalidate_range_start(&range);
> >> +	}
> >> +
> >> +	walk_start = walk_end = start;
> >> +	while (walk_end < end && !ret) {
> >> +		if (IS_PM_SCAN_GET(p.flags)) {
> >> +			p.vec_buf_index = 0;
> >> +
> >> +			/*
> >> +			 * All data is copied to cur_buf first. When more data
> >> +			 * is found, we push cur_buf to vec_buf and copy new
> >> +			 * data to cur_buf. Subtract 1 from length as the
> >> +			 * index of cur_buf isn't counted in length.
> >> +			 */
> >> +			empty_slots = arg.vec_len - vec_index;
> >> +			p.vec_buf_len = min(p.vec_buf_len, empty_slots - 1);
> >> +		}
> >> +
> >> +		ret = mmap_read_lock_killable(mm);
> >> +		if (ret)
> >> +			goto return_status;
> >> +
> >> +		walk_end = min((walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK, end);
> >> +
> >> +		ret = walk_page_range(mm, walk_start, walk_end,
> >> +				      &pagemap_scan_ops, &p);
> >> +		mmap_read_unlock(mm);
> >> +
> >> +		if (ret && ret != PM_SCAN_FOUND_MAX_PAGES &&
> >> +		    ret != PM_SCAN_END_WALK)
> >> +			goto return_status;
> >> +
> >> +		walk_start = walk_end;
> >> +		if (IS_PM_SCAN_GET(p.flags) && p.vec_buf_index) {
> >> +			if (copy_to_user(&vec[vec_index], p.vec_buf,
> >> +					 p.vec_buf_index * sizeof(*p.vec_buf))) {
> >> +				/*
> >> +				 * Return error even though the OP succeeded
> >> +				 */
> >> +				ret = -EFAULT;
> >> +				goto return_status;
> >> +			}
> >> +			vec_index += p.vec_buf_index;
> >> +		}
> >> +	}
> >> +
> >> +	if (p.cur_buf.len) {
> >> +		if (copy_to_user(&vec[vec_index], &p.cur_buf, sizeof(p.cur_buf))) {
> >> +			ret = -EFAULT;
> >> +			goto return_status;
> >> +		}
> >> +		vec_index++;
> >> +	}
> >> +
> >> +	ret = vec_index;
> >> +
> >> +return_status:
> >> +	arg.start = (unsigned long)walk_end;
> > 
> > This doesn't look right. pagemap_scan_pmd_entry can stop early. For
> > example, it can happen when it hits the max_pages limit. Do I miss
> > something?
> The walk_page_range() calls pagemap_scan_pmd_entry(). So whatever status is
> returned from pagemap_scan_pmd_entry(), walk_page_range() returns to this
> function where we are handling the status code. After while loop starts,
> there is only 1 return path. Hence there isn't any path missing where we'll
> miss setting arg.start.

I mean walk_end isn't actually the end address. The end adress should be
the next page after the last revised page. Here we don't know whether
all pages in [walk_start, walk_end) has been revised.

> 
> > 
> >> +	if (copy_to_user(&uarg->start, &arg.start, sizeof(arg.start)))
> >> +		ret = -EFAULT;
> >> +
> >> +	if (IS_PM_SCAN_WP(p.flags))
> >> +		mmu_notifier_invalidate_range_end(&range);
> >> +
> >> +	kfree(p.vec_buf);
> >> +	return ret;
> >> +}
> >> +
> 
> -- 
> BR,
> Muhammad Usama Anjum
kernel test robot July 4, 2023, 10:15 p.m. UTC | #6
Hi Muhammad,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master next-20230704]
[cannot apply to v6.4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Muhammad-Usama-Anjum/userfaultfd-UFFD_FEATURE_WP_ASYNC/20230628-180259
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230628095426.1886064-3-usama.anjum%40collabora.com
patch subject: [PATCH v22 2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs
config: arc-allyesconfig (https://download.01.org/0day-ci/archive/20230705/202307050619.OArmlJne-lkp@intel.com/config)
compiler: arceb-elf-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230705/202307050619.OArmlJne-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202307050619.OArmlJne-lkp@intel.com/

All errors (new ones prefixed by >>):

   fs/proc/task_mmu.c: In function 'pagemap_scan_pmd_entry':
>> fs/proc/task_mmu.c:1996:39: error: 'HPAGE_SIZE' undeclared (first use in this function); did you mean 'PAGE_SIZE'?
    1996 |                             n_pages < HPAGE_SIZE/PAGE_SIZE) {
         |                                       ^~~~~~~~~~
         |                                       PAGE_SIZE
   fs/proc/task_mmu.c:1996:39: note: each undeclared identifier is reported only once for each function it appears in


vim +1996 fs/proc/task_mmu.c

  1959	
  1960	static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
  1961					  unsigned long end, struct mm_walk *walk)
  1962	{
  1963		bool is_written, flush = false, interesting = true;
  1964		struct pagemap_scan_private *p = walk->private;
  1965		struct vm_area_struct *vma = walk->vma;
  1966		unsigned long bitmap, addr = end;
  1967		pte_t *pte, *orig_pte, ptent;
  1968		spinlock_t *ptl;
  1969		int ret = 0;
  1970	
  1971		arch_enter_lazy_mmu_mode();
  1972	
  1973	#ifdef CONFIG_TRANSPARENT_HUGEPAGE
  1974		ptl = pmd_trans_huge_lock(pmd, vma);
  1975		if (ptl) {
  1976			unsigned long n_pages = (end - start)/PAGE_SIZE;
  1977	
  1978			if (n_pages > p->max_pages - p->found_pages)
  1979				n_pages = p->max_pages - p->found_pages;
  1980	
  1981			is_written = !is_pmd_uffd_wp(*pmd);
  1982	
  1983			bitmap = PM_SCAN_FLAGS(is_written, false,
  1984					       pmd_present(*pmd), is_swap_pmd(*pmd),
  1985					       pmd_present(*pmd) && is_zero_pfn(pmd_pfn(*pmd)));
  1986	
  1987			if (IS_PM_SCAN_GET(p->flags))
  1988				interesting = pagemap_scan_is_interesting_page(bitmap, p);
  1989	
  1990			if (interesting) {
  1991				/*
  1992				 * Break huge page into small pages if the WP operation
  1993				 * need to be performed is on a portion of the huge page.
  1994				 */
  1995				if (is_written && IS_PM_SCAN_WP(p->flags) &&
> 1996				    n_pages < HPAGE_SIZE/PAGE_SIZE) {
  1997					spin_unlock(ptl);
  1998	
  1999					split_huge_pmd(vma, pmd, start);
  2000					goto process_smaller_pages;
  2001				}
  2002	
  2003				if (IS_PM_SCAN_GET(p->flags))
  2004					ret = pagemap_scan_output(bitmap, p, start, n_pages);
  2005	
  2006				if (IS_PM_SCAN_WP(p->flags) && is_written && ret >= 0) {
  2007					make_uffd_wp_pmd(vma, start, pmd);
  2008					flush_tlb_range(vma, start, end);
  2009				}
  2010			}
  2011	
  2012			spin_unlock(ptl);
  2013			arch_leave_lazy_mmu_mode();
  2014	
  2015			return ret;
  2016		}
  2017	
  2018	process_smaller_pages:
  2019	#endif
  2020	
  2021		orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
  2022		if (!pte) {
  2023			walk->action = ACTION_AGAIN;
  2024			return 0;
  2025		}
  2026	
  2027		for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
  2028			ptent = ptep_get(pte);
  2029			is_written = !is_pte_uffd_wp(ptent);
  2030	
  2031			bitmap = PM_SCAN_FLAGS(is_written, pagemap_scan_is_file(vma, ptent, addr),
  2032					       pte_present(ptent), is_swap_pte(ptent),
  2033					       pte_present(ptent) && is_zero_pfn(pte_pfn(ptent)));
  2034	
  2035			if (IS_PM_SCAN_GET(p->flags)) {
  2036				interesting = pagemap_scan_is_interesting_page(bitmap, p);
  2037				if (interesting)
  2038					ret = pagemap_scan_output(bitmap, p, addr, 1);
  2039			}
  2040	
  2041			if (IS_PM_SCAN_WP(p->flags) && is_written && interesting &&
  2042			    ret >= 0) {
  2043				make_uffd_wp_pte(vma, addr, pte);
  2044				flush = true;
  2045			}
  2046		}
  2047	
  2048		if (flush)
  2049			flush_tlb_range(vma, start, addr);
  2050	
  2051		pte_unmap_unlock(orig_pte, ptl);
  2052		arch_leave_lazy_mmu_mode();
  2053	
  2054		cond_resched();
  2055		return ret;
  2056	}
  2057
Muhammad Usama Anjum July 6, 2023, 5:17 a.m. UTC | #7
On 7/3/23 8:07 PM, Andrei Vagin wrote:
> On Mon, Jul 03, 2023 at 11:47:37AM +0500, Muhammad Usama Anjum wrote:
>> On 6/30/23 8:01 PM, Andrei Vagin wrote:
>>> On Wed, Jun 28, 2023 at 02:54:23PM +0500, Muhammad Usama Anjum wrote:
>>>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>>>> the info about page table entries. The following operations are supported
>>>> in this ioctl:
>>>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT), swapped
>>>>   (PAGE_IS_SWAPPED) or page has pfn zero (PAGE_IS_PFNZERO).
>>>> - Find pages which have been written-to and/or write protect the pages
>>>>   (atomic PM_SCAN_OP_GET + PM_SCAN_OP_WP)
>>>>
>>>> This IOCTL can be extended to get information about more PTE bits. The
>>>> entire address range passed by user [start, end) is scanned until either
>>>> the user provided buffer is full or max_pages have been found.
>>>>
>>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>
>>> <snip>
>>>
>>>> +
>>>> +static long do_pagemap_scan(struct mm_struct *mm, unsigned long __arg)
>>>> +{
>>>> +	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)__arg;
>>>> +	unsigned long long start, end, walk_start, walk_end;
>>>> +	unsigned long empty_slots, vec_index = 0;
>>>> +	struct mmu_notifier_range range;
>>>> +	struct page_region __user *vec;
>>>> +	struct pagemap_scan_private p;
>>>> +	struct pm_scan_arg arg;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (copy_from_user(&arg, uarg, sizeof(arg)))
>>>> +		return -EFAULT;
>>>> +
>>>> +	start = untagged_addr((unsigned long)arg.start);
>>>> +	end = untagged_addr((unsigned long)arg.end);
>>>> +	vec = (struct page_region __user *)untagged_addr((unsigned long)arg.vec);
>>>> +
>>>> +	ret = pagemap_scan_args_valid(&arg, start, end, vec);
>>>> +	if (ret)
>>>> +		return ret;
>>>> +
>>>> +	p.max_pages = (arg.max_pages) ? arg.max_pages : ULONG_MAX;
>>>> +	p.found_pages = 0;
>>>> +	p.required_mask = arg.required_mask;
>>>> +	p.anyof_mask = arg.anyof_mask;
>>>> +	p.excluded_mask = arg.excluded_mask;
>>>> +	p.return_mask = arg.return_mask;
>>>> +	p.flags = arg.flags;
>>>> +	p.flags |= ((p.required_mask | p.anyof_mask | p.excluded_mask) &
>>>> +		    PAGE_IS_WRITTEN) ? PM_SCAN_REQUIRE_UFFD : 0;
>>>> +	p.cur_buf.start = p.cur_buf.len = p.cur_buf.flags = 0;
>>>> +	p.vec_buf = NULL;
>>>> +	p.vec_buf_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
>>>> +
>>>> +	/*
>>>> +	 * Allocate smaller buffer to get output from inside the page walk
>>>> +	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>>>> +	 * we want to return output to user in compact form where no two
>>>> +	 * consecutive regions should be continuous and have the same flags.
>>>> +	 * So store the latest element in p.cur_buf between different walks and
>>>> +	 * store the p.cur_buf at the end of the walk to the user buffer.
>>>> +	 */
>>>> +	if (IS_PM_SCAN_GET(p.flags)) {
>>>> +		p.vec_buf = kmalloc_array(p.vec_buf_len, sizeof(*p.vec_buf),
>>>> +					  GFP_KERNEL);
>>>> +		if (!p.vec_buf)
>>>> +			return -ENOMEM;
>>>> +	}
>>>> +
>>>> +	if (IS_PM_SCAN_WP(p.flags)) {
>>>> +		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
>>>> +					mm, start, end);
>>>> +		mmu_notifier_invalidate_range_start(&range);
>>>> +	}
>>>> +
>>>> +	walk_start = walk_end = start;
>>>> +	while (walk_end < end && !ret) {
>>>> +		if (IS_PM_SCAN_GET(p.flags)) {
>>>> +			p.vec_buf_index = 0;
>>>> +
>>>> +			/*
>>>> +			 * All data is copied to cur_buf first. When more data
>>>> +			 * is found, we push cur_buf to vec_buf and copy new
>>>> +			 * data to cur_buf. Subtract 1 from length as the
>>>> +			 * index of cur_buf isn't counted in length.
>>>> +			 */
>>>> +			empty_slots = arg.vec_len - vec_index;
>>>> +			p.vec_buf_len = min(p.vec_buf_len, empty_slots - 1);
>>>> +		}
>>>> +
>>>> +		ret = mmap_read_lock_killable(mm);
>>>> +		if (ret)
>>>> +			goto return_status;
>>>> +
>>>> +		walk_end = min((walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK, end);
>>>> +
>>>> +		ret = walk_page_range(mm, walk_start, walk_end,
>>>> +				      &pagemap_scan_ops, &p);
>>>> +		mmap_read_unlock(mm);
>>>> +
>>>> +		if (ret && ret != PM_SCAN_FOUND_MAX_PAGES &&
>>>> +		    ret != PM_SCAN_END_WALK)
>>>> +			goto return_status;
>>>> +
>>>> +		walk_start = walk_end;
>>>> +		if (IS_PM_SCAN_GET(p.flags) && p.vec_buf_index) {
>>>> +			if (copy_to_user(&vec[vec_index], p.vec_buf,
>>>> +					 p.vec_buf_index * sizeof(*p.vec_buf))) {
>>>> +				/*
>>>> +				 * Return error even though the OP succeeded
>>>> +				 */
>>>> +				ret = -EFAULT;
>>>> +				goto return_status;
>>>> +			}
>>>> +			vec_index += p.vec_buf_index;
>>>> +		}
>>>> +	}
>>>> +
>>>> +	if (p.cur_buf.len) {
>>>> +		if (copy_to_user(&vec[vec_index], &p.cur_buf, sizeof(p.cur_buf))) {
>>>> +			ret = -EFAULT;
>>>> +			goto return_status;
>>>> +		}
>>>> +		vec_index++;
>>>> +	}
>>>> +
>>>> +	ret = vec_index;
>>>> +
>>>> +return_status:
>>>> +	arg.start = (unsigned long)walk_end;
>>>
>>> This doesn't look right. pagemap_scan_pmd_entry can stop early. For
>>> example, it can happen when it hits the max_pages limit. Do I miss
>>> something?
>> The walk_page_range() calls pagemap_scan_pmd_entry(). So whatever status is
>> returned from pagemap_scan_pmd_entry(), walk_page_range() returns to this
>> function where we are handling the status code. After while loop starts,
>> there is only 1 return path. Hence there isn't any path missing where we'll
>> miss setting arg.start.
> 
> I mean walk_end isn't actually the end address. The end adress should be
> the next page after the last revised page. Here we don't know whether
> all pages in [walk_start, walk_end) has been revised.
Sorry, understood. Let me post the new patch series.

> 
>>
>>>
>>>> +	if (copy_to_user(&uarg->start, &arg.start, sizeof(arg.start)))
>>>> +		ret = -EFAULT;
>>>> +
>>>> +	if (IS_PM_SCAN_WP(p.flags))
>>>> +		mmu_notifier_invalidate_range_end(&range);
>>>> +
>>>> +	kfree(p.vec_buf);
>>>> +	return ret;
>>>> +}
>>>> +
>>
>> -- 
>> BR,
>> Muhammad Usama Anjum
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 507cd4e59d07..be747fb89827 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@ 
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1765,11 +1766,575 @@  static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PM_SCAN_REQUIRE_UFFD	(1ULL << 63)
+
+#define PM_SCAN_FOUND_MAX_PAGES	(1)
+#define PM_SCAN_END_WALK	(-256)
+
+#define PM_SCAN_BITS_ALL	(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED |	\
+				 PAGE_IS_PFNZERO)
+#define PM_SCAN_OPS		(PM_SCAN_OP_GET | PM_SCAN_OP_WP)
+#define IS_PM_SCAN_GET(flags)	(flags & PM_SCAN_OP_GET)
+#define IS_PM_SCAN_WP(flags)	(flags & PM_SCAN_OP_WP)
+
+#define PM_SCAN_FLAGS(wt, file, present, swap, pfnzero)	\
+	((wt) | ((file) << 1) | ((present) << 2) |	\
+	 ((swap) << 3) | ((pfnzero) << 4))
+
+struct pagemap_scan_private {
+	struct page_region *vec_buf, cur_buf;
+	unsigned long vec_buf_len, vec_buf_index, max_pages, found_pages;
+	unsigned long long flags, required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+	return (pte_present(pte) && pte_uffd_wp(pte)) ||
+	       pte_swp_uffd_wp_any(pte);
+}
+
+static inline void make_uffd_wp_pte(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *pte)
+{
+	pte_t ptent = ptep_get(pte);
+
+	if (pte_present(ptent)) {
+		pte_t old_pte;
+
+		old_pte = ptep_modify_prot_start(vma, addr, pte);
+		ptent = pte_mkuffd_wp(ptent);
+		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+	} else if (is_swap_pte(ptent)) {
+		ptent = pte_swp_mkuffd_wp(ptent);
+		set_pte_at(vma->vm_mm, addr, pte, ptent);
+	} else {
+		set_pte_at(vma->vm_mm, addr, pte,
+			   make_pte_marker(PTE_MARKER_UFFD_WP));
+	}
+}
+
+static inline bool pagemap_scan_is_file(struct vm_area_struct *vma, pte_t ptent,
+					unsigned long addr)
+{
+	struct page *page = NULL;
+	swp_entry_t entry;
+
+	if (pte_present(ptent)) {
+		page = vm_normal_page(vma, addr, ptent);
+	} else {
+		entry = pte_to_swp_entry(ptent);
+		if (is_pfn_swap_entry(entry))
+			page = pfn_swap_entry_to_page(entry);
+	}
+
+	if (page && !PageAnon(page))
+		return true;
+
+	return false;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+	return (pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+	       (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd));
+}
+
+static inline void make_uffd_wp_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t old, pmd = *pmdp;
+
+	if (pmd_present(pmd)) {
+		old = pmdp_invalidate_ad(vma, addr, pmdp);
+		pmd = pmd_mkuffd_wp(old);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_mkuffd_wp(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
+}
+#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+static inline bool is_huge_pte_uffd_wp(pte_t pte)
+{
+	return (pte_present(pte) && huge_pte_uffd_wp(pte)) ||
+	       pte_swp_uffd_wp_any(pte);
+}
+
+static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 pte_t ptent)
+{
+	if (is_hugetlb_entry_hwpoisoned(ptent) || is_pte_marker(ptent))
+		return;
+
+	if (is_hugetlb_entry_migration(ptent))
+		set_huge_pte_at(vma->vm_mm, addr, ptep,
+				pte_swp_mkuffd_wp(ptent));
+	else if (!huge_pte_none(ptent))
+		huge_ptep_modify_prot_commit(vma, addr, ptep, ptent,
+					     huge_pte_mkuffd_wp(ptent));
+	else
+		set_huge_pte_at(vma->vm_mm, addr, ptep,
+				make_pte_marker(PTE_MARKER_UFFD_WP));
+}
+
+static inline bool pagemap_scan_is_huge_file(pte_t pte)
+{
+	if (pte_present(pte) && (!PageAnon(pte_page(pte))))
+		return true;
+
+	return false;
+}
+#endif
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+				  struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if ((p->flags & PM_SCAN_REQUIRE_UFFD) && (!userfaultfd_wp_async(vma) ||
+	    !userfaultfd_wp_use_markers(vma)))
+		return -EPERM;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static int pagemap_scan_output(unsigned long bitmap,
+			       struct pagemap_scan_private *p,
+			       unsigned long addr, unsigned int n_pages)
+{
+	struct page_region *cur_buf = &p->cur_buf;
+
+	if (!n_pages)
+		return -EINVAL;
+
+	bitmap &= p->return_mask;
+
+	if (cur_buf->flags == bitmap &&
+	    cur_buf->start + cur_buf->len * PAGE_SIZE == addr) {
+		cur_buf->len += n_pages;
+		p->found_pages += n_pages;
+	} else {
+		if (cur_buf->len) {
+			if (p->vec_buf_index >= p->vec_buf_len)
+				return PM_SCAN_END_WALK;
+
+			memcpy(&p->vec_buf[p->vec_buf_index], cur_buf,
+			       sizeof(*p->vec_buf));
+			p->vec_buf_index++;
+		}
+
+		cur_buf->start = addr;
+		cur_buf->len = n_pages;
+		cur_buf->flags = bitmap;
+		p->found_pages += n_pages;
+	}
+
+	if (p->found_pages == p->max_pages)
+		return PM_SCAN_FOUND_MAX_PAGES;
+
+	return 0;
+}
+
+static bool pagemap_scan_is_interesting_page(unsigned long bitmap,
+					     struct pagemap_scan_private *p)
+{
+	if ((p->required_mask & bitmap) != p->required_mask)
+		return false;
+	if (p->anyof_mask && !(p->anyof_mask & bitmap))
+		return false;
+	if (p->excluded_mask & bitmap)
+		return false;
+
+	return true;
+}
+
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
+{
+	bool is_written, flush = false, interesting = true;
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long bitmap, addr = end;
+	pte_t *pte, *orig_pte, ptent;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	arch_enter_lazy_mmu_mode();
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		unsigned long n_pages = (end - start)/PAGE_SIZE;
+
+		if (n_pages > p->max_pages - p->found_pages)
+			n_pages = p->max_pages - p->found_pages;
+
+		is_written = !is_pmd_uffd_wp(*pmd);
+
+		bitmap = PM_SCAN_FLAGS(is_written, false,
+				       pmd_present(*pmd), is_swap_pmd(*pmd),
+				       pmd_present(*pmd) && is_zero_pfn(pmd_pfn(*pmd)));
+
+		if (IS_PM_SCAN_GET(p->flags))
+			interesting = pagemap_scan_is_interesting_page(bitmap, p);
+
+		if (interesting) {
+			/*
+			 * Break huge page into small pages if the WP operation
+			 * need to be performed is on a portion of the huge page.
+			 */
+			if (is_written && IS_PM_SCAN_WP(p->flags) &&
+			    n_pages < HPAGE_SIZE/PAGE_SIZE) {
+				spin_unlock(ptl);
+
+				split_huge_pmd(vma, pmd, start);
+				goto process_smaller_pages;
+			}
+
+			if (IS_PM_SCAN_GET(p->flags))
+				ret = pagemap_scan_output(bitmap, p, start, n_pages);
+
+			if (IS_PM_SCAN_WP(p->flags) && is_written && ret >= 0) {
+				make_uffd_wp_pmd(vma, start, pmd);
+				flush_tlb_range(vma, start, end);
+			}
+		}
+
+		spin_unlock(ptl);
+		arch_leave_lazy_mmu_mode();
+
+		return ret;
+	}
+
+process_smaller_pages:
+#endif
+
+	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+	if (!pte) {
+		walk->action = ACTION_AGAIN;
+		return 0;
+	}
+
+	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
+		ptent = ptep_get(pte);
+		is_written = !is_pte_uffd_wp(ptent);
+
+		bitmap = PM_SCAN_FLAGS(is_written, pagemap_scan_is_file(vma, ptent, addr),
+				       pte_present(ptent), is_swap_pte(ptent),
+				       pte_present(ptent) && is_zero_pfn(pte_pfn(ptent)));
+
+		if (IS_PM_SCAN_GET(p->flags)) {
+			interesting = pagemap_scan_is_interesting_page(bitmap, p);
+			if (interesting)
+				ret = pagemap_scan_output(bitmap, p, addr, 1);
+		}
+
+		if (IS_PM_SCAN_WP(p->flags) && is_written && interesting &&
+		    ret >= 0) {
+			make_uffd_wp_pte(vma, addr, pte);
+			flush = true;
+		}
+	}
+
+	if (flush)
+		flush_tlb_range(vma, start, addr);
+
+	pte_unmap_unlock(orig_pte, ptl);
+	arch_leave_lazy_mmu_mode();
+
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
+				      unsigned long start, unsigned long end,
+				      struct mm_walk *walk)
+{
+	unsigned long n_pages = (end - start)/PAGE_SIZE;
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	bool is_written, interesting = true;
+	struct hstate *h = hstate_vma(vma);
+	unsigned long bitmap;
+	spinlock_t *ptl;
+	int ret = 0;
+	pte_t ptent;
+
+	if (IS_PM_SCAN_WP(p->flags) && n_pages < HPAGE_SIZE/PAGE_SIZE)
+		return -EINVAL;
+
+	if (n_pages > p->max_pages - p->found_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	if (IS_PM_SCAN_WP(p->flags)) {
+		i_mmap_lock_write(vma->vm_file->f_mapping);
+		ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+	}
+
+	ptent = huge_ptep_get(ptep);
+	is_written = !is_huge_pte_uffd_wp(ptent);
+
+	bitmap = PM_SCAN_FLAGS(is_written, pagemap_scan_is_huge_file(ptent),
+			       pte_present(ptent), is_swap_pte(ptent),
+			       pte_present(ptent) && is_zero_pfn(pte_pfn(ptent)));
+
+	if (IS_PM_SCAN_GET(p->flags))
+		interesting = pagemap_scan_is_interesting_page(bitmap, p);
+
+	if (interesting) {
+		/*
+		 * Partial hugetlb page clear isn't supported
+		 */
+		if (is_written && IS_PM_SCAN_WP(p->flags) &&
+		    n_pages < HPAGE_SIZE/PAGE_SIZE) {
+			ret = PM_SCAN_END_WALK;
+			goto unlock_and_return;
+		}
+
+		if (IS_PM_SCAN_GET(p->flags))
+			ret = pagemap_scan_output(bitmap, p, start, n_pages);
+
+		if (IS_PM_SCAN_WP(p->flags) && is_written && ret >= 0) {
+			make_uffd_wp_huge_pte(vma, start, ptep, ptent);
+			flush_hugetlb_tlb_range(vma, start, end);
+		}
+	}
+
+unlock_and_return:
+	if (IS_PM_SCAN_WP(p->flags)) {
+		spin_unlock(ptl);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	}
+
+	return ret;
+}
+#else
+#define pagemap_scan_hugetlb_entry NULL
+#endif
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+				 int depth, struct mm_walk *walk)
+{
+	unsigned long n_pages = (end - addr)/PAGE_SIZE;
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int ret = 0;
+
+	if (!vma || !IS_PM_SCAN_GET(p->flags))
+		return 0;
+
+	if (n_pages > p->max_pages - p->found_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	ret = pagemap_scan_output(PM_SCAN_FLAGS(false, false, false, false, false),
+				  p, addr, n_pages);
+
+	return ret;
+}
+
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+	.hugetlb_entry = pagemap_scan_hugetlb_entry,
+};
+
+static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
+				   unsigned long end, struct page_region __user *vec)
+{
+	/* Detect illegal size, flags, len and masks */
+	if (arg->size != sizeof(struct pm_scan_arg))
+		return -EINVAL;
+	if (!arg->flags)
+		return -EINVAL;
+	if (arg->flags & ~PM_SCAN_OPS)
+		return -EINVAL;
+	if (!(end - start))
+		return -EINVAL;
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
+	     arg->return_mask) & ~PM_SCAN_BITS_ALL)
+		return -EINVAL;
+	if (!arg->required_mask && !arg->anyof_mask &&
+	    !arg->excluded_mask)
+		return -EINVAL;
+	if (!arg->return_mask)
+		return -EINVAL;
+
+	/* Validate memory range */
+	if (!IS_ALIGNED(start, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok((void __user *)start, end - start))
+		return -EFAULT;
+
+	if (IS_PM_SCAN_GET(arg->flags)) {
+		if (arg->vec_len == 0)
+			return -EINVAL;
+		if (!vec)
+			return -EINVAL;
+		if (!access_ok((void __user *)vec,
+			       arg->vec_len * sizeof(struct page_region)))
+			return -EFAULT;
+	}
+
+	if (IS_PM_SCAN_WP(arg->flags) && !IS_PM_SCAN_GET(arg->flags) &&
+	    arg->max_pages)
+		return -EINVAL;
+
+	return 0;
+}
+
+static long do_pagemap_scan(struct mm_struct *mm, unsigned long __arg)
+{
+	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)__arg;
+	unsigned long long start, end, walk_start, walk_end;
+	unsigned long empty_slots, vec_index = 0;
+	struct mmu_notifier_range range;
+	struct page_region __user *vec;
+	struct pagemap_scan_private p;
+	struct pm_scan_arg arg;
+	int ret = 0;
+
+	if (copy_from_user(&arg, uarg, sizeof(arg)))
+		return -EFAULT;
+
+	start = untagged_addr((unsigned long)arg.start);
+	end = untagged_addr((unsigned long)arg.end);
+	vec = (struct page_region __user *)untagged_addr((unsigned long)arg.vec);
+
+	ret = pagemap_scan_args_valid(&arg, start, end, vec);
+	if (ret)
+		return ret;
+
+	p.max_pages = (arg.max_pages) ? arg.max_pages : ULONG_MAX;
+	p.found_pages = 0;
+	p.required_mask = arg.required_mask;
+	p.anyof_mask = arg.anyof_mask;
+	p.excluded_mask = arg.excluded_mask;
+	p.return_mask = arg.return_mask;
+	p.flags = arg.flags;
+	p.flags |= ((p.required_mask | p.anyof_mask | p.excluded_mask) &
+		    PAGE_IS_WRITTEN) ? PM_SCAN_REQUIRE_UFFD : 0;
+	p.cur_buf.start = p.cur_buf.len = p.cur_buf.flags = 0;
+	p.vec_buf = NULL;
+	p.vec_buf_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
+
+	/*
+	 * Allocate smaller buffer to get output from inside the page walk
+	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
+	 * we want to return output to user in compact form where no two
+	 * consecutive regions should be continuous and have the same flags.
+	 * So store the latest element in p.cur_buf between different walks and
+	 * store the p.cur_buf at the end of the walk to the user buffer.
+	 */
+	if (IS_PM_SCAN_GET(p.flags)) {
+		p.vec_buf = kmalloc_array(p.vec_buf_len, sizeof(*p.vec_buf),
+					  GFP_KERNEL);
+		if (!p.vec_buf)
+			return -ENOMEM;
+	}
+
+	if (IS_PM_SCAN_WP(p.flags)) {
+		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+					mm, start, end);
+		mmu_notifier_invalidate_range_start(&range);
+	}
+
+	walk_start = walk_end = start;
+	while (walk_end < end && !ret) {
+		if (IS_PM_SCAN_GET(p.flags)) {
+			p.vec_buf_index = 0;
+
+			/*
+			 * All data is copied to cur_buf first. When more data
+			 * is found, we push cur_buf to vec_buf and copy new
+			 * data to cur_buf. Subtract 1 from length as the
+			 * index of cur_buf isn't counted in length.
+			 */
+			empty_slots = arg.vec_len - vec_index;
+			p.vec_buf_len = min(p.vec_buf_len, empty_slots - 1);
+		}
+
+		ret = mmap_read_lock_killable(mm);
+		if (ret)
+			goto return_status;
+
+		walk_end = min((walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK, end);
+
+		ret = walk_page_range(mm, walk_start, walk_end,
+				      &pagemap_scan_ops, &p);
+		mmap_read_unlock(mm);
+
+		if (ret && ret != PM_SCAN_FOUND_MAX_PAGES &&
+		    ret != PM_SCAN_END_WALK)
+			goto return_status;
+
+		walk_start = walk_end;
+		if (IS_PM_SCAN_GET(p.flags) && p.vec_buf_index) {
+			if (copy_to_user(&vec[vec_index], p.vec_buf,
+					 p.vec_buf_index * sizeof(*p.vec_buf))) {
+				/*
+				 * Return error even though the OP succeeded
+				 */
+				ret = -EFAULT;
+				goto return_status;
+			}
+			vec_index += p.vec_buf_index;
+		}
+	}
+
+	if (p.cur_buf.len) {
+		if (copy_to_user(&vec[vec_index], &p.cur_buf, sizeof(p.cur_buf))) {
+			ret = -EFAULT;
+			goto return_status;
+		}
+		vec_index++;
+	}
+
+	ret = vec_index;
+
+return_status:
+	arg.start = (unsigned long)walk_end;
+	if (copy_to_user(&uarg->start, &arg.start, sizeof(arg.start)))
+		ret = -EFAULT;
+
+	if (IS_PM_SCAN_WP(p.flags))
+		mmu_notifier_invalidate_range_end(&range);
+
+	kfree(p.vec_buf);
+	return ret;
+}
+
+static long do_pagemap_cmd(struct file *file, unsigned int cmd,
+			   unsigned long arg)
+{
+	struct mm_struct *mm = file->private_data;
+
+	switch (cmd) {
+	case PAGEMAP_SCAN:
+		return do_pagemap_scan(mm, arg);
+
+	default:
+		return -EINVAL;
+	}
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = do_pagemap_cmd,
+	.compat_ioctl	= do_pagemap_cmd,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index beb7c63d2871..a6e773c3e2b4 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -261,6 +261,7 @@  long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long cp_flags);
 
 bool is_hugetlb_entry_migration(pte_t pte);
+bool is_hugetlb_entry_hwpoisoned(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
 
 #else /* !CONFIG_HUGETLB_PAGE */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..ac684d99e68f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,59 @@  typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+
+/* Bits are set in flags of the page_region and masks in pm_scan_args */
+#define PAGE_IS_WRITTEN		(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+#define PAGE_IS_PFNZERO		(1 << 4)
+
+/*
+ * struct page_region - Page region with flags
+ * @start:	Start of the region
+ * @len:	Length of the region in pages
+ * @bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 flags;
+};
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ *			(Ending address of the walk is also returned in it)
+ * @end:		Ending address of the region
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 end;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Supported flags */
+#define PM_SCAN_OP_GET	(1 << 0)
+#define PM_SCAN_OP_WP	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0db13167b1ee..7e60f0f3fd03 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4980,7 +4980,7 @@  bool is_hugetlb_entry_migration(pte_t pte)
 		return false;
 }
 
-static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
+bool is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
 	swp_entry_t swp;