diff mbox series

[v10,3/6] fs/proc/task_mmu: Implement IOCTL to get and/or the clear info about PTEs

Message ID 20230202112915.867409-4-usama.anjum@collabora.com (mailing list archive)
State New
Headers show
Series Implement IOCTL to get and/or the clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum Feb. 2, 2023, 11:29 a.m. UTC
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
  (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
  pages have been written-to.
- Find pages which have been written-to and write protect the pages
  (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

To get information about which pages have been written-to and/or write
protect the pages, following must be performed first in order:
- The userfaultfd file descriptor is created with userfaultfd syscall.
- The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
- The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
  through UFFDIO_REGISTER IOCTL.
Then the any part of the registered memory or the whole memory region
can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
PAGEMAP_SCAN IOCTL.

struct pagemap_scan_args is used as the argument of the IOCTL. In this
struct:
- The range is specified through start and len.
- The output buffer of struct page_region array and size is specified as
  vec and vec_len.
- The optional maximum requested pages are specified in the max_pages.
- The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
  is the only added flag at this time.
- The masks are specified in required_mask, anyof_mask, excluded_ mask
  and return_mask.

This IOCTL can be extended to get information about more PTE bits. This
IOCTL doesn't support hugetlbs at the moment. No information about
hugetlb can be obtained. This patch has evolved from a basic patch from
Gabriel Krisman Bertazi.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl

Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl
---
 fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  50 +++++++
 2 files changed, 340 insertions(+)

Comments

Peter Xu Feb. 8, 2023, 10:15 p.m. UTC | #1
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>   (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>   pages have been written-to.
> - Find pages which have been written-to and write protect the pages
>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> 
> To get information about which pages have been written-to and/or write
> protect the pages, following must be performed first in order:
> - The userfaultfd file descriptor is created with userfaultfd syscall.
> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>   through UFFDIO_REGISTER IOCTL.
> Then the any part of the registered memory or the whole memory region
> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> PAGEMAP_SCAN IOCTL.
> 
> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> struct:
> - The range is specified through start and len.
> - The output buffer of struct page_region array and size is specified as
>   vec and vec_len.
> - The optional maximum requested pages are specified in the max_pages.
> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>   is the only added flag at this time.
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>   and return_mask.
> 
> This IOCTL can be extended to get information about more PTE bits. This
> IOCTL doesn't support hugetlbs at the moment. No information about
> hugetlb can be obtained. This patch has evolved from a basic patch from
> Gabriel Krisman Bertazi.
> 
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
> 
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
> 
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
> 
> Changes in v6:
> - Rename variables and update comments
> - Make IOCTL independent of soft_dirty config
> - Change masks and bitmap type to _u64
> - Improve code quality
> 
> Changes in v5:
> - Remove tlb flushing even for clear operation
> 
> Changes in v4:
> - Update the interface and implementation
> 
> Changes in v3:
> - Tighten the user-kernel interface by using explicit types and add more
>   error checking
> 
> Changes in v2:
> - Convert the interface from syscall to ioctl
> - Remove pidfd support as it doesn't make sense in ioctl
> ---
>  fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fs.h |  50 +++++++
>  2 files changed, 340 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..c6bde19d63d9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/uaccess.h>
>  #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>  
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> +	    (pte_swp_uffd_wp_any(pte)))
> +		return true;
> +	return false;

Sorry I should have mentioned this earlier: you can directly return here.

  return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte);

> +}
> +
> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
> +{
> +	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> +	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> +		return true;
> +	return false;

Same here.

> +}
> +
>  #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>  		unsigned long addr, pmd_t *pmdp)
> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> +#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
> +					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
> +#define IS_GET_OP(a)			(a->vec)
> +#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))
> +
> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
> +	(wt | file << 1 | present << 2 | swap << 3)
> +#define IS_WT_REQUIRED(a)				\
> +	((a->required_mask & PAGE_IS_WRITTEN) ||	\
> +	 (a->anyof_mask & PAGE_IS_WRITTEN))
> +
> +struct pagemap_scan_private {
> +	struct page_region *vec;
> +	struct page_region prev;
> +	unsigned long vec_len, vec_index;
> +	unsigned int max_pages, found_pages, flags;
> +	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
> +};
> +
> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +
> +	if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))

Should this be:

       (IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma)))

Instead?

> +		return -EPERM;
> +	if (vma->vm_flags & VM_PFNMAP)
> +		return 1;
> +	return 0;
> +}
> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> +				      struct pagemap_scan_private *p, unsigned long addr,
> +				      unsigned int len)
> +{
> +	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> +	bool cpy = true;
> +	struct page_region *prev = &p->prev;

Nit: switch the above two lines?

> +
> +	if (HAS_NO_SPACE(p))
> +		return -ENOSPC;
> +
> +	if (p->max_pages && p->found_pages + len >= p->max_pages)
> +		len = p->max_pages - p->found_pages;

If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC?

> +	if (!len)
> +		return -EINVAL;
> +
> +	if (p->required_mask)
> +		cpy = ((p->required_mask & cur) == p->required_mask);
> +	if (cpy && p->anyof_mask)
> +		cpy = (p->anyof_mask & cur);
> +	if (cpy && p->excluded_mask)
> +		cpy = !(p->excluded_mask & cur);
> +	bitmap = cur & p->return_mask;
> +	if (cpy && bitmap) {
> +		if ((prev->len) && (prev->bitmap == bitmap) &&
> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
> +			prev->len += len;
> +			p->found_pages += len;
> +		} else if (p->vec_index < p->vec_len) {
> +			if (prev->len) {
> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> +				p->vec_index++;
> +			}

IIUC you can have:

  int pagemap_scan_deposit(p)
  {
        if (p->vec_index >= p->vec_len)
                return -ENOSPC;

        if (p->prev->len) {
                memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
                p->vec_index++;
        }

        return 0;
  }

Then call it here.  I think it can also be called below to replace
export_prev_to_out().

> +			prev->start = addr;
> +			prev->len = len;
> +			prev->bitmap = bitmap;
> +			p->found_pages += len;
> +		} else {
> +			return -ENOSPC;
> +		}
> +	}
> +	return 0;
> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> +				     unsigned long *vec_index)
> +{
> +	struct page_region *prev = &p->prev;
> +
> +	if (prev->len) {
> +		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> +			return -EFAULT;
> +		p->vec_index++;
> +		(*vec_index)++;
> +		prev->len = 0;
> +	}
> +	return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +					 unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long addr = end;

This assignment is useless?

> +	spinlock_t *ptl;
> +	int ret = 0;
> +	pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		bool pmd_wt;
> +
> +		pmd_wt = !is_pmd_uffd_wp(*pmd);
> +		/*
> +		 * Break huge page into small pages if operation needs to be performed is
> +		 * on a portion of the huge page.
> +		 */
> +		if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, start);
> +			goto process_smaller_pages;
> +		}
> +		if (IS_GET_OP(p))
> +			ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
> +						  is_swap_pmd(*pmd), p, start,
> +						  (end - start)/PAGE_SIZE);
> +		spin_unlock(ptl);
> +		if (!ret) {
> +			if (pmd_wt && IS_WP_ENGAGE_OP(p))
> +				uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
> +		}
> +		return ret;
> +	}
> +process_smaller_pages:
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> +	if (IS_GET_OP(p)) {
> +		for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
> +			ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
> +						  pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
> +			if (ret)
> +				break;
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
> +		uffd_wp_range(walk->mm, vma, start, addr - start, true);
> +
> +	cond_resched();
> +	return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
> +				 struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	int ret = 0;
> +
> +	if (vma)
> +		ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
> +					  (end - addr)/PAGE_SIZE);
> +	return ret;
> +}
> +
> +/* No hugetlb support is present. */
> +static const struct mm_walk_ops pagemap_scan_ops = {
> +	.test_walk = pagemap_scan_test_walk,
> +	.pmd_entry = pagemap_scan_pmd_entry,
> +	.pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> +	unsigned long empty_slots, vec_index = 0;
> +	unsigned long __user start, end;
> +	unsigned long __start, __end;
> +	struct page_region __user *vec;
> +	struct pagemap_scan_private p;
> +	int ret = 0;
> +
> +	start = (unsigned long)untagged_addr(arg->start);
> +	vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> +	/* Validate memory ranges */
> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> +		return -EINVAL;
> +	if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
> +	    (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
> +		return -EINVAL;
> +
> +	/* Detect illegal flags and masks */
> +	if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
> +	    (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
> +	    (arg->return_mask & ~PAGEMAP_BITS_ALL))
> +		return -EINVAL;
> +	if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
> +				!arg->return_mask))
> +		return -EINVAL;
> +	/* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
> +	if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
> +	    (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
> +		return -EINVAL;

I think you said you'll clean this up a bit.  I don't think so..

> +
> +	end = start + arg->len;
> +	p.max_pages = arg->max_pages;
> +	p.found_pages = 0;
> +	p.flags = arg->flags;
> +	p.required_mask = arg->required_mask;
> +	p.anyof_mask = arg->anyof_mask;
> +	p.excluded_mask = arg->excluded_mask;
> +	p.return_mask = arg->return_mask;
> +	p.prev.len = 0;
> +	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> +	if (IS_GET_OP(arg)) {
> +		p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
> +		if (!p.vec)
> +			return -ENOMEM;
> +	} else {
> +		p.vec = NULL;
> +	}
> +	__start = __end = start;
> +	while (!ret && __end < end) {
> +		p.vec_index = 0;
> +		empty_slots = arg->vec_len - vec_index;
> +		if (p.vec_len > empty_slots)
> +			p.vec_len = empty_slots;
> +
> +		__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> +		if (__end > end)
> +			__end = end;
> +
> +		mmap_read_lock(mm);
> +		ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
> +		mmap_read_unlock(mm);
> +		if (!(!ret || ret == -ENOSPC))
> +			goto free_data;
> +
> +		__start = __end;
> +		if (IS_GET_OP(arg) && p.vec_index) {
> +			if (copy_to_user(&vec[vec_index], p.vec,
> +					 p.vec_index * sizeof(struct page_region))) {
> +				ret = -EFAULT;
> +				goto free_data;
> +			}
> +			vec_index += p.vec_index;
> +		}

I think you can move copy_to_user() to outside the loop, then call
pagemap_scan_deposit() before copy_to_user(), then I think you can drop
the ugly export_prev_to_out()..

> +	}
> +	ret = export_prev_to_out(&p, vec, &vec_index);
> +	if (!ret)
> +		ret = vec_index;
> +free_data:
> +	if (IS_GET_OP(arg))
> +		kfree(p.vec);
> +
> +	return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
> +	struct mm_struct *mm = file->private_data;
> +	struct pagemap_scan_arg argument;
> +
> +	if (cmd == PAGEMAP_SCAN) {
> +		if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
> +			return -EFAULT;
> +		return do_pagemap_cmd(mm, &argument);
> +	}
> +	return -EINVAL;
> +}
> +
>  const struct file_operations proc_pagemap_operations = {
>  	.llseek		= mem_lseek, /* borrow this */
>  	.read		= pagemap_read,
>  	.open		= pagemap_open,
>  	.release	= pagemap_release,
> +	.unlocked_ioctl = pagemap_scan_ioctl,
> +	.compat_ioctl	= pagemap_scan_ioctl,
>  };
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>  
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..1ae9a8684b48 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>  			 RWF_APPEND)
>  
> +/* Pagemap ioctl */
> +#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_scan_arg)
> +
> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
> +#define PAGE_IS_WRITTEN		(1 << 0)
> +#define PAGE_IS_FILE		(1 << 1)
> +#define PAGE_IS_PRESENT		(1 << 2)
> +#define PAGE_IS_SWAPPED		(1 << 3)
> +
> +/*
> + * struct page_region - Page region with bitmap flags
> + * @start:	Start of the region
> + * @len:	Length of the region
> + * bitmap:	Bits sets for the region
> + */
> +struct page_region {
> +	__u64 start;
> +	__u64 len;
> +	__u64 bitmap;
> +};
> +
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start:		Starting address of the region
> + * @len:		Length of the region (All the pages in this length are included)
> + * @vec:		Address of page_region struct array for output
> + * @vec_len:		Length of the page_region struct array
> + * @max_pages:		Optional max return pages
> + * @flags:		Flags for the IOCTL
> + * @required_mask:	Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask:		Any mask - Any of these bits are set in the PTE
> + * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
> + * @return_mask:	Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> +	__u64 start;
> +	__u64 len;
> +	__u64 vec;
> +	__u64 vec_len;
> +	__u32 max_pages;
> +	__u32 flags;
> +	__u64 required_mask;
> +	__u64 anyof_mask;
> +	__u64 excluded_mask;
> +	__u64 return_mask;
> +};
> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE	(1 << 0)
> +
>  #endif /* _UAPI_LINUX_FS_H */
> -- 
> 2.30.2
>
Cyrill Gorcunov Feb. 8, 2023, 10:22 p.m. UTC | #2
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
...
Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i
fear cant look with precise care at least for some time, hopefully other CRIU guys
pick it up. Anyway, here a few comment from a glance.

> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> +				      struct pagemap_scan_private *p, unsigned long addr,
> +				      unsigned int len)
> +{

This is a big function and usually it's a flag to not declare it as "inline" until
there very serious reson to.

> +	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> +	bool cpy = true;
> +	struct page_region *prev = &p->prev;
> +
> +	if (HAS_NO_SPACE(p))
> +		return -ENOSPC;
> +
> +	if (p->max_pages && p->found_pages + len >= p->max_pages)
> +		len = p->max_pages - p->found_pages;
> +	if (!len)
> +		return -EINVAL;
> +
> +	if (p->required_mask)
> +		cpy = ((p->required_mask & cur) == p->required_mask);
> +	if (cpy && p->anyof_mask)
> +		cpy = (p->anyof_mask & cur);
> +	if (cpy && p->excluded_mask)
> +		cpy = !(p->excluded_mask & cur);
> +	bitmap = cur & p->return_mask;
> +	if (cpy && bitmap) {

You can exit early here simply

	if (!cpy || !bitmap)
		return 0;

saving one tab for the code below.

> +		if ((prev->len) && (prev->bitmap == bitmap) &&
> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
> +			prev->len += len;
> +			p->found_pages += len;
> +		} else if (p->vec_index < p->vec_len) {
> +			if (prev->len) {
> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> +				p->vec_index++;
> +			}
> +			prev->start = addr;
> +			prev->len = len;
> +			prev->bitmap = bitmap;
> +			p->found_pages += len;
> +		} else {
> +			return -ENOSPC;
> +		}
> +	}
> +	return 0;
> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> +				     unsigned long *vec_index)
> +{

No need for inline either.

> +	struct page_region *prev = &p->prev;
> +
> +	if (prev->len) {
> +		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> +			return -EFAULT;
> +		p->vec_index++;
> +		(*vec_index)++;
> +		prev->len = 0;
> +	}
> +	return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +					 unsigned long end, struct mm_walk *walk)
> +{

Same, no need for inline. I've a few comments more in my mind will try to
collect them tomorrow.
Muhammad Usama Anjum Feb. 13, 2023, 8:19 a.m. UTC | #3
Hi Cyrill,

Thank you for your time and review.

On 2/9/23 3:22 AM, Cyrill Gorcunov wrote:
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> ...
> Hi Muhammad! I'm really sorry for not commenting this code, just out of time and i
> fear cant look with precise care at least for some time, hopefully other CRIU guys
> pick it up. Anyway, here a few comment from a glance.
> 
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> +				      struct pagemap_scan_private *p, unsigned long addr,
>> +				      unsigned int len)
>> +{
> 
> This is a big function and usually it's a flag to not declare it as "inline" until
> there very serious reson to.
I'll remove all these inline in next revision.

> 
>> +	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> +	bool cpy = true;
>> +	struct page_region *prev = &p->prev;
>> +
>> +	if (HAS_NO_SPACE(p))
>> +		return -ENOSPC;
>> +
>> +	if (p->max_pages && p->found_pages + len >= p->max_pages)
>> +		len = p->max_pages - p->found_pages;
>> +	if (!len)
>> +		return -EINVAL;
>> +
>> +	if (p->required_mask)
>> +		cpy = ((p->required_mask & cur) == p->required_mask);
>> +	if (cpy && p->anyof_mask)
>> +		cpy = (p->anyof_mask & cur);
>> +	if (cpy && p->excluded_mask)
>> +		cpy = !(p->excluded_mask & cur);
>> +	bitmap = cur & p->return_mask;
>> +	if (cpy && bitmap) {
> 
> You can exit early here simply
> 
> 	if (!cpy || !bitmap)
> 		return 0;
I'm avoiding an extra return here.

> 
> saving one tab for the code below.
> 
>> +		if ((prev->len) && (prev->bitmap == bitmap) &&
>> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
>> +			prev->len += len;
>> +			p->found_pages += len;
>> +		} else if (p->vec_index < p->vec_len) {
>> +			if (prev->len) {
>> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>> +				p->vec_index++;
>> +			}
>> +			prev->start = addr;
>> +			prev->len = len;
>> +			prev->bitmap = bitmap;
>> +			p->found_pages += len;
>> +		} else {
>> +			return -ENOSPC;
>> +		}
>> +	}
>> +	return 0;
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
>> +				     unsigned long *vec_index)
>> +{
> 
> No need for inline either.
> 
>> +	struct page_region *prev = &p->prev;
>> +
>> +	if (prev->len) {
>> +		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
>> +			return -EFAULT;
>> +		p->vec_index++;
>> +		(*vec_index)++;
>> +		prev->len = 0;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +					 unsigned long end, struct mm_walk *walk)
>> +{
> 
> Same, no need for inline. I've a few comments more in my mind will try to
> collect them tomorrow.
Your review would be much appreciated.
Muhammad Usama Anjum Feb. 13, 2023, 12:55 p.m. UTC | #4
On 2/9/23 3:15 AM, Peter Xu wrote:
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>   (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>>   pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> To get information about which pages have been written-to and/or write
>> protect the pages, following must be performed first in order:
>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>>   through UFFDIO_REGISTER IOCTL.
>> Then the any part of the registered memory or the whole memory region
>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>> PAGEMAP_SCAN IOCTL.
>>
>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>> struct:
>> - The range is specified through start and len.
>> - The output buffer of struct page_region array and size is specified as
>>   vec and vec_len.
>> - The optional maximum requested pages are specified in the max_pages.
>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>>   is the only added flag at this time.
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>>   and return_mask.
>>
>> This IOCTL can be extended to get information about more PTE bits. This
>> IOCTL doesn't support hugetlbs at the moment. No information about
>> hugetlb can be obtained. This patch has evolved from a basic patch from
>> Gabriel Krisman Bertazi.
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> ---
>> Changes in v10:
>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>> - update commit message
>>
>> Change in v8:
>> - Correct is_pte_uffd_wp()
>> - Improve readability and error checks
>> - Remove some un-needed code
>>
>> Changes in v7:
>> - Rebase on top of latest next
>> - Fix some corner cases
>> - Base soft-dirty on the uffd wp async
>> - Update the terminologies
>> - Optimize the memory usage inside the ioctl
>>
>> Changes in v6:
>> - Rename variables and update comments
>> - Make IOCTL independent of soft_dirty config
>> - Change masks and bitmap type to _u64
>> - Improve code quality
>>
>> Changes in v5:
>> - Remove tlb flushing even for clear operation
>>
>> Changes in v4:
>> - Update the interface and implementation
>>
>> Changes in v3:
>> - Tighten the user-kernel interface by using explicit types and add more
>>   error checking
>>
>> Changes in v2:
>> - Convert the interface from syscall to ioctl
>> - Remove pidfd support as it doesn't make sense in ioctl
>> ---
>>  fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/fs.h |  50 +++++++
>>  2 files changed, 340 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index e35a0398db63..c6bde19d63d9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/shmem_fs.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>  
>>  #include <asm/elf.h>
>>  #include <asm/tlb.h>
>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>>  }
>>  #endif
>>  
>> +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> +	    (pte_swp_uffd_wp_any(pte)))
>> +		return true;
>> +	return false;
> 
> Sorry I should have mentioned this earlier: you can directly return here.
No problem at all. I'm replacing these two helper functions with following
in next version so that !present pages don't show as dirty:

static inline bool is_pte_written(pte_t pte)
{
	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
	    (pte_swp_uffd_wp_any(pte)))
		return false;
	return (pte_present(pte) || is_swap_pte(pte));
}

static inline bool is_pmd_written(pmd_t pmd)
{
	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
		return false;
	return (pmd_present(pmd) || is_swap_pmd(pmd));
}

> 
>   return (pte_present(pte) && pte_uffd_wp(pte)) || pte_swp_uffd_wp_any(pte);
> 
>> +}
>> +
>> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
>> +{
>> +	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> +	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> +		return true;
>> +	return false;
> 
> Same here.
> 
>> +}
>> +
>>  #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>>  		unsigned long addr, pmd_t *pmdp)
>> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
>>  	return 0;
>>  }
>>  
>> +#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
>> +					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
>> +#define IS_GET_OP(a)			(a->vec)
>> +#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))
>> +
>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
>> +	(wt | file << 1 | present << 2 | swap << 3)
>> +#define IS_WT_REQUIRED(a)				\
>> +	((a->required_mask & PAGE_IS_WRITTEN) ||	\
>> +	 (a->anyof_mask & PAGE_IS_WRITTEN))
>> +
>> +struct pagemap_scan_private {
>> +	struct page_region *vec;
>> +	struct page_region prev;
>> +	unsigned long vec_len, vec_index;
>> +	unsigned int max_pages, found_pages, flags;
>> +	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
>> +};
>> +
>> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
>> +{
>> +	struct pagemap_scan_private *p = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +
>> +	if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
> 
> Should this be:
> 
>        (IS_WT_REQUIRED(p) && (!userfaultfd_wp(vma) || !userfaultfd_wp_async(vma)))
> 
> Instead?
Correct. I'll fix this.

> 
>> +		return -EPERM;
>> +	if (vma->vm_flags & VM_PFNMAP)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> +				      struct pagemap_scan_private *p, unsigned long addr,
>> +				      unsigned int len)
>> +{
>> +	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> +	bool cpy = true;
>> +	struct page_region *prev = &p->prev;
> 
> Nit: switch the above two lines?
I'll fix this.

> 
>> +
>> +	if (HAS_NO_SPACE(p))
>> +		return -ENOSPC;
>> +
>> +	if (p->max_pages && p->found_pages + len >= p->max_pages)
>> +		len = p->max_pages - p->found_pages;
> 
> If "p->found_pages + len >= p->max_pages", shouldn't this already return -ENOSPC?
Length calculation is happening in the funtions calling this function. I'll
move this out of here to make things logically better.

> 
>> +	if (!len)
>> +		return -EINVAL;
>> +
>> +	if (p->required_mask)
>> +		cpy = ((p->required_mask & cur) == p->required_mask);
>> +	if (cpy && p->anyof_mask)
>> +		cpy = (p->anyof_mask & cur);
>> +	if (cpy && p->excluded_mask)
>> +		cpy = !(p->excluded_mask & cur);
>> +	bitmap = cur & p->return_mask;
>> +	if (cpy && bitmap) {
>> +		if ((prev->len) && (prev->bitmap == bitmap) &&
>> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
>> +			prev->len += len;
>> +			p->found_pages += len;
>> +		} else if (p->vec_index < p->vec_len) {
>> +			if (prev->len) {
>> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>> +				p->vec_index++;
>> +			}
> 
> IIUC you can have:
> 
>   int pagemap_scan_deposit(p)
>   {
>         if (p->vec_index >= p->vec_len)
>                 return -ENOSPC;
> 
>         if (p->prev->len) {
>                 memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>                 p->vec_index++;
>         }
> 
>         return 0;
>   }
> 
> Then call it here.  I think it can also be called below to replace
> export_prev_to_out().
No this isn't possible. We fill up prev until the next range doesn't merge
with it. At that point, we put prev into the output buffer and new range is
put into prev. Now that we have shifted to smaller page walks of <= 512
entries. We want to visit all ranges before finally putting the prev to
output. Sorry to have this some what complex method. The problem is that we
want to merge the consective matching regions into one entry in the output.
So to achieve this among multiple different page walks, the prev is being used.

Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
having length of 1024 pages and all of the memory has been written.
walk_page_range() will be called 2 times. In the first call, prev will be
set having length of 512. In second call, prev will be updated to 1024 as
the previous range stored in prev could be extended. After this, the prev
will be stored to the user output buffer consuming only 1 struct of page_range.

If we store prev back to output memory in every walk_page_range() call, we
wouldn't get 1 struct of page_range with length 1024. Instead we would get
2 elements of page_range structs with half the length.

> 
>> +			prev->start = addr;
>> +			prev->len = len;
>> +			prev->bitmap = bitmap;
>> +			p->found_pages += len;
>> +		} else {
>> +			return -ENOSPC;
>> +		}
>> +	}
>> +	return 0;
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
>> +				     unsigned long *vec_index)
>> +{
>> +	struct page_region *prev = &p->prev;
>> +
>> +	if (prev->len) {
>> +		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
>> +			return -EFAULT;
>> +		p->vec_index++;
>> +		(*vec_index)++;
>> +		prev->len = 0;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +					 unsigned long end, struct mm_walk *walk)
>> +{
>> +	struct pagemap_scan_private *p = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +	unsigned long addr = end;
> 
> This assignment is useless?
No, this assignement gets used when only the WP_ENGAGE operation is used on
normal size pages.

> 
>> +	spinlock_t *ptl;
>> +	int ret = 0;
>> +	pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	ptl = pmd_trans_huge_lock(pmd, vma);
>> +	if (ptl) {
>> +		bool pmd_wt;
>> +
>> +		pmd_wt = !is_pmd_uffd_wp(*pmd);
>> +		/*
>> +		 * Break huge page into small pages if operation needs to be performed is
>> +		 * on a portion of the huge page.
>> +		 */
>> +		if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>> +			spin_unlock(ptl);
>> +			split_huge_pmd(vma, pmd, start);
>> +			goto process_smaller_pages;
>> +		}
>> +		if (IS_GET_OP(p))
>> +			ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
>> +						  is_swap_pmd(*pmd), p, start,
>> +						  (end - start)/PAGE_SIZE);
>> +		spin_unlock(ptl);
>> +		if (!ret) {
>> +			if (pmd_wt && IS_WP_ENGAGE_OP(p))
>> +				uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
>> +		}
>> +		return ret;
>> +	}
>> +process_smaller_pages:
>> +	if (pmd_trans_unstable(pmd))
>> +		return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
>> +	if (IS_GET_OP(p)) {
>> +		for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
>> +			ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
>> +						  pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
>> +			if (ret)
>> +				break;
>> +		}
>> +	}
>> +	pte_unmap_unlock(pte - 1, ptl);
>> +	if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
>> +		uffd_wp_range(walk->mm, vma, start, addr - start, true);
>> +
>> +	cond_resched();
>> +	return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
>> +				 struct mm_walk *walk)
>> +{
>> +	struct pagemap_scan_private *p = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +	int ret = 0;
>> +
>> +	if (vma)
>> +		ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
>> +					  (end - addr)/PAGE_SIZE);
>> +	return ret;
>> +}
>> +
>> +/* No hugetlb support is present. */
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> +	.test_walk = pagemap_scan_test_walk,
>> +	.pmd_entry = pagemap_scan_pmd_entry,
>> +	.pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
>> +{
>> +	unsigned long empty_slots, vec_index = 0;
>> +	unsigned long __user start, end;
>> +	unsigned long __start, __end;
>> +	struct page_region __user *vec;
>> +	struct pagemap_scan_private p;
>> +	int ret = 0;
>> +
>> +	start = (unsigned long)untagged_addr(arg->start);
>> +	vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> +	/* Validate memory ranges */
>> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
>> +		return -EINVAL;
>> +	if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
>> +	    (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
>> +		return -EINVAL;
>> +
>> +	/* Detect illegal flags and masks */
>> +	if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
>> +	    (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
>> +	    (arg->return_mask & ~PAGEMAP_BITS_ALL))
>> +		return -EINVAL;
>> +	if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
>> +				!arg->return_mask))
>> +		return -EINVAL;
>> +	/* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
>> +	if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
>> +	    (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
>> +		return -EINVAL;
> 
> I think you said you'll clean this up a bit.  I don't think so..
You had showed a really clean way to put all these error checking
conditions. But I wasn't able to put the current error checking conditions
in that much nice way. I'd done at least something to make them look
better. Sorry, I'll revisit and try to come up with easier to follow error
checking conditions.
Peter Xu Feb. 13, 2023, 9:42 p.m. UTC | #5
On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
> On 2/9/23 3:15 AM, Peter Xu wrote:
> > On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> >> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> >> the info about page table entries. The following operations are supported
> >> in this ioctl:
> >> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> >>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> >>   (PAGE_IS_SWAPPED).
> >> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> >>   pages have been written-to.
> >> - Find pages which have been written-to and write protect the pages
> >>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> >>
> >> To get information about which pages have been written-to and/or write
> >> protect the pages, following must be performed first in order:
> >> - The userfaultfd file descriptor is created with userfaultfd syscall.
> >> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> >> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
> >>   through UFFDIO_REGISTER IOCTL.
> >> Then the any part of the registered memory or the whole memory region
> >> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> >> PAGEMAP_SCAN IOCTL.
> >>
> >> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> >> struct:
> >> - The range is specified through start and len.
> >> - The output buffer of struct page_region array and size is specified as
> >>   vec and vec_len.
> >> - The optional maximum requested pages are specified in the max_pages.
> >> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
> >>   is the only added flag at this time.
> >> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> >>   and return_mask.
> >>
> >> This IOCTL can be extended to get information about more PTE bits. This
> >> IOCTL doesn't support hugetlbs at the moment. No information about
> >> hugetlb can be obtained. This patch has evolved from a basic patch from
> >> Gabriel Krisman Bertazi.
> >>
> >> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >> ---
> >> Changes in v10:
> >> - move changes in tools/include/uapi/linux/fs.h to separate patch
> >> - update commit message
> >>
> >> Change in v8:
> >> - Correct is_pte_uffd_wp()
> >> - Improve readability and error checks
> >> - Remove some un-needed code
> >>
> >> Changes in v7:
> >> - Rebase on top of latest next
> >> - Fix some corner cases
> >> - Base soft-dirty on the uffd wp async
> >> - Update the terminologies
> >> - Optimize the memory usage inside the ioctl
> >>
> >> Changes in v6:
> >> - Rename variables and update comments
> >> - Make IOCTL independent of soft_dirty config
> >> - Change masks and bitmap type to _u64
> >> - Improve code quality
> >>
> >> Changes in v5:
> >> - Remove tlb flushing even for clear operation
> >>
> >> Changes in v4:
> >> - Update the interface and implementation
> >>
> >> Changes in v3:
> >> - Tighten the user-kernel interface by using explicit types and add more
> >>   error checking
> >>
> >> Changes in v2:
> >> - Convert the interface from syscall to ioctl
> >> - Remove pidfd support as it doesn't make sense in ioctl
> >> ---
> >>  fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
> >>  include/uapi/linux/fs.h |  50 +++++++
> >>  2 files changed, 340 insertions(+)
> >>
> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >> index e35a0398db63..c6bde19d63d9 100644
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
> >> @@ -19,6 +19,7 @@
> >>  #include <linux/shmem_fs.h>
> >>  #include <linux/uaccess.h>
> >>  #include <linux/pkeys.h>
> >> +#include <linux/minmax.h>
> >>  
> >>  #include <asm/elf.h>
> >>  #include <asm/tlb.h>
> >> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> >>  }
> >>  #endif
> >>  
> >> +static inline bool is_pte_uffd_wp(pte_t pte)
> >> +{
> >> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >> +	    (pte_swp_uffd_wp_any(pte)))
> >> +		return true;
> >> +	return false;
> > 
> > Sorry I should have mentioned this earlier: you can directly return here.
> No problem at all. I'm replacing these two helper functions with following
> in next version so that !present pages don't show as dirty:
> 
> static inline bool is_pte_written(pte_t pte)
> {
> 	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> 	    (pte_swp_uffd_wp_any(pte)))
> 		return false;
> 	return (pte_present(pte) || is_swap_pte(pte));
> }

Could you explain why you don't want to return dirty for !present?  A page
can be written then swapped out.  Don't you want to know that happened
(from dirty tracking POV)?

The code looks weird to me too..  We only have three types of ptes: (1)
present, (2) swap, (3) none.

Then, "(pte_present() || is_swap_pte())" is the same as !pte_none().  Is
that what you're really looking for?

> 
> static inline bool is_pmd_written(pmd_t pmd)
> {
> 	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> 	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> 		return false;
> 	return (pmd_present(pmd) || is_swap_pmd(pmd));
> }

[...]

> >> +	bitmap = cur & p->return_mask;
> >> +	if (cpy && bitmap) {
> >> +		if ((prev->len) && (prev->bitmap == bitmap) &&
> >> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
> >> +			prev->len += len;
> >> +			p->found_pages += len;
> >> +		} else if (p->vec_index < p->vec_len) {
> >> +			if (prev->len) {
> >> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> >> +				p->vec_index++;
> >> +			}
> > 
> > IIUC you can have:
> > 
> >   int pagemap_scan_deposit(p)
> >   {
> >         if (p->vec_index >= p->vec_len)
> >                 return -ENOSPC;
> > 
> >         if (p->prev->len) {
> >                 memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> >                 p->vec_index++;
> >         }
> > 
> >         return 0;
> >   }
> > 
> > Then call it here.  I think it can also be called below to replace
> > export_prev_to_out().
> No this isn't possible. We fill up prev until the next range doesn't merge
> with it. At that point, we put prev into the output buffer and new range is
> put into prev. Now that we have shifted to smaller page walks of <= 512
> entries. We want to visit all ranges before finally putting the prev to
> output. Sorry to have this some what complex method. The problem is that we
> want to merge the consective matching regions into one entry in the output.
> So to achieve this among multiple different page walks, the prev is being used.
> 
> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
> having length of 1024 pages and all of the memory has been written.
> walk_page_range() will be called 2 times. In the first call, prev will be
> set having length of 512. In second call, prev will be updated to 1024 as
> the previous range stored in prev could be extended. After this, the prev
> will be stored to the user output buffer consuming only 1 struct of page_range.
> 
> If we store prev back to output memory in every walk_page_range() call, we
> wouldn't get 1 struct of page_range with length 1024. Instead we would get
> 2 elements of page_range structs with half the length.

I didn't mean to merge PREV for each pgtable walk.  What I meant is I think
with such a pagemap_scan_deposit() you can rewrite it as:

if (cpy && bitmap) {
        if ((prev->len) && (prev->bitmap == bitmap) &&
            (prev->start + prev->len * PAGE_SIZE == addr)) {
                prev->len += len;
                p->found_pages += len;
        } else {
                if (pagemap_scan_deposit(p))
                        return -ENOSPC;
                prev->start = addr;
                prev->len = len;
                prev->bitmap = bitmap;
                p->found_pages += len;
        }
}

Then you can reuse pagemap_scan_deposit() when before returning to
userspace, just to flush PREV to p->vec properly in a single helper.
It also makes the code slightly easier to read.
Muhammad Usama Anjum Feb. 14, 2023, 7:57 a.m. UTC | #6
On 2/14/23 2:42 AM, Peter Xu wrote:
> On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
>> On 2/9/23 3:15 AM, Peter Xu wrote:
>>> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>>>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>>>> the info about page table entries. The following operations are supported
>>>> in this ioctl:
>>>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>>>   (PAGE_IS_SWAPPED).
>>>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>>>>   pages have been written-to.
>>>> - Find pages which have been written-to and write protect the pages
>>>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>>>
>>>> To get information about which pages have been written-to and/or write
>>>> protect the pages, following must be performed first in order:
>>>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>>>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>>>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>>>>   through UFFDIO_REGISTER IOCTL.
>>>> Then the any part of the registered memory or the whole memory region
>>>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>>>> PAGEMAP_SCAN IOCTL.
>>>>
>>>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>>>> struct:
>>>> - The range is specified through start and len.
>>>> - The output buffer of struct page_region array and size is specified as
>>>>   vec and vec_len.
>>>> - The optional maximum requested pages are specified in the max_pages.
>>>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>>>>   is the only added flag at this time.
>>>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>>>>   and return_mask.
>>>>
>>>> This IOCTL can be extended to get information about more PTE bits. This
>>>> IOCTL doesn't support hugetlbs at the moment. No information about
>>>> hugetlb can be obtained. This patch has evolved from a basic patch from
>>>> Gabriel Krisman Bertazi.
>>>>
>>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>>>> ---
>>>> Changes in v10:
>>>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>>>> - update commit message
>>>>
>>>> Change in v8:
>>>> - Correct is_pte_uffd_wp()
>>>> - Improve readability and error checks
>>>> - Remove some un-needed code
>>>>
>>>> Changes in v7:
>>>> - Rebase on top of latest next
>>>> - Fix some corner cases
>>>> - Base soft-dirty on the uffd wp async
>>>> - Update the terminologies
>>>> - Optimize the memory usage inside the ioctl
>>>>
>>>> Changes in v6:
>>>> - Rename variables and update comments
>>>> - Make IOCTL independent of soft_dirty config
>>>> - Change masks and bitmap type to _u64
>>>> - Improve code quality
>>>>
>>>> Changes in v5:
>>>> - Remove tlb flushing even for clear operation
>>>>
>>>> Changes in v4:
>>>> - Update the interface and implementation
>>>>
>>>> Changes in v3:
>>>> - Tighten the user-kernel interface by using explicit types and add more
>>>>   error checking
>>>>
>>>> Changes in v2:
>>>> - Convert the interface from syscall to ioctl
>>>> - Remove pidfd support as it doesn't make sense in ioctl
>>>> ---
>>>>  fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>>>>  include/uapi/linux/fs.h |  50 +++++++
>>>>  2 files changed, 340 insertions(+)
>>>>
>>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>>> index e35a0398db63..c6bde19d63d9 100644
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
>>>> @@ -19,6 +19,7 @@
>>>>  #include <linux/shmem_fs.h>
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/pkeys.h>
>>>> +#include <linux/minmax.h>
>>>>  
>>>>  #include <asm/elf.h>
>>>>  #include <asm/tlb.h>
>>>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>>>>  }
>>>>  #endif
>>>>  
>>>> +static inline bool is_pte_uffd_wp(pte_t pte)
>>>> +{
>>>> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>>>> +	    (pte_swp_uffd_wp_any(pte)))
>>>> +		return true;
>>>> +	return false;
>>>
>>> Sorry I should have mentioned this earlier: you can directly return here.
>> No problem at all. I'm replacing these two helper functions with following
>> in next version so that !present pages don't show as dirty:
>>
>> static inline bool is_pte_written(pte_t pte)
>> {
>> 	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> 	    (pte_swp_uffd_wp_any(pte)))
>> 		return false;
>> 	return (pte_present(pte) || is_swap_pte(pte));
>> }
> 
> Could you explain why you don't want to return dirty for !present?  A page
> can be written then swapped out.  Don't you want to know that happened
> (from dirty tracking POV)?
> 
> The code looks weird to me too..  We only have three types of ptes: (1)
> present, (2) swap, (3) none.
> 
> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none().  Is
> that what you're really looking for?
Yes, this is what I've been trying to do. I'll use !pte_none() to make it
simpler.

> 
>>
>> static inline bool is_pmd_written(pmd_t pmd)
>> {
>> 	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> 	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> 		return false;
>> 	return (pmd_present(pmd) || is_swap_pmd(pmd));
>> }
> 
> [...]
> 
>>>> +	bitmap = cur & p->return_mask;
>>>> +	if (cpy && bitmap) {
>>>> +		if ((prev->len) && (prev->bitmap == bitmap) &&
>>>> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
>>>> +			prev->len += len;
>>>> +			p->found_pages += len;
>>>> +		} else if (p->vec_index < p->vec_len) {
>>>> +			if (prev->len) {
>>>> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>> +				p->vec_index++;
>>>> +			}
>>>
>>> IIUC you can have:
>>>
>>>   int pagemap_scan_deposit(p)
>>>   {
>>>         if (p->vec_index >= p->vec_len)
>>>                 return -ENOSPC;
>>>
>>>         if (p->prev->len) {
>>>                 memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>                 p->vec_index++;
>>>         }
>>>
>>>         return 0;
>>>   }
>>>
>>> Then call it here.  I think it can also be called below to replace
>>> export_prev_to_out().
>> No this isn't possible. We fill up prev until the next range doesn't merge
>> with it. At that point, we put prev into the output buffer and new range is
>> put into prev. Now that we have shifted to smaller page walks of <= 512
>> entries. We want to visit all ranges before finally putting the prev to
>> output. Sorry to have this some what complex method. The problem is that we
>> want to merge the consective matching regions into one entry in the output.
>> So to achieve this among multiple different page walks, the prev is being used.
>>
>> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
>> having length of 1024 pages and all of the memory has been written.
>> walk_page_range() will be called 2 times. In the first call, prev will be
>> set having length of 512. In second call, prev will be updated to 1024 as
>> the previous range stored in prev could be extended. After this, the prev
>> will be stored to the user output buffer consuming only 1 struct of page_range.
>>
>> If we store prev back to output memory in every walk_page_range() call, we
>> wouldn't get 1 struct of page_range with length 1024. Instead we would get
>> 2 elements of page_range structs with half the length.
> 
> I didn't mean to merge PREV for each pgtable walk.  What I meant is I think
> with such a pagemap_scan_deposit() you can rewrite it as:
> 
> if (cpy && bitmap) {
>         if ((prev->len) && (prev->bitmap == bitmap) &&
>             (prev->start + prev->len * PAGE_SIZE == addr)) {
>                 prev->len += len;
>                 p->found_pages += len;
>         } else {
>                 if (pagemap_scan_deposit(p))
>                         return -ENOSPC;
>                 prev->start = addr;
>                 prev->len = len;
>                 prev->bitmap = bitmap;
>                 p->found_pages += len;
>         }
> }
> 
> Then you can reuse pagemap_scan_deposit() when before returning to
> userspace, just to flush PREV to p->vec properly in a single helper.
> It also makes the code slightly easier to read.
Yeah, this would have worked as you have described. But in
pagemap_scan_output(), we are flushing prev to p->vec. But later in
export_prev_to_out() we need to flush prev to user_memory directly.


>
Peter Xu Feb. 14, 2023, 8:59 p.m. UTC | #7
On Tue, Feb 14, 2023 at 12:57:21PM +0500, Muhammad Usama Anjum wrote:
> On 2/14/23 2:42 AM, Peter Xu wrote:
> > On Mon, Feb 13, 2023 at 05:55:19PM +0500, Muhammad Usama Anjum wrote:
> >> On 2/9/23 3:15 AM, Peter Xu wrote:
> >>> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> >>>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> >>>> the info about page table entries. The following operations are supported
> >>>> in this ioctl:
> >>>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
> >>>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
> >>>>   (PAGE_IS_SWAPPED).
> >>>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
> >>>>   pages have been written-to.
> >>>> - Find pages which have been written-to and write protect the pages
> >>>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> >>>>
> >>>> To get information about which pages have been written-to and/or write
> >>>> protect the pages, following must be performed first in order:
> >>>> - The userfaultfd file descriptor is created with userfaultfd syscall.
> >>>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> >>>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
> >>>>   through UFFDIO_REGISTER IOCTL.
> >>>> Then the any part of the registered memory or the whole memory region
> >>>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> >>>> PAGEMAP_SCAN IOCTL.
> >>>>
> >>>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> >>>> struct:
> >>>> - The range is specified through start and len.
> >>>> - The output buffer of struct page_region array and size is specified as
> >>>>   vec and vec_len.
> >>>> - The optional maximum requested pages are specified in the max_pages.
> >>>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
> >>>>   is the only added flag at this time.
> >>>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
> >>>>   and return_mask.
> >>>>
> >>>> This IOCTL can be extended to get information about more PTE bits. This
> >>>> IOCTL doesn't support hugetlbs at the moment. No information about
> >>>> hugetlb can be obtained. This patch has evolved from a basic patch from
> >>>> Gabriel Krisman Bertazi.
> >>>>
> >>>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> >>>> ---
> >>>> Changes in v10:
> >>>> - move changes in tools/include/uapi/linux/fs.h to separate patch
> >>>> - update commit message
> >>>>
> >>>> Change in v8:
> >>>> - Correct is_pte_uffd_wp()
> >>>> - Improve readability and error checks
> >>>> - Remove some un-needed code
> >>>>
> >>>> Changes in v7:
> >>>> - Rebase on top of latest next
> >>>> - Fix some corner cases
> >>>> - Base soft-dirty on the uffd wp async
> >>>> - Update the terminologies
> >>>> - Optimize the memory usage inside the ioctl
> >>>>
> >>>> Changes in v6:
> >>>> - Rename variables and update comments
> >>>> - Make IOCTL independent of soft_dirty config
> >>>> - Change masks and bitmap type to _u64
> >>>> - Improve code quality
> >>>>
> >>>> Changes in v5:
> >>>> - Remove tlb flushing even for clear operation
> >>>>
> >>>> Changes in v4:
> >>>> - Update the interface and implementation
> >>>>
> >>>> Changes in v3:
> >>>> - Tighten the user-kernel interface by using explicit types and add more
> >>>>   error checking
> >>>>
> >>>> Changes in v2:
> >>>> - Convert the interface from syscall to ioctl
> >>>> - Remove pidfd support as it doesn't make sense in ioctl
> >>>> ---
> >>>>  fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
> >>>>  include/uapi/linux/fs.h |  50 +++++++
> >>>>  2 files changed, 340 insertions(+)
> >>>>
> >>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>>> index e35a0398db63..c6bde19d63d9 100644
> >>>> --- a/fs/proc/task_mmu.c
> >>>> +++ b/fs/proc/task_mmu.c
> >>>> @@ -19,6 +19,7 @@
> >>>>  #include <linux/shmem_fs.h>
> >>>>  #include <linux/uaccess.h>
> >>>>  #include <linux/pkeys.h>
> >>>> +#include <linux/minmax.h>
> >>>>  
> >>>>  #include <asm/elf.h>
> >>>>  #include <asm/tlb.h>
> >>>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
> >>>>  }
> >>>>  #endif
> >>>>  
> >>>> +static inline bool is_pte_uffd_wp(pte_t pte)
> >>>> +{
> >>>> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >>>> +	    (pte_swp_uffd_wp_any(pte)))
> >>>> +		return true;
> >>>> +	return false;
> >>>
> >>> Sorry I should have mentioned this earlier: you can directly return here.
> >> No problem at all. I'm replacing these two helper functions with following
> >> in next version so that !present pages don't show as dirty:
> >>
> >> static inline bool is_pte_written(pte_t pte)
> >> {
> >> 	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >> 	    (pte_swp_uffd_wp_any(pte)))
> >> 		return false;
> >> 	return (pte_present(pte) || is_swap_pte(pte));
> >> }
> > 
> > Could you explain why you don't want to return dirty for !present?  A page
> > can be written then swapped out.  Don't you want to know that happened
> > (from dirty tracking POV)?
> > 
> > The code looks weird to me too..  We only have three types of ptes: (1)
> > present, (2) swap, (3) none.
> > 
> > Then, "(pte_present() || is_swap_pte())" is the same as !pte_none().  Is
> > that what you're really looking for?
> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
> simpler.

Ah I think I see what you wanted to do now.. But I'm afraid it won't work
for all cases.

So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
persist on anon (but none) ptes, then we got it lost and we cannot identify
it from pages being written.  Your solution will solve problem for
anonymous, but I think it'll break file memories.

Example:

Consider one shmem page that got mapped, write protected (using UFFDIO_WP
ioctl), written again (removing uffd-wp bit automatically), then zapped.
The pte will be pte_none() but it's actually written, afaiu.

Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
to install pte markers for anonymous too (then it will work similarly like
shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
need to have the new UFFD_FEATURE_WP_ASYNC depend on it.  With that I think
you can keep using the old check and it should start to work.

Please let me know if my understanding is correct above.

I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some
patch at the meantime.  That's something we wanted before too, when the app
cares about zero pages on anon.  We used to populate the pages before doing
ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag
should be more efficient.

> 
> > 
> >>
> >> static inline bool is_pmd_written(pmd_t pmd)
> >> {
> >> 	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> >> 	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> >> 		return false;
> >> 	return (pmd_present(pmd) || is_swap_pmd(pmd));
> >> }
> > 
> > [...]
> > 
> >>>> +	bitmap = cur & p->return_mask;
> >>>> +	if (cpy && bitmap) {
> >>>> +		if ((prev->len) && (prev->bitmap == bitmap) &&
> >>>> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
> >>>> +			prev->len += len;
> >>>> +			p->found_pages += len;
> >>>> +		} else if (p->vec_index < p->vec_len) {
> >>>> +			if (prev->len) {
> >>>> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> >>>> +				p->vec_index++;
> >>>> +			}
> >>>
> >>> IIUC you can have:
> >>>
> >>>   int pagemap_scan_deposit(p)
> >>>   {
> >>>         if (p->vec_index >= p->vec_len)
> >>>                 return -ENOSPC;
> >>>
> >>>         if (p->prev->len) {
> >>>                 memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> >>>                 p->vec_index++;
> >>>         }
> >>>
> >>>         return 0;
> >>>   }
> >>>
> >>> Then call it here.  I think it can also be called below to replace
> >>> export_prev_to_out().
> >> No this isn't possible. We fill up prev until the next range doesn't merge
> >> with it. At that point, we put prev into the output buffer and new range is
> >> put into prev. Now that we have shifted to smaller page walks of <= 512
> >> entries. We want to visit all ranges before finally putting the prev to
> >> output. Sorry to have this some what complex method. The problem is that we
> >> want to merge the consective matching regions into one entry in the output.
> >> So to achieve this among multiple different page walks, the prev is being used.
> >>
> >> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
> >> having length of 1024 pages and all of the memory has been written.
> >> walk_page_range() will be called 2 times. In the first call, prev will be
> >> set having length of 512. In second call, prev will be updated to 1024 as
> >> the previous range stored in prev could be extended. After this, the prev
> >> will be stored to the user output buffer consuming only 1 struct of page_range.
> >>
> >> If we store prev back to output memory in every walk_page_range() call, we
> >> wouldn't get 1 struct of page_range with length 1024. Instead we would get
> >> 2 elements of page_range structs with half the length.
> > 
> > I didn't mean to merge PREV for each pgtable walk.  What I meant is I think
> > with such a pagemap_scan_deposit() you can rewrite it as:
> > 
> > if (cpy && bitmap) {
> >         if ((prev->len) && (prev->bitmap == bitmap) &&
> >             (prev->start + prev->len * PAGE_SIZE == addr)) {
> >                 prev->len += len;
> >                 p->found_pages += len;
> >         } else {
> >                 if (pagemap_scan_deposit(p))
> >                         return -ENOSPC;
> >                 prev->start = addr;
> >                 prev->len = len;
> >                 prev->bitmap = bitmap;
> >                 p->found_pages += len;
> >         }
> > }
> > 
> > Then you can reuse pagemap_scan_deposit() when before returning to
> > userspace, just to flush PREV to p->vec properly in a single helper.
> > It also makes the code slightly easier to read.
> Yeah, this would have worked as you have described. But in
> pagemap_scan_output(), we are flushing prev to p->vec. But later in
> export_prev_to_out() we need to flush prev to user_memory directly.

I think there's a loop to copy_to_user().  Could you use the new helper so
the copy_to_user() loop will work without export_prev_to_out()?

I really hope we can get rid of export_prev_to_out().  Thanks,
Muhammad Usama Anjum Feb. 15, 2023, 10:03 a.m. UTC | #8
On 2/15/23 1:59 AM, Peter Xu wrote:
[..]
>>>> static inline bool is_pte_written(pte_t pte)
>>>> {
>>>> 	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>>>> 	    (pte_swp_uffd_wp_any(pte)))
>>>> 		return false;
>>>> 	return (pte_present(pte) || is_swap_pte(pte));
>>>> }
>>>
>>> Could you explain why you don't want to return dirty for !present?  A page
>>> can be written then swapped out.  Don't you want to know that happened
>>> (from dirty tracking POV)?
>>>
>>> The code looks weird to me too..  We only have three types of ptes: (1)
>>> present, (2) swap, (3) none.
>>>
>>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none().  Is
>>> that what you're really looking for?
>> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
>> simpler.
> 
> Ah I think I see what you wanted to do now.. But I'm afraid it won't work
> for all cases.
> 
> So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
> persist on anon (but none) ptes, then we got it lost and we cannot identify
> it from pages being written.  Your solution will solve problem for
> anonymous, but I think it'll break file memories.
> 
> Example:
> 
> Consider one shmem page that got mapped, write protected (using UFFDIO_WP
> ioctl), written again (removing uffd-wp bit automatically), then zapped.
> The pte will be pte_none() but it's actually written, afaiu.
> 
> Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
> to install pte markers for anonymous too (then it will work similarly like
> shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
> need to have the new UFFD_FEATURE_WP_ASYNC depend on it.  With that I think
> you can keep using the old check and it should start to work.
> 
> Please let me know if my understanding is correct above.
Thank you for identifying it. Your understanding seems on point. I'll have
research things up about PTE Markers. I'm looking at your patches about it
[1]. Can you refer me to "mm alignment sessions" discussion in form of
presentation or if any transcript is available?

> 
> I'll see whether I can quickly play with UFFD_FEATURE_WP_ZEROPAGE with some
> patch at the meantime.  That's something we wanted before too, when the app
> cares about zero pages on anon.  We used to populate the pages before doing
> ioctl(UFFDIO_WP) to make sure zero pages will be repoted too, but that flag
> should be more efficient.
Is this discussion public? For what application you were looking into this?
I'll dig down to see how can I contribute to it.

> 
>>
>>>
>>>>
>>>> static inline bool is_pmd_written(pmd_t pmd)
>>>> {
>>>> 	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>>>> 	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>>>> 		return false;
>>>> 	return (pmd_present(pmd) || is_swap_pmd(pmd));
>>>> }
>>>
>>> [...]
>>>
>>>>>> +	bitmap = cur & p->return_mask;
>>>>>> +	if (cpy && bitmap) {
>>>>>> +		if ((prev->len) && (prev->bitmap == bitmap) &&
>>>>>> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
>>>>>> +			prev->len += len;
>>>>>> +			p->found_pages += len;
>>>>>> +		} else if (p->vec_index < p->vec_len) {
>>>>>> +			if (prev->len) {
>>>>>> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>>>> +				p->vec_index++;
>>>>>> +			}
>>>>>
>>>>> IIUC you can have:
>>>>>
>>>>>   int pagemap_scan_deposit(p)
>>>>>   {
>>>>>         if (p->vec_index >= p->vec_len)
>>>>>                 return -ENOSPC;
>>>>>
>>>>>         if (p->prev->len) {
>>>>>                 memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>>>>>                 p->vec_index++;
>>>>>         }
>>>>>
>>>>>         return 0;
>>>>>   }
>>>>>
>>>>> Then call it here.  I think it can also be called below to replace
>>>>> export_prev_to_out().
>>>> No this isn't possible. We fill up prev until the next range doesn't merge
>>>> with it. At that point, we put prev into the output buffer and new range is
>>>> put into prev. Now that we have shifted to smaller page walks of <= 512
>>>> entries. We want to visit all ranges before finally putting the prev to
>>>> output. Sorry to have this some what complex method. The problem is that we
>>>> want to merge the consective matching regions into one entry in the output.
>>>> So to achieve this among multiple different page walks, the prev is being used.
>>>>
>>>> Lets suppose we want to visit memory from 0x7FFF00000000 to 7FFF00400000
>>>> having length of 1024 pages and all of the memory has been written.
>>>> walk_page_range() will be called 2 times. In the first call, prev will be
>>>> set having length of 512. In second call, prev will be updated to 1024 as
>>>> the previous range stored in prev could be extended. After this, the prev
>>>> will be stored to the user output buffer consuming only 1 struct of page_range.
>>>>
>>>> If we store prev back to output memory in every walk_page_range() call, we
>>>> wouldn't get 1 struct of page_range with length 1024. Instead we would get
>>>> 2 elements of page_range structs with half the length.
>>>
>>> I didn't mean to merge PREV for each pgtable walk.  What I meant is I think
>>> with such a pagemap_scan_deposit() you can rewrite it as:
>>>
>>> if (cpy && bitmap) {
>>>         if ((prev->len) && (prev->bitmap == bitmap) &&
>>>             (prev->start + prev->len * PAGE_SIZE == addr)) {
>>>                 prev->len += len;
>>>                 p->found_pages += len;
>>>         } else {
>>>                 if (pagemap_scan_deposit(p))
>>>                         return -ENOSPC;
>>>                 prev->start = addr;
>>>                 prev->len = len;
>>>                 prev->bitmap = bitmap;
>>>                 p->found_pages += len;
>>>         }
>>> }
>>>
>>> Then you can reuse pagemap_scan_deposit() when before returning to
>>> userspace, just to flush PREV to p->vec properly in a single helper.
>>> It also makes the code slightly easier to read.
>> Yeah, this would have worked as you have described. But in
>> pagemap_scan_output(), we are flushing prev to p->vec. But later in
>> export_prev_to_out() we need to flush prev to user_memory directly.
> 
> I think there's a loop to copy_to_user().  Could you use the new helper so
> the copy_to_user() loop will work without export_prev_to_out()?
> 
> I really hope we can get rid of export_prev_to_out().  Thanks,
I truly understand how you feel about export_prev_to_out(). It is really
difficult to understand. Even I had to made a hard try to come up with the
current code to avoid consuming a lot of kernel's memory while giving user
the compact output. I can surely map both of these with a dirty looking
macro. But I'm unable to find a decent macro to replace these. I think I'll
put a comment some where to explain whats going-on.
Peter Xu Feb. 15, 2023, 9:12 p.m. UTC | #9
On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote:
> On 2/15/23 1:59 AM, Peter Xu wrote:
> [..]
> >>>> static inline bool is_pte_written(pte_t pte)
> >>>> {
> >>>> 	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> >>>> 	    (pte_swp_uffd_wp_any(pte)))
> >>>> 		return false;
> >>>> 	return (pte_present(pte) || is_swap_pte(pte));
> >>>> }
> >>>
> >>> Could you explain why you don't want to return dirty for !present?  A page
> >>> can be written then swapped out.  Don't you want to know that happened
> >>> (from dirty tracking POV)?
> >>>
> >>> The code looks weird to me too..  We only have three types of ptes: (1)
> >>> present, (2) swap, (3) none.
> >>>
> >>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none().  Is
> >>> that what you're really looking for?
> >> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
> >> simpler.
> > 
> > Ah I think I see what you wanted to do now.. But I'm afraid it won't work
> > for all cases.
> > 
> > So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
> > persist on anon (but none) ptes, then we got it lost and we cannot identify
> > it from pages being written.  Your solution will solve problem for
> > anonymous, but I think it'll break file memories.
> > 
> > Example:
> > 
> > Consider one shmem page that got mapped, write protected (using UFFDIO_WP
> > ioctl), written again (removing uffd-wp bit automatically), then zapped.
> > The pte will be pte_none() but it's actually written, afaiu.
> > 
> > Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
> > to install pte markers for anonymous too (then it will work similarly like
> > shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
> > need to have the new UFFD_FEATURE_WP_ASYNC depend on it.  With that I think
> > you can keep using the old check and it should start to work.
> > 
> > Please let me know if my understanding is correct above.
> Thank you for identifying it. Your understanding seems on point. I'll have
> research things up about PTE Markers. I'm looking at your patches about it
> [1]. Can you refer me to "mm alignment sessions" discussion in form of
> presentation or if any transcript is available?

No worry now, after a second thought I think zero page is better than pte
markers, and I've got a patch that works for it here by injecting zero
pages for anonymous:

https://lore.kernel.org/all/20230215210257.224243-1-peterx@redhat.com/

I think we'd also better to enforce your new WP_ASYNC feature bit to depend
on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE.

Could you please try by rebasing your work upon this one?  Hope it'll work
for you already.  Note again that you'll need to go back to the old
is_pte|pmd_written() to make things work always, I think.

[...]

> I truly understand how you feel about export_prev_to_out(). It is really
> difficult to understand. Even I had to made a hard try to come up with the
> current code to avoid consuming a lot of kernel's memory while giving user
> the compact output. I can surely map both of these with a dirty looking
> macro. But I'm unable to find a decent macro to replace these. I think I'll
> put a comment some where to explain whats going-on.

So maybe I still missed something? I'll read the new version when it comes.

Thanks,
Mike Rapoport Feb. 17, 2023, 10:10 a.m. UTC | #10
On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>   (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>   pages have been written-to.
> - Find pages which have been written-to and write protect the pages
>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> 
> To get information about which pages have been written-to and/or write
> protect the pages, following must be performed first in order:
> - The userfaultfd file descriptor is created with userfaultfd syscall.
> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>   through UFFDIO_REGISTER IOCTL.
> Then the any part of the registered memory or the whole memory region
> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> PAGEMAP_SCAN IOCTL.
> 
> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> struct:
> - The range is specified through start and len.
> - The output buffer of struct page_region array and size is specified as
>   vec and vec_len.
> - The optional maximum requested pages are specified in the max_pages.
> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>   is the only added flag at this time.
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>   and return_mask.
> 
> This IOCTL can be extended to get information about more PTE bits. This
> IOCTL doesn't support hugetlbs at the moment. No information about
> hugetlb can be obtained. This patch has evolved from a basic patch from
> Gabriel Krisman Bertazi.
> 
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
> 
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
> 
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
> 
> Changes in v6:
> - Rename variables and update comments
> - Make IOCTL independent of soft_dirty config
> - Change masks and bitmap type to _u64
> - Improve code quality
> 
> Changes in v5:
> - Remove tlb flushing even for clear operation
> 
> Changes in v4:
> - Update the interface and implementation
> 
> Changes in v3:
> - Tighten the user-kernel interface by using explicit types and add more
>   error checking
> 
> Changes in v2:
> - Convert the interface from syscall to ioctl
> - Remove pidfd support as it doesn't make sense in ioctl
> ---
>  fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>  include/uapi/linux/fs.h |  50 +++++++
>  2 files changed, 340 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..c6bde19d63d9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/uaccess.h>
>  #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>  
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> +	    (pte_swp_uffd_wp_any(pte)))
> +		return true;
> +	return false;
> +}
> +
> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
> +{
> +	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> +	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> +		return true;
> +	return false;
> +}
> +
>  #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>  		unsigned long addr, pmd_t *pmdp)
> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
>  	return 0;
>  }
>  
> +#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
> +					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
> +#define IS_GET_OP(a)			(a->vec)
> +#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))
> +
> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
> +	(wt | file << 1 | present << 2 | swap << 3)
> +#define IS_WT_REQUIRED(a)				\
> +	((a->required_mask & PAGE_IS_WRITTEN) ||	\
> +	 (a->anyof_mask & PAGE_IS_WRITTEN))

All these macros are specific to pagemap_scan_ioctl() and should be
namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.

Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.

And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or
similar flag rather than using arg->vec.

> +
> +struct pagemap_scan_private {
> +	struct page_region *vec;
> +	struct page_region prev;
> +	unsigned long vec_len, vec_index;
> +	unsigned int max_pages, found_pages, flags;
> +	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
> +};
> +
> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)

Please keep the lines under 80 characters limit.

> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +
> +	if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
> +		return -EPERM;
> +	if (vma->vm_flags & VM_PFNMAP)
> +		return 1;
> +	return 0;
> +}
> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> +				      struct pagemap_scan_private *p, unsigned long addr,
> +				      unsigned int len)
> +{
> +	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> +	bool cpy = true;
> +	struct page_region *prev = &p->prev;
> +
> +	if (HAS_NO_SPACE(p))
> +		return -ENOSPC;
> +
> +	if (p->max_pages && p->found_pages + len >= p->max_pages)
> +		len = p->max_pages - p->found_pages;
> +	if (!len)
> +		return -EINVAL;
> +
> +	if (p->required_mask)
> +		cpy = ((p->required_mask & cur) == p->required_mask);
> +	if (cpy && p->anyof_mask)
> +		cpy = (p->anyof_mask & cur);
> +	if (cpy && p->excluded_mask)
> +		cpy = !(p->excluded_mask & cur);
> +	bitmap = cur & p->return_mask;
> +	if (cpy && bitmap) {
> +		if ((prev->len) && (prev->bitmap == bitmap) &&
> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
> +			prev->len += len;
> +			p->found_pages += len;
> +		} else if (p->vec_index < p->vec_len) {
> +			if (prev->len) {
> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> +				p->vec_index++;
> +			}
> +			prev->start = addr;
> +			prev->len = len;
> +			prev->bitmap = bitmap;
> +			p->found_pages += len;
> +		} else {
> +			return -ENOSPC;
> +		}
> +	}
> +	return 0;

Please don't save on empty lines. Empty lines between logical pieces
improve readability.

> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> +				     unsigned long *vec_index)
> +{
> +	struct page_region *prev = &p->prev;
> +
> +	if (prev->len) {
> +		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> +			return -EFAULT;
> +		p->vec_index++;
> +		(*vec_index)++;
> +		prev->len = 0;
> +	}
> +	return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +					 unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long addr = end;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +	pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		bool pmd_wt;
> +
> +		pmd_wt = !is_pmd_uffd_wp(*pmd);
> +		/*
> +		 * Break huge page into small pages if operation needs to be performed is
> +		 * on a portion of the huge page.
> +		 */
> +		if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, start);
> +			goto process_smaller_pages;
> +		}
> +		if (IS_GET_OP(p))
> +			ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
> +						  is_swap_pmd(*pmd), p, start,
> +						  (end - start)/PAGE_SIZE);
> +		spin_unlock(ptl);
> +		if (!ret) {
> +			if (pmd_wt && IS_WP_ENGAGE_OP(p))
> +				uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
> +		}
> +		return ret;
> +	}
> +process_smaller_pages:
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> +	if (IS_GET_OP(p)) {
> +		for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
> +			ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
> +						  pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
> +			if (ret)
> +				break;
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
> +		uffd_wp_range(walk->mm, vma, start, addr - start, true);
> +
> +	cond_resched();
> +	return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
> +				 struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	int ret = 0;
> +
> +	if (vma)
> +		ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
> +					  (end - addr)/PAGE_SIZE);
> +	return ret;
> +}
> +
> +/* No hugetlb support is present. */
> +static const struct mm_walk_ops pagemap_scan_ops = {
> +	.test_walk = pagemap_scan_test_walk,
> +	.pmd_entry = pagemap_scan_pmd_entry,
> +	.pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> +	unsigned long empty_slots, vec_index = 0;
> +	unsigned long __user start, end;
> +	unsigned long __start, __end;
> +	struct page_region __user *vec;
> +	struct pagemap_scan_private p;
> +	int ret = 0;
> +
> +	start = (unsigned long)untagged_addr(arg->start);
> +	vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> +	/* Validate memory ranges */
> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> +		return -EINVAL;
> +	if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
> +	    (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
> +		return -EINVAL;
> +
> +	/* Detect illegal flags and masks */
> +	if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
> +	    (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
> +	    (arg->return_mask & ~PAGEMAP_BITS_ALL))
> +		return -EINVAL;
> +	if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
> +				!arg->return_mask))
> +		return -EINVAL;
> +	/* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
> +	if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
> +	    (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
> +		return -EINVAL;

I'd split argument validation into a separate function and split the OR'ed
conditions into separate if statements, e.g

bool pm_scan_args_valid(struct pagemap_scan_arg *arg)
{
	if (IS_GET_OP(arg)) {
		if (!arg->return_mask)
			return false;
		if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask)
			return false;
	}

	/* ... */

	return true;
}

> +
> +	end = start + arg->len;
> +	p.max_pages = arg->max_pages;
> +	p.found_pages = 0;
> +	p.flags = arg->flags;
> +	p.required_mask = arg->required_mask;
> +	p.anyof_mask = arg->anyof_mask;
> +	p.excluded_mask = arg->excluded_mask;
> +	p.return_mask = arg->return_mask;
> +	p.prev.len = 0;
> +	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> +	if (IS_GET_OP(arg)) {
> +		p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
> +		if (!p.vec)
> +			return -ENOMEM;
> +	} else {
> +		p.vec = NULL;
> +	}
> +	__start = __end = start;
> +	while (!ret && __end < end) {
> +		p.vec_index = 0;
> +		empty_slots = arg->vec_len - vec_index;
> +		if (p.vec_len > empty_slots)
> +			p.vec_len = empty_slots;
> +
> +		__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> +		if (__end > end)
> +			__end = end;
> +
> +		mmap_read_lock(mm);
> +		ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
> +		mmap_read_unlock(mm);
> +		if (!(!ret || ret == -ENOSPC))
> +			goto free_data;
> +
> +		__start = __end;
> +		if (IS_GET_OP(arg) && p.vec_index) {
> +			if (copy_to_user(&vec[vec_index], p.vec,
> +					 p.vec_index * sizeof(struct page_region))) {
> +				ret = -EFAULT;
> +				goto free_data;
> +			}
> +			vec_index += p.vec_index;
> +		}
> +	}
> +	ret = export_prev_to_out(&p, vec, &vec_index);
> +	if (!ret)
> +		ret = vec_index;
> +free_data:
> +	if (IS_GET_OP(arg))
> +		kfree(p.vec);
> +
> +	return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
> +	struct mm_struct *mm = file->private_data;
> +	struct pagemap_scan_arg argument;
> +
> +	if (cmd == PAGEMAP_SCAN) {
> +		if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
> +			return -EFAULT;
> +		return do_pagemap_cmd(mm, &argument);
> +	}
> +	return -EINVAL;
> +}
> +
>  const struct file_operations proc_pagemap_operations = {
>  	.llseek		= mem_lseek, /* borrow this */
>  	.read		= pagemap_read,
>  	.open		= pagemap_open,
>  	.release	= pagemap_release,
> +	.unlocked_ioctl = pagemap_scan_ioctl,
> +	.compat_ioctl	= pagemap_scan_ioctl,
>  };
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>  
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..1ae9a8684b48 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>  			 RWF_APPEND)
>  
> +/* Pagemap ioctl */
> +#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_scan_arg)
> +
> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
> +#define PAGE_IS_WRITTEN		(1 << 0)
> +#define PAGE_IS_FILE		(1 << 1)
> +#define PAGE_IS_PRESENT		(1 << 2)
> +#define PAGE_IS_SWAPPED		(1 << 3)
> +
> +/*
> + * struct page_region - Page region with bitmap flags
> + * @start:	Start of the region
> + * @len:	Length of the region
> + * bitmap:	Bits sets for the region
> + */
> +struct page_region {
> +	__u64 start;
> +	__u64 len;
> +	__u64 bitmap;
> +};
> +
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start:		Starting address of the region
> + * @len:		Length of the region (All the pages in this length are included)
> + * @vec:		Address of page_region struct array for output
> + * @vec_len:		Length of the page_region struct array
> + * @max_pages:		Optional max return pages
> + * @flags:		Flags for the IOCTL
> + * @required_mask:	Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask:		Any mask - Any of these bits are set in the PTE
> + * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
> + * @return_mask:	Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> +	__u64 start;
> +	__u64 len;
> +	__u64 vec;
> +	__u64 vec_len;
> +	__u32 max_pages;
> +	__u32 flags;
> +	__u64 required_mask;
> +	__u64 anyof_mask;
> +	__u64 excluded_mask;
> +	__u64 return_mask;
> +};
> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE	(1 << 0)
> +
>  #endif /* _UAPI_LINUX_FS_H */
> -- 
> 2.30.2
>
Muhammad Usama Anjum Feb. 17, 2023, 10:39 a.m. UTC | #11
On 2/16/23 2:12 AM, Peter Xu wrote:
> On Wed, Feb 15, 2023 at 03:03:09PM +0500, Muhammad Usama Anjum wrote:
>> On 2/15/23 1:59 AM, Peter Xu wrote:
>> [..]
>>>>>> static inline bool is_pte_written(pte_t pte)
>>>>>> {
>>>>>> 	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>>>>>> 	    (pte_swp_uffd_wp_any(pte)))
>>>>>> 		return false;
>>>>>> 	return (pte_present(pte) || is_swap_pte(pte));
>>>>>> }
>>>>>
>>>>> Could you explain why you don't want to return dirty for !present?  A page
>>>>> can be written then swapped out.  Don't you want to know that happened
>>>>> (from dirty tracking POV)?
>>>>>
>>>>> The code looks weird to me too..  We only have three types of ptes: (1)
>>>>> present, (2) swap, (3) none.
>>>>>
>>>>> Then, "(pte_present() || is_swap_pte())" is the same as !pte_none().  Is
>>>>> that what you're really looking for?
>>>> Yes, this is what I've been trying to do. I'll use !pte_none() to make it
>>>> simpler.
>>>
>>> Ah I think I see what you wanted to do now.. But I'm afraid it won't work
>>> for all cases.
>>>
>>> So IIUC the problem is anon pte can be empty, but since uffd-wp bit doesn't
>>> persist on anon (but none) ptes, then we got it lost and we cannot identify
>>> it from pages being written.  Your solution will solve problem for
>>> anonymous, but I think it'll break file memories.
>>>
>>> Example:
>>>
>>> Consider one shmem page that got mapped, write protected (using UFFDIO_WP
>>> ioctl), written again (removing uffd-wp bit automatically), then zapped.
>>> The pte will be pte_none() but it's actually written, afaiu.
>>>
>>> Maybe it's time we should introduce UFFD_FEATURE_WP_ZEROPAGE, so we'll need
>>> to install pte markers for anonymous too (then it will work similarly like
>>> shmem/hugetlbfs, that we'll report writting to zero pages), then you'll
>>> need to have the new UFFD_FEATURE_WP_ASYNC depend on it.  With that I think
>>> you can keep using the old check and it should start to work.
>>>
>>> Please let me know if my understanding is correct above.
>> Thank you for identifying it. Your understanding seems on point. I'll have
>> research things up about PTE Markers. I'm looking at your patches about it
>> [1]. Can you refer me to "mm alignment sessions" discussion in form of
>> presentation or if any transcript is available?
> 
> No worry now, after a second thought I think zero page is better than pte
> markers, and I've got a patch that works for it here by injecting zero
> pages for anonymous:
> 
> https://lore.kernel.org/all/20230215210257.224243-1-peterx@redhat.com/
> 
> I think we'd also better to enforce your new WP_ASYNC feature bit to depend
> on this one, so fail the UFFDIO_API if WP_ASYNC && !WP_ZEROPAGE.
> 
> Could you please try by rebasing your work upon this one?  Hope it'll work
> for you already.  Note again that you'll need to go back to the old
> is_pte|pmd_written() to make things work always, I think.
Thank you so much for sending the ZEROPAGE patch. I've rebased my patches
on top of it and my all tests for anon memory are passing. Now we don't
need to touch the page before engaging wp. This is what we wanted to
achieve. So !wp flag can be easily translated to soft-dirty flag
(is_pte_soft_dirty = is_pte_wp).

I've only a few file mem and shmem tests. I'll write more tests.

> 
> [...]
> 
>> I truly understand how you feel about export_prev_to_out(). It is really
>> difficult to understand. Even I had to made a hard try to come up with the
>> current code to avoid consuming a lot of kernel's memory while giving user
>> the compact output. I can surely map both of these with a dirty looking
>> macro. But I'm unable to find a decent macro to replace these. I think I'll
>> put a comment some where to explain whats going-on.
> 
> So maybe I still missed something? I'll read the new version when it comes.
Lets reconvene in next patches if you feel like they can be improved.

> 
> Thanks,
>
Michał Mirosław Feb. 17, 2023, 3:18 p.m. UTC | #12
On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
[...]
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>   and return_mask.
[...]

May I suggest a slightly modified interface for the flags?

As I understand, the return_mask is what is applied to page flags to
aggregate the list.
This is a separate thing, and I think it doesn't need changes except
maybe an improvement
in the documentation and visual distinction.

For the page-selection mechanism, currently required_mask and
excluded_mask have conflicting
responsibilities. I suggest to rework that to:
1. negated_flags: page flags which are to be negated before applying
the page selection using following masks;
2. required_flags: flags which all have to be set in the
(negation-applied) page flags;
3. anyof_flags: flags of which at least one has to be set in the
(negation-applied) page flags;

IOW, the resulting algorithm would be:

tested_flags = page_flags ^ negated_flags;
if (~tested_flags & required_flags)
  skip page;
if (!(tested_flags & anyof_flags))
  skip_page;

aggregate_on(page_flags & return_flags);

Best Regards
Michał Mirosław
Nadav Amit Feb. 19, 2023, 1:52 p.m. UTC | #13
On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>    file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>    (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>    pages have been written-to.
> - Find pages which have been written-to and write protect the pages
>    (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>
> To get information about which pages have been written-to and/or write
> protect the pages, following must be performed first in order:
> - The userfaultfd file descriptor is created with userfaultfd syscall.
> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>    through UFFDIO_REGISTER IOCTL.
> Then the any part of the registered memory or the whole memory region
> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
> PAGEMAP_SCAN IOCTL.
>
> struct pagemap_scan_args is used as the argument of the IOCTL. In this
> struct:
> - The range is specified through start and len.
> - The output buffer of struct page_region array and size is specified as
>    vec and vec_len.
> - The optional maximum requested pages are specified in the max_pages.
> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>    is the only added flag at this time.
> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>    and return_mask.
>
> This IOCTL can be extended to get information about more PTE bits. This
> IOCTL doesn't support hugetlbs at the moment. No information about
> hugetlb can be obtained. This patch has evolved from a basic patch from
> Gabriel Krisman Bertazi.

I was not involved before, so I am not commenting on the API and code to 
avoid making unhelpful noise.

Having said that, some things in the code seem quite dirty and make 
understanding the code hard to read.

> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---
> Changes in v10:
> - move changes in tools/include/uapi/linux/fs.h to separate patch
> - update commit message
>
> Change in v8:
> - Correct is_pte_uffd_wp()
> - Improve readability and error checks
> - Remove some un-needed code
>
> Changes in v7:
> - Rebase on top of latest next
> - Fix some corner cases
> - Base soft-dirty on the uffd wp async
> - Update the terminologies
> - Optimize the memory usage inside the ioctl
>
> Changes in v6:
> - Rename variables and update comments
> - Make IOCTL independent of soft_dirty config
> - Change masks and bitmap type to _u64
> - Improve code quality
>
> Changes in v5:
> - Remove tlb flushing even for clear operation
>
> Changes in v4:
> - Update the interface and implementation
>
> Changes in v3:
> - Tighten the user-kernel interface by using explicit types and add more
>    error checking
>
> Changes in v2:
> - Convert the interface from syscall to ioctl
> - Remove pidfd support as it doesn't make sense in ioctl
> ---
>   fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>   include/uapi/linux/fs.h |  50 +++++++
>   2 files changed, 340 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index e35a0398db63..c6bde19d63d9 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
>   #include <linux/shmem_fs.h>
>   #include <linux/uaccess.h>
>   #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>   
>   #include <asm/elf.h>
>   #include <asm/tlb.h>
> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>   }
>   #endif
>   
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
> +	    (pte_swp_uffd_wp_any(pte)))
> +		return true;
> +	return false;
> +}
> +
> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
> +{
> +	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
> +	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
> +		return true;
> +	return false;
> +}
> +
>   #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>   static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>   		unsigned long addr, pmd_t *pmdp)
> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
>   	return 0;
>   }
>   
> +#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
> +					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> +#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
> +#define IS_GET_OP(a)			(a->vec)
> +#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))

I think that in general it is better to have an inline function instead 
of macros when possible, as it is clearer and checks types. Anyhow, IMHO 
most of these macros are better be open-coded.

> +
> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
> +	(wt | file << 1 | present << 2 | swap << 3)
> +#define IS_WT_REQUIRED(a)				\
> +	((a->required_mask & PAGE_IS_WRITTEN) ||	\
> +	 (a->anyof_mask & PAGE_IS_WRITTEN))
> +
> +struct pagemap_scan_private {
> +	struct page_region *vec;
> +	struct page_region prev;
> +	unsigned long vec_len, vec_index;
> +	unsigned int max_pages, found_pages, flags;
> +	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
> +};
> +
> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +
> +	if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
> +		return -EPERM;
> +	if (vma->vm_flags & VM_PFNMAP)
> +		return 1;
> +	return 0;
> +}
> +
> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> +				      struct pagemap_scan_private *p, unsigned long addr,
> +				      unsigned int len)
> +{
> +	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
> +	bool cpy = true;
> +	struct page_region *prev = &p->prev;
> +
> +	if (HAS_NO_SPACE(p))
> +		return -ENOSPC;
> +
> +	if (p->max_pages && p->found_pages + len >= p->max_pages)
> +		len = p->max_pages - p->found_pages;
> +	if (!len)
> +		return -EINVAL;
> +
> +	if (p->required_mask)
> +		cpy = ((p->required_mask & cur) == p->required_mask);
> +	if (cpy && p->anyof_mask)
> +		cpy = (p->anyof_mask & cur);
> +	if (cpy && p->excluded_mask)
> +		cpy = !(p->excluded_mask & cur);
> +	bitmap = cur & p->return_mask;
> +	if (cpy && bitmap) {
> +		if ((prev->len) && (prev->bitmap == bitmap) &&
> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
> +			prev->len += len;
The use of "len" both for bytes and pages is very confusing. Consider 
changing the name to n_pages or something similar.
> +			p->found_pages += len;
> +		} else if (p->vec_index < p->vec_len) {
> +			if (prev->len) {
> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
> +				p->vec_index++;
> +			}
> +			prev->start = addr;
> +			prev->len = len;
> +			prev->bitmap = bitmap;
> +			p->found_pages += len;
> +		} else {
> +			return -ENOSPC;
> +		}
> +	}
> +	return 0;
> +}
> +
> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
> +				     unsigned long *vec_index)
> +{
> +	struct page_region *prev = &p->prev;
> +
> +	if (prev->len) {
> +		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
> +			return -EFAULT;
> +		p->vec_index++;
> +		(*vec_index)++;
> +		prev->len = 0;
> +	}
> +	return 0;
> +}
> +
> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +					 unsigned long end, struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	unsigned long addr = end;
> +	spinlock_t *ptl;
> +	int ret = 0;
> +	pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		bool pmd_wt;
> +
> +		pmd_wt = !is_pmd_uffd_wp(*pmd);
> +		/*
> +		 * Break huge page into small pages if operation needs to be performed is
> +		 * on a portion of the huge page.
> +		 */
> +		if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
> +			spin_unlock(ptl);
> +			split_huge_pmd(vma, pmd, start);
> +			goto process_smaller_pages;
I think that such goto's are really confusing and should be avoided. And 
using 'else' (could have easily prevented the need for goto). It is not 
the best solution though, since I think it would have been better to 
invert the conditions.
> +		}
> +		if (IS_GET_OP(p))
> +			ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
> +						  is_swap_pmd(*pmd), p, start,
> +						  (end - start)/PAGE_SIZE);
> +		spin_unlock(ptl);
> +		if (!ret) {
> +			if (pmd_wt && IS_WP_ENGAGE_OP(p))
> +				uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
> +		}
> +		return ret;
> +	}
> +process_smaller_pages:
> +	if (pmd_trans_unstable(pmd))
> +		return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
> +	if (IS_GET_OP(p)) {
> +		for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
> +			ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
> +						  pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
> +			if (ret)
> +				break;
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
We might have not entered the loop and pte-1 would be wrong.
> +	if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
What does 'addr - start' mean? If you want to say they are not equal, 
why not say so?
> +		uffd_wp_range(walk->mm, vma, start, addr - start, true);
> +
> +	cond_resched();
> +	return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
> +				 struct mm_walk *walk)
> +{
> +	struct pagemap_scan_private *p = walk->private;
> +	struct vm_area_struct *vma = walk->vma;
> +	int ret = 0;
> +
> +	if (vma)
> +		ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
> +					  (end - addr)/PAGE_SIZE);
> +	return ret;
> +}
> +
> +/* No hugetlb support is present. */
> +static const struct mm_walk_ops pagemap_scan_ops = {
> +	.test_walk = pagemap_scan_test_walk,
> +	.pmd_entry = pagemap_scan_pmd_entry,
> +	.pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
> +{
> +	unsigned long empty_slots, vec_index = 0;
> +	unsigned long __user start, end;

The whole point of __user (attribute) is to be assigned to pointers.

> +	unsigned long __start, __end;

I think such names do not convey sufficient information.

> +	struct page_region __user *vec;
> +	struct pagemap_scan_private p;
> +	int ret = 0;
> +
> +	start = (unsigned long)untagged_addr(arg->start);
> +	vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> +	/* Validate memory ranges */
> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
> +		return -EINVAL;
> +	if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
> +	    (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
> +		return -EINVAL;
> +
> +	/* Detect illegal flags and masks */
> +	if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
> +	    (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
> +	    (arg->return_mask & ~PAGEMAP_BITS_ALL))

Using bitwise or to check

     (arg->required_mask | arg->anyof_mask |
      arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL

Would have been much cleaner, IMHO.

> +		return -EINVAL;
> +	if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
> +				!arg->return_mask))
> +		return -EINVAL;
> +	/* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
> +	if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
> +	    (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
> +		return -EINVAL;
> +
> +	end = start + arg->len;
> +	p.max_pages = arg->max_pages;
> +	p.found_pages = 0;
> +	p.flags = arg->flags;
> +	p.required_mask = arg->required_mask;
> +	p.anyof_mask = arg->anyof_mask;
> +	p.excluded_mask = arg->excluded_mask;
> +	p.return_mask = arg->return_mask;
> +	p.prev.len = 0;
> +	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> +	if (IS_GET_OP(arg)) {
> +		p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
> +		if (!p.vec)
> +			return -ENOMEM;
> +	} else {
> +		p.vec = NULL;
I find it cleaner to initialize 'p.vec = NULL' unconditionally before 
IS_GET_OP() check.
> +	}
> +	__start = __end = start;
> +	while (!ret && __end < end) {
> +		p.vec_index = 0;
> +		empty_slots = arg->vec_len - vec_index;
> +		if (p.vec_len > empty_slots)
> +			p.vec_len = empty_slots;
> +
> +		__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> +		if (__end > end)
> +			__end = end;
Easier to understand using min().
> +
> +		mmap_read_lock(mm);
> +		ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
> +		mmap_read_unlock(mm);
> +		if (!(!ret || ret == -ENOSPC))

Double negations complicate things unnecessarily.

And if you already "break" on ret, why do you check the condition in the 
while loop?

> +			goto free_data;
> +
> +		__start = __end;
> +		if (IS_GET_OP(arg) && p.vec_index) {
> +			if (copy_to_user(&vec[vec_index], p.vec,
> +					 p.vec_index * sizeof(struct page_region))) {
> +				ret = -EFAULT;
> +				goto free_data;
> +			}
> +			vec_index += p.vec_index;
> +		}
> +	}
> +	ret = export_prev_to_out(&p, vec, &vec_index);
> +	if (!ret)
> +		ret = vec_index;
> +free_data:
> +	if (IS_GET_OP(arg))
> +		kfree(p.vec);
Just call it unconditionally.
> +
> +	return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
> +	struct mm_struct *mm = file->private_data;
> +	struct pagemap_scan_arg argument;
> +
> +	if (cmd == PAGEMAP_SCAN) {
> +		if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
> +			return -EFAULT;
> +		return do_pagemap_cmd(mm, &argument);
> +	}
> +	return -EINVAL;
> +}
> +
>   const struct file_operations proc_pagemap_operations = {
>   	.llseek		= mem_lseek, /* borrow this */
>   	.read		= pagemap_read,
>   	.open		= pagemap_open,
>   	.release	= pagemap_release,
> +	.unlocked_ioctl = pagemap_scan_ioctl,
> +	.compat_ioctl	= pagemap_scan_ioctl,
>   };
>   #endif /* CONFIG_PROC_PAGE_MONITOR */
>   
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index b7b56871029c..1ae9a8684b48 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>   #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>   			 RWF_APPEND)
>   
> +/* Pagemap ioctl */
> +#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_scan_arg)
> +
> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
> +#define PAGE_IS_WRITTEN		(1 << 0)
> +#define PAGE_IS_FILE		(1 << 1)
> +#define PAGE_IS_PRESENT		(1 << 2)
> +#define PAGE_IS_SWAPPED		(1 << 3)

These names are way too generic and are likely to be misused for the 
wrong purpose. The "_IS_" part seems confusing as well. So I think the 
naming needs to be fixed and some new type (using typedef) or enum 
should be introduced to hold these flags. I understand it is part of 
uapi and it is less common there, but it is not unheard of and does make 
things clearer.


> +
> +/*
> + * struct page_region - Page region with bitmap flags
> + * @start:	Start of the region
> + * @len:	Length of the region
> + * bitmap:	Bits sets for the region
> + */
> +struct page_region {
> +	__u64 start;
> +	__u64 len;

I presume in bytes. Would be useful to mention.

> +	__u64 bitmap;
> +};
> +
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start:		Starting address of the region
> + * @len:		Length of the region (All the pages in this length are included)
> + * @vec:		Address of page_region struct array for output
> + * @vec_len:		Length of the page_region struct array
> + * @max_pages:		Optional max return pages
> + * @flags:		Flags for the IOCTL
> + * @required_mask:	Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask:		Any mask - Any of these bits are set in the PTE
> + * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
> + * @return_mask:	Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> +	__u64 start;
> +	__u64 len;
> +	__u64 vec;
> +	__u64 vec_len;
> +	__u32 max_pages;
> +	__u32 flags;
> +	__u64 required_mask;
> +	__u64 anyof_mask;
> +	__u64 excluded_mask;
> +	__u64 return_mask;
> +};
> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE	(1 << 0)
> +
>   #endif /* _UAPI_LINUX_FS_H */
Muhammad Usama Anjum Feb. 20, 2023, 10:38 a.m. UTC | #14
On 2/17/23 3:10 PM, Mike Rapoport wrote:
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>   (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>>   pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> To get information about which pages have been written-to and/or write
>> protect the pages, following must be performed first in order:
>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>>   through UFFDIO_REGISTER IOCTL.
>> Then the any part of the registered memory or the whole memory region
>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>> PAGEMAP_SCAN IOCTL.
>>
>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>> struct:
>> - The range is specified through start and len.
>> - The output buffer of struct page_region array and size is specified as
>>   vec and vec_len.
>> - The optional maximum requested pages are specified in the max_pages.
>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>>   is the only added flag at this time.
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>>   and return_mask.
>>
>> This IOCTL can be extended to get information about more PTE bits. This
>> IOCTL doesn't support hugetlbs at the moment. No information about
>> hugetlb can be obtained. This patch has evolved from a basic patch from
>> Gabriel Krisman Bertazi.
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> ---
>> Changes in v10:
>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>> - update commit message
>>
>> Change in v8:
>> - Correct is_pte_uffd_wp()
>> - Improve readability and error checks
>> - Remove some un-needed code
>>
>> Changes in v7:
>> - Rebase on top of latest next
>> - Fix some corner cases
>> - Base soft-dirty on the uffd wp async
>> - Update the terminologies
>> - Optimize the memory usage inside the ioctl
>>
>> Changes in v6:
>> - Rename variables and update comments
>> - Make IOCTL independent of soft_dirty config
>> - Change masks and bitmap type to _u64
>> - Improve code quality
>>
>> Changes in v5:
>> - Remove tlb flushing even for clear operation
>>
>> Changes in v4:
>> - Update the interface and implementation
>>
>> Changes in v3:
>> - Tighten the user-kernel interface by using explicit types and add more
>>   error checking
>>
>> Changes in v2:
>> - Convert the interface from syscall to ioctl
>> - Remove pidfd support as it doesn't make sense in ioctl
>> ---
>>  fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>>  include/uapi/linux/fs.h |  50 +++++++
>>  2 files changed, 340 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index e35a0398db63..c6bde19d63d9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/shmem_fs.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>  
>>  #include <asm/elf.h>
>>  #include <asm/tlb.h>
>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>>  }
>>  #endif
>>  
>> +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> +	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> +	    (pte_swp_uffd_wp_any(pte)))
>> +		return true;
>> +	return false;
>> +}
>> +
>> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
>> +{
>> +	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> +	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> +		return true;
>> +	return false;
>> +}
>> +
>>  #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>  static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>>  		unsigned long addr, pmd_t *pmdp)
>> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode, struct file *file)
>>  	return 0;
>>  }
>>  
>> +#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
>> +					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
>> +#define IS_GET_OP(a)			(a->vec)
>> +#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))
>> +
>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
>> +	(wt | file << 1 | present << 2 | swap << 3)
>> +#define IS_WT_REQUIRED(a)				\
>> +	((a->required_mask & PAGE_IS_WRITTEN) ||	\
>> +	 (a->anyof_mask & PAGE_IS_WRITTEN))
> 
> All these macros are specific to pagemap_scan_ioctl() and should be
> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
> 
> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
Will do in next version.

> 
> And I'd also make IS_GET_OP() more explicit by defining a PAGEMAP_WP_GET or
> similar flag rather than using arg->vec.
I had in the first revisions. But explicit GET_OP was removed in the
previous iterations after some feedback. Peter has also suggested this.
I'll add the GET_OP flag again.

> 
>> +
>> +struct pagemap_scan_private {
>> +	struct page_region *vec;
>> +	struct page_region prev;
>> +	unsigned long vec_len, vec_index;
>> +	unsigned int max_pages, found_pages, flags;
>> +	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
>> +};
>> +
>> +static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
> 
> Please keep the lines under 80 characters limit.
> 
>> +{
>> +	struct pagemap_scan_private *p = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +
>> +	if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
>> +		return -EPERM;
>> +	if (vma->vm_flags & VM_PFNMAP)
>> +		return 1;
>> +	return 0;
>> +}
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> +				      struct pagemap_scan_private *p, unsigned long addr,
>> +				      unsigned int len)
>> +{
>> +	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> +	bool cpy = true;
>> +	struct page_region *prev = &p->prev;
>> +
>> +	if (HAS_NO_SPACE(p))
>> +		return -ENOSPC;
>> +
>> +	if (p->max_pages && p->found_pages + len >= p->max_pages)
>> +		len = p->max_pages - p->found_pages;
>> +	if (!len)
>> +		return -EINVAL;
>> +
>> +	if (p->required_mask)
>> +		cpy = ((p->required_mask & cur) == p->required_mask);
>> +	if (cpy && p->anyof_mask)
>> +		cpy = (p->anyof_mask & cur);
>> +	if (cpy && p->excluded_mask)
>> +		cpy = !(p->excluded_mask & cur);
>> +	bitmap = cur & p->return_mask;
>> +	if (cpy && bitmap) {
>> +		if ((prev->len) && (prev->bitmap == bitmap) &&
>> +		    (prev->start + prev->len * PAGE_SIZE == addr)) {
>> +			prev->len += len;
>> +			p->found_pages += len;
>> +		} else if (p->vec_index < p->vec_len) {
>> +			if (prev->len) {
>> +				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
>> +				p->vec_index++;
>> +			}
>> +			prev->start = addr;
>> +			prev->len = len;
>> +			prev->bitmap = bitmap;
>> +			p->found_pages += len;
>> +		} else {
>> +			return -ENOSPC;
>> +		}
>> +	}
>> +	return 0;
> 
> Please don't save on empty lines. Empty lines between logical pieces
> improve readability.
Sorry, I'll add them.

> 
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
>> +				     unsigned long *vec_index)
>> +{
>> +	struct page_region *prev = &p->prev;
>> +
>> +	if (prev->len) {
>> +		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
>> +			return -EFAULT;
>> +		p->vec_index++;
>> +		(*vec_index)++;
>> +		prev->len = 0;
>> +	}
>> +	return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +					 unsigned long end, struct mm_walk *walk)
>> +{
>> +	struct pagemap_scan_private *p = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +	unsigned long addr = end;
>> +	spinlock_t *ptl;
>> +	int ret = 0;
>> +	pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +	ptl = pmd_trans_huge_lock(pmd, vma);
>> +	if (ptl) {
>> +		bool pmd_wt;
>> +
>> +		pmd_wt = !is_pmd_uffd_wp(*pmd);
>> +		/*
>> +		 * Break huge page into small pages if operation needs to be performed is
>> +		 * on a portion of the huge page.
>> +		 */
>> +		if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>> +			spin_unlock(ptl);
>> +			split_huge_pmd(vma, pmd, start);
>> +			goto process_smaller_pages;
>> +		}
>> +		if (IS_GET_OP(p))
>> +			ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
>> +						  is_swap_pmd(*pmd), p, start,
>> +						  (end - start)/PAGE_SIZE);
>> +		spin_unlock(ptl);
>> +		if (!ret) {
>> +			if (pmd_wt && IS_WP_ENGAGE_OP(p))
>> +				uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
>> +		}
>> +		return ret;
>> +	}
>> +process_smaller_pages:
>> +	if (pmd_trans_unstable(pmd))
>> +		return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
>> +	if (IS_GET_OP(p)) {
>> +		for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
>> +			ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
>> +						  pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
>> +			if (ret)
>> +				break;
>> +		}
>> +	}
>> +	pte_unmap_unlock(pte - 1, ptl);
>> +	if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
>> +		uffd_wp_range(walk->mm, vma, start, addr - start, true);
>> +
>> +	cond_resched();
>> +	return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
>> +				 struct mm_walk *walk)
>> +{
>> +	struct pagemap_scan_private *p = walk->private;
>> +	struct vm_area_struct *vma = walk->vma;
>> +	int ret = 0;
>> +
>> +	if (vma)
>> +		ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
>> +					  (end - addr)/PAGE_SIZE);
>> +	return ret;
>> +}
>> +
>> +/* No hugetlb support is present. */
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> +	.test_walk = pagemap_scan_test_walk,
>> +	.pmd_entry = pagemap_scan_pmd_entry,
>> +	.pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
>> +{
>> +	unsigned long empty_slots, vec_index = 0;
>> +	unsigned long __user start, end;
>> +	unsigned long __start, __end;
>> +	struct page_region __user *vec;
>> +	struct pagemap_scan_private p;
>> +	int ret = 0;
>> +
>> +	start = (unsigned long)untagged_addr(arg->start);
>> +	vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> +	/* Validate memory ranges */
>> +	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
>> +		return -EINVAL;
>> +	if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
>> +	    (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
>> +		return -EINVAL;
>> +
>> +	/* Detect illegal flags and masks */
>> +	if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
>> +	    (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
>> +	    (arg->return_mask & ~PAGEMAP_BITS_ALL))
>> +		return -EINVAL;
>> +	if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
>> +				!arg->return_mask))
>> +		return -EINVAL;
>> +	/* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
>> +	if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
>> +	    (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
>> +		return -EINVAL;
> 
> I'd split argument validation into a separate function and split the OR'ed
> conditions into separate if statements, e.g
> 
> bool pm_scan_args_valid(struct pagemap_scan_arg *arg)
> {
> 	if (IS_GET_OP(arg)) {
> 		if (!arg->return_mask)
> 			return false;
> 		if (!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask)
> 			return false;
> 	}
> 
> 	/* ... */
> 
> 	return true;
> }
This seems a very good way. Thank you so much!

> 
>> +
>> +	end = start + arg->len;
>> +	p.max_pages = arg->max_pages;
>> +	p.found_pages = 0;
>> +	p.flags = arg->flags;
>> +	p.required_mask = arg->required_mask;
>> +	p.anyof_mask = arg->anyof_mask;
>> +	p.excluded_mask = arg->excluded_mask;
>> +	p.return_mask = arg->return_mask;
>> +	p.prev.len = 0;
>> +	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>> +
>> +	if (IS_GET_OP(arg)) {
>> +		p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
>> +		if (!p.vec)
>> +			return -ENOMEM;
>> +	} else {
>> +		p.vec = NULL;
>> +	}
>> +	__start = __end = start;
>> +	while (!ret && __end < end) {
>> +		p.vec_index = 0;
>> +		empty_slots = arg->vec_len - vec_index;
>> +		if (p.vec_len > empty_slots)
>> +			p.vec_len = empty_slots;
>> +
>> +		__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
>> +		if (__end > end)
>> +			__end = end;
>> +
>> +		mmap_read_lock(mm);
>> +		ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
>> +		mmap_read_unlock(mm);
>> +		if (!(!ret || ret == -ENOSPC))
>> +			goto free_data;
>> +
>> +		__start = __end;
>> +		if (IS_GET_OP(arg) && p.vec_index) {
>> +			if (copy_to_user(&vec[vec_index], p.vec,
>> +					 p.vec_index * sizeof(struct page_region))) {
>> +				ret = -EFAULT;
>> +				goto free_data;
>> +			}
>> +			vec_index += p.vec_index;
>> +		}
>> +	}
>> +	ret = export_prev_to_out(&p, vec, &vec_index);
>> +	if (!ret)
>> +		ret = vec_index;
>> +free_data:
>> +	if (IS_GET_OP(arg))
>> +		kfree(p.vec);
>> +
>> +	return ret;
>> +}
>> +
>> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> +	struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
>> +	struct mm_struct *mm = file->private_data;
>> +	struct pagemap_scan_arg argument;
>> +
>> +	if (cmd == PAGEMAP_SCAN) {
>> +		if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
>> +			return -EFAULT;
>> +		return do_pagemap_cmd(mm, &argument);
>> +	}
>> +	return -EINVAL;
>> +}
>> +
>>  const struct file_operations proc_pagemap_operations = {
>>  	.llseek		= mem_lseek, /* borrow this */
>>  	.read		= pagemap_read,
>>  	.open		= pagemap_open,
>>  	.release	= pagemap_release,
>> +	.unlocked_ioctl = pagemap_scan_ioctl,
>> +	.compat_ioctl	= pagemap_scan_ioctl,
>>  };
>>  #endif /* CONFIG_PROC_PAGE_MONITOR */
>>  
>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index b7b56871029c..1ae9a8684b48 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>>  #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>>  			 RWF_APPEND)
>>  
>> +/* Pagemap ioctl */
>> +#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_scan_arg)
>> +
>> +/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
>> +#define PAGE_IS_WRITTEN		(1 << 0)
>> +#define PAGE_IS_FILE		(1 << 1)
>> +#define PAGE_IS_PRESENT		(1 << 2)
>> +#define PAGE_IS_SWAPPED		(1 << 3)
>> +
>> +/*
>> + * struct page_region - Page region with bitmap flags
>> + * @start:	Start of the region
>> + * @len:	Length of the region
>> + * bitmap:	Bits sets for the region
>> + */
>> +struct page_region {
>> +	__u64 start;
>> +	__u64 len;
>> +	__u64 bitmap;
>> +};
>> +
>> +/*
>> + * struct pagemap_scan_arg - Pagemap ioctl argument
>> + * @start:		Starting address of the region
>> + * @len:		Length of the region (All the pages in this length are included)
>> + * @vec:		Address of page_region struct array for output
>> + * @vec_len:		Length of the page_region struct array
>> + * @max_pages:		Optional max return pages
>> + * @flags:		Flags for the IOCTL
>> + * @required_mask:	Required mask - All of these bits have to be set in the PTE
>> + * @anyof_mask:		Any mask - Any of these bits are set in the PTE
>> + * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
>> + * @return_mask:	Bits that are to be reported in page_region
>> + */
>> +struct pagemap_scan_arg {
>> +	__u64 start;
>> +	__u64 len;
>> +	__u64 vec;
>> +	__u64 vec_len;
>> +	__u32 max_pages;
>> +	__u32 flags;
>> +	__u64 required_mask;
>> +	__u64 anyof_mask;
>> +	__u64 excluded_mask;
>> +	__u64 return_mask;
>> +};
>> +
>> +/* Special flags */
>> +#define PAGEMAP_WP_ENGAGE	(1 << 0)
>> +
>>  #endif /* _UAPI_LINUX_FS_H */
>> -- 
>> 2.30.2
>>
>
Muhammad Usama Anjum Feb. 20, 2023, 11:38 a.m. UTC | #15
On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote:
>>> +#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
>>> +					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>>> +#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>>> +#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
>>> +#define IS_GET_OP(a)			(a->vec)
>>> +#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))
>>> +
>>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
>>> +	(wt | file << 1 | present << 2 | swap << 3)
>>> +#define IS_WT_REQUIRED(a)				\
>>> +	((a->required_mask & PAGE_IS_WRITTEN) ||	\
>>> +	 (a->anyof_mask & PAGE_IS_WRITTEN))
>> All these macros are specific to pagemap_scan_ioctl() and should be
>> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
>>
>> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
>> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
>> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
> Will do in next version.
> 

IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to
PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of
open code as they seem more readable to me. I can open code if you insist.
Mike Rapoport Feb. 20, 2023, 1:17 p.m. UTC | #16
On Mon, Feb 20, 2023 at 04:38:10PM +0500, Muhammad Usama Anjum wrote:
> On 2/20/23 3:38 PM, Muhammad Usama Anjum wrote:
> >>> +#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
> >>> +					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> >>> +#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
> >>> +#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
> >>> +#define IS_GET_OP(a)			(a->vec)
> >>> +#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))
> >>> +
> >>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
> >>> +	(wt | file << 1 | present << 2 | swap << 3)
> >>> +#define IS_WT_REQUIRED(a)				\
> >>> +	((a->required_mask & PAGE_IS_WRITTEN) ||	\
> >>> +	 (a->anyof_mask & PAGE_IS_WRITTEN))
> >> All these macros are specific to pagemap_scan_ioctl() and should be
> >> namespaced accordingly, e.g. PM_SCAN_BITS_ALL, PM_SCAN_BITMAP etc.
> >>
> >> Also, IS_<opname>_OP() will be more readable as PM_SCAN_OP_IS_<opname> and
> >> I'd suggest to open code IS_WP_ENGAGE_OP() and IS_GET_OP() and make
> >> HAS_NO_SPACE() and IS_WT_REQUIRED() static inlines rather than macros.
> > Will do in next version.
> > 
> 
> IS_WP_ENGAGE_OP() and IS_GET_OP() which can be renamed to
> PM_SCAN_OP_IS_WP() and PM_SCAN_OP_IS_GET() seem better to me instead of
> open code as they seem more readable to me. I can open code if you insist.

I'd suggest to see how the rework of pagemap_scan_pmd_entry() paves out. An
open-coded '&' is surely clearer than a macro/function, but if it's buried
in a long sequence of conditions, it may be not such clear win.
 
> -- 
> BR,
> Muhammad Usama Anjum
Muhammad Usama Anjum Feb. 20, 2023, 1:24 p.m. UTC | #17
Hello Nadav,

Thank you so much for reviewing!

On 2/19/23 6:52 PM, Nadav Amit wrote:
> 
> On 2/2/23 1:29 PM, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>    file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>    (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>>    pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>>    (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> To get information about which pages have been written-to and/or write
>> protect the pages, following must be performed first in order:
>> - The userfaultfd file descriptor is created with userfaultfd syscall.
>> - The UFFD_FEATURE_WP_ASYNC feature is set by UFFDIO_API IOCTL.
>> - The memory range is registered with UFFDIO_REGISTER_MODE_WP mode
>>    through UFFDIO_REGISTER IOCTL.
>> Then the any part of the registered memory or the whole memory region
>> can be write protected using the UFFDIO_WRITEPROTECT IOCTL or
>> PAGEMAP_SCAN IOCTL.
>>
>> struct pagemap_scan_args is used as the argument of the IOCTL. In this
>> struct:
>> - The range is specified through start and len.
>> - The output buffer of struct page_region array and size is specified as
>>    vec and vec_len.
>> - The optional maximum requested pages are specified in the max_pages.
>> - The flags can be specified in the flags field. The PAGEMAP_WP_ENGAGE
>>    is the only added flag at this time.
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>>    and return_mask.
>>
>> This IOCTL can be extended to get information about more PTE bits. This
>> IOCTL doesn't support hugetlbs at the moment. No information about
>> hugetlb can be obtained. This patch has evolved from a basic patch from
>> Gabriel Krisman Bertazi.
> 
> I was not involved before, so I am not commenting on the API and code to
> avoid making unhelpful noise.
> 
> Having said that, some things in the code seem quite dirty and make
> understanding the code hard to read.
There is a new proposal about the flags in the interface. I'll include you
there.

> 
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> ---
>> Changes in v10:
>> - move changes in tools/include/uapi/linux/fs.h to separate patch
>> - update commit message
>>
>> Change in v8:
>> - Correct is_pte_uffd_wp()
>> - Improve readability and error checks
>> - Remove some un-needed code
>>
>> Changes in v7:
>> - Rebase on top of latest next
>> - Fix some corner cases
>> - Base soft-dirty on the uffd wp async
>> - Update the terminologies
>> - Optimize the memory usage inside the ioctl
>>
>> Changes in v6:
>> - Rename variables and update comments
>> - Make IOCTL independent of soft_dirty config
>> - Change masks and bitmap type to _u64
>> - Improve code quality
>>
>> Changes in v5:
>> - Remove tlb flushing even for clear operation
>>
>> Changes in v4:
>> - Update the interface and implementation
>>
>> Changes in v3:
>> - Tighten the user-kernel interface by using explicit types and add more
>>    error checking
>>
>> Changes in v2:
>> - Convert the interface from syscall to ioctl
>> - Remove pidfd support as it doesn't make sense in ioctl
>> ---
>>   fs/proc/task_mmu.c      | 290 ++++++++++++++++++++++++++++++++++++++++
>>   include/uapi/linux/fs.h |  50 +++++++
>>   2 files changed, 340 insertions(+)
>>
>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>> index e35a0398db63..c6bde19d63d9 100644
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>>   #include <linux/shmem_fs.h>
>>   #include <linux/uaccess.h>
>>   #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>     #include <asm/elf.h>
>>   #include <asm/tlb.h>
>> @@ -1135,6 +1136,22 @@ static inline void clear_soft_dirty(struct
>> vm_area_struct *vma,
>>   }
>>   #endif
>>   +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> +    if ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> +        (pte_swp_uffd_wp_any(pte)))
>> +        return true;
>> +    return false;
>> +}
>> +
>> +static inline bool is_pmd_uffd_wp(pmd_t pmd)
>> +{
>> +    if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
>> +        (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
>> +        return true;
>> +    return false;
>> +}
>> +
>>   #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
>>   static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
>>           unsigned long addr, pmd_t *pmdp)
>> @@ -1763,11 +1780,284 @@ static int pagemap_release(struct inode *inode,
>> struct file *file)
>>       return 0;
>>   }
>>   +#define PAGEMAP_BITS_ALL        (PAGE_IS_WRITTEN | PAGE_IS_FILE |    \
>> +                     PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
>> +#define PAGEMAP_NON_WRITTEN_BITS    (PAGE_IS_FILE |    PAGE_IS_PRESENT |
>> PAGE_IS_SWAPPED)
>> +#define IS_WP_ENGAGE_OP(a)        (a->flags & PAGEMAP_WP_ENGAGE)
>> +#define IS_GET_OP(a)            (a->vec)
>> +#define HAS_NO_SPACE(p)            (p->max_pages && (p->found_pages ==
>> p->max_pages))
> 
> I think that in general it is better to have an inline function instead of
> macros when possible, as it is clearer and checks types. Anyhow, IMHO most
> of these macros are better be open-coded.
I'll update most of these in next version.

> 
>> +
>> +#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)    \
>> +    (wt | file << 1 | present << 2 | swap << 3)
>> +#define IS_WT_REQUIRED(a)                \
>> +    ((a->required_mask & PAGE_IS_WRITTEN) ||    \
>> +     (a->anyof_mask & PAGE_IS_WRITTEN))
>> +
>> +struct pagemap_scan_private {
>> +    struct page_region *vec;
>> +    struct page_region prev;
>> +    unsigned long vec_len, vec_index;
>> +    unsigned int max_pages, found_pages, flags;
>> +    unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
>> +};
>> +
>> +static int pagemap_scan_test_walk(unsigned long start, unsigned long
>> end, struct mm_walk *walk)
>> +{
>> +    struct pagemap_scan_private *p = walk->private;
>> +    struct vm_area_struct *vma = walk->vma;
>> +
>> +    if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) &&
>> !userfaultfd_wp_async(vma))
>> +        return -EPERM;
>> +    if (vma->vm_flags & VM_PFNMAP)
>> +        return 1;
>> +    return 0;
>> +}
>> +
>> +static inline int pagemap_scan_output(bool wt, bool file, bool pres,
>> bool swap,
>> +                      struct pagemap_scan_private *p, unsigned long addr,
>> +                      unsigned int len)
>> +{
>> +    unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
>> +    bool cpy = true;
>> +    struct page_region *prev = &p->prev;
>> +
>> +    if (HAS_NO_SPACE(p))
>> +        return -ENOSPC;
>> +
>> +    if (p->max_pages && p->found_pages + len >= p->max_pages)
>> +        len = p->max_pages - p->found_pages;
>> +    if (!len)
>> +        return -EINVAL;
>> +
>> +    if (p->required_mask)
>> +        cpy = ((p->required_mask & cur) == p->required_mask);
>> +    if (cpy && p->anyof_mask)
>> +        cpy = (p->anyof_mask & cur);
>> +    if (cpy && p->excluded_mask)
>> +        cpy = !(p->excluded_mask & cur);
>> +    bitmap = cur & p->return_mask;
>> +    if (cpy && bitmap) {
>> +        if ((prev->len) && (prev->bitmap == bitmap) &&
>> +            (prev->start + prev->len * PAGE_SIZE == addr)) {
>> +            prev->len += len;
> The use of "len" both for bytes and pages is very confusing. Consider
> changing the name to n_pages or something similar.
Will update in next version.

>> +            p->found_pages += len;
>> +        } else if (p->vec_index < p->vec_len) {
>> +            if (prev->len) {
>> +                memcpy(&p->vec[p->vec_index], prev, sizeof(struct
>> page_region));
>> +                p->vec_index++;
>> +            }
>> +            prev->start = addr;
>> +            prev->len = len;
>> +            prev->bitmap = bitmap;
>> +            p->found_pages += len;
>> +        } else {
>> +            return -ENOSPC;
>> +        }
>> +    }
>> +    return 0;
>> +}
>> +
>> +static inline int export_prev_to_out(struct pagemap_scan_private *p,
>> struct page_region __user *vec,
>> +                     unsigned long *vec_index)
>> +{
>> +    struct page_region *prev = &p->prev;
>> +
>> +    if (prev->len) {
>> +        if (copy_to_user(&vec[*vec_index], prev, sizeof(struct
>> page_region)))
>> +            return -EFAULT;
>> +        p->vec_index++;
>> +        (*vec_index)++;
>> +        prev->len = 0;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                     unsigned long end, struct mm_walk *walk)
>> +{
>> +    struct pagemap_scan_private *p = walk->private;
>> +    struct vm_area_struct *vma = walk->vma;
>> +    unsigned long addr = end;
>> +    spinlock_t *ptl;
>> +    int ret = 0;
>> +    pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +    ptl = pmd_trans_huge_lock(pmd, vma);
>> +    if (ptl) {
>> +        bool pmd_wt;
>> +
>> +        pmd_wt = !is_pmd_uffd_wp(*pmd);
>> +        /*
>> +         * Break huge page into small pages if operation needs to be
>> performed is
>> +         * on a portion of the huge page.
>> +         */
>> +        if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>> +            spin_unlock(ptl);
>> +            split_huge_pmd(vma, pmd, start);
>> +            goto process_smaller_pages;
> I think that such goto's are really confusing and should be avoided. And
> using 'else' (could have easily prevented the need for goto). It is not the
> best solution though, since I think it would have been better to invert the
> conditions.
Yeah, else can be used here. But then we'll have to add a tab to all the
code after adding else. We have already so many tabs and very less space to
right code. Not sure which is better.

>> +        }
>> +        if (IS_GET_OP(p))
>> +            ret = pagemap_scan_output(pmd_wt, vma->vm_file,
>> pmd_present(*pmd),
>> +                          is_swap_pmd(*pmd), p, start,
>> +                          (end - start)/PAGE_SIZE);
>> +        spin_unlock(ptl);
>> +        if (!ret) {
>> +            if (pmd_wt && IS_WP_ENGAGE_OP(p))
>> +                uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
>> +        }
>> +        return ret;
>> +    }
>> +process_smaller_pages:
>> +    if (pmd_trans_unstable(pmd))
>> +        return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +    pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
>> +    if (IS_GET_OP(p)) {
>> +        for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
>> +            ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
>> +                          pte_present(*pte), is_swap_pte(*pte), p, addr,
>> 1);
>> +            if (ret)
>> +                break;
>> +        }
>> +    }
>> +    pte_unmap_unlock(pte - 1, ptl);
> We might have not entered the loop and pte-1 would be wrong.
>> +    if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
> What does 'addr - start' mean? If you want to say they are not equal, why
> not say so?
This has been revamped in the next version.

>> +        uffd_wp_range(walk->mm, vma, start, addr - start, true);
>> +
>> +    cond_resched();
>> +    return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
>> int depth,
>> +                 struct mm_walk *walk)
>> +{
>> +    struct pagemap_scan_private *p = walk->private;
>> +    struct vm_area_struct *vma = walk->vma;
>> +    int ret = 0;
>> +
>> +    if (vma)
>> +        ret = pagemap_scan_output(false, vma->vm_file, false, false, p,
>> addr,
>> +                      (end - addr)/PAGE_SIZE);
>> +    return ret;
>> +}
>> +
>> +/* No hugetlb support is present. */
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> +    .test_walk = pagemap_scan_test_walk,
>> +    .pmd_entry = pagemap_scan_pmd_entry,
>> +    .pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg
>> *arg)
>> +{
>> +    unsigned long empty_slots, vec_index = 0;
>> +    unsigned long __user start, end;
> 
> The whole point of __user (attribute) is to be assigned to pointers.
I'll remove it.

> 
>> +    unsigned long __start, __end;
> 
> I think such names do not convey sufficient information.
I'll update it.

> 
>> +    struct page_region __user *vec;
>> +    struct pagemap_scan_private p;
>> +    int ret = 0;
>> +
>> +    start = (unsigned long)untagged_addr(arg->start);
>> +    vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> +    /* Validate memory ranges */
>> +    if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user
>> *)start, arg->len)))
>> +        return -EINVAL;
>> +    if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
>> +        (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct
>> page_region)))))
>> +        return -EINVAL;
>> +
>> +    /* Detect illegal flags and masks */
>> +    if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask &
>> ~PAGEMAP_BITS_ALL) ||
>> +        (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask &
>> ~PAGEMAP_BITS_ALL) ||
>> +        (arg->return_mask & ~PAGEMAP_BITS_ALL))
> 
> Using bitwise or to check
> 
>     (arg->required_mask | arg->anyof_mask |
>      arg->excluded_mask | arg->return_mask) & ~PAGE_MAP_BITS_ALL
> 
> Would have been much cleaner, IMHO.
I'll update it.

> 
>> +        return -EINVAL;
>> +    if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask &&
>> !arg->excluded_mask) ||
>> +                !arg->return_mask))
>> +        return -EINVAL;
>> +    /* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also
>> specified. */
>> +    if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask &
>> PAGEMAP_NON_WRITTEN_BITS) ||
>> +        (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
>> +        return -EINVAL;
>> +
>> +    end = start + arg->len;
>> +    p.max_pages = arg->max_pages;
>> +    p.found_pages = 0;
>> +    p.flags = arg->flags;
>> +    p.required_mask = arg->required_mask;
>> +    p.anyof_mask = arg->anyof_mask;
>> +    p.excluded_mask = arg->excluded_mask;
>> +    p.return_mask = arg->return_mask;
>> +    p.prev.len = 0;
>> +    p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>> +
>> +    if (IS_GET_OP(arg)) {
>> +        p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>> GFP_KERNEL);
>> +        if (!p.vec)
>> +            return -ENOMEM;
>> +    } else {
>> +        p.vec = NULL;
> I find it cleaner to initialize 'p.vec = NULL' unconditionally before
> IS_GET_OP() check.
It'll get updated.

>> +    }
>> +    __start = __end = start;
>> +    while (!ret && __end < end) {
>> +        p.vec_index = 0;
>> +        empty_slots = arg->vec_len - vec_index;
>> +        if (p.vec_len > empty_slots)
>> +            p.vec_len = empty_slots;
>> +
>> +        __end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
>> +        if (__end > end)
>> +            __end = end;
> Easier to understand using min().
Will update.

>> +
>> +        mmap_read_lock(mm);
>> +        ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
>> +        mmap_read_unlock(mm);
>> +        if (!(!ret || ret == -ENOSPC))
> 
> Double negations complicate things unnecessarily.
> 
> And if you already "break" on ret, why do you check the condition in the
> while loop?
Ohh, good catch.

> 
>> +            goto free_data;
>> +
>> +        __start = __end;
>> +        if (IS_GET_OP(arg) && p.vec_index) {
>> +            if (copy_to_user(&vec[vec_index], p.vec,
>> +                     p.vec_index * sizeof(struct page_region))) {
>> +                ret = -EFAULT;
>> +                goto free_data;
>> +            }
>> +            vec_index += p.vec_index;
>> +        }
>> +    }
>> +    ret = export_prev_to_out(&p, vec, &vec_index);
>> +    if (!ret)
>> +        ret = vec_index;
>> +free_data:
>> +    if (IS_GET_OP(arg))
>> +        kfree(p.vec);
> Just call it unconditionally.
I didn't know it. I'll do it.

>> +
>> +    return ret;
>> +}
>> +
>> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
>> unsigned long arg)
>> +{
>> +    struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg
>> __user *)arg;
>> +    struct mm_struct *mm = file->private_data;
>> +    struct pagemap_scan_arg argument;
>> +
>> +    if (cmd == PAGEMAP_SCAN) {
>> +        if (copy_from_user(&argument, uarg, sizeof(struct
>> pagemap_scan_arg)))
>> +            return -EFAULT;
>> +        return do_pagemap_cmd(mm, &argument);
>> +    }
>> +    return -EINVAL;
>> +}
>> +
>>   const struct file_operations proc_pagemap_operations = {
>>       .llseek        = mem_lseek, /* borrow this */
>>       .read        = pagemap_read,
>>       .open        = pagemap_open,
>>       .release    = pagemap_release,
>> +    .unlocked_ioctl = pagemap_scan_ioctl,
>> +    .compat_ioctl    = pagemap_scan_ioctl,
>>   };
>>   #endif /* CONFIG_PROC_PAGE_MONITOR */
>>   diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index b7b56871029c..1ae9a8684b48 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>>   #define RWF_SUPPORTED    (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>>                RWF_APPEND)
>>   +/* Pagemap ioctl */
>> +#define PAGEMAP_SCAN    _IOWR('f', 16, struct pagemap_scan_arg)
>> +
>> +/* Bits are set in the bitmap of the page_region and masks in
>> pagemap_scan_args */
>> +#define PAGE_IS_WRITTEN        (1 << 0)
>> +#define PAGE_IS_FILE        (1 << 1)
>> +#define PAGE_IS_PRESENT        (1 << 2)
>> +#define PAGE_IS_SWAPPED        (1 << 3)
> 
> These names are way too generic and are likely to be misused for the wrong
> purpose. The "_IS_" part seems confusing as well. So I think the naming
> needs to be fixed and some new type (using typedef) or enum should be
> introduced to hold these flags. I understand it is part of uapi and it is
> less common there, but it is not unheard of and does make things clearer.
Do you think PM_SCAN_PAGE_IS_* work here?

> 
> 
>> +
>> +/*
>> + * struct page_region - Page region with bitmap flags
>> + * @start:    Start of the region
>> + * @len:    Length of the region
>> + * bitmap:    Bits sets for the region
>> + */
>> +struct page_region {
>> +    __u64 start;
>> +    __u64 len;
> 
> I presume in bytes. Would be useful to mention.
Length of region in pages.

> 
>> +    __u64 bitmap;
>> +};
>> +
>> +/*
>> + * struct pagemap_scan_arg - Pagemap ioctl argument
>> + * @start:        Starting address of the region
>> + * @len:        Length of the region (All the pages in this length are
>> included)
>> + * @vec:        Address of page_region struct array for output
>> + * @vec_len:        Length of the page_region struct array
>> + * @max_pages:        Optional max return pages
>> + * @flags:        Flags for the IOCTL
>> + * @required_mask:    Required mask - All of these bits have to be set
>> in the PTE
>> + * @anyof_mask:        Any mask - Any of these bits are set in the PTE
>> + * @excluded_mask:    Exclude mask - None of these bits are set in the PTE
>> + * @return_mask:    Bits that are to be reported in page_region
>> + */
>> +struct pagemap_scan_arg {
>> +    __u64 start;
>> +    __u64 len;
>> +    __u64 vec;
>> +    __u64 vec_len;
>> +    __u32 max_pages;
>> +    __u32 flags;
>> +    __u64 required_mask;
>> +    __u64 anyof_mask;
>> +    __u64 excluded_mask;
>> +    __u64 return_mask;
>> +};
>> +
>> +/* Special flags */
>> +#define PAGEMAP_WP_ENGAGE    (1 << 0)
>> +
>>   #endif /* _UAPI_LINUX_FS_H */
Mike Rapoport Feb. 20, 2023, 1:26 p.m. UTC | #18
Hi,

On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>   (PAGE_IS_SWAPPED).
> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>   pages have been written-to.
> - Find pages which have been written-to and write protect the pages
>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> 
> +/*
> + * struct pagemap_scan_arg - Pagemap ioctl argument
> + * @start:		Starting address of the region
> + * @len:		Length of the region (All the pages in this length are included)
> + * @vec:		Address of page_region struct array for output
> + * @vec_len:		Length of the page_region struct array
> + * @max_pages:		Optional max return pages
> + * @flags:		Flags for the IOCTL
> + * @required_mask:	Required mask - All of these bits have to be set in the PTE
> + * @anyof_mask:		Any mask - Any of these bits are set in the PTE
> + * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
> + * @return_mask:	Bits that are to be reported in page_region
> + */
> +struct pagemap_scan_arg {
> +	__u64 start;
> +	__u64 len;
> +	__u64 vec;
> +	__u64 vec_len;
> +	__u32 max_pages;
> +	__u32 flags;
> +	__u64 required_mask;
> +	__u64 anyof_mask;
> +	__u64 excluded_mask;
> +	__u64 return_mask;
> +};

After Nadav's comment I've realized I missed the API part :)

A few quick notes for now:
* The arg struct is fixed, so it would be impossible to extend the API
later. Following the clone3() example, I'd add 'size' field to the
pagemam_scan_arg so that it would be possible to add new fields afterwards.
* Please make flags __u64, just in case
* Put size and flags at the beginning of the struct, e.g.

strucr pagemap_scan_arg {
	size_t size;
	__u64 flags;
	/* all the rest */
};

> +
> +/* Special flags */
> +#define PAGEMAP_WP_ENGAGE	(1 << 0)
> +
>  #endif /* _UAPI_LINUX_FS_H */
> -- 
> 2.30.2
>
Muhammad Usama Anjum Feb. 21, 2023, 7:02 a.m. UTC | #19
On 2/20/23 6:26 PM, Mike Rapoport wrote:
> Hi,
> 
> On Thu, Feb 02, 2023 at 04:29:12PM +0500, Muhammad Usama Anjum wrote:
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>   (PAGE_IS_SWAPPED).
>> - Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
>>   pages have been written-to.
>> - Find pages which have been written-to and write protect the pages
>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
>>
>> +/*
>> + * struct pagemap_scan_arg - Pagemap ioctl argument
>> + * @start:		Starting address of the region
>> + * @len:		Length of the region (All the pages in this length are included)
>> + * @vec:		Address of page_region struct array for output
>> + * @vec_len:		Length of the page_region struct array
>> + * @max_pages:		Optional max return pages
>> + * @flags:		Flags for the IOCTL
>> + * @required_mask:	Required mask - All of these bits have to be set in the PTE
>> + * @anyof_mask:		Any mask - Any of these bits are set in the PTE
>> + * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
>> + * @return_mask:	Bits that are to be reported in page_region
>> + */
>> +struct pagemap_scan_arg {
>> +	__u64 start;
>> +	__u64 len;
>> +	__u64 vec;
>> +	__u64 vec_len;
>> +	__u32 max_pages;
>> +	__u32 flags;
>> +	__u64 required_mask;
>> +	__u64 anyof_mask;
>> +	__u64 excluded_mask;
>> +	__u64 return_mask;
>> +};
> 
> After Nadav's comment I've realized I missed the API part :)
> 
> A few quick notes for now:
> * The arg struct is fixed, so it would be impossible to extend the API
> later. Following the clone3() example, I'd add 'size' field to the
> pagemam_scan_arg so that it would be possible to add new fields afterwards.
> * Please make flags __u64, just in case
> * Put size and flags at the beginning of the struct, e.g.
> 
> strucr pagemap_scan_arg {
> 	size_t size;
> 	__u64 flags;
> 	/* all the rest */
> };
Updated. Thank you so much!

> 
>> +
>> +/* Special flags */
>> +#define PAGEMAP_WP_ENGAGE	(1 << 0)
>> +
>>  #endif /* _UAPI_LINUX_FS_H */
>> -- 
>> 2.30.2
>>
>
Muhammad Usama Anjum Feb. 21, 2023, 10:28 a.m. UTC | #20
Hi Michał,

Thank you so much for comment!

On 2/17/23 8:18 PM, Michał Mirosław wrote:
> On Thu, 2 Feb 2023 at 12:30, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
> [...]
>> - The masks are specified in required_mask, anyof_mask, excluded_ mask
>>   and return_mask.
> [...]

The interface was suggested by Andrei back on the review of v3 [1]:
> I mean we should be able to specify for what pages we need to get info
> for. An ioctl argument can have these four fields:
> * required bits (rmask & mask == mask) - all bits from this mask have to
be set.
> * any of these bits (amask & mask != 0) - any of these bits is set.
> * exclude masks (emask & mask == 0) = none of these bits are set.
> * return mask - bits that have to be reported to user.

> 
> May I suggest a slightly modified interface for the flags?
I've added everyone who may be interested in making interface better.

> 
> As I understand, the return_mask is what is applied to page flags to
> aggregate the list.
> This is a separate thing, and I think it doesn't need changes except
> maybe an improvement
> in the documentation and visual distinction.
> 
> For the page-selection mechanism, currently required_mask and
> excluded_mask have conflicting
They are opposite of each other:
All the set bits in required_mask must be set for the page to be selected.
All the set bits in excluded_mask must _not_ be set for the page to be
selected.

> responsibilities. I suggest to rework that to:
> 1. negated_flags: page flags which are to be negated before applying
> the page selection using following masks;
Sorry I'm unable to understand the negation (which is XOR?). Lets look at
the truth table:
Page Flag	negated_flags		
0		0			0
0		1			1
1		0			1
1		1			0

If a page flag is 0 and negated_flag is 1, the result would be 1 which has
changed the page flag. It isn't making sense to me. Why the page flag bit
is being fliped?

When Anrdei had proposed these masks, they seemed like a fancy way of
filtering inside kernel and it was straight forward to understand. These
masks would help his use cases for CRIU. So I'd included it. Please can you
elaborate what is the purpose of negation?

> 2. required_flags: flags which all have to be set in the
> (negation-applied) page flags;
> 3. anyof_flags: flags of which at least one has to be set in the
> (negation-applied) page flags;
> 
> IOW, the resulting algorithm would be:
> 
> tested_flags = page_flags ^ negated_flags;
> if (~tested_flags & required_flags)
>   skip page;
> if (!(tested_flags & anyof_flags))
>   skip_page;
> 
> aggregate_on(page_flags & return_flags);
> 
> Best Regards
> Michał Mirosław

[1] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com
Michał Mirosław Feb. 21, 2023, 12:42 p.m. UTC | #21
On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> Hi Michał,
>
> Thank you so much for comment!
>
> On 2/17/23 8:18 PM, Michał Mirosław wrote:
[...]
> > For the page-selection mechanism, currently required_mask and
> > excluded_mask have conflicting
> They are opposite of each other:
> All the set bits in required_mask must be set for the page to be selected.
> All the set bits in excluded_mask must _not_ be set for the page to be
> selected.
>
> > responsibilities. I suggest to rework that to:
> > 1. negated_flags: page flags which are to be negated before applying
> > the page selection using following masks;
> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> the truth table:
> Page Flag       negated_flags
> 0               0                       0
> 0               1                       1
> 1               0                       1
> 1               1                       0
>
> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> changed the page flag. It isn't making sense to me. Why the page flag bit
> is being fliped?
>
> When Anrdei had proposed these masks, they seemed like a fancy way of
> filtering inside kernel and it was straight forward to understand. These
> masks would help his use cases for CRIU. So I'd included it. Please can you
> elaborate what is the purpose of negation?

The XOR is a way to invert the tested value of a flag (from positive
to negative and the other way) without having the API with invalid
values (with required_flags and excluded_flags you need to define a
rule about what happens if a flag is present in both of the masks -
either prioritise one mask over the other or reject the call).
(Note: the XOR is applied only to the value of the flags for the
purpose of testing page-selection criteria.)

So:
1. if a flag is not set in negated_flags, but set in required_flags,
then it means "this flag must be one" - equivalent to it being set in
required_flag (in your current version of the API).
2. if a flag is set in negated_flags and also in required_flags, then
it means "this flag must be zero" - equivalent to it being set in
excluded_flags.

The same thing goes for anyof_flags: if a flag is set in anyof_flags,
then for it to be considered matched:
1. it must have a value of 1 if it is not set in negated_flags
2. it must have a value of 0 if it is set in negated_flags

BTW, I think I assumed that both conditions (all flags in
required_flags and at least one in anyof_flags is present) need to be
true for the page to be selected - is this your intention? The example
code has a bug though, in that if anyof_flags is zero it will never
match. Let me fix the selection part:

// calc. a mask of flags that have expected ("active") values
tested_flags = page_flags ^ negated_flags;
// are all required flags in "active" state? [== all zero when negated]
if (~tested_flags & required_mask)
  skip page;
// is any extra flag "active"?
if (anyof_flags && !(tested_flags & anyof_flags))
  skip page;


Best Regards
Michał Mirosław
Muhammad Usama Anjum Feb. 22, 2023, 10:11 a.m. UTC | #22
On 2/21/23 5:42 PM, Michał Mirosław wrote:
> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> Hi Michał,
>>
>> Thank you so much for comment!
>>
>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
> [...]
>>> For the page-selection mechanism, currently required_mask and
>>> excluded_mask have conflicting
>> They are opposite of each other:
>> All the set bits in required_mask must be set for the page to be selected.
>> All the set bits in excluded_mask must _not_ be set for the page to be
>> selected.
>>
>>> responsibilities. I suggest to rework that to:
>>> 1. negated_flags: page flags which are to be negated before applying
>>> the page selection using following masks;
>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
>> the truth table:
>> Page Flag       negated_flags
>> 0               0                       0
>> 0               1                       1
>> 1               0                       1
>> 1               1                       0
>>
>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
>> changed the page flag. It isn't making sense to me. Why the page flag bit
>> is being fliped?
>>
>> When Anrdei had proposed these masks, they seemed like a fancy way of
>> filtering inside kernel and it was straight forward to understand. These
>> masks would help his use cases for CRIU. So I'd included it. Please can you
>> elaborate what is the purpose of negation?
> 
> The XOR is a way to invert the tested value of a flag (from positive
> to negative and the other way) without having the API with invalid
> values (with required_flags and excluded_flags you need to define a
> rule about what happens if a flag is present in both of the masks -
> either prioritise one mask over the other or reject the call).
At minimum, one mask (required, any or excluded) must be specified. For a
page to get selected, the page flags must fulfill the criterion of all the
specified masks.

If a flag is present in both required_mask and excluded_mask, the
required_mask would select a page. But exculded_mask would drop the page.
So page page would be dropped. It is responsibility of the user to
correctly specify the flags.

matched = true;
if (p->required_mask)
	matched = ((p->required_mask & bitmap) == p->required_mask);
if (matched && p->anyof_mask)
	matched = (p->anyof_mask & bitmap);
if (matched && p->excluded_mask)
	matched = !(p->excluded_mask & bitmap);

if (matched && bitmap) {
	// page selected
}

Do you accept/like this behavior of masks after explaintation?

> (Note: the XOR is applied only to the value of the flags for the
> purpose of testing page-selection criteria.)
> 
> So:
> 1. if a flag is not set in negated_flags, but set in required_flags,
> then it means "this flag must be one" - equivalent to it being set in
> required_flag (in your current version of the API).
> 2. if a flag is set in negated_flags and also in required_flags, then
> it means "this flag must be zero" - equivalent to it being set in
> excluded_flags.
Lets translate words into table:
pageflags	required_flags	negated_flags	matched
1		1		0		yes
0		1		1		yes

> 
> The same thing goes for anyof_flags: if a flag is set in anyof_flags,
> then for it to be considered matched:
> 1. it must have a value of 1 if it is not set in negated_flags
> 2. it must have a value of 0 if it is set in negated_flags

pageflags	anyof_flags	negated_flags	matched
1		1		0		yes
0		1		1		yes

> 
> BTW, I think I assumed that both conditions (all flags in
> required_flags and at least one in anyof_flags is present) need to be
> true for the page to be selected - is this your intention? 
All the masks are optional. If all or any of the 3 masks are specified, the
page flags must pass these masks to get selected.

> The example
> code has a bug though, in that if anyof_flags is zero it will never
> match. Let me fix the selection part:
> 
> // calc. a mask of flags that have expected ("active") values
> tested_flags = page_flags ^ negated_flags;
> // are all required flags in "active" state? [== all zero when negated]
> if (~tested_flags & required_mask)
>   skip page;
> // is any extra flag "active"?
> if (anyof_flags && !(tested_flags & anyof_flags))
>   skip page;
> 
After taking a while to understand this and compare with already present
flag system, `negated flags` is comparatively difficult to understand while
already present flags seem easier.

> 
> Best Regards
> Michał Mirosław
Michał Mirosław Feb. 22, 2023, 10:44 a.m. UTC | #23
On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 2/21/23 5:42 PM, Michał Mirosław wrote:
> > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >>
> >> Hi Michał,
> >>
> >> Thank you so much for comment!
> >>
> >> On 2/17/23 8:18 PM, Michał Mirosław wrote:
> > [...]
> >>> For the page-selection mechanism, currently required_mask and
> >>> excluded_mask have conflicting
> >> They are opposite of each other:
> >> All the set bits in required_mask must be set for the page to be selected.
> >> All the set bits in excluded_mask must _not_ be set for the page to be
> >> selected.
> >>
> >>> responsibilities. I suggest to rework that to:
> >>> 1. negated_flags: page flags which are to be negated before applying
> >>> the page selection using following masks;
> >> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> >> the truth table:
> >> Page Flag       negated_flags
> >> 0               0                       0
> >> 0               1                       1
> >> 1               0                       1
> >> 1               1                       0
> >>
> >> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> >> changed the page flag. It isn't making sense to me. Why the page flag bit
> >> is being fliped?
> >>
> >> When Anrdei had proposed these masks, they seemed like a fancy way of
> >> filtering inside kernel and it was straight forward to understand. These
> >> masks would help his use cases for CRIU. So I'd included it. Please can you
> >> elaborate what is the purpose of negation?
> >
> > The XOR is a way to invert the tested value of a flag (from positive
> > to negative and the other way) without having the API with invalid
> > values (with required_flags and excluded_flags you need to define a
> > rule about what happens if a flag is present in both of the masks -
> > either prioritise one mask over the other or reject the call).
> At minimum, one mask (required, any or excluded) must be specified. For a
> page to get selected, the page flags must fulfill the criterion of all the
> specified masks.

[Please see the comment below.]

[...]
> Lets translate words into table:
[Yes, those tables captured the intent correctly.]

> > BTW, I think I assumed that both conditions (all flags in
> > required_flags and at least one in anyof_flags is present) need to be
> > true for the page to be selected - is this your intention?
> All the masks are optional. If all or any of the 3 masks are specified, the
> page flags must pass these masks to get selected.

This explanation contradicts in part the introductory paragraph, but
this version seems more useful as you can pass all masks zero to have
all pages selected.

> > The example
> > code has a bug though, in that if anyof_flags is zero it will never
> > match. Let me fix the selection part:
> >
> > // calc. a mask of flags that have expected ("active") values
> > tested_flags = page_flags ^ negated_flags;
> > // are all required flags in "active" state? [== all zero when negated]
> > if (~tested_flags & required_mask)
> >   skip page;
> > // is any extra flag "active"?
> > if (anyof_flags && !(tested_flags & anyof_flags))
> >   skip page;
> >
> After taking a while to understand this and compare with already present
> flag system, `negated flags` is comparatively difficult to understand while
> already present flags seem easier.

Maybe replacing negated_flags in the API with matched_values =
~negated_flags would make this better?

We compare having to understand XOR vs having to understand ordering
of required_flags and excluded_flags.
IOW my proposal is to replace branches in the masks interpretation (if
in one set then matches but if in another set then doesn't; if flags
match ... ) with plain calculation (flag is matching when equals
~negated_flags; if flags match the masks ...).

Best Regards
Michał Mirosław
Muhammad Usama Anjum Feb. 22, 2023, 11:06 a.m. UTC | #24
On 2/22/23 3:44 PM, Michał Mirosław wrote:
> On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 2/21/23 5:42 PM, Michał Mirosław wrote:
>>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>>
>>>> Hi Michał,
>>>>
>>>> Thank you so much for comment!
>>>>
>>>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
>>> [...]
>>>>> For the page-selection mechanism, currently required_mask and
>>>>> excluded_mask have conflicting
>>>> They are opposite of each other:
>>>> All the set bits in required_mask must be set for the page to be selected.
>>>> All the set bits in excluded_mask must _not_ be set for the page to be
>>>> selected.
>>>>
>>>>> responsibilities. I suggest to rework that to:
>>>>> 1. negated_flags: page flags which are to be negated before applying
>>>>> the page selection using following masks;
>>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
>>>> the truth table:
>>>> Page Flag       negated_flags
>>>> 0               0                       0
>>>> 0               1                       1
>>>> 1               0                       1
>>>> 1               1                       0
>>>>
>>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
>>>> changed the page flag. It isn't making sense to me. Why the page flag bit
>>>> is being fliped?
>>>>
>>>> When Anrdei had proposed these masks, they seemed like a fancy way of
>>>> filtering inside kernel and it was straight forward to understand. These
>>>> masks would help his use cases for CRIU. So I'd included it. Please can you
>>>> elaborate what is the purpose of negation?
>>>
>>> The XOR is a way to invert the tested value of a flag (from positive
>>> to negative and the other way) without having the API with invalid
>>> values (with required_flags and excluded_flags you need to define a
>>> rule about what happens if a flag is present in both of the masks -
>>> either prioritise one mask over the other or reject the call).
>> At minimum, one mask (required, any or excluded) must be specified. For a
>> page to get selected, the page flags must fulfill the criterion of all the
>> specified masks.
> 
> [Please see the comment below.]
> 
> [...]
>> Lets translate words into table:
> [Yes, those tables captured the intent correctly.]
> 
>>> BTW, I think I assumed that both conditions (all flags in
>>> required_flags and at least one in anyof_flags is present) need to be
>>> true for the page to be selected - is this your intention?
>> All the masks are optional. If all or any of the 3 masks are specified, the
>> page flags must pass these masks to get selected.
> 
> This explanation contradicts in part the introductory paragraph, but
> this version seems more useful as you can pass all masks zero to have
> all pages selected.
Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
must be specified. The return_mask must always be specified. Error is
returned if all 3 masks (required, anyof, exclude) are zero or return_mask
is zero.

> 
>>> The example
>>> code has a bug though, in that if anyof_flags is zero it will never
>>> match. Let me fix the selection part:
>>>
>>> // calc. a mask of flags that have expected ("active") values
>>> tested_flags = page_flags ^ negated_flags;
>>> // are all required flags in "active" state? [== all zero when negated]
>>> if (~tested_flags & required_mask)
>>>   skip page;
>>> // is any extra flag "active"?
>>> if (anyof_flags && !(tested_flags & anyof_flags))
>>>   skip page;
>>>
>> After taking a while to understand this and compare with already present
>> flag system, `negated flags` is comparatively difficult to understand while
>> already present flags seem easier.
> 
> Maybe replacing negated_flags in the API with matched_values =
> ~negated_flags would make this better?
> 
> We compare having to understand XOR vs having to understand ordering
> of required_flags and excluded_flags.
There is no ordering in current masks scheme. No mask is preferable. For a
page to get selected, all the definitions of the masks must be fulfilled.
You have come up with good example that what if required_mask =
exclude_mask. In this case, no page will fulfill the criterion and hence no
page would be selected. It is user's fault that he isn't understanding the
definitions of these masks correctly.

Now thinking about it, I can add a error check which would return error if
a bit in required and excluded masks matches. Would you like it? Lets put
this check in place.
(Previously I'd left it for user's wisdom not to do this. If he'll specify
same masks in them, he'll get no addresses out of the syscall.)

> IOW my proposal is to replace branches in the masks interpretation (if
> in one set then matches but if in another set then doesn't; if flags
> match ... ) with plain calculation (flag is matching when equals
> ~negated_flags; if flags match the masks ...).
> 
> Best Regards
> Michał Mirosław
Michał Mirosław Feb. 22, 2023, 11:48 a.m. UTC | #25
On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> On 2/22/23 3:44 PM, Michał Mirosław wrote:
> > On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 2/21/23 5:42 PM, Michał Mirosław wrote:
> >>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> >>>>
> >>>> Hi Michał,
> >>>>
> >>>> Thank you so much for comment!
> >>>>
> >>>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
> >>> [...]
> >>>>> For the page-selection mechanism, currently required_mask and
> >>>>> excluded_mask have conflicting
> >>>> They are opposite of each other:
> >>>> All the set bits in required_mask must be set for the page to be selected.
> >>>> All the set bits in excluded_mask must _not_ be set for the page to be
> >>>> selected.
> >>>>
> >>>>> responsibilities. I suggest to rework that to:
> >>>>> 1. negated_flags: page flags which are to be negated before applying
> >>>>> the page selection using following masks;
> >>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> >>>> the truth table:
> >>>> Page Flag       negated_flags
> >>>> 0               0                       0
> >>>> 0               1                       1
> >>>> 1               0                       1
> >>>> 1               1                       0
> >>>>
> >>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> >>>> changed the page flag. It isn't making sense to me. Why the page flag bit
> >>>> is being fliped?
> >>>>
> >>>> When Anrdei had proposed these masks, they seemed like a fancy way of
> >>>> filtering inside kernel and it was straight forward to understand. These
> >>>> masks would help his use cases for CRIU. So I'd included it. Please can you
> >>>> elaborate what is the purpose of negation?
> >>>
> >>> The XOR is a way to invert the tested value of a flag (from positive
> >>> to negative and the other way) without having the API with invalid
> >>> values (with required_flags and excluded_flags you need to define a
> >>> rule about what happens if a flag is present in both of the masks -
> >>> either prioritise one mask over the other or reject the call).
> >> At minimum, one mask (required, any or excluded) must be specified. For a
> >> page to get selected, the page flags must fulfill the criterion of all the
> >> specified masks.
> >
> > [Please see the comment below.]
> >
> > [...]
> >> Lets translate words into table:
> > [Yes, those tables captured the intent correctly.]
> >
> >>> BTW, I think I assumed that both conditions (all flags in
> >>> required_flags and at least one in anyof_flags is present) need to be
> >>> true for the page to be selected - is this your intention?
> >> All the masks are optional. If all or any of the 3 masks are specified, the
> >> page flags must pass these masks to get selected.
> >
> > This explanation contradicts in part the introductory paragraph, but
> > this version seems more useful as you can pass all masks zero to have
> > all pages selected.
> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
> must be specified. The return_mask must always be specified. Error is
> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
> is zero.

Why do you need those restrictions? I'd guess it is valid to request a
list of all pages with zero return_mask - this will return a compact
list of used ranges of the virtual address space.

> >> After taking a while to understand this and compare with already present
> >> flag system, `negated flags` is comparatively difficult to understand while
> >> already present flags seem easier.
> >
> > Maybe replacing negated_flags in the API with matched_values =
> > ~negated_flags would make this better?
> >
> > We compare having to understand XOR vs having to understand ordering
> > of required_flags and excluded_flags.
> There is no ordering in current masks scheme. No mask is preferable. For a
> page to get selected, all the definitions of the masks must be fulfilled.
> You have come up with good example that what if required_mask =
> exclude_mask. In this case, no page will fulfill the criterion and hence no
> page would be selected. It is user's fault that he isn't understanding the
> definitions of these masks correctly.
>
> Now thinking about it, I can add a error check which would return error if
> a bit in required and excluded masks matches. Would you like it? Lets put
> this check in place.
> (Previously I'd left it for user's wisdom not to do this. If he'll specify
> same masks in them, he'll get no addresses out of the syscall.)

This error case is (one of) the problems I propose avoiding. You also
need much more text to describe the requred/excluded flags
interactions and edge cases than saying that a flag must have a value
equal to corresponding bit in ~negated_flags to be matched by
requried/anyof masks.

> > IOW my proposal is to replace branches in the masks interpretation (if
> > in one set then matches but if in another set then doesn't; if flags
> > match ... ) with plain calculation (flag is matching when equals
> > ~negated_flags; if flags match the masks ...).

Best Regards
Michał Mirosław
Nadav Amit Feb. 22, 2023, 7:10 p.m. UTC | #26
> On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote:
> 
>>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>> +                     unsigned long end, struct mm_walk *walk)
>>> +{
>>> +    struct pagemap_scan_private *p = walk->private;
>>> +    struct vm_area_struct *vma = walk->vma;
>>> +    unsigned long addr = end;
>>> +    spinlock_t *ptl;
>>> +    int ret = 0;
>>> +    pte_t *pte;
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +    ptl = pmd_trans_huge_lock(pmd, vma);
>>> +    if (ptl) {
>>> +        bool pmd_wt;
>>> +
>>> +        pmd_wt = !is_pmd_uffd_wp(*pmd);
>>> +        /*
>>> +         * Break huge page into small pages if operation needs to be
>>> performed is
>>> +         * on a portion of the huge page.
>>> +         */
>>> +        if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>>> +            spin_unlock(ptl);
>>> +            split_huge_pmd(vma, pmd, start);
>>> +            goto process_smaller_pages;
>> I think that such goto's are really confusing and should be avoided. And
>> using 'else' (could have easily prevented the need for goto). It is not the
>> best solution though, since I think it would have been better to invert the
>> conditions.
> Yeah, else can be used here. But then we'll have to add a tab to all the
> code after adding else. We have already so many tabs and very less space to
> right code. Not sure which is better.

goto’s are usually not the right solution. You can extract things into a different
function if you have to.

I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the
lock and dropping it in such a case. I think that the code can be simplified and
additional condition nesting can be avoided.

>>> --- a/include/uapi/linux/fs.h
>>> +++ b/include/uapi/linux/fs.h
>>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>>>  #define RWF_SUPPORTED    (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>>>               RWF_APPEND)
>>>  +/* Pagemap ioctl */
>>> +#define PAGEMAP_SCAN    _IOWR('f', 16, struct pagemap_scan_arg)
>>> +
>>> +/* Bits are set in the bitmap of the page_region and masks in
>>> pagemap_scan_args */
>>> +#define PAGE_IS_WRITTEN        (1 << 0)
>>> +#define PAGE_IS_FILE        (1 << 1)
>>> +#define PAGE_IS_PRESENT        (1 << 2)
>>> +#define PAGE_IS_SWAPPED        (1 << 3)
>> 
>> These names are way too generic and are likely to be misused for the wrong
>> purpose. The "_IS_" part seems confusing as well. So I think the naming
>> needs to be fixed and some new type (using typedef) or enum should be
>> introduced to hold these flags. I understand it is part of uapi and it is
>> less common there, but it is not unheard of and does make things clearer.
> Do you think PM_SCAN_PAGE_IS_* work here?

Can we lose the IS somehow?

> 
>> 
>> 
>>> +
>>> +/*
>>> + * struct page_region - Page region with bitmap flags
>>> + * @start:    Start of the region
>>> + * @len:    Length of the region
>>> + * bitmap:    Bits sets for the region
>>> + */
>>> +struct page_region {
>>> +    __u64 start;
>>> +    __u64 len;
>> 
>> I presume in bytes. Would be useful to mention.
> Length of region in pages.

Very unintuitive to me I must say. If the start is an address, I would expect
the len to be in bytes.
Muhammad Usama Anjum Feb. 23, 2023, 6:44 a.m. UTC | #27
On 2/22/23 4:48 PM, Michał Mirosław wrote:
> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> On 2/22/23 3:44 PM, Michał Mirosław wrote:
>>> On Wed, 22 Feb 2023 at 11:11, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 2/21/23 5:42 PM, Michał Mirosław wrote:
>>>>> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>>>>>
>>>>>> Hi Michał,
>>>>>>
>>>>>> Thank you so much for comment!
>>>>>>
>>>>>> On 2/17/23 8:18 PM, Michał Mirosław wrote:
>>>>> [...]
>>>>>>> For the page-selection mechanism, currently required_mask and
>>>>>>> excluded_mask have conflicting
>>>>>> They are opposite of each other:
>>>>>> All the set bits in required_mask must be set for the page to be selected.
>>>>>> All the set bits in excluded_mask must _not_ be set for the page to be
>>>>>> selected.
>>>>>>
>>>>>>> responsibilities. I suggest to rework that to:
>>>>>>> 1. negated_flags: page flags which are to be negated before applying
>>>>>>> the page selection using following masks;
>>>>>> Sorry I'm unable to understand the negation (which is XOR?). Lets look at
>>>>>> the truth table:
>>>>>> Page Flag       negated_flags
>>>>>> 0               0                       0
>>>>>> 0               1                       1
>>>>>> 1               0                       1
>>>>>> 1               1                       0
>>>>>>
>>>>>> If a page flag is 0 and negated_flag is 1, the result would be 1 which has
>>>>>> changed the page flag. It isn't making sense to me. Why the page flag bit
>>>>>> is being fliped?
>>>>>>
>>>>>> When Anrdei had proposed these masks, they seemed like a fancy way of
>>>>>> filtering inside kernel and it was straight forward to understand. These
>>>>>> masks would help his use cases for CRIU. So I'd included it. Please can you
>>>>>> elaborate what is the purpose of negation?
>>>>>
>>>>> The XOR is a way to invert the tested value of a flag (from positive
>>>>> to negative and the other way) without having the API with invalid
>>>>> values (with required_flags and excluded_flags you need to define a
>>>>> rule about what happens if a flag is present in both of the masks -
>>>>> either prioritise one mask over the other or reject the call).
>>>> At minimum, one mask (required, any or excluded) must be specified. For a
>>>> page to get selected, the page flags must fulfill the criterion of all the
>>>> specified masks.
>>>
>>> [Please see the comment below.]
>>>
>>> [...]
>>>> Lets translate words into table:
>>> [Yes, those tables captured the intent correctly.]
>>>
>>>>> BTW, I think I assumed that both conditions (all flags in
>>>>> required_flags and at least one in anyof_flags is present) need to be
>>>>> true for the page to be selected - is this your intention?
>>>> All the masks are optional. If all or any of the 3 masks are specified, the
>>>> page flags must pass these masks to get selected.
>>>
>>> This explanation contradicts in part the introductory paragraph, but
>>> this version seems more useful as you can pass all masks zero to have
>>> all pages selected.
>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
>> must be specified. The return_mask must always be specified. Error is
>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
>> is zero.
> 
> Why do you need those restrictions? I'd guess it is valid to request a
> list of all pages with zero return_mask - this will return a compact
> list of used ranges of the virtual address space.
At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
flags of interest in the return_mask. If he wants only 1 flag, he'll
specify it. Definitely if user wants only 1 flag, initially it doesn't make
any sense to mention in the return mask. But we want uniformity. If user
want, 2 or more flags in returned, return_mask becomes compulsory. So to
keep things simple and generic for any number of flags of interest
returned, the return_mask must be specified even if the flag of interest is
only 1.

> 
>>>> After taking a while to understand this and compare with already present
>>>> flag system, `negated flags` is comparatively difficult to understand while
>>>> already present flags seem easier.
>>>
>>> Maybe replacing negated_flags in the API with matched_values =
>>> ~negated_flags would make this better?
>>>
>>> We compare having to understand XOR vs having to understand ordering
>>> of required_flags and excluded_flags.
>> There is no ordering in current masks scheme. No mask is preferable. For a
>> page to get selected, all the definitions of the masks must be fulfilled.
>> You have come up with good example that what if required_mask =
>> exclude_mask. In this case, no page will fulfill the criterion and hence no
>> page would be selected. It is user's fault that he isn't understanding the
>> definitions of these masks correctly.
>>
>> Now thinking about it, I can add a error check which would return error if
>> a bit in required and excluded masks matches. Would you like it? Lets put
>> this check in place.
>> (Previously I'd left it for user's wisdom not to do this. If he'll specify
>> same masks in them, he'll get no addresses out of the syscall.)
> 
> This error case is (one of) the problems I propose avoiding. You also
> need much more text to describe the requred/excluded flags
> interactions and edge cases than saying that a flag must have a value
> equal to corresponding bit in ~negated_flags to be matched by
> requried/anyof masks.
I've found excluded_mask very intuitive as compared to negated_mask which
is so difficult to understand that I don't know how to use it correctly.
Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
PAGE_IS_SWAPPED. This can be specified as:

required_mask = PAGE_IS_WRITTEN
excluded_mask = PAGE_IS_FILE
anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP

(a) assume page_flags = 0b1111
skip page as 0b1111 & 0b0010 = true

(b) assume page_flags = 0b1001
select page as 0b1001 & 0b0010 = false

It seemed intuitive. Right? How would you achieve same thing with negated_mask?

required_mask = PAGE_IS_WRITTEN
negated_mask = PAGE_IS_FILE
anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP

(1) assume page_flags = 0b1111
tested_flags = 0b1111 ^ 0b0010 = 0b1101

(2) assume page_flags = 0b1001
tested_flags = 0b1001 ^ 0b0010 = 0b1011

In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
negated_mask has just masked it and page is still getting tested if it
should be selected and it would get selected. It is wrong.

In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
PAGE_IS_FILE in tested_flags.

> 
>>> IOW my proposal is to replace branches in the masks interpretation (if
>>> in one set then matches but if in another set then doesn't; if flags
>>> match ... ) with plain calculation (flag is matching when equals
>>> ~negated_flags; if flags match the masks ...).
> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum Feb. 23, 2023, 7:10 a.m. UTC | #28
Hi Nadav, Mike, Michał,

Can you please share your thoughts at [A] below?

On 2/23/23 12:10 AM, Nadav Amit wrote:
> 
> 
>> On Feb 20, 2023, at 5:24 AM, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote:
>>
>>>> +static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>> +                     unsigned long end, struct mm_walk *walk)
>>>> +{
>>>> +    struct pagemap_scan_private *p = walk->private;
>>>> +    struct vm_area_struct *vma = walk->vma;
>>>> +    unsigned long addr = end;
>>>> +    spinlock_t *ptl;
>>>> +    int ret = 0;
>>>> +    pte_t *pte;
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +    ptl = pmd_trans_huge_lock(pmd, vma);
>>>> +    if (ptl) {
>>>> +        bool pmd_wt;
>>>> +
>>>> +        pmd_wt = !is_pmd_uffd_wp(*pmd);
>>>> +        /*
>>>> +         * Break huge page into small pages if operation needs to be
>>>> performed is
>>>> +         * on a portion of the huge page.
>>>> +         */
>>>> +        if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
>>>> +            spin_unlock(ptl);
>>>> +            split_huge_pmd(vma, pmd, start);
>>>> +            goto process_smaller_pages;
>>> I think that such goto's are really confusing and should be avoided. And
>>> using 'else' (could have easily prevented the need for goto). It is not the
>>> best solution though, since I think it would have been better to invert the
>>> conditions.
>> Yeah, else can be used here. But then we'll have to add a tab to all the
>> code after adding else. We have already so many tabs and very less space to
>> right code. Not sure which is better.
> 
> goto’s are usually not the right solution. You can extract things into a different
> function if you have to.
> 
> I’m not sure why IS_GET_OP(p) might be false and what’s the meaning of taking the
> lock and dropping it in such a case. I think that the code can be simplified and
> additional condition nesting can be avoided.
Lock is taken and we check if pmd has UFFD_WP set or not. In the next
version, the GET check has been removed as we have dropped WP_ENGAGE + !GET
operation. So get is always specified and condition isn't needed.

Please comment on next version if you want anything more optimized.

> 
>>>> --- a/include/uapi/linux/fs.h
>>>> +++ b/include/uapi/linux/fs.h
>>>> @@ -305,4 +305,54 @@ typedef int __bitwise __kernel_rwf_t;
>>>>  #define RWF_SUPPORTED    (RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
>>>>               RWF_APPEND)
>>>>  +/* Pagemap ioctl */
>>>> +#define PAGEMAP_SCAN    _IOWR('f', 16, struct pagemap_scan_arg)
>>>> +
>>>> +/* Bits are set in the bitmap of the page_region and masks in
>>>> pagemap_scan_args */
>>>> +#define PAGE_IS_WRITTEN        (1 << 0)
>>>> +#define PAGE_IS_FILE        (1 << 1)
>>>> +#define PAGE_IS_PRESENT        (1 << 2)
>>>> +#define PAGE_IS_SWAPPED        (1 << 3)
>>>
>>> These names are way too generic and are likely to be misused for the wrong
>>> purpose. The "_IS_" part seems confusing as well. So I think the naming
>>> needs to be fixed and some new type (using typedef) or enum should be
>>> introduced to hold these flags. I understand it is part of uapi and it is
>>> less common there, but it is not unheard of and does make things clearer.
>> Do you think PM_SCAN_PAGE_IS_* work here?
> 
> Can we lose the IS somehow?
[A] Do you think these names would work better: PM_SCAN_WRITTEN_PAGE,
PM_SCAN_FILE_PAGE, PM_SCAN_SWAP_PAGE, PM_SCAN_PRESENT_PAGE?

> 
>>
>>>
>>>
>>>> +
>>>> +/*
>>>> + * struct page_region - Page region with bitmap flags
>>>> + * @start:    Start of the region
>>>> + * @len:    Length of the region
>>>> + * bitmap:    Bits sets for the region
>>>> + */
>>>> +struct page_region {
>>>> +    __u64 start;
>>>> +    __u64 len;
>>>
>>> I presume in bytes. Would be useful to mention.
>> Length of region in pages.
> 
> Very unintuitive to me I must say. If the start is an address, I would expect
> the len to be in bytes.
The PAGEMAP_SCAN ioctl is working on page granularity level. We tell the
user if a page has certain flags are not. Keeping length in bytes doesn't
makes sense.

>
Michał Mirosław Feb. 23, 2023, 8:41 a.m. UTC | #29
On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> On 2/22/23 4:48 PM, Michał Mirosław wrote:
> > On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
[...]
> >>>>> BTW, I think I assumed that both conditions (all flags in
> >>>>> required_flags and at least one in anyof_flags is present) need to be
> >>>>> true for the page to be selected - is this your intention?
> >>>> All the masks are optional. If all or any of the 3 masks are specified, the
> >>>> page flags must pass these masks to get selected.
> >>>
> >>> This explanation contradicts in part the introductory paragraph, but
> >>> this version seems more useful as you can pass all masks zero to have
> >>> all pages selected.
> >> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
> >> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
> >> must be specified. The return_mask must always be specified. Error is
> >> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
> >> is zero.
> >
> > Why do you need those restrictions? I'd guess it is valid to request a
> > list of all pages with zero return_mask - this will return a compact
> > list of used ranges of the virtual address space.
> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
> flags of interest in the return_mask. If he wants only 1 flag, he'll
> specify it. Definitely if user wants only 1 flag, initially it doesn't make
> any sense to mention in the return mask. But we want uniformity. If user
> want, 2 or more flags in returned, return_mask becomes compulsory. So to
> keep things simple and generic for any number of flags of interest
> returned, the return_mask must be specified even if the flag of interest is
> only 1.

I'm not sure why do we want uniformity in the case of 1 flag? If a
user specifies a single required flag, I'd expect he doesn't need to
look at the flags returned as those will duplicate the information
from mere presence of a page. A user might also require a single flag,
but want all of them returned. Both requests - return 1 flag and
return 0 flags would give meaningful output, so why force one way or
the other? Allowing two will also enable users to express the intent:
they need either just a list of pages, or they need a list with
per-page flags - the need would follow from the code structure or
other factors.

> >>>> After taking a while to understand this and compare with already present
> >>>> flag system, `negated flags` is comparatively difficult to understand while
> >>>> already present flags seem easier.
> >>>
> >>> Maybe replacing negated_flags in the API with matched_values =
> >>> ~negated_flags would make this better?
> >>>
> >>> We compare having to understand XOR vs having to understand ordering
> >>> of required_flags and excluded_flags.
> >> There is no ordering in current masks scheme. No mask is preferable. For a
> >> page to get selected, all the definitions of the masks must be fulfilled.
> >> You have come up with good example that what if required_mask =
> >> exclude_mask. In this case, no page will fulfill the criterion and hence no
> >> page would be selected. It is user's fault that he isn't understanding the
> >> definitions of these masks correctly.
> >>
> >> Now thinking about it, I can add a error check which would return error if
> >> a bit in required and excluded masks matches. Would you like it? Lets put
> >> this check in place.
> >> (Previously I'd left it for user's wisdom not to do this. If he'll specify
> >> same masks in them, he'll get no addresses out of the syscall.)
> >
> > This error case is (one of) the problems I propose avoiding. You also
> > need much more text to describe the requred/excluded flags
> > interactions and edge cases than saying that a flag must have a value
> > equal to corresponding bit in ~negated_flags to be matched by
> > requried/anyof masks.
> I've found excluded_mask very intuitive as compared to negated_mask which
> is so difficult to understand that I don't know how to use it correctly.
> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
> PAGE_IS_SWAPPED. This can be specified as:
>
> required_mask = PAGE_IS_WRITTEN
> excluded_mask = PAGE_IS_FILE
> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>
> (a) assume page_flags = 0b1111
> skip page as 0b1111 & 0b0010 = true
>
> (b) assume page_flags = 0b1001
> select page as 0b1001 & 0b0010 = false
>
> It seemed intuitive. Right? How would you achieve same thing with negated_mask?
>
> required_mask = PAGE_IS_WRITTEN
> negated_mask = PAGE_IS_FILE
> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>
> (1) assume page_flags = 0b1111
> tested_flags = 0b1111 ^ 0b0010 = 0b1101
>
> (2) assume page_flags = 0b1001
> tested_flags = 0b1001 ^ 0b0010 = 0b1011
>
> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
> negated_mask has just masked it and page is still getting tested if it
> should be selected and it would get selected. It is wrong.
>
> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
> PAGE_IS_FILE in tested_flags.

I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:

required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE;
negated_flags = PAGE_IS_FILE; // flags I want zero

I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so:

anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP;

Another case: I want to analyse a process' working set:

required_mask = 0;
negated_flags = PAGE_IS_FILE;
anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN;

-> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0].

To clarify a bit: negated_flags doesn't mask anything: the field
inverts values of the flags (marks some "active low", if you consider
electronic signal analogy).

Best Regards
Michał Mirosław
Muhammad Usama Anjum Feb. 23, 2023, 9:23 a.m. UTC | #30
On 2/23/23 1:41 PM, Michał Mirosław wrote:
> On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> On 2/22/23 4:48 PM, Michał Mirosław wrote:
>>> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>>>>> BTW, I think I assumed that both conditions (all flags in
>>>>>>> required_flags and at least one in anyof_flags is present) need to be
>>>>>>> true for the page to be selected - is this your intention?
>>>>>> All the masks are optional. If all or any of the 3 masks are specified, the
>>>>>> page flags must pass these masks to get selected.
>>>>>
>>>>> This explanation contradicts in part the introductory paragraph, but
>>>>> this version seems more useful as you can pass all masks zero to have
>>>>> all pages selected.
>>>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
>>>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
>>>> must be specified. The return_mask must always be specified. Error is
>>>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
>>>> is zero.
>>>
>>> Why do you need those restrictions? I'd guess it is valid to request a
>>> list of all pages with zero return_mask - this will return a compact
>>> list of used ranges of the virtual address space.
>> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
>> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
>> flags of interest in the return_mask. If he wants only 1 flag, he'll
>> specify it. Definitely if user wants only 1 flag, initially it doesn't make
>> any sense to mention in the return mask. But we want uniformity. If user
>> want, 2 or more flags in returned, return_mask becomes compulsory. So to
>> keep things simple and generic for any number of flags of interest
>> returned, the return_mask must be specified even if the flag of interest is
>> only 1.
> 
> I'm not sure why do we want uniformity in the case of 1 flag? If a
> user specifies a single required flag, I'd expect he doesn't need to
> look at the flags returned as those will duplicate the information
> from mere presence of a page. A user might also require a single flag,
> but want all of them returned. Both requests - return 1 flag and
> return 0 flags would give meaningful output, so why force one way or
> the other? Allowing two will also enable users to express the intent:
> they need either just a list of pages, or they need a list with
> per-page flags - the need would follow from the code structure or
> other factors.
We can add as much flexibility as much people ask by keeping code simple.
But it is going to be dirty to add error check which detects if return_mask
= 0 and if there is only 1 flag of interest mentioned by the user. The
following mentioned error check is essential to return deterministic
output. Do you think this case is worth it to support and we don't want to
go with the generality for both 1 or more flag cases?

if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1)
	return error;

> 
>>>>>> After taking a while to understand this and compare with already present
>>>>>> flag system, `negated flags` is comparatively difficult to understand while
>>>>>> already present flags seem easier.
>>>>>
>>>>> Maybe replacing negated_flags in the API with matched_values =
>>>>> ~negated_flags would make this better?
>>>>>
>>>>> We compare having to understand XOR vs having to understand ordering
>>>>> of required_flags and excluded_flags.
>>>> There is no ordering in current masks scheme. No mask is preferable. For a
>>>> page to get selected, all the definitions of the masks must be fulfilled.
>>>> You have come up with good example that what if required_mask =
>>>> exclude_mask. In this case, no page will fulfill the criterion and hence no
>>>> page would be selected. It is user's fault that he isn't understanding the
>>>> definitions of these masks correctly.
>>>>
>>>> Now thinking about it, I can add a error check which would return error if
>>>> a bit in required and excluded masks matches. Would you like it? Lets put
>>>> this check in place.
>>>> (Previously I'd left it for user's wisdom not to do this. If he'll specify
>>>> same masks in them, he'll get no addresses out of the syscall.)
>>>
>>> This error case is (one of) the problems I propose avoiding. You also
>>> need much more text to describe the requred/excluded flags
>>> interactions and edge cases than saying that a flag must have a value
>>> equal to corresponding bit in ~negated_flags to be matched by
>>> requried/anyof masks.
>> I've found excluded_mask very intuitive as compared to negated_mask which
>> is so difficult to understand that I don't know how to use it correctly.
>> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
>> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
>> PAGE_IS_SWAPPED. This can be specified as:
>>
>> required_mask = PAGE_IS_WRITTEN
>> excluded_mask = PAGE_IS_FILE
>> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>>
>> (a) assume page_flags = 0b1111
>> skip page as 0b1111 & 0b0010 = true
>>
>> (b) assume page_flags = 0b1001
>> select page as 0b1001 & 0b0010 = false
>>
>> It seemed intuitive. Right? How would you achieve same thing with negated_mask?
>>
>> required_mask = PAGE_IS_WRITTEN
>> negated_mask = PAGE_IS_FILE
>> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
>>
>> (1) assume page_flags = 0b1111
>> tested_flags = 0b1111 ^ 0b0010 = 0b1101
>>
>> (2) assume page_flags = 0b1001
>> tested_flags = 0b1001 ^ 0b0010 = 0b1011
>>
>> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
>> negated_mask has just masked it and page is still getting tested if it
>> should be selected and it would get selected. It is wrong.
>>
>> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
>> PAGE_IS_FILE in tested_flags.
> 
> I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:
> 
> required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE;
> negated_flags = PAGE_IS_FILE; // flags I want zero
You want PAGE_IS_FILE to be zero and at the same time you are requiring the
PAGE_IS_FILE. It is confusing. Lets go with excluded mask and excluded_mask
must never have any bit matching with required_mask. Lets stay with this as
it is intuitive and would be easy to use from the user's perspective.
Andrei and Danylo had suggested these mask scheme and have use cases for
this. Andrei and Danylo can please comment as well.

> 
> I also require one of PAGE_IS_PRESENT=1 or PAGE_IS_SWAP=1, so:
> 
> anyof_mask = PAGE_IS_PRESENT | PAGE_IS_SWAP;
> 
> Another case: I want to analyse a process' working set:
> 
> required_mask = 0;
> negated_flags = PAGE_IS_FILE;
> anyof_mask = PAGE_IS_FILE | PAGE_IS_WRITTEN;
> 
> -> gathering pages modified [WRITTEN=1] or not backed by a file [FILE=0].
> 
> To clarify a bit: negated_flags doesn't mask anything: the field
> inverts values of the flags (marks some "active low", if you consider
> electronic signal analogy).
> 
> Best Regards
> Michał Mirosław
Michał Mirosław Feb. 23, 2023, 9:42 a.m. UTC | #31
On Thu, 23 Feb 2023 at 10:23, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> On 2/23/23 1:41 PM, Michał Mirosław wrote:
> > On Thu, 23 Feb 2023 at 07:44, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >>
> >> On 2/22/23 4:48 PM, Michał Mirosław wrote:
> >>> On Wed, 22 Feb 2023 at 12:06, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> > [...]
> >>>>>>> BTW, I think I assumed that both conditions (all flags in
> >>>>>>> required_flags and at least one in anyof_flags is present) need to be
> >>>>>>> true for the page to be selected - is this your intention?
> >>>>>> All the masks are optional. If all or any of the 3 masks are specified, the
> >>>>>> page flags must pass these masks to get selected.
> >>>>>
> >>>>> This explanation contradicts in part the introductory paragraph, but
> >>>>> this version seems more useful as you can pass all masks zero to have
> >>>>> all pages selected.
> >>>> Sorry, I wrote it wrongly. (All the masks are not optional.) Let me
> >>>> rephrase. All or at least any 1 of the 3 masks (required, any, exclude)
> >>>> must be specified. The return_mask must always be specified. Error is
> >>>> returned if all 3 masks (required, anyof, exclude) are zero or return_mask
> >>>> is zero.
> >>>
> >>> Why do you need those restrictions? I'd guess it is valid to request a
> >>> list of all pages with zero return_mask - this will return a compact
> >>> list of used ranges of the virtual address space.
> >> At the time, we are supporting 4 flags (PAGE_IS_WRITTEN, PAGE_IS_FILE,
> >> PAGE_IS_PRESENT and PAGE_IS_SWAPPED). The idea is that user mention his
> >> flags of interest in the return_mask. If he wants only 1 flag, he'll
> >> specify it. Definitely if user wants only 1 flag, initially it doesn't make
> >> any sense to mention in the return mask. But we want uniformity. If user
> >> want, 2 or more flags in returned, return_mask becomes compulsory. So to
> >> keep things simple and generic for any number of flags of interest
> >> returned, the return_mask must be specified even if the flag of interest is
> >> only 1.
> >
> > I'm not sure why do we want uniformity in the case of 1 flag? If a
> > user specifies a single required flag, I'd expect he doesn't need to
> > look at the flags returned as those will duplicate the information
> > from mere presence of a page. A user might also require a single flag,
> > but want all of them returned. Both requests - return 1 flag and
> > return 0 flags would give meaningful output, so why force one way or
> > the other? Allowing two will also enable users to express the intent:
> > they need either just a list of pages, or they need a list with
> > per-page flags - the need would follow from the code structure or
> > other factors.
> We can add as much flexibility as much people ask by keeping code simple.
> But it is going to be dirty to add error check which detects if return_mask
> = 0 and if there is only 1 flag of interest mentioned by the user. The
> following mentioned error check is essential to return deterministic
> output. Do you think this case is worth it to support and we don't want to
> go with the generality for both 1 or more flag cases?
>
> if (return_mask == 0 && hweight_long(required_mask | any_mask) != 1)
>         return error;

Why would you want to add this error check? If a user requires
multiple flags but cares only about a list of matching pages, then it
would be natural to express this intent as return_mask = 0.

> >>>>>> After taking a while to understand this and compare with already present
> >>>>>> flag system, `negated flags` is comparatively difficult to understand while
> >>>>>> already present flags seem easier.
> >>>>>
> >>>>> Maybe replacing negated_flags in the API with matched_values =
> >>>>> ~negated_flags would make this better?
> >>>>>
> >>>>> We compare having to understand XOR vs having to understand ordering
> >>>>> of required_flags and excluded_flags.
> >>>> There is no ordering in current masks scheme. No mask is preferable. For a
> >>>> page to get selected, all the definitions of the masks must be fulfilled.
> >>>> You have come up with good example that what if required_mask =
> >>>> exclude_mask. In this case, no page will fulfill the criterion and hence no
> >>>> page would be selected. It is user's fault that he isn't understanding the
> >>>> definitions of these masks correctly.
> >>>>
> >>>> Now thinking about it, I can add a error check which would return error if
> >>>> a bit in required and excluded masks matches. Would you like it? Lets put
> >>>> this check in place.
> >>>> (Previously I'd left it for user's wisdom not to do this. If he'll specify
> >>>> same masks in them, he'll get no addresses out of the syscall.)
> >>>
> >>> This error case is (one of) the problems I propose avoiding. You also
> >>> need much more text to describe the requred/excluded flags
> >>> interactions and edge cases than saying that a flag must have a value
> >>> equal to corresponding bit in ~negated_flags to be matched by
> >>> requried/anyof masks.
> >> I've found excluded_mask very intuitive as compared to negated_mask which
> >> is so difficult to understand that I don't know how to use it correctly.
> >> Lets take an example, I want pages which are PAGE_IS_WRITTEN and are not
> >> PAGE_IS_FILE. In addition, the pages must be PAGE_IS_PRESENT or
> >> PAGE_IS_SWAPPED. This can be specified as:
> >>
> >> required_mask = PAGE_IS_WRITTEN
> >> excluded_mask = PAGE_IS_FILE
> >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
> >>
> >> (a) assume page_flags = 0b1111
> >> skip page as 0b1111 & 0b0010 = true
> >>
> >> (b) assume page_flags = 0b1001
> >> select page as 0b1001 & 0b0010 = false
> >>
> >> It seemed intuitive. Right? How would you achieve same thing with negated_mask?
> >>
> >> required_mask = PAGE_IS_WRITTEN
> >> negated_mask = PAGE_IS_FILE
> >> anyof_mask = PAGE_IS_PRESETNT | PAGE_IS_SWAP
> >>
> >> (1) assume page_flags = 0b1111
> >> tested_flags = 0b1111 ^ 0b0010 = 0b1101
> >>
> >> (2) assume page_flags = 0b1001
> >> tested_flags = 0b1001 ^ 0b0010 = 0b1011
> >>
> >> In (1), we wanted to skip pages which have PAGE_IS_FILE set. But
> >> negated_mask has just masked it and page is still getting tested if it
> >> should be selected and it would get selected. It is wrong.
> >>
> >> In (2), the PAGE_IS_FILE bit of page_flags was 0 and got updated to 1 or
> >> PAGE_IS_FILE in tested_flags.
> >
> > I require flags PAGE_IS_WRITTEN=1, PAGE_IS_FILE=0, so:
> >
> > required_mask = PAGE_IS_WRITTEN | PAGE_IS_FILE;
> > negated_flags = PAGE_IS_FILE; // flags I want zero
> You want PAGE_IS_FILE to be zero and at the same time you are requiring the
> PAGE_IS_FILE. It is confusing.

Ok, I believe the misunderstanding comes from the naming. I "require"
the flag to be a particular value - hence include it in
"required_flags" and specify the required value in ~negated_flags. You
"require" the flag to be set (equal 1) and so include it in
"required_flags" and you "require" the flag to be clear (equal to 0)
so include it in "excluded_flags". Both approaches are correct, but I
would not consider one "easier" than the other. The former is more
general, though - makes any_of also able to match on flags cleared and
removes the possibility of a conflicting case of a flag present in
both sets.

Maybe considered_flags or matched_flags then would make the field
better understandable?

Best Regards
Michał Mirosław
Nadav Amit Feb. 23, 2023, 5:11 p.m. UTC | #32
> On Feb 22, 2023, at 11:10 PM, Muhammad Usama Anjum <usama.anjum@collabora.com> wrote:
> 
> Hi Nadav, Mike, Michał,
> 
> Can you please share your thoughts at [A] below?

I promised I won't talk about the API, but was persuaded to reconsider. I have a
general question regarding the suitablity of currently proposed high-level API.
To explore some alternatives, I'd like to suggest an alternative that may have
some advantages. If these have already been considered and dismissed, feel free
to ignore.

I believe we have two distinct usage scenarios: (1) vectored reads from pagemap,
and (2) atomic UFFD WP-read/protect. It's possible that these require separate
interfaces

Regarding vectored reads, I believe the simplest solution is to maintain the
current pagemap entry format for output and extend it if necessary. The input
can be a vector of ranges. I'm uncertain about the purpose of fields such
as 'anyof_mask' in 'pagemap_scan_arg', so I can't confirm their necessity and
whether the input need to be made. more complicated. There is a possibility
that fields such as 'anyof_mask' might expose internal APIs, so I hope they’re
not required.

For the atomic operation of 'PAGE_IS_WRITTEN' + 'PAGEMAP_WP_ENGAGE', a different
mechanism might be necessary. This function appears to be UFFD-specific.
Instead of the proposed IOCTL, an alternative option is to
use 'UFFD_FEATURE_WP_ASYNC' to log the pages that were written, similar to
page-modification logging on Intel. Since this feature appears to be specific
to UFFD, I believe it would be more appropriate to include the log as part of
the UFFD mechanism rather than the pagemap.

From my experience with UFFD, proper ordering of events  is crucial, although it
is not always done well. Therefore, we should aim for improvement, not
regression. I believe that utilizing the pagemap-based mechanism for WP'ing
might be a step in the wrong direction. I think that it would have been better
to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
file descriptor unless the log is full.

I am sorry that I chime in that late, but I think the complications that the
proposed mechanism might raise are not negligible. And anyhow this patch-set
still requires quite a bit of work before it can be merged.
Andrei Vagin Feb. 24, 2023, 2:20 a.m. UTC | #33
On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław <emmir@google.com> wrote:
>
> On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
> >
> > Hi Michał,
> >
> > Thank you so much for comment!
> >
> > On 2/17/23 8:18 PM, Michał Mirosław wrote:
> [...]
> > > For the page-selection mechanism, currently required_mask and
> > > excluded_mask have conflicting
> > They are opposite of each other:
> > All the set bits in required_mask must be set for the page to be selected.
> > All the set bits in excluded_mask must _not_ be set for the page to be
> > selected.
> >
> > > responsibilities. I suggest to rework that to:
> > > 1. negated_flags: page flags which are to be negated before applying
> > > the page selection using following masks;
> > Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> > the truth table:
> > Page Flag       negated_flags
> > 0               0                       0
> > 0               1                       1
> > 1               0                       1
> > 1               1                       0
> >
> > If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> > changed the page flag. It isn't making sense to me. Why the page flag bit
> > is being fliped?
> >
> > When Anrdei had proposed these masks, they seemed like a fancy way of
> > filtering inside kernel and it was straight forward to understand. These
> > masks would help his use cases for CRIU. So I'd included it. Please can you
> > elaborate what is the purpose of negation?
>
> The XOR is a way to invert the tested value of a flag (from positive
> to negative and the other way) without having the API with invalid
> values (with required_flags and excluded_flags you need to define a
> rule about what happens if a flag is present in both of the masks -
> either prioritise one mask over the other or reject the call).
> (Note: the XOR is applied only to the value of the flags for the
> purpose of testing page-selection criteria.)

Michał,

Your API isn't much different from the current one, but it requires
a bit more brain activity for understanding.

The current set of masks can be easy translated to the new one:
negated_flags = excluded_flags
required_flags_new = excluded_flags | required_flags

As for invalid values, I think it is an advantage of the current API.
I mean we can easily detect invalid values and return EINVAL. With your
API, such mistakes will be undetectable.

As for priorities, I don't see this problem here If I don't miss something.

We can rewrite the code this way:
```
if (required_mask && ((page_flags & required_mask) != required_mask)
  skip page;
if (anyof_mask && !(page_flags & anyof_mask))
  skip page;
if (page_flags & excluded_mask)
  skip page;
```

I think the result is always the same no matter in what order each
mask is applied.

Thanks,
Andrei
Michał Mirosław Feb. 25, 2023, 9:38 a.m. UTC | #34
On Fri, 24 Feb 2023 at 03:20, Andrei Vagin <avagin@gmail.com> wrote:
>
> On Tue, Feb 21, 2023 at 4:42 AM Michał Mirosław <emmir@google.com> wrote:
> >
> > On Tue, 21 Feb 2023 at 11:28, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> > >
> > > Hi Michał,
> > >
> > > Thank you so much for comment!
> > >
> > > On 2/17/23 8:18 PM, Michał Mirosław wrote:
> > [...]
> > > > For the page-selection mechanism, currently required_mask and
> > > > excluded_mask have conflicting
> > > They are opposite of each other:
> > > All the set bits in required_mask must be set for the page to be selected.
> > > All the set bits in excluded_mask must _not_ be set for the page to be
> > > selected.
> > >
> > > > responsibilities. I suggest to rework that to:
> > > > 1. negated_flags: page flags which are to be negated before applying
> > > > the page selection using following masks;
> > > Sorry I'm unable to understand the negation (which is XOR?). Lets look at
> > > the truth table:
> > > Page Flag       negated_flags
> > > 0               0                       0
> > > 0               1                       1
> > > 1               0                       1
> > > 1               1                       0
> > >
> > > If a page flag is 0 and negated_flag is 1, the result would be 1 which has
> > > changed the page flag. It isn't making sense to me. Why the page flag bit
> > > is being fliped?
> > >
> > > When Anrdei had proposed these masks, they seemed like a fancy way of
> > > filtering inside kernel and it was straight forward to understand. These
> > > masks would help his use cases for CRIU. So I'd included it. Please can you
> > > elaborate what is the purpose of negation?
> >
> > The XOR is a way to invert the tested value of a flag (from positive
> > to negative and the other way) without having the API with invalid
> > values (with required_flags and excluded_flags you need to define a
> > rule about what happens if a flag is present in both of the masks -
> > either prioritise one mask over the other or reject the call).
> > (Note: the XOR is applied only to the value of the flags for the
> > purpose of testing page-selection criteria.)
>
> Michał,
>
> Your API isn't much different from the current one, but it requires
> a bit more brain activity for understanding.
>
> The current set of masks can be easy translated to the new one:
> negated_flags = excluded_flags
> required_flags_new = excluded_flags | required_flags
>
> As for invalid values, I think it is an advantage of the current API.
> I mean we can easily detect invalid values and return EINVAL. With your
> API, such mistakes will be undetectable.
>
> As for priorities, I don't see this problem here If I don't miss something.
>
> We can rewrite the code this way:
> ```
> if (required_mask && ((page_flags & required_mask) != required_mask)
>   skip page;
> if (anyof_mask && !(page_flags & anyof_mask))
>   skip page;
> if (page_flags & excluded_mask)
>   skip page;
> ```
>
> I think the result is always the same no matter in what order each
> mask is applied.

Hi,

I would not want the discussion to wander into easier/harder territory
as that highty depends on experience one has. What I'm arguing about
is the consistency of the API. Let me expand a bit on that.

We have two ways to look at the page_flags:
 A. the field represents a *set of elements* (tags, attributes)
present on the page;
 B. the field represents a bitfield (structure; a fixed set of boolean
fields having a value of 0 or 1)

From A follows the include/exclude way of API design for matching the
flags, and from B the matched mask (which flags to check) + value set
(what values to require).

My argument is that B is consistent with how the flags are used in the
kernel: we don't have operations that add or remove flags, but we have
operations that set or change their value.

Best Regards
Michał Mirosław
Peter Xu Feb. 27, 2023, 9:18 p.m. UTC | #35
On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
> From my experience with UFFD, proper ordering of events  is crucial, although it
> is not always done well. Therefore, we should aim for improvement, not
> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
> might be a step in the wrong direction. I think that it would have been better
> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
> file descriptor unless the log is full.

Yes this is an interesting question to think about..

Keeping the data in the pgtable has one good thing that it doesn't need any
complexity on maintaining the log, and no possibility of "log full".

If there's possible "log full" then the next question is whether we should
let the worker wait the monitor if the monitor is not fast enough to
collect those data.  It adds some slight dependency on the two threads, I
think it can make the tracking harder or impossible in latency sensitive
workloads.

The other thing is we can also make the log "never gonna full" by making it
a bitmap covering any registered ranges, but I don't either know whether
it'll be worth it for the effort.

Thanks,
Nadav Amit Feb. 27, 2023, 11:09 p.m. UTC | #36
> On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> !! External Email
> 
> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
>> From my experience with UFFD, proper ordering of events  is crucial, although it
>> is not always done well. Therefore, we should aim for improvement, not
>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
>> might be a step in the wrong direction. I think that it would have been better
>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
>> file descriptor unless the log is full.
> 
> Yes this is an interesting question to think about..
> 
> Keeping the data in the pgtable has one good thing that it doesn't need any
> complexity on maintaining the log, and no possibility of "log full".

I understand your concern, but I think that eventually it might be simpler
to maintain, since the logic of how to process the log is moved to userspace.

At the same time, handling inputs from pagemap and uffd handlers and sync’ing
them would not be too easy for userspace.

But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
be needed, and there are some issues of ordering the events (I think all #PF
and other events should be ordered regardless) and how not to traverse all
async-userfaultfd_wait_queue’s (except those that block if the log is full)
when a wakeup is needed.

> 
> If there's possible "log full" then the next question is whether we should
> let the worker wait the monitor if the monitor is not fast enough to
> collect those data.  It adds some slight dependency on the two threads, I
> think it can make the tracking harder or impossible in latency sensitive
> workloads.

Again, I understand your concern. But this model that I propose is not new.
It is used with PML (page-modification logging) and KVM, and IIRC there is
a similar interface between KVM and QEMU to provide this information. There
are endless other examples for similar producer-consumer mechanisms that
might lead to stall in extreme cases. 

> 
> The other thing is we can also make the log "never gonna full" by making it
> a bitmap covering any registered ranges, but I don't either know whether
> it'll be worth it for the effort.

I do not see a benefit of half-log half-scan. It tries to take the
data-structure of one format and combine it with another.

Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the
threads of previous versions and I did not see userspace components that
use the API to say something smart. Personally, I do not find the current
API proposal to be very consistent and simple, and it seems to me that it
lets pagemap do userfaultfd-related tasks, which might be considered
inappropriate and non-intuitive.

If I derailed the discussion, I apologize.
Peter Xu Feb. 28, 2023, 3:55 p.m. UTC | #37
On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
> 
> 
> > On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > !! External Email
> > 
> > On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
> >> From my experience with UFFD, proper ordering of events  is crucial, although it
> >> is not always done well. Therefore, we should aim for improvement, not
> >> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
> >> might be a step in the wrong direction. I think that it would have been better
> >> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
> >> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
> >> file descriptor unless the log is full.
> > 
> > Yes this is an interesting question to think about..
> > 
> > Keeping the data in the pgtable has one good thing that it doesn't need any
> > complexity on maintaining the log, and no possibility of "log full".
> 
> I understand your concern, but I think that eventually it might be simpler
> to maintain, since the logic of how to process the log is moved to userspace.
> 
> At the same time, handling inputs from pagemap and uffd handlers and sync’ing
> them would not be too easy for userspace.

I do not expect a common uffd-wp async user to provide a fault handler at
all.  In my imagination it's in most cases used standalone from other uffd
modes; it means all the faults will still be handled by the kernel.  Here
we only leverage the accuracy of userfaultfd comparing to soft-dirty, so
not really real "user"-faults.

> 
> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
> be needed, and there are some issues of ordering the events (I think all #PF
> and other events should be ordered regardless) and how not to traverse all
> async-userfaultfd_wait_queue’s (except those that block if the log is full)
> when a wakeup is needed.

Will there be an ordering requirement for an async mode?  Considering it
should be async to whatever else, I would think it's not a problem, but
maybe I missed something.

> 
> > 
> > If there's possible "log full" then the next question is whether we should
> > let the worker wait the monitor if the monitor is not fast enough to
> > collect those data.  It adds some slight dependency on the two threads, I
> > think it can make the tracking harder or impossible in latency sensitive
> > workloads.
> 
> Again, I understand your concern. But this model that I propose is not new.
> It is used with PML (page-modification logging) and KVM, and IIRC there is
> a similar interface between KVM and QEMU to provide this information. There
> are endless other examples for similar producer-consumer mechanisms that
> might lead to stall in extreme cases. 

Yes, I'm not against thinking of using similar structures here.  It's just
that it's definitely more complicated on the interface, at least we need
yet one more interface to setup the rings and define its interfaces.

Note that although Muhammud is defining another new interface here too for
pagemap, I don't think it's strictly needed for uffd-wp async mode.  One
can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap
interface already.

So what Muhammud is proposing here are two things to me: (1) uffd-wp async,
plus (2) a new pagemap interface (which will closely work with (1) only if
we need atomicity on get-dirty and reprotect).

Defining new interface for uffd-wp async mode will be something extra, so
IMHO besides the heap allocation on the rings, we need to also justify
whether that is needed.  That's why I think it's fine to go with what
Muhammud proposed, because it's a minimum changeset at least for userfault
to support an async mode, and anything else can be done on top if necessary.

Going a bit back to the "lead to stall in extreme cases" above, just also
want to mention that the VM use case is slightly different - dirty tracking
is only heavily used during migration afaict, and it's a short period.  Not
a lot of people will complain performance degrades during that period
because that's just rare.  And, even without the ring the perf is really
bad during migration anyway... Especially when huge pages are used to back
the guest RAM.

Here it's slightly different to me: it's about tracking dirty pages during
any possible workload, and it can be monitored periodically and frequently.
So IMHO stricter than a VM use case where migration is the only period to
use it.

> 
> > 
> > The other thing is we can also make the log "never gonna full" by making it
> > a bitmap covering any registered ranges, but I don't either know whether
> > it'll be worth it for the effort.
> 
> I do not see a benefit of half-log half-scan. It tries to take the
> data-structure of one format and combine it with another.

What I'm saying here is not half-log / half-scan, but use a single bitmap
to store what page is dirty, just like KVM_GET_DIRTY_LOG.  I think it
avoids any above "stall" issue.

> 
> Anyhow, I was just giving my 2 cents. Admittedly, I did not follow the
> threads of previous versions and I did not see userspace components that
> use the API to say something smart.

Actually similar here. :) So I'm probably not the best one to describe what
is the best to look as API.

What I know is I think the new pagemap interface is welcomed by CRIU
developers, so it may be something good with/without userfaultfd getting
involved already.  I see this as "let's add one more bit for uffd-wp" in
the new interface only.

Quotting some link I got from Muhammud before with CRIU usage:

https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com
https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com

> Personally, I do not find the current API proposal to be very consistent
> and simple, and it seems to me that it lets pagemap do
> userfaultfd-related tasks, which might be considered inappropriate and
> non-intuitive.

Yes, I agree.  I just don't know what's the best way to avoid this.

The issue here IIUC is Muhammud needs one operation to do what Windows does
with getWriteWatch() API.  It means we need to mix up GET and PROTECT in a
single shot.  If we want to use pagemap as GET, then no choice to PROTECT
also here to me.

I think it'll be the same to soft-dirty if it's used, it means we'll extend
soft-dirty modifications from clear_refs to pagemap too which I also don't
think it's as clean.

> 
> If I derailed the discussion, I apologize.

Not at all.  I just wished you joined earlier!
Nadav Amit Feb. 28, 2023, 5:21 p.m. UTC | #38
> On Feb 28, 2023, at 7:55 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> !! External Email
> 
> On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
>> 
>> 
>>> On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote:
>>> 
>>> !! External Email
>>> 
>>> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
>>>> From my experience with UFFD, proper ordering of events  is crucial, although it
>>>> is not always done well. Therefore, we should aim for improvement, not
>>>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
>>>> might be a step in the wrong direction. I think that it would have been better
>>>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
>>>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
>>>> file descriptor unless the log is full.
>>> 
>>> Yes this is an interesting question to think about..
>>> 
>>> Keeping the data in the pgtable has one good thing that it doesn't need any
>>> complexity on maintaining the log, and no possibility of "log full".
>> 
>> I understand your concern, but I think that eventually it might be simpler
>> to maintain, since the logic of how to process the log is moved to userspace.
>> 
>> At the same time, handling inputs from pagemap and uffd handlers and sync’ing
>> them would not be too easy for userspace.
> 
> I do not expect a common uffd-wp async user to provide a fault handler at
> all.  In my imagination it's in most cases used standalone from other uffd
> modes; it means all the faults will still be handled by the kernel.  Here
> we only leverage the accuracy of userfaultfd comparing to soft-dirty, so
> not really real "user"-faults.

If that is the only use-case, it might make sense. But I guess most users would
most likely use some library (and not syscalls directly). So slightly
complicating the API for better generality may be reasonable.

> 
>> 
>> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
>> be needed, and there are some issues of ordering the events (I think all #PF
>> and other events should be ordered regardless) and how not to traverse all
>> async-userfaultfd_wait_queue’s (except those that block if the log is full)
>> when a wakeup is needed.
> 
> Will there be an ordering requirement for an async mode?  Considering it
> should be async to whatever else, I would think it's not a problem, but
> maybe I missed something.

You may be right, but I am not sure. I am still not sure what use-cases are
targeted in this patch-set. For CRIU checkpoint use-case (when the app is
not running), I guess the current interface makes sense. But if there are
use-cases in which this you do care about UFFD-events this can become an
issue.

But even in some obvious use-cases, this might be the wrong interface for
major performance issues. If we think about some incremental copying of
modified pages (a-la pre-copy live-migration or to create point-in-time
snapshots), it seems to me much more efficient for application to have a
log than traversing all the page-tables.


>> 
>>> 
>>> If there's possible "log full" then the next question is whether we should
>>> let the worker wait the monitor if the monitor is not fast enough to
>>> collect those data.  It adds some slight dependency on the two threads, I
>>> think it can make the tracking harder or impossible in latency sensitive
>>> workloads.
>> 
>> Again, I understand your concern. But this model that I propose is not new.
>> It is used with PML (page-modification logging) and KVM, and IIRC there is
>> a similar interface between KVM and QEMU to provide this information. There
>> are endless other examples for similar producer-consumer mechanisms that
>> might lead to stall in extreme cases.
> 
> Yes, I'm not against thinking of using similar structures here.  It's just
> that it's definitely more complicated on the interface, at least we need
> yet one more interface to setup the rings and define its interfaces.
> 
> Note that although Muhammud is defining another new interface here too for
> pagemap, I don't think it's strictly needed for uffd-wp async mode.  One
> can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap
> interface already.
> 
> So what Muhammud is proposing here are two things to me: (1) uffd-wp async,
> plus (2) a new pagemap interface (which will closely work with (1) only if
> we need atomicity on get-dirty and reprotect).
> 
> Defining new interface for uffd-wp async mode will be something extra, so
> IMHO besides the heap allocation on the rings, we need to also justify
> whether that is needed.  That's why I think it's fine to go with what
> Muhammud proposed, because it's a minimum changeset at least for userfault
> to support an async mode, and anything else can be done on top if necessary.
> 
> Going a bit back to the "lead to stall in extreme cases" above, just also
> want to mention that the VM use case is slightly different - dirty tracking
> is only heavily used during migration afaict, and it's a short period.  Not
> a lot of people will complain performance degrades during that period
> because that's just rare.  And, even without the ring the perf is really
> bad during migration anyway... Especially when huge pages are used to back
> the guest RAM.
> 
> Here it's slightly different to me: it's about tracking dirty pages during
> any possible workload, and it can be monitored periodically and frequently.
> So IMHO stricter than a VM use case where migration is the only period to
> use it.

I still don’t get the use-cases. "monitored periodically and frequently” is
not a use-case. And as I said before, actually, monitoring frequently is
more performant with a log than with scanning all the page-tables.

> 
>> 
>>> 
>>> The other thing is we can also make the log "never gonna full" by making it
>>> a bitmap covering any registered ranges, but I don't either know whether
>>> it'll be worth it for the effort.
>> 
>> I do not see a benefit of half-log half-scan. It tries to take the
>> data-structure of one format and combine it with another.
> 
> What I'm saying here is not half-log / half-scan, but use a single bitmap
> to store what page is dirty, just like KVM_GET_DIRTY_LOG.  I think it
> avoids any above "stall" issue.

Oh, I never went into the KVM details before - stupid me. If that’s what
eventually was proven to work for KVM/QEMU, then it really sounds like
the pagemap solution that Muhammad proposed.

But still not convoluting pagemap with userfaultfd (and especially
uffd-wp) can be beneficial. Linus already threw some comments here and
there about disliking uffd-wp, and I’m not sure adding uffd-wp specific
stuff to pagemap would be welcomed.

Anyhow, thanks for all the explanations. Eventually, I understand that
using bitmaps can be more efficient than a log if the bits are condensed.
Peter Xu Feb. 28, 2023, 7:31 p.m. UTC | #39
On Tue, Feb 28, 2023 at 05:21:20PM +0000, Nadav Amit wrote:
> 
> 
> > On Feb 28, 2023, at 7:55 AM, Peter Xu <peterx@redhat.com> wrote:
> > 
> > !! External Email
> > 
> > On Mon, Feb 27, 2023 at 11:09:12PM +0000, Nadav Amit wrote:
> >> 
> >> 
> >>> On Feb 27, 2023, at 1:18 PM, Peter Xu <peterx@redhat.com> wrote:
> >>> 
> >>> !! External Email
> >>> 
> >>> On Thu, Feb 23, 2023 at 05:11:11PM +0000, Nadav Amit wrote:
> >>>> From my experience with UFFD, proper ordering of events  is crucial, although it
> >>>> is not always done well. Therefore, we should aim for improvement, not
> >>>> regression. I believe that utilizing the pagemap-based mechanism for WP'ing
> >>>> might be a step in the wrong direction. I think that it would have been better
> >>>> to emit a 'UFFD_FEATURE_WP_ASYNC' WP-log (and ordered) with UFFD #PF and
> >>>> events. The 'UFFD_FEATURE_WP_ASYNC'-log may not need to wake waiters on the
> >>>> file descriptor unless the log is full.
> >>> 
> >>> Yes this is an interesting question to think about..
> >>> 
> >>> Keeping the data in the pgtable has one good thing that it doesn't need any
> >>> complexity on maintaining the log, and no possibility of "log full".
> >> 
> >> I understand your concern, but I think that eventually it might be simpler
> >> to maintain, since the logic of how to process the log is moved to userspace.
> >> 
> >> At the same time, handling inputs from pagemap and uffd handlers and sync’ing
> >> them would not be too easy for userspace.
> > 
> > I do not expect a common uffd-wp async user to provide a fault handler at
> > all.  In my imagination it's in most cases used standalone from other uffd
> > modes; it means all the faults will still be handled by the kernel.  Here
> > we only leverage the accuracy of userfaultfd comparing to soft-dirty, so
> > not really real "user"-faults.
> 
> If that is the only use-case, it might make sense. But I guess most users would
> most likely use some library (and not syscalls directly). So slightly
> complicating the API for better generality may be reasonable.
> 
> > 
> >> 
> >> But yes, allocation on the heap for userfaultfd_wait_queue-like entries would
> >> be needed, and there are some issues of ordering the events (I think all #PF
> >> and other events should be ordered regardless) and how not to traverse all
> >> async-userfaultfd_wait_queue’s (except those that block if the log is full)
> >> when a wakeup is needed.
> > 
> > Will there be an ordering requirement for an async mode?  Considering it
> > should be async to whatever else, I would think it's not a problem, but
> > maybe I missed something.
> 
> You may be right, but I am not sure. I am still not sure what use-cases are
> targeted in this patch-set. For CRIU checkpoint use-case (when the app is
> not running), I guess the current interface makes sense. But if there are
> use-cases in which this you do care about UFFD-events this can become an
> issue.
> 
> But even in some obvious use-cases, this might be the wrong interface for
> major performance issues. If we think about some incremental copying of
> modified pages (a-la pre-copy live-migration or to create point-in-time
> snapshots), it seems to me much more efficient for application to have a
> log than traversing all the page-tables.

IMHO snapshots may not need a log at all - it needs CoW before the write
happens.  Nor is the case for swapping with userfaults, IIUC.  IOW in those
cases people don't care which page got dirtied, but care on data not being
modified until the app allows it to.

But I get the point, and I agree collecting by scanning is slower.

> 
> 
> >> 
> >>> 
> >>> If there's possible "log full" then the next question is whether we should
> >>> let the worker wait the monitor if the monitor is not fast enough to
> >>> collect those data.  It adds some slight dependency on the two threads, I
> >>> think it can make the tracking harder or impossible in latency sensitive
> >>> workloads.
> >> 
> >> Again, I understand your concern. But this model that I propose is not new.
> >> It is used with PML (page-modification logging) and KVM, and IIRC there is
> >> a similar interface between KVM and QEMU to provide this information. There
> >> are endless other examples for similar producer-consumer mechanisms that
> >> might lead to stall in extreme cases.
> > 
> > Yes, I'm not against thinking of using similar structures here.  It's just
> > that it's definitely more complicated on the interface, at least we need
> > yet one more interface to setup the rings and define its interfaces.
> > 
> > Note that although Muhammud is defining another new interface here too for
> > pagemap, I don't think it's strictly needed for uffd-wp async mode.  One
> > can use uffd-wp async mode with PM_UFFD_WP which is with current pagemap
> > interface already.
> > 
> > So what Muhammud is proposing here are two things to me: (1) uffd-wp async,
> > plus (2) a new pagemap interface (which will closely work with (1) only if
> > we need atomicity on get-dirty and reprotect).
> > 
> > Defining new interface for uffd-wp async mode will be something extra, so
> > IMHO besides the heap allocation on the rings, we need to also justify
> > whether that is needed.  That's why I think it's fine to go with what
> > Muhammud proposed, because it's a minimum changeset at least for userfault
> > to support an async mode, and anything else can be done on top if necessary.
> > 
> > Going a bit back to the "lead to stall in extreme cases" above, just also
> > want to mention that the VM use case is slightly different - dirty tracking
> > is only heavily used during migration afaict, and it's a short period.  Not
> > a lot of people will complain performance degrades during that period
> > because that's just rare.  And, even without the ring the perf is really
> > bad during migration anyway... Especially when huge pages are used to back
> > the guest RAM.
> > 
> > Here it's slightly different to me: it's about tracking dirty pages during
> > any possible workload, and it can be monitored periodically and frequently.
> > So IMHO stricter than a VM use case where migration is the only period to
> > use it.
> 
> I still don’t get the use-cases. "monitored periodically and frequently” is
> not a use-case. And as I said before, actually, monitoring frequently is
> more performant with a log than with scanning all the page-tables.

Feel free to ignore this part if we're not taking about using a ring
structure.  My previous comment was mostly for that.  Bitmaps won't have
this issue.  Here I see a bitmap as one way to implement a log, where it's
recorded by one bit per page.  My comment was that we should be careful on
using rings.

Side note: actually kvm dirty ring is even trickier; see the soft-full
(kvm_dirty_ring.soft_limit) besides the hard-full event to make sure
hard-full won't really trigger (or we're prone to lose dirty bits).  I
don't think we'll have the same issue here so we can trigger hard-full, but
it's still unwanted to halt the threads being tracked for dirty pages.  I
don't know whether there'll be other side effects by the ring, though..

> 
> > 
> >> 
> >>> 
> >>> The other thing is we can also make the log "never gonna full" by making it
> >>> a bitmap covering any registered ranges, but I don't either know whether
> >>> it'll be worth it for the effort.
> >> 
> >> I do not see a benefit of half-log half-scan. It tries to take the
> >> data-structure of one format and combine it with another.
> > 
> > What I'm saying here is not half-log / half-scan, but use a single bitmap
> > to store what page is dirty, just like KVM_GET_DIRTY_LOG.  I think it
> > avoids any above "stall" issue.
> 
> Oh, I never went into the KVM details before - stupid me. If that’s what
> eventually was proven to work for KVM/QEMU, then it really sounds like
> the pagemap solution that Muhammad proposed.
> 
> But still not convoluting pagemap with userfaultfd (and especially
> uffd-wp) can be beneficial. Linus already threw some comments here and
> there about disliking uffd-wp, and I’m not sure adding uffd-wp specific
> stuff to pagemap would be welcomed.

Yes I also don't know..  As I mentioned I'm not super happy with the
interface either, but that's the simplest I can think of so far.

IOW, from an "userfaultfd-side reviewer" POV I'm fine if someone wants to
leverage the concepts of uffd-wp and its internals using a separate but
very light weighted patch just to impl async mode of uffd-wp.  But I'm
always open to any suggestions too.  It's just that when there're multiple
options and when we're not confident on either way, I normally prefer the
simplest and cleanest (even if less efficient).

> Anyhow, thanks for all the explanations. Eventually, I understand that
> using bitmaps can be more efficient than a log if the bits are condensed.

Note that I think what Muhammad (sorry, Muhammad! I think I spelled your
name wrongly before starting from some email..) proposed is not a bitmap,
but an array of ranges that can coalesce the result into very condensed
form.  Pros and cons.

Again, I can't comment much on that API, but since there're a bunch of
other developers looking at that and they're also potential future users,
I'll trust their judgement and just focus more on the other side of things.

Thanks,
Nadav Amit March 1, 2023, 1:59 a.m. UTC | #40
> On Feb 28, 2023, at 11:31 AM, Peter Xu <peterx@redhat.com> wrote:
> 
>> Anyhow, thanks for all the explanations. Eventually, I understand that
>> using bitmaps can be more efficient than a log if the bits are condensed.
> 
> Note that I think what Muhammad (sorry, Muhammad! I think I spelled your
> name wrongly before starting from some email..) proposed is not a bitmap,
> but an array of ranges that can coalesce the result into very condensed
> form.  Pros and cons.
> 
> Again, I can't comment much on that API, but since there're a bunch of
> other developers looking at that and they're also potential future users,
> I'll trust their judgement and just focus more on the other side of things.


Thanks Peter for your patience.

I would just note that I understood that Muhammad did not propose a condensed
bitmap, and that was a hint that handling a condensed bitmap (at least on x86)
can be done rather efficiently. I am not sure about other representations.

Thanks for your explanations again, Peter.
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..c6bde19d63d9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@ 
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1135,6 +1136,22 @@  static inline void clear_soft_dirty(struct vm_area_struct *vma,
 }
 #endif
 
+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+	if ((pte_present(pte) && pte_uffd_wp(pte)) ||
+	    (pte_swp_uffd_wp_any(pte)))
+		return true;
+	return false;
+}
+
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+	if ((pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+	    (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd)))
+		return true;
+	return false;
+}
+
 #if defined(CONFIG_MEM_SOFT_DIRTY) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
 static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		unsigned long addr, pmd_t *pmdp)
@@ -1763,11 +1780,284 @@  static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PAGEMAP_BITS_ALL		(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
+					 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PAGEMAP_NON_WRITTEN_BITS	(PAGE_IS_FILE |	PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define IS_WP_ENGAGE_OP(a)		(a->flags & PAGEMAP_WP_ENGAGE)
+#define IS_GET_OP(a)			(a->vec)
+#define HAS_NO_SPACE(p)			(p->max_pages && (p->found_pages == p->max_pages))
+
+#define PAGEMAP_SCAN_BITMAP(wt, file, present, swap)	\
+	(wt | file << 1 | present << 2 | swap << 3)
+#define IS_WT_REQUIRED(a)				\
+	((a->required_mask & PAGE_IS_WRITTEN) ||	\
+	 (a->anyof_mask & PAGE_IS_WRITTEN))
+
+struct pagemap_scan_private {
+	struct page_region *vec;
+	struct page_region prev;
+	unsigned long vec_len, vec_index;
+	unsigned int max_pages, found_pages, flags;
+	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (IS_WT_REQUIRED(p) && !userfaultfd_wp(vma) && !userfaultfd_wp_async(vma))
+		return -EPERM;
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+	return 0;
+}
+
+static inline int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
+				      struct pagemap_scan_private *p, unsigned long addr,
+				      unsigned int len)
+{
+	unsigned long bitmap, cur = PAGEMAP_SCAN_BITMAP(wt, file, pres, swap);
+	bool cpy = true;
+	struct page_region *prev = &p->prev;
+
+	if (HAS_NO_SPACE(p))
+		return -ENOSPC;
+
+	if (p->max_pages && p->found_pages + len >= p->max_pages)
+		len = p->max_pages - p->found_pages;
+	if (!len)
+		return -EINVAL;
+
+	if (p->required_mask)
+		cpy = ((p->required_mask & cur) == p->required_mask);
+	if (cpy && p->anyof_mask)
+		cpy = (p->anyof_mask & cur);
+	if (cpy && p->excluded_mask)
+		cpy = !(p->excluded_mask & cur);
+	bitmap = cur & p->return_mask;
+	if (cpy && bitmap) {
+		if ((prev->len) && (prev->bitmap == bitmap) &&
+		    (prev->start + prev->len * PAGE_SIZE == addr)) {
+			prev->len += len;
+			p->found_pages += len;
+		} else if (p->vec_index < p->vec_len) {
+			if (prev->len) {
+				memcpy(&p->vec[p->vec_index], prev, sizeof(struct page_region));
+				p->vec_index++;
+			}
+			prev->start = addr;
+			prev->len = len;
+			prev->bitmap = bitmap;
+			p->found_pages += len;
+		} else {
+			return -ENOSPC;
+		}
+	}
+	return 0;
+}
+
+static inline int export_prev_to_out(struct pagemap_scan_private *p, struct page_region __user *vec,
+				     unsigned long *vec_index)
+{
+	struct page_region *prev = &p->prev;
+
+	if (prev->len) {
+		if (copy_to_user(&vec[*vec_index], prev, sizeof(struct page_region)))
+			return -EFAULT;
+		p->vec_index++;
+		(*vec_index)++;
+		prev->len = 0;
+	}
+	return 0;
+}
+
+static inline int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+					 unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long addr = end;
+	spinlock_t *ptl;
+	int ret = 0;
+	pte_t *pte;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		bool pmd_wt;
+
+		pmd_wt = !is_pmd_uffd_wp(*pmd);
+		/*
+		 * Break huge page into small pages if operation needs to be performed is
+		 * on a portion of the huge page.
+		 */
+		if (pmd_wt && IS_WP_ENGAGE_OP(p) && (end - start < HPAGE_SIZE)) {
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, start);
+			goto process_smaller_pages;
+		}
+		if (IS_GET_OP(p))
+			ret = pagemap_scan_output(pmd_wt, vma->vm_file, pmd_present(*pmd),
+						  is_swap_pmd(*pmd), p, start,
+						  (end - start)/PAGE_SIZE);
+		spin_unlock(ptl);
+		if (!ret) {
+			if (pmd_wt && IS_WP_ENGAGE_OP(p))
+				uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true);
+		}
+		return ret;
+	}
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+	if (IS_GET_OP(p)) {
+		for (addr = start; addr < end; pte++, addr += PAGE_SIZE) {
+			ret = pagemap_scan_output(!is_pte_uffd_wp(*pte), vma->vm_file,
+						  pte_present(*pte), is_swap_pte(*pte), p, addr, 1);
+			if (ret)
+				break;
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	if ((!ret || ret == -ENOSPC) && IS_WP_ENGAGE_OP(p) && (addr - start))
+		uffd_wp_range(walk->mm, vma, start, addr - start, true);
+
+	cond_resched();
+	return ret;
+}
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end, int depth,
+				 struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int ret = 0;
+
+	if (vma)
+		ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
+					  (end - addr)/PAGE_SIZE);
+	return ret;
+}
+
+/* No hugetlb support is present. */
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+};
+
+static long do_pagemap_cmd(struct mm_struct *mm, struct pagemap_scan_arg *arg)
+{
+	unsigned long empty_slots, vec_index = 0;
+	unsigned long __user start, end;
+	unsigned long __start, __end;
+	struct page_region __user *vec;
+	struct pagemap_scan_private p;
+	int ret = 0;
+
+	start = (unsigned long)untagged_addr(arg->start);
+	vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
+
+	/* Validate memory ranges */
+	if ((!IS_ALIGNED(start, PAGE_SIZE)) || (!access_ok((void __user *)start, arg->len)))
+		return -EINVAL;
+	if (IS_GET_OP(arg) && ((arg->vec_len == 0) ||
+	    (!access_ok((void __user *)vec, arg->vec_len * sizeof(struct page_region)))))
+		return -EINVAL;
+
+	/* Detect illegal flags and masks */
+	if ((arg->flags & ~PAGEMAP_WP_ENGAGE) || (arg->required_mask & ~PAGEMAP_BITS_ALL) ||
+	    (arg->anyof_mask & ~PAGEMAP_BITS_ALL) || (arg->excluded_mask & ~PAGEMAP_BITS_ALL) ||
+	    (arg->return_mask & ~PAGEMAP_BITS_ALL))
+		return -EINVAL;
+	if (IS_GET_OP(arg) && ((!arg->required_mask && !arg->anyof_mask && !arg->excluded_mask) ||
+				!arg->return_mask))
+		return -EINVAL;
+	/* The non-WT flags cannot be obtained if PAGEMAP_WP_ENGAGE is also specified. */
+	if (IS_WP_ENGAGE_OP(arg) && ((arg->required_mask & PAGEMAP_NON_WRITTEN_BITS) ||
+	    (arg->anyof_mask & PAGEMAP_NON_WRITTEN_BITS)))
+		return -EINVAL;
+
+	end = start + arg->len;
+	p.max_pages = arg->max_pages;
+	p.found_pages = 0;
+	p.flags = arg->flags;
+	p.required_mask = arg->required_mask;
+	p.anyof_mask = arg->anyof_mask;
+	p.excluded_mask = arg->excluded_mask;
+	p.return_mask = arg->return_mask;
+	p.prev.len = 0;
+	p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
+
+	if (IS_GET_OP(arg)) {
+		p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region), GFP_KERNEL);
+		if (!p.vec)
+			return -ENOMEM;
+	} else {
+		p.vec = NULL;
+	}
+	__start = __end = start;
+	while (!ret && __end < end) {
+		p.vec_index = 0;
+		empty_slots = arg->vec_len - vec_index;
+		if (p.vec_len > empty_slots)
+			p.vec_len = empty_slots;
+
+		__end = (__start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
+		if (__end > end)
+			__end = end;
+
+		mmap_read_lock(mm);
+		ret = walk_page_range(mm, __start, __end, &pagemap_scan_ops, &p);
+		mmap_read_unlock(mm);
+		if (!(!ret || ret == -ENOSPC))
+			goto free_data;
+
+		__start = __end;
+		if (IS_GET_OP(arg) && p.vec_index) {
+			if (copy_to_user(&vec[vec_index], p.vec,
+					 p.vec_index * sizeof(struct page_region))) {
+				ret = -EFAULT;
+				goto free_data;
+			}
+			vec_index += p.vec_index;
+		}
+	}
+	ret = export_prev_to_out(&p, vec, &vec_index);
+	if (!ret)
+		ret = vec_index;
+free_data:
+	if (IS_GET_OP(arg))
+		kfree(p.vec);
+
+	return ret;
+}
+
+static long pagemap_scan_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct pagemap_scan_arg __user *uarg = (struct pagemap_scan_arg __user *)arg;
+	struct mm_struct *mm = file->private_data;
+	struct pagemap_scan_arg argument;
+
+	if (cmd == PAGEMAP_SCAN) {
+		if (copy_from_user(&argument, uarg, sizeof(struct pagemap_scan_arg)))
+			return -EFAULT;
+		return do_pagemap_cmd(mm, &argument);
+	}
+	return -EINVAL;
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = pagemap_scan_ioctl,
+	.compat_ioctl	= pagemap_scan_ioctl,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..1ae9a8684b48 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,54 @@  typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pagemap_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pagemap_scan_args */
+#define PAGE_IS_WRITTEN		(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pagemap_scan_arg - Pagemap ioctl argument
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @flags:		Flags for the IOCTL
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pagemap_scan_arg {
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u32 max_pages;
+	__u32 flags;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Special flags */
+#define PAGEMAP_WP_ENGAGE	(1 << 0)
+
 #endif /* _UAPI_LINUX_FS_H */