diff mbox series

[RESEND,v15,2/5] fs/proc/task_mmu: Implement IOCTL to get and optionally clear info about PTEs

Message ID 20230420060156.895881-3-usama.anjum@collabora.com (mailing list archive)
State Mainlined, archived
Headers show
Series Implement IOCTL to get and optionally clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum April 20, 2023, 6:01 a.m. UTC
This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are supported
in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
  (PAGE_IS_SWAPPED).
- Find pages which have been written-to and write protect the pages
  (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

This IOCTL can be extended to get information about more PTE bits.

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
Changes in v15:
- Build fix:
  - Use generic tlb flush function in pagemap_scan_pmd_entry() instead of
    using x86 specific flush function in do_pagemap_scan()
  - Remove #ifdef from pagemap_scan_hugetlb_entry()
  - Use mm instead of undefined vma->vm_mm

Changes in v14:
- Fix build error caused by #ifdef added at last minute in some configs

Changes in v13:
- Review updates
- mmap_read_lock_killable() instead of mmap_read_lock()
- Replace uffd_wp_range() with helpers which increases performance
  drastically for OP_WP operations by reducing the number of tlb
  flushing etc
- Add MMU_NOTIFY_PROTECTION_VMA notification for the memory range

Changes in v12:
- Add hugetlb support to cover all memory types
- Merge "userfaultfd: Define dummy uffd_wp_range()" with this patch
- Review updates to the code

Changes in v11:
- Find written pages in a better way
- Fix a corner case (thanks Paul)
- Improve the code/comments
- remove ENGAGE_WP + ! GET operation
- shorten the commit message in favour of moving documentation to
  pagemap.rst

Changes in v10:
- move changes in tools/include/uapi/linux/fs.h to separate patch
- update commit message

Change in v8:
- Correct is_pte_uffd_wp()
- Improve readability and error checks
- Remove some un-needed code

Changes in v7:
- Rebase on top of latest next
- Fix some corner cases
- Base soft-dirty on the uffd wp async
- Update the terminologies
- Optimize the memory usage inside the ioctl

Changes in v6:
- Rename variables and update comments
- Make IOCTL independent of soft_dirty config
- Change masks and bitmap type to _u64
- Improve code quality

Changes in v5:
- Remove tlb flushing even for clear operation

Changes in v4:
- Update the interface and implementation

Changes in v3:
- Tighten the user-kernel interface by using explicit types and add more
  error checking

Changes in v2:
- Convert the interface from syscall to ioctl
- Remove pidfd support as it doesn't make sense in ioctl

task_mmu

task_mmu

task_mmu.c
---
 fs/proc/task_mmu.c      | 481 ++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/fs.h |  53 +++++
 2 files changed, 534 insertions(+)

Comments

Muhammad Usama Anjum April 26, 2023, 7:06 a.m. UTC | #1
On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
> +/* Supported flags */
> +#define PM_SCAN_OP_GET	(1 << 0)
> +#define PM_SCAN_OP_WP	(1 << 1)
We have only these flag options available in PAGEMAP_SCAN IOCTL.
PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
be specified as need. But PM_SCAN_OP_WP cannot be specified without
PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
functionality which can be achieved by UFFDIO_WRITEPROTECT.)

1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
vs
2) UFFDIO_WRITEPROTECT

After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
getting really good performance which is comparable just like we are
depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
performance and behavior wise.

I've got the results from someone else that UFFDIO_WRITEPROTECT block
pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
as I don't have tests comparing them one-to-one.

What are your thoughts about it? Have you thought about making
UFFDIO_WRITEPROTECT perform better?

I'm sorry to mention the word "performance" here. Actually we want better
performance to emulate Windows syscall. That is why we are adding this
functionality. So either we need to see what can be improved in
UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
pagemap_ioctl?

Thank you so much for the help.
Peter Xu April 26, 2023, 2:13 p.m. UTC | #2
Hi, Muhammad,

On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
> > +/* Supported flags */
> > +#define PM_SCAN_OP_GET	(1 << 0)
> > +#define PM_SCAN_OP_WP	(1 << 1)
> We have only these flag options available in PAGEMAP_SCAN IOCTL.
> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
> be specified as need. But PM_SCAN_OP_WP cannot be specified without
> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
> 
> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
> vs
> 2) UFFDIO_WRITEPROTECT
> 
> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
> getting really good performance which is comparable just like we are
> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
> performance and behavior wise.
> 
> I've got the results from someone else that UFFDIO_WRITEPROTECT block
> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
> as I don't have tests comparing them one-to-one.
> 
> What are your thoughts about it? Have you thought about making
> UFFDIO_WRITEPROTECT perform better?
> 
> I'm sorry to mention the word "performance" here. Actually we want better
> performance to emulate Windows syscall. That is why we are adding this
> functionality. So either we need to see what can be improved in
> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
> pagemap_ioctl?

I'm fine if you want to add it back if it works for you.  Though before
that, could you remind me why there can be a difference on performance?

Thanks,
Muhammad Usama Anjum April 27, 2023, 3:31 p.m. UTC | #3
Hi Peter,

Thank you for your reply.

On 4/26/23 7:13 PM, Peter Xu wrote:
> Hi, Muhammad,
> 
> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
>>> +/* Supported flags */
>>> +#define PM_SCAN_OP_GET	(1 << 0)
>>> +#define PM_SCAN_OP_WP	(1 << 1)
>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
>>
>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
>> vs
>> 2) UFFDIO_WRITEPROTECT
>>
>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
>> getting really good performance which is comparable just like we are
>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
>> performance and behavior wise.
>>
>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
>> as I don't have tests comparing them one-to-one.
>>
>> What are your thoughts about it? Have you thought about making
>> UFFDIO_WRITEPROTECT perform better?
>>
>> I'm sorry to mention the word "performance" here. Actually we want better
>> performance to emulate Windows syscall. That is why we are adding this
>> functionality. So either we need to see what can be improved in
>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
>> pagemap_ioctl?
> 
> I'm fine if you want to add it back if it works for you.  Though before
> that, could you remind me why there can be a difference on performance?
The only difference can be that UFFDIO_WRITEPROTECT acquires read mm lock
once for entire duration. But for PAGEMAP_SCAN IOCTL, we acquire and
release for each PMD to keep intermediate buffer short.

This must be hard to convince you. So I'll write some test to see what is
the exact difference and show you the numbers.

> 
> Thanks,
>
Muhammad Usama Anjum May 22, 2023, 10:24 a.m. UTC | #4
On 4/26/23 7:13 PM, Peter Xu wrote:
> Hi, Muhammad,
> 
> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
>>> +/* Supported flags */
>>> +#define PM_SCAN_OP_GET	(1 << 0)
>>> +#define PM_SCAN_OP_WP	(1 << 1)
>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
>>
>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
>> vs
>> 2) UFFDIO_WRITEPROTECT
>>
>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
>> getting really good performance which is comparable just like we are
>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
>> performance and behavior wise.
>>
>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
>> as I don't have tests comparing them one-to-one.
>>
>> What are your thoughts about it? Have you thought about making
>> UFFDIO_WRITEPROTECT perform better?
>>
>> I'm sorry to mention the word "performance" here. Actually we want better
>> performance to emulate Windows syscall. That is why we are adding this
>> functionality. So either we need to see what can be improved in
>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
>> pagemap_ioctl?
> 
> I'm fine if you want to add it back if it works for you.  Though before
> that, could you remind me why there can be a difference on performance?
I've looked at the code again and I think I've found something. Lets look
at exact performance numbers:

I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
the average write time to the same memory which is being WP-ed and total
time of execution of these APIs:

**avg write time:**
| No of pages            | 2000 | 8192 | 100000 | 500000 |
|------------------------|------|------|--------|--------|
| UFFDIO_WRITEPROTECT    | 2200 | 2300 | 4100   | 4200   |
| PM_SCAN_OP_WP          | 2000 | 2300 | 2500   | 2800   |

**Execution time measured in rdtsc:**
| No of pages            | 2000 | 8192  | 100000 | 500000 |
|------------------------|------|-------|--------|--------|
| UFFDIO_WRITEPROTECT    | 3200 | 14000 | 59000  | 58000  |
| PM_SCAN_OP_WP          | 1900 | 7000  | 38000  | 40000  |

Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution
time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So
UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time
is also slower.

This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although
PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We
should have seen no difference in performance. But we have quite a lot of
difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses
walk_page_range() to walk over pages which finds VMAs from address ranges
to walk over them and pagemap_scan_pmd_entry() is handling most of the work
including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and
iterating from all the different page directories until a pte is found and
then flags are updated there and tlb is flushed for every pte.

My next deduction would be that we are getting worse performance as we are
flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While
we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP.
I've just verified this by adding some logs to the change_pte_range() and
pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000
pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP.
The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1
page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger
sizes. I've learned over my last experience that tlb flush is very
expensive. Probably this is what we need to improve if we don't want to add
PM_SCAN_OP_WP?

The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic
function and I'm not sure if can try to not do tlb flushes if uffd_wp is
true. We can try to do flush somewhere else and hopefully we should do only
one flush if possible. It will not be so straight forward to move away from
generic fundtion. Thoughts?


> Thanks,
>
Muhammad Usama Anjum May 22, 2023, 11:26 a.m. UTC | #5
On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote:
> On 4/26/23 7:13 PM, Peter Xu wrote:
>> Hi, Muhammad,
>>
>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
>>>> +/* Supported flags */
>>>> +#define PM_SCAN_OP_GET	(1 << 0)
>>>> +#define PM_SCAN_OP_WP	(1 << 1)
>>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
>>>
>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
>>> vs
>>> 2) UFFDIO_WRITEPROTECT
>>>
>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
>>> getting really good performance which is comparable just like we are
>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
>>> performance and behavior wise.
>>>
>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
>>> as I don't have tests comparing them one-to-one.
>>>
>>> What are your thoughts about it? Have you thought about making
>>> UFFDIO_WRITEPROTECT perform better?
>>>
>>> I'm sorry to mention the word "performance" here. Actually we want better
>>> performance to emulate Windows syscall. That is why we are adding this
>>> functionality. So either we need to see what can be improved in
>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
>>> pagemap_ioctl?
>>
>> I'm fine if you want to add it back if it works for you.  Though before
>> that, could you remind me why there can be a difference on performance?
> I've looked at the code again and I think I've found something. Lets look
> at exact performance numbers:
> 
> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
> the average write time to the same memory which is being WP-ed and total
> time of execution of these APIs:
> 
> **avg write time:**
> | No of pages            | 2000 | 8192 | 100000 | 500000 |
> |------------------------|------|------|--------|--------|
> | UFFDIO_WRITEPROTECT    | 2200 | 2300 | 4100   | 4200   |
> | PM_SCAN_OP_WP          | 2000 | 2300 | 2500   | 2800   |
> 
> **Execution time measured in rdtsc:**
> | No of pages            | 2000 | 8192  | 100000 | 500000 |
> |------------------------|------|-------|--------|--------|
> | UFFDIO_WRITEPROTECT    | 3200 | 14000 | 59000  | 58000  |
> | PM_SCAN_OP_WP          | 1900 | 7000  | 38000  | 40000  |
> 
> Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution
> time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So
> UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time
> is also slower.
> 
> This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although
> PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We
> should have seen no difference in performance. But we have quite a lot of
> difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses
> walk_page_range() to walk over pages which finds VMAs from address ranges
> to walk over them and pagemap_scan_pmd_entry() is handling most of the work
> including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and
> iterating from all the different page directories until a pte is found and
> then flags are updated there and tlb is flushed for every pte.
> 
> My next deduction would be that we are getting worse performance as we are
> flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While
> we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP.
> I've just verified this by adding some logs to the change_pte_range() and
> pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000
> pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP.
> The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1
> page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger
> sizes. I've learned over my last experience that tlb flush is very
> expensive. Probably this is what we need to improve if we don't want to add
> PM_SCAN_OP_WP?
> 
> The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic
> function and I'm not sure if can try to not do tlb flushes if uffd_wp is
> true. We can try to do flush somewhere else and hopefully we should do only
> one flush if possible. It will not be so straight forward to move away from
> generic fundtion. Thoughts?
I've just tested this theory of not doing per pte flushes and only did one
flush on entire range in uffd_wp_range(). But it didn't improve the
situation either. I was wrong that tlb flushes may be the cause.


> 
> 
>> Thanks,
>>
>
Peter Xu May 23, 2023, 7:43 p.m. UTC | #6
Hi, Muhammad,

On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote:
> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote:
> > On 4/26/23 7:13 PM, Peter Xu wrote:
> >> Hi, Muhammad,
> >>
> >> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
> >>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
> >>>> +/* Supported flags */
> >>>> +#define PM_SCAN_OP_GET	(1 << 0)
> >>>> +#define PM_SCAN_OP_WP	(1 << 1)
> >>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
> >>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
> >>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
> >>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
> >>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
> >>>
> >>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
> >>> vs
> >>> 2) UFFDIO_WRITEPROTECT
> >>>
> >>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
> >>> getting really good performance which is comparable just like we are
> >>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
> >>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
> >>> performance and behavior wise.
> >>>
> >>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
> >>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
> >>> as I don't have tests comparing them one-to-one.
> >>>
> >>> What are your thoughts about it? Have you thought about making
> >>> UFFDIO_WRITEPROTECT perform better?
> >>>
> >>> I'm sorry to mention the word "performance" here. Actually we want better
> >>> performance to emulate Windows syscall. That is why we are adding this
> >>> functionality. So either we need to see what can be improved in
> >>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
> >>> pagemap_ioctl?
> >>
> >> I'm fine if you want to add it back if it works for you.  Though before
> >> that, could you remind me why there can be a difference on performance?
> > I've looked at the code again and I think I've found something. Lets look
> > at exact performance numbers:
> > 
> > I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
> > for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
> > the average write time to the same memory which is being WP-ed and total
> > time of execution of these APIs:

What is the steps of the test?  Is it as simple as "writeprotect",
"unprotect", then write all pages in a single thread?

Is UFFDIO_WRITEPROTECT sent in one range covering all pages?

Maybe you can attach the test program here too.

> > 
> > **avg write time:**
> > | No of pages            | 2000 | 8192 | 100000 | 500000 |
> > |------------------------|------|------|--------|--------|
> > | UFFDIO_WRITEPROTECT    | 2200 | 2300 | 4100   | 4200   |
> > | PM_SCAN_OP_WP          | 2000 | 2300 | 2500   | 2800   |
> > 
> > **Execution time measured in rdtsc:**
> > | No of pages            | 2000 | 8192  | 100000 | 500000 |
> > |------------------------|------|-------|--------|--------|
> > | UFFDIO_WRITEPROTECT    | 3200 | 14000 | 59000  | 58000  |
> > | PM_SCAN_OP_WP          | 1900 | 7000  | 38000  | 40000  |
> > 
> > Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution
> > time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So
> > UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time
> > is also slower.
> > 
> > This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although
> > PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We
> > should have seen no difference in performance. But we have quite a lot of
> > difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses
> > walk_page_range() to walk over pages which finds VMAs from address ranges
> > to walk over them and pagemap_scan_pmd_entry() is handling most of the work
> > including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and
> > iterating from all the different page directories until a pte is found and
> > then flags are updated there and tlb is flushed for every pte.
> > 
> > My next deduction would be that we are getting worse performance as we are
> > flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While
> > we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP.
> > I've just verified this by adding some logs to the change_pte_range() and
> > pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000
> > pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP.
> > The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1
> > page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger
> > sizes. I've learned over my last experience that tlb flush is very
> > expensive. Probably this is what we need to improve if we don't want to add
> > PM_SCAN_OP_WP?
> > 
> > The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic
> > function and I'm not sure if can try to not do tlb flushes if uffd_wp is
> > true. We can try to do flush somewhere else and hopefully we should do only
> > one flush if possible. It will not be so straight forward to move away from
> > generic fundtion. Thoughts?
> I've just tested this theory of not doing per pte flushes and only did one
> flush on entire range in uffd_wp_range(). But it didn't improve the
> situation either. I was wrong that tlb flushes may be the cause.

I had a feeling that you were trapping tlb_flush_pte_range(), which is
actually not really sending any TLB flushes but updating mmu_gather object
for the addr range for future invalidations.

That's probably why it didn't show an effect when you comment it out.

I am not sure whether the wr-protect path difference can be caused by the
arch hooks, namely arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode().

On x86 I saw that it's actually hooked onto some PV calls.  I had a feeling
that this is for optimization only, but maybe it's still a good idea you
also take that into your new code:

static inline void arch_enter_lazy_mmu_mode(void)
{
	PVOP_VCALL0(mmu.lazy_mode.enter);
}

The other thing is I think you're flushing tlb outside pgtable lock in your
new code.  IIUC that's racy, see:

commit 6ce64428d62026a10cb5d80138ff2f90cc21d367
Author: Nadav Amit <namit@vmware.com>
Date:   Fri Mar 12 21:08:17 2021 -0800

    mm/userfaultfd: fix memory corruption due to writeprotect

So you may want to put it at least into pgtable lock critical section, or
IIUC you can also do inc_tlb_flush_pending() then dec_tlb_flush_pending()
just like __tlb_gather_mmu(), to make sure do_wp_page() will properly flush
the page when unluckily hit some of the page.

That's also the spot (the flush_tlb_page() in 6ce64428d) that made me think
on whether it caused the slowness on writting to those pages.  But it
really depends on your test program, e.g. if it's a single threaded I don't
think it'll trigger because when writting mm_tlb_flush_pending() should
start to return 0 already, so the tlb should logically not be needed.  If
you want maybe you can double check that.

So in short, I had a feeling that the new PM_SCAN_OP_WP just misses
something here and there so it's faster - it means even if it's faster it
may also be prone to race conditions etc so we'd better figure it out..

Thanks,
Muhammad Usama Anjum May 24, 2023, 11:26 a.m. UTC | #7
On 5/24/23 12:43 AM, Peter Xu wrote:
> Hi, Muhammad,
> 
> On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote:
>> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote:
>>> On 4/26/23 7:13 PM, Peter Xu wrote:
>>>> Hi, Muhammad,
>>>>
>>>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
>>>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
>>>>>> +/* Supported flags */
>>>>>> +#define PM_SCAN_OP_GET	(1 << 0)
>>>>>> +#define PM_SCAN_OP_WP	(1 << 1)
>>>>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
>>>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
>>>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
>>>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
>>>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
>>>>>
>>>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
>>>>> vs
>>>>> 2) UFFDIO_WRITEPROTECT
>>>>>
>>>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
>>>>> getting really good performance which is comparable just like we are
>>>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
>>>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
>>>>> performance and behavior wise.
>>>>>
>>>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
>>>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
>>>>> as I don't have tests comparing them one-to-one.
>>>>>
>>>>> What are your thoughts about it? Have you thought about making
>>>>> UFFDIO_WRITEPROTECT perform better?
>>>>>
>>>>> I'm sorry to mention the word "performance" here. Actually we want better
>>>>> performance to emulate Windows syscall. That is why we are adding this
>>>>> functionality. So either we need to see what can be improved in
>>>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
>>>>> pagemap_ioctl?
>>>>
>>>> I'm fine if you want to add it back if it works for you.  Though before
>>>> that, could you remind me why there can be a difference on performance?
>>> I've looked at the code again and I think I've found something. Lets look
>>> at exact performance numbers:
>>>
>>> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
>>> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
>>> the average write time to the same memory which is being WP-ed and total
>>> time of execution of these APIs:
> 
> What is the steps of the test?  Is it as simple as "writeprotect",
> "unprotect", then write all pages in a single thread?
> 
> Is UFFDIO_WRITEPROTECT sent in one range covering all pages?
> 
> Maybe you can attach the test program here too.

I'd not attached the test earlier as I thought that you wouldn't be
interested in running the test. I've attached it now. The test has multiple
threads where one thread tries to get status of flags and reset them, while
other threads write to that memory. In main(), we call the pagemap_scan
ioctl to get status of flags and reset the memory area as well. While in N
threads, the memory is written.

I usually run the test by following where memory area is of 100000 * pages:
./win2_linux 8 100000 1 1 0

I'm running tests on real hardware. The results are pretty consistent. I'm
also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to
UFFDIO_WRITEPROTECT.

The PM_SCAN_OP_WP op doesn't work exclusively on v15. So please find the
updated WIP code here:
https://gitlab.collabora.com/usama.anjum/linux-mainline/-/commits/memwatchv16/

> 
>>>
>>> **avg write time:**
>>> | No of pages            | 2000 | 8192 | 100000 | 500000 |
>>> |------------------------|------|------|--------|--------|
>>> | UFFDIO_WRITEPROTECT    | 2200 | 2300 | 4100   | 4200   |
>>> | PM_SCAN_OP_WP          | 2000 | 2300 | 2500   | 2800   |
>>>
>>> **Execution time measured in rdtsc:**
>>> | No of pages            | 2000 | 8192  | 100000 | 500000 |
>>> |------------------------|------|-------|--------|--------|
>>> | UFFDIO_WRITEPROTECT    | 3200 | 14000 | 59000  | 58000  |
>>> | PM_SCAN_OP_WP          | 1900 | 7000  | 38000  | 40000  |
>>>
>>> Avg write time for UFFDIO_WRITEPROTECT is 1.3 times slow. The execution
>>> time is 1.5 times slower in the case of UFFDIO_WRITEPROTECT. So
>>> UFFDIO_WRITEPROTECT is making writes slower to the pages and execution time
>>> is also slower.
>>>
>>> This proves that PM_SCAN_OP_WP is better than UFFDIO_WRITEPROTECT. Although
>>> PM_SCAN_OP_WP and UFFDIO_WRITEPROTECT have been implemented differently. We
>>> should have seen no difference in performance. But we have quite a lot of
>>> difference in performance here. PM_SCAN_OP_WP takes read mm lock, uses
>>> walk_page_range() to walk over pages which finds VMAs from address ranges
>>> to walk over them and pagemap_scan_pmd_entry() is handling most of the work
>>> including tlb flushing. UFFDIO_WRITEPROTECT is also taking the mm lock and
>>> iterating from all the different page directories until a pte is found and
>>> then flags are updated there and tlb is flushed for every pte.
>>>
>>> My next deduction would be that we are getting worse performance as we are
>>> flushing tlb for one page at a time in case of UFFDIO_WRITEPROTECT. While
>>> we flush tlb for 512 pages (moslty) at a time in case of PM_SCAN_OP_WP.
>>> I've just verified this by adding some logs to the change_pte_range() and
>>> pagemap_scan_pmd_entry(). Logs are attached. I've allocated memory of 1000
>>> pages and write-protected it with UFFDIO_WRITEPROTECT and PM_SCAN_OP_WP.
>>> The logs show that UFFDIO_WRITEPROTECT has flushed tlb 1000 times of size 1
>>> page each time. While PM_SCAN_OP_WP has flushed only 3 times of bigger
>>> sizes. I've learned over my last experience that tlb flush is very
>>> expensive. Probably this is what we need to improve if we don't want to add
>>> PM_SCAN_OP_WP?
>>>
>>> The UFFDIO_WRITEPROTECT uses change_pte_range() which is very generic
>>> function and I'm not sure if can try to not do tlb flushes if uffd_wp is
>>> true. We can try to do flush somewhere else and hopefully we should do only
>>> one flush if possible. It will not be so straight forward to move away from
>>> generic fundtion. Thoughts?
>> I've just tested this theory of not doing per pte flushes and only did one
>> flush on entire range in uffd_wp_range(). But it didn't improve the
>> situation either. I was wrong that tlb flushes may be the cause.
> 
> I had a feeling that you were trapping tlb_flush_pte_range(), which is
> actually not really sending any TLB flushes but updating mmu_gather object
> for the addr range for future invalidations.
> 
> That's probably why it didn't show an effect when you comment it out.
Yeah, probably.

> 
> I am not sure whether the wr-protect path difference can be caused by the
> arch hooks, namely arch_enter_lazy_mmu_mode() / arch_leave_lazy_mmu_mode().
> 
> On x86 I saw that it's actually hooked onto some PV calls.  I had a feeling
> that this is for optimization only, but maybe it's still a good idea you
> also take that into your new code:
> 
> static inline void arch_enter_lazy_mmu_mode(void)
> {
> 	PVOP_VCALL0(mmu.lazy_mode.enter);
> }
I've just looked into it. It isn't making any difference. But I think I
should include it in the code. It must be helpful for hypervisors etc.

> 
> The other thing is I think you're flushing tlb outside pgtable lock in your
> new code.  IIUC that's racy, see:
> 
> commit 6ce64428d62026a10cb5d80138ff2f90cc21d367
> Author: Nadav Amit <namit@vmware.com>
> Date:   Fri Mar 12 21:08:17 2021 -0800
> 
>     mm/userfaultfd: fix memory corruption due to writeprotect
> 
> So you may want to put it at least into pgtable lock critical section, or
> IIUC you can also do inc_tlb_flush_pending() then dec_tlb_flush_pending()
> just like __tlb_gather_mmu(), to make sure do_wp_page() will properly flush
> the page when unluckily hit some of the page.
Good point. I'll release page table lock after tlb flushing. I've just
added it to next WIP v16.

> 
> That's also the spot (the flush_tlb_page() in 6ce64428d) that made me think
> on whether it caused the slowness on writting to those pages.  But it
> really depends on your test program, e.g. if it's a single threaded I don't
> think it'll trigger because when writting mm_tlb_flush_pending() should
> start to return 0 already, so the tlb should logically not be needed.  If
> you want maybe you can double check that.
> 
> So in short, I had a feeling that the new PM_SCAN_OP_WP just misses
> something here and there so it's faster - it means even if it's faster it
> may also be prone to race conditions etc so we'd better figure it out..
The test program is multi-threaded. The performance number cannot be
reproduced with single-threaded application.

> 
> Thanks,
>
Peter Xu May 24, 2023, 1:55 p.m. UTC | #8
On Wed, May 24, 2023 at 04:26:33PM +0500, Muhammad Usama Anjum wrote:
> On 5/24/23 12:43 AM, Peter Xu wrote:
> > Hi, Muhammad,
> > 
> > On Mon, May 22, 2023 at 04:26:07PM +0500, Muhammad Usama Anjum wrote:
> >> On 5/22/23 3:24 PM, Muhammad Usama Anjum wrote:
> >>> On 4/26/23 7:13 PM, Peter Xu wrote:
> >>>> Hi, Muhammad,
> >>>>
> >>>> On Wed, Apr 26, 2023 at 12:06:23PM +0500, Muhammad Usama Anjum wrote:
> >>>>> On 4/20/23 11:01 AM, Muhammad Usama Anjum wrote:
> >>>>>> +/* Supported flags */
> >>>>>> +#define PM_SCAN_OP_GET	(1 << 0)
> >>>>>> +#define PM_SCAN_OP_WP	(1 << 1)
> >>>>> We have only these flag options available in PAGEMAP_SCAN IOCTL.
> >>>>> PM_SCAN_OP_GET must always be specified for this IOCTL. PM_SCAN_OP_WP can
> >>>>> be specified as need. But PM_SCAN_OP_WP cannot be specified without
> >>>>> PM_SCAN_OP_GET. (This was removed after you had asked me to not duplicate
> >>>>> functionality which can be achieved by UFFDIO_WRITEPROTECT.)
> >>>>>
> >>>>> 1) PM_SCAN_OP_GET | PM_SCAN_OP_WP
> >>>>> vs
> >>>>> 2) UFFDIO_WRITEPROTECT
> >>>>>
> >>>>> After removing the usage of uffd_wp_range() from PAGEMAP_SCAN IOCTL, we are
> >>>>> getting really good performance which is comparable just like we are
> >>>>> depending on SOFT_DIRTY flags in the PTE. But when we want to perform wp,
> >>>>> PM_SCAN_OP_GET | PM_SCAN_OP_WP is more desirable than UFFDIO_WRITEPROTECT
> >>>>> performance and behavior wise.
> >>>>>
> >>>>> I've got the results from someone else that UFFDIO_WRITEPROTECT block
> >>>>> pagefaults somehow which PAGEMAP_IOCTL doesn't. I still need to verify this
> >>>>> as I don't have tests comparing them one-to-one.
> >>>>>
> >>>>> What are your thoughts about it? Have you thought about making
> >>>>> UFFDIO_WRITEPROTECT perform better?
> >>>>>
> >>>>> I'm sorry to mention the word "performance" here. Actually we want better
> >>>>> performance to emulate Windows syscall. That is why we are adding this
> >>>>> functionality. So either we need to see what can be improved in
> >>>>> UFFDIO_WRITEPROTECT or can I please add only PM_SCAN_OP_WP back in
> >>>>> pagemap_ioctl?
> >>>>
> >>>> I'm fine if you want to add it back if it works for you.  Though before
> >>>> that, could you remind me why there can be a difference on performance?
> >>> I've looked at the code again and I think I've found something. Lets look
> >>> at exact performance numbers:
> >>>
> >>> I've run 2 different tests. In first test UFFDIO_WRITEPROTECT is being used
> >>> for engaging WP. In second test PM_SCAN_OP_WP is being used. I've measured
> >>> the average write time to the same memory which is being WP-ed and total
> >>> time of execution of these APIs:
> > 
> > What is the steps of the test?  Is it as simple as "writeprotect",
> > "unprotect", then write all pages in a single thread?
> > 
> > Is UFFDIO_WRITEPROTECT sent in one range covering all pages?
> > 
> > Maybe you can attach the test program here too.
> 
> I'd not attached the test earlier as I thought that you wouldn't be
> interested in running the test. I've attached it now. The test has multiple

Thanks.  No plan to run it, just to make sure I understand why such a
difference.

> threads where one thread tries to get status of flags and reset them, while
> other threads write to that memory. In main(), we call the pagemap_scan
> ioctl to get status of flags and reset the memory area as well. While in N
> threads, the memory is written.
> 
> I usually run the test by following where memory area is of 100000 * pages:
> ./win2_linux 8 100000 1 1 0
> 
> I'm running tests on real hardware. The results are pretty consistent. I'm
> also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to
> UFFDIO_WRITEPROTECT.

If it's multi-threaded test especially when the ioctl runs together with
the writers, then I'd assume it's caused by writers frequently need to
flush tlb (when writes during UFFDIO_WRITEPROTECT), the flush target could
potentially also include the core running the main thread who is also
trying to reprotect because they run on the same mm.

This makes me think that your current test case probably is the worst case
of Nadav's patch 6ce64428d6 because (1) the UFFDIO_WRITEPROTECT covers a
super large range, and (2) there're a _lot_ of concurrent writers during
the ioctl, so all of them will need to trigger a tlb flush, and that tlb
flush will further slow down the ioctl sender.

While I think that's the optimal case sometimes, of having minimum tlb
flush on the ioctl(UFFDIO_WRITEPROTECT), so maybe it makes sense somewhere
else where concurrent writers are not that much. I'll need to rethink a bit
on all these to find out whether we can have a good way for both..

For now, if your workload is mostly exactly like your test case, maybe you
can have your pagemap version of WP-only op there, making sure tlb flush is
within the pgtable lock critical section (so you should be safe even
without Nadav's patch).  If so, I'd appreciate you can add some comment
somewhere about such difference of using pagemap WP-only and
ioctl(UFFDIO_WRITEPROTECT), though.  In short, functional-wise they should
be the same, but trivial detail difference on performance as TBD (maybe one
day we can have a good approach for all and make them aligned again, but
maybe that also doesn't need to block your work).
Muhammad Usama Anjum May 24, 2023, 2:16 p.m. UTC | #9
On 5/24/23 6:55 PM, Peter Xu wrote:
...
>>> What is the steps of the test?  Is it as simple as "writeprotect",
>>> "unprotect", then write all pages in a single thread?
>>>
>>> Is UFFDIO_WRITEPROTECT sent in one range covering all pages?
>>>
>>> Maybe you can attach the test program here too.
>>
>> I'd not attached the test earlier as I thought that you wouldn't be
>> interested in running the test. I've attached it now. The test has multiple
> 
> Thanks.  No plan to run it, just to make sure I understand why such a
> difference.
> 
>> threads where one thread tries to get status of flags and reset them, while
>> other threads write to that memory. In main(), we call the pagemap_scan
>> ioctl to get status of flags and reset the memory area as well. While in N
>> threads, the memory is written.
>>
>> I usually run the test by following where memory area is of 100000 * pages:
>> ./win2_linux 8 100000 1 1 0
>>
>> I'm running tests on real hardware. The results are pretty consistent. I'm
>> also testing only on x86_64. PM_SCAN_OP_WP wins every time as compared to
>> UFFDIO_WRITEPROTECT.
> 
> If it's multi-threaded test especially when the ioctl runs together with
> the writers, then I'd assume it's caused by writers frequently need to
> flush tlb (when writes during UFFDIO_WRITEPROTECT), the flush target could
> potentially also include the core running the main thread who is also
> trying to reprotect because they run on the same mm.
> 
> This makes me think that your current test case probably is the worst case
> of Nadav's patch 6ce64428d6 because (1) the UFFDIO_WRITEPROTECT covers a
> super large range, and (2) there're a _lot_ of concurrent writers during
> the ioctl, so all of them will need to trigger a tlb flush, and that tlb
> flush will further slow down the ioctl sender.
> 
> While I think that's the optimal case sometimes, of having minimum tlb
> flush on the ioctl(UFFDIO_WRITEPROTECT), so maybe it makes sense somewhere
> else where concurrent writers are not that much. I'll need to rethink a bit
> on all these to find out whether we can have a good way for both..
> 
> For now, if your workload is mostly exactly like your test case, maybe you
> can have your pagemap version of WP-only op there, making sure tlb flush is
> within the pgtable lock critical section (so you should be safe even
> without Nadav's patch).  If so, I'd appreciate you can add some comment
> somewhere about such difference of using pagemap WP-only and
> ioctl(UFFDIO_WRITEPROTECT), though.  In short, functional-wise they should
> be the same, but trivial detail difference on performance as TBD (maybe one
> day we can have a good approach for all and make them aligned again, but
> maybe that also doesn't need to block your work).
Thank you for understanding what I've been trying to convey. We are going
to translate Windows syscall to this new ioctl. So it is very difficult to
find out the exact use cases as application must be using this syscall in
several different ways. There is one thing for sure is that we want to get
best performance possible which we are getting by adding WP-only. I'll add
it and send v16. I think that we are almost there.

>
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 38b19a757281..e119e13a9ba5 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -19,6 +19,7 @@ 
 #include <linux/shmem_fs.h>
 #include <linux/uaccess.h>
 #include <linux/pkeys.h>
+#include <linux/minmax.h>
 
 #include <asm/elf.h>
 #include <asm/tlb.h>
@@ -1767,11 +1768,491 @@  static int pagemap_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+#define PM_SCAN_FOUND_MAX_PAGES	(1)
+#define PM_SCAN_BITS_ALL	(PAGE_IS_WRITTEN | PAGE_IS_FILE |	\
+				 PAGE_IS_PRESENT | PAGE_IS_SWAPPED)
+#define PM_SCAN_OPS		(PM_SCAN_OP_GET | PM_SCAN_OP_WP)
+#define PM_SCAN_DO_UFFD_WP(a)	(a->flags & PM_SCAN_OP_WP)
+#define PM_SCAN_BITMAP(wt, file, present, swap)	\
+	((wt) | ((file) << 1) | ((present) << 2) | ((swap) << 3))
+
+struct pagemap_scan_private {
+	struct page_region *vec;
+	struct page_region cur;
+	unsigned long vec_len, vec_index;
+	unsigned int max_pages, found_pages, flags;
+	unsigned long required_mask, anyof_mask, excluded_mask, return_mask;
+};
+
+static inline bool is_pte_uffd_wp(pte_t pte)
+{
+	return (pte_present(pte) && pte_uffd_wp(pte)) ||
+	       pte_swp_uffd_wp_any(pte);
+}
+
+static inline void make_uffd_wp_pte(struct vm_area_struct *vma,
+				    unsigned long addr, pte_t *pte)
+{
+	pte_t ptent = *pte;
+
+	if (pte_present(ptent)) {
+		pte_t old_pte;
+
+		old_pte = ptep_modify_prot_start(vma, addr, pte);
+		ptent = pte_mkuffd_wp(ptent);
+		ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent);
+	} else if (is_swap_pte(ptent)) {
+		ptent = pte_swp_mkuffd_wp(ptent);
+		set_pte_at(vma->vm_mm, addr, pte, ptent);
+	}
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool is_pmd_uffd_wp(pmd_t pmd)
+{
+	return (pmd_present(pmd) && pmd_uffd_wp(pmd)) ||
+	       (is_swap_pmd(pmd) && pmd_swp_uffd_wp(pmd));
+}
+
+static inline void make_uffd_wp_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, pmd_t *pmdp)
+{
+	pmd_t old, pmd = *pmdp;
+
+	if (pmd_present(pmd)) {
+		old = pmdp_invalidate_ad(vma, addr, pmdp);
+		pmd = pmd_mkuffd_wp(old);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+		pmd = pmd_swp_mkuffd_wp(pmd);
+		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
+	}
+}
+#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+static inline bool is_huge_pte_uffd_wp(pte_t pte)
+{
+	return ((pte_present(pte) && huge_pte_uffd_wp(pte)) ||
+		pte_swp_uffd_wp_any(pte));
+}
+
+static inline void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 pte_t ptent)
+{
+	pte_t old_pte;
+
+	if (!huge_pte_none(ptent)) {
+		old_pte = huge_ptep_modify_prot_start(vma, addr, ptep);
+		ptent = huge_pte_mkuffd_wp(old_pte);
+		ptep_modify_prot_commit(vma, addr, ptep, old_pte, ptent);
+	} else {
+		set_huge_pte_at(vma->vm_mm, addr, ptep,
+				make_pte_marker(PTE_MARKER_UFFD_WP));
+	}
+}
+#endif
+
+static inline bool pagemap_scan_check_page_written(struct pagemap_scan_private *p)
+{
+	return (p->required_mask | p->anyof_mask | p->excluded_mask) &
+	       PAGE_IS_WRITTEN;
+}
+
+static int pagemap_scan_test_walk(unsigned long start, unsigned long end,
+				  struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+
+	if (pagemap_scan_check_page_written(p) && (!userfaultfd_wp(vma) ||
+	    !userfaultfd_wp_async(vma)))
+		return -EPERM;
+
+	if (vma->vm_flags & VM_PFNMAP)
+		return 1;
+
+	return 0;
+}
+
+static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
+			       struct pagemap_scan_private *p,
+			       unsigned long addr, unsigned int n_pages)
+{
+	unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
+	struct page_region *cur = &p->cur;
+
+	if (!n_pages)
+		return -EINVAL;
+
+	if ((p->required_mask & bitmap) != p->required_mask)
+		return 0;
+	if (p->anyof_mask && !(p->anyof_mask & bitmap))
+		return 0;
+	if (p->excluded_mask & bitmap)
+		return 0;
+
+	bitmap &= p->return_mask;
+	if (!bitmap)
+		return 0;
+
+	if (cur->bitmap == bitmap &&
+	    cur->start + cur->len * PAGE_SIZE == addr) {
+		cur->len += n_pages;
+		p->found_pages += n_pages;
+
+		if (p->max_pages && (p->found_pages == p->max_pages))
+			return PM_SCAN_FOUND_MAX_PAGES;
+
+		return 0;
+	}
+
+	/*
+	 * All data is copied to cur first. When more data is found, we push
+	 * cur to vec and copy new data to cur. The vec_index represents the
+	 * current index of vec array. We add 1 to the vec_index while
+	 * performing checks to account for data in cur.
+	 */
+	if (p->vec_index && (p->vec_index + 1) >= p->vec_len)
+		return -ENOSPC;
+
+	if (cur->len) {
+		memcpy(&p->vec[p->vec_index], cur, sizeof(*p->vec));
+		p->vec_index++;
+	}
+
+	cur->start = addr;
+	cur->len = n_pages;
+	cur->bitmap = bitmap;
+	p->found_pages += n_pages;
+
+	if (p->max_pages && (p->found_pages == p->max_pages))
+		return PM_SCAN_FOUND_MAX_PAGES;
+
+	return 0;
+}
+
+static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
+{
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	unsigned long addr = end;
+	pte_t *pte, *orig_pte;
+	spinlock_t *ptl;
+	bool is_written;
+	int ret = 0;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		unsigned long n_pages = (end - start)/PAGE_SIZE;
+
+		if (p->max_pages && n_pages > p->max_pages - p->found_pages)
+			n_pages = p->max_pages - p->found_pages;
+
+		is_written = !is_pmd_uffd_wp(*pmd);
+
+		/*
+		 * Break huge page into small pages if the WP operation need to
+		 * be performed is on a portion of the huge page.
+		 */
+		if (is_written && PM_SCAN_DO_UFFD_WP(p) &&
+		    n_pages < HPAGE_SIZE/PAGE_SIZE) {
+			spin_unlock(ptl);
+			split_huge_pmd(vma, pmd, start);
+			goto process_smaller_pages;
+		}
+
+		ret = pagemap_scan_output(is_written, vma->vm_file,
+					  pmd_present(*pmd), is_swap_pmd(*pmd),
+					  p, start, n_pages);
+
+		if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p))
+			make_uffd_wp_pmd(vma, addr, pmd);
+
+		spin_unlock(ptl);
+
+		if (PM_SCAN_DO_UFFD_WP(p))
+			flush_tlb_range(vma, start, end);
+
+		return ret;
+	}
+process_smaller_pages:
+	if (pmd_trans_unstable(pmd))
+		return 0;
+#endif
+
+	orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
+	for (addr = start; addr < end && !ret; pte++, addr += PAGE_SIZE) {
+		is_written = !is_pte_uffd_wp(*pte);
+
+		ret = pagemap_scan_output(is_written, vma->vm_file,
+					  pte_present(*pte), is_swap_pte(*pte),
+					  p, addr, 1);
+
+		if (ret >= 0 && is_written && PM_SCAN_DO_UFFD_WP(p))
+			make_uffd_wp_pte(vma, addr, pte);
+	}
+	pte_unmap_unlock(orig_pte, ptl);
+
+	if (PM_SCAN_DO_UFFD_WP(p))
+		flush_tlb_range(vma, start, addr);
+
+	cond_resched();
+	return ret;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int pagemap_scan_hugetlb_entry(pte_t *ptep, unsigned long hmask,
+				      unsigned long start, unsigned long end,
+				      struct mm_walk *walk)
+{
+	unsigned long n_pages = (end - start)/PAGE_SIZE;
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct hstate *h = hstate_vma(vma);
+	int ret = -EPERM;
+	spinlock_t *ptl;
+	bool is_written;
+	pte_t pte;
+
+	if (p->max_pages && n_pages > p->max_pages - p->found_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	if (PM_SCAN_DO_UFFD_WP(p)) {
+		i_mmap_lock_write(vma->vm_file->f_mapping);
+		ptl = huge_pte_lock(h, vma->vm_mm, ptep);
+	}
+
+	pte = huge_ptep_get(ptep);
+	is_written = !is_huge_pte_uffd_wp(pte);
+
+	/*
+	 * Partial hugetlb page clear isn't supported
+	 */
+	if (is_written && PM_SCAN_DO_UFFD_WP(p) &&
+	    n_pages < HPAGE_SIZE/PAGE_SIZE)
+		goto unlock_and_return;
+
+	ret = pagemap_scan_output(is_written, vma->vm_file, pte_present(pte),
+				  is_swap_pte(pte), p, start, n_pages);
+	if (ret < 0)
+		goto unlock_and_return;
+
+	if (is_written && PM_SCAN_DO_UFFD_WP(p)) {
+		make_uffd_wp_huge_pte(vma, start, ptep, pte);
+		flush_hugetlb_tlb_range(vma, start, end);
+	}
+
+unlock_and_return:
+	if (PM_SCAN_DO_UFFD_WP(p)) {
+		spin_unlock(ptl);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	}
+
+	return ret;
+}
+#else
+#define pagemap_scan_hugetlb_entry NULL
+#endif
+
+static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
+				 int depth, struct mm_walk *walk)
+{
+	unsigned long n_pages = (end - addr)/PAGE_SIZE;
+	struct pagemap_scan_private *p = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	int ret = 0;
+
+	if (!vma)
+		return 0;
+
+	if (p->max_pages && p->found_pages + n_pages > p->max_pages)
+		n_pages = p->max_pages - p->found_pages;
+
+	ret = pagemap_scan_output(false, vma->vm_file, false, false, p, addr,
+				  n_pages);
+	return ret;
+}
+
+static const struct mm_walk_ops pagemap_scan_ops = {
+	.test_walk = pagemap_scan_test_walk,
+	.pmd_entry = pagemap_scan_pmd_entry,
+	.pte_hole = pagemap_scan_pte_hole,
+	.hugetlb_entry = pagemap_scan_hugetlb_entry,
+};
+
+static int pagemap_scan_args_valid(struct pm_scan_arg *arg, unsigned long start,
+				   struct page_region __user *vec)
+{
+	/* Detect illegal size, flags, len and masks */
+	if (arg->size != sizeof(struct pm_scan_arg))
+		return -EINVAL;
+	if (arg->flags & ~PM_SCAN_OPS)
+		return -EINVAL;
+	if (!arg->len)
+		return -EINVAL;
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
+	     arg->return_mask) & ~PM_SCAN_BITS_ALL)
+		return -EINVAL;
+	if (!arg->required_mask && !arg->anyof_mask &&
+	    !arg->excluded_mask)
+		return -EINVAL;
+	if (!arg->return_mask)
+		return -EINVAL;
+
+	/* Validate memory ranges */
+	if (!(arg->flags & PM_SCAN_OP_GET))
+		return -EINVAL;
+	if (!arg->vec)
+		return -EINVAL;
+	if (arg->vec_len == 0)
+		return -EINVAL;
+
+	if (!IS_ALIGNED(start, PAGE_SIZE))
+		return -EINVAL;
+	if (!access_ok((void __user *)start, arg->len))
+		return -EFAULT;
+
+	if (!PM_SCAN_DO_UFFD_WP(arg))
+		return 0;
+
+	if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask) &
+	    ~PAGE_IS_WRITTEN)
+		return -EINVAL;
+
+	return 0;
+}
+
+static long do_pagemap_scan(struct mm_struct *mm,
+			   struct pm_scan_arg __user *uarg)
+{
+	unsigned long start, end, walk_start, walk_end;
+	unsigned long empty_slots, vec_index = 0;
+	struct mmu_notifier_range range;
+	struct page_region __user *vec;
+	struct pagemap_scan_private p;
+	struct pm_scan_arg arg;
+	int ret = 0;
+
+	if (copy_from_user(&arg, uarg, sizeof(arg)))
+		return -EFAULT;
+
+	start = untagged_addr((unsigned long)arg.start);
+	vec = (struct page_region *)untagged_addr((unsigned long)arg.vec);
+
+	ret = pagemap_scan_args_valid(&arg, start, vec);
+	if (ret)
+		return ret;
+
+	end = start + arg.len;
+	p.max_pages = arg.max_pages;
+	p.found_pages = 0;
+	p.flags = arg.flags;
+	p.required_mask = arg.required_mask;
+	p.anyof_mask = arg.anyof_mask;
+	p.excluded_mask = arg.excluded_mask;
+	p.return_mask = arg.return_mask;
+	p.cur.len = 0;
+	p.cur.start = 0;
+	p.vec = NULL;
+	p.vec_len = PAGEMAP_WALK_SIZE >> PAGE_SHIFT;
+
+	/*
+	 * Allocate smaller buffer to get output from inside the page walk
+	 * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
+	 * we want to return output to user in compact form where no two
+	 * consecutive regions should be continuous and have the same flags.
+	 * So store the latest element in p.cur between different walks and
+	 * store the p.cur at the end of the walk to the user buffer.
+	 */
+	p.vec = kmalloc_array(p.vec_len, sizeof(*p.vec), GFP_KERNEL);
+	if (!p.vec)
+		return -ENOMEM;
+
+	if (p.flags & PM_SCAN_OP_WP) {
+		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_VMA, 0,
+					mm, start, end);
+		mmu_notifier_invalidate_range_start(&range);
+	}
+
+	walk_start = walk_end = start;
+	while (walk_end < end && !ret) {
+		p.vec_index = 0;
+
+		empty_slots = arg.vec_len - vec_index;
+		p.vec_len = min(p.vec_len, empty_slots);
+
+		walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
+		if (walk_end > end)
+			walk_end = end;
+
+		ret = mmap_read_lock_killable(mm);
+		if (ret)
+			goto free_data;
+		ret = walk_page_range(mm, walk_start, walk_end,
+				      &pagemap_scan_ops, &p);
+		mmap_read_unlock(mm);
+
+		if (ret && ret != -ENOSPC && ret != PM_SCAN_FOUND_MAX_PAGES)
+			goto free_data;
+
+		walk_start = walk_end;
+		if (p.vec_index) {
+			if (copy_to_user(&vec[vec_index], p.vec,
+					 p.vec_index * sizeof(*p.vec))) {
+				/*
+				 * Return error even though the OP succeeded
+				 */
+				ret = -EFAULT;
+				goto free_data;
+			}
+			vec_index += p.vec_index;
+		}
+	}
+
+	if (p.flags & PM_SCAN_OP_WP)
+		mmu_notifier_invalidate_range_end(&range);
+
+	if (p.cur.len) {
+		if (copy_to_user(&vec[vec_index], &p.cur, sizeof(*p.vec))) {
+			ret = -EFAULT;
+			goto free_data;
+		}
+		vec_index++;
+	}
+
+	ret = vec_index;
+
+free_data:
+	kfree(p.vec);
+	return ret;
+}
+
+static long do_pagemap_cmd(struct file *file, unsigned int cmd,
+			   unsigned long arg)
+{
+	struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
+	struct mm_struct *mm = file->private_data;
+
+	switch (cmd) {
+	case PAGEMAP_SCAN:
+		return do_pagemap_scan(mm, uarg);
+
+	default:
+		return -EINVAL;
+	}
+}
+
 const struct file_operations proc_pagemap_operations = {
 	.llseek		= mem_lseek, /* borrow this */
 	.read		= pagemap_read,
 	.open		= pagemap_open,
 	.release	= pagemap_release,
+	.unlocked_ioctl = do_pagemap_cmd,
+	.compat_ioctl	= do_pagemap_cmd,
 };
 #endif /* CONFIG_PROC_PAGE_MONITOR */
 
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index b7b56871029c..47879c38ce2f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -305,4 +305,57 @@  typedef int __bitwise __kernel_rwf_t;
 #define RWF_SUPPORTED	(RWF_HIPRI | RWF_DSYNC | RWF_SYNC | RWF_NOWAIT |\
 			 RWF_APPEND)
 
+/* Pagemap ioctl */
+#define PAGEMAP_SCAN	_IOWR('f', 16, struct pm_scan_arg)
+
+/* Bits are set in the bitmap of the page_region and masks in pm_scan_args */
+#define PAGE_IS_WRITTEN		(1 << 0)
+#define PAGE_IS_FILE		(1 << 1)
+#define PAGE_IS_PRESENT		(1 << 2)
+#define PAGE_IS_SWAPPED		(1 << 3)
+
+/*
+ * struct page_region - Page region with bitmap flags
+ * @start:	Start of the region
+ * @len:	Length of the region in pages
+ * bitmap:	Bits sets for the region
+ */
+struct page_region {
+	__u64 start;
+	__u64 len;
+	__u64 bitmap;
+};
+
+/*
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:		Size of the structure
+ * @flags:		Flags for the IOCTL
+ * @start:		Starting address of the region
+ * @len:		Length of the region (All the pages in this length are included)
+ * @vec:		Address of page_region struct array for output
+ * @vec_len:		Length of the page_region struct array
+ * @max_pages:		Optional max return pages
+ * @required_mask:	Required mask - All of these bits have to be set in the PTE
+ * @anyof_mask:		Any mask - Any of these bits are set in the PTE
+ * @excluded_mask:	Exclude mask - None of these bits are set in the PTE
+ * @return_mask:	Bits that are to be reported in page_region
+ */
+struct pm_scan_arg {
+	__u64 size;
+	__u64 flags;
+	__u64 start;
+	__u64 len;
+	__u64 vec;
+	__u64 vec_len;
+	__u64 max_pages;
+	__u64 required_mask;
+	__u64 anyof_mask;
+	__u64 excluded_mask;
+	__u64 return_mask;
+};
+
+/* Supported flags */
+#define PM_SCAN_OP_GET	(1 << 0)
+#define PM_SCAN_OP_WP	(1 << 1)
+
 #endif /* _UAPI_LINUX_FS_H */