diff mbox series

[v7,1/4] userfaultfd: Add UFFD WP Async support

Message ID 20230109064519.3555250-2-usama.anjum@collabora.com (mailing list archive)
State New
Headers show
Series Implement IOCTL to get and/or the clear info about PTEs | expand

Commit Message

Muhammad Usama Anjum Jan. 9, 2023, 6:45 a.m. UTC
Add new WP Async mode (UFFDIO_WRITEPROTECT_MODE_ASYNC_WP) which resolves
the page faults on its own. It can be used to track that which pages have
been written to from the time the pages were write protected. It is very
efficient way to track the changes as uffd is by nature pte/pmd based.

UFFD WP (UFFDIO_WRITEPROTECT_MODE_WP) sends the page faults to the
userspace where the pages which have been written-to can be tracked. But
it is not efficient. This is why this async version is being added.
After setting the WP Async, the pages which have been written to can be
found in the pagemap file or information can be obtained from the
PAGEMAP_IOCTL (see next patches).

Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
---
 fs/userfaultfd.c                 | 150 +++++++++++++++++--------------
 include/uapi/linux/userfaultfd.h |   6 ++
 2 files changed, 90 insertions(+), 66 deletions(-)

Comments

Peter Xu Jan. 18, 2023, 4:54 p.m. UTC | #1
Hi, Muhammad,

On Mon, Jan 09, 2023 at 11:45:16AM +0500, Muhammad Usama Anjum wrote:
> Add new WP Async mode (UFFDIO_WRITEPROTECT_MODE_ASYNC_WP) which resolves
> the page faults on its own. It can be used to track that which pages have
> been written to from the time the pages were write protected. It is very
> efficient way to track the changes as uffd is by nature pte/pmd based.
> 
> UFFD WP (UFFDIO_WRITEPROTECT_MODE_WP) sends the page faults to the
> userspace where the pages which have been written-to can be tracked. But
> it is not efficient. This is why this async version is being added.
> After setting the WP Async, the pages which have been written to can be
> found in the pagemap file or information can be obtained from the
> PAGEMAP_IOCTL (see next patches).
> 
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> ---
>  fs/userfaultfd.c                 | 150 +++++++++++++++++--------------
>  include/uapi/linux/userfaultfd.h |   6 ++
>  2 files changed, 90 insertions(+), 66 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 15a5bf765d43..be5e10d15058 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -69,6 +69,7 @@ struct userfaultfd_ctx {
>  	unsigned int features;
>  	/* released */
>  	bool released;
> +	bool async;

Let's just make it a feature flag,

  UFFD_FEATURE_WP_ASYNC

>  	/* memory mappings are changing because of non-cooperative event */
>  	atomic_t mmap_changing;
>  	/* mm with one ore more vmas attached to this userfaultfd_ctx */
> @@ -497,80 +498,93 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>  
>  	/* take the reference before dropping the mmap_lock */
>  	userfaultfd_ctx_get(ctx);
> +	if (ctx->async) {

Firstly, please consider not touching the existing code/indent as much as
what this patch did.  Hopefully we can keep the major part of sync uffd be
there with its git log, it also helps reviewing your code.  You can add the
async block before that, handle the fault and return just earlier.

And, I think this is a bit too late because we're going to return with
VM_FAULT_RETRY here, while maybe we don't need to retry at all here because
we're going to resolve the page fault immediately.

I assume you added this because you wanted userfaultfd_ctx_get() to make
sure the uffd context will not go away from under us, but it's not needed
if we're still holding the mmap read lock.  I'd expect for async mode we
don't really need to release it at all.

> +		// Resolve page fault of this page

Please use "/* ... */" as that's the common pattern of commenting in the
Linux kernel, at least what I see in mm/.

> +		unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ?
> +				      vmf->real_address : vmf->address;
> +		struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr);
> +		size_t s = PAGE_SIZE;

This is weird - if we want async uffd-wp, let's consider huge page from the
1st day.

> +
> +		if (dst_vma->vm_flags & VM_HUGEPAGE) {

VM_HUGEPAGE is only a hint.  It doesn't mean this page is always a huge
page.  For anon, we can have thp wr-protected as a whole, not happening for
!anon because we'll split already.

For anon, if a write happens to a thp being uffd-wp-ed, we'll keep that pmd
wr-protected and report the uffd message.  The pmd split happens when the
user invokes UFFDIO_WRITEPROTECT on the small page.  I think it'll stop
working for async uffd-wp because we're going to resolve the page faults
right away.

So for async uffd-wp (note: this will be different from hugetlb), you may
want to consider having a pre-requisite patch to change wp_huge_pmd()
behavior: rather than calling handle_userfault(), IIUC you can also just
fallback to the split path right below (__split_huge_pmd) so the thp will
split now even before the uffd message is generated.

I think it should be transparent to the user and it'll start working for
you with async uffd-wp here, because it means when reaching
handle_userfault, it should not be possible to have thp at all since they
should have all split up.

> +			s = HPAGE_SIZE;
> +			addr &= HPAGE_MASK;
> +		}
>  
> -	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> -	uwq.wq.private = current;
> -	uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
> -				reason, ctx->features);
> -	uwq.ctx = ctx;
> -	uwq.waken = false;
> -
> -	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> +		ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing);

This is an overkill - we're pretty sure it's a single page, no need to call
a range function here.

> +	} else {
> +		init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> +		uwq.wq.private = current;
> +		uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
> +					reason, ctx->features);
> +		uwq.ctx = ctx;
> +		uwq.waken = false;
>  
> -        /*
> -         * Take the vma lock now, in order to safely call
> -         * userfaultfd_huge_must_wait() later. Since acquiring the
> -         * (sleepable) vma lock can modify the current task state, that
> -         * must be before explicitly calling set_current_state().
> -         */
> -	if (is_vm_hugetlb_page(vma))
> -		hugetlb_vma_lock_read(vma);
> +		blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>  
> -	spin_lock_irq(&ctx->fault_pending_wqh.lock);
> -	/*
> -	 * After the __add_wait_queue the uwq is visible to userland
> -	 * through poll/read().
> -	 */
> -	__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
> -	/*
> -	 * The smp_mb() after __set_current_state prevents the reads
> -	 * following the spin_unlock to happen before the list_add in
> -	 * __add_wait_queue.
> -	 */
> -	set_current_state(blocking_state);
> -	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
> +		/*
> +		 * Take the vma lock now, in order to safely call
> +		 * userfaultfd_huge_must_wait() later. Since acquiring the
> +		 * (sleepable) vma lock can modify the current task state, that
> +		 * must be before explicitly calling set_current_state().
> +		 */
> +		if (is_vm_hugetlb_page(vma))
> +			hugetlb_vma_lock_read(vma);
>  
> -	if (!is_vm_hugetlb_page(vma))
> -		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> -						  reason);
> -	else
> -		must_wait = userfaultfd_huge_must_wait(ctx, vma,
> -						       vmf->address,
> -						       vmf->flags, reason);
> -	if (is_vm_hugetlb_page(vma))
> -		hugetlb_vma_unlock_read(vma);
> -	mmap_read_unlock(mm);
> +		spin_lock_irq(&ctx->fault_pending_wqh.lock);
> +		/*
> +		 * After the __add_wait_queue the uwq is visible to userland
> +		 * through poll/read().
> +		 */
> +		__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
> +		/*
> +		 * The smp_mb() after __set_current_state prevents the reads
> +		 * following the spin_unlock to happen before the list_add in
> +		 * __add_wait_queue.
> +		 */
> +		set_current_state(blocking_state);
> +		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>  
> -	if (likely(must_wait && !READ_ONCE(ctx->released))) {
> -		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> -		schedule();
> -	}
> +		if (!is_vm_hugetlb_page(vma))
> +			must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
> +							  reason);
> +		else
> +			must_wait = userfaultfd_huge_must_wait(ctx, vma,
> +							       vmf->address,
> +							       vmf->flags, reason);
> +		if (is_vm_hugetlb_page(vma))
> +			hugetlb_vma_unlock_read(vma);
> +		mmap_read_unlock(mm);
> +
> +		if (likely(must_wait && !READ_ONCE(ctx->released))) {
> +			wake_up_poll(&ctx->fd_wqh, EPOLLIN);
> +			schedule();
> +		}
>  
> -	__set_current_state(TASK_RUNNING);
> +		__set_current_state(TASK_RUNNING);
>  
> -	/*
> -	 * Here we race with the list_del; list_add in
> -	 * userfaultfd_ctx_read(), however because we don't ever run
> -	 * list_del_init() to refile across the two lists, the prev
> -	 * and next pointers will never point to self. list_add also
> -	 * would never let any of the two pointers to point to
> -	 * self. So list_empty_careful won't risk to see both pointers
> -	 * pointing to self at any time during the list refile. The
> -	 * only case where list_del_init() is called is the full
> -	 * removal in the wake function and there we don't re-list_add
> -	 * and it's fine not to block on the spinlock. The uwq on this
> -	 * kernel stack can be released after the list_del_init.
> -	 */
> -	if (!list_empty_careful(&uwq.wq.entry)) {
> -		spin_lock_irq(&ctx->fault_pending_wqh.lock);
>  		/*
> -		 * No need of list_del_init(), the uwq on the stack
> -		 * will be freed shortly anyway.
> +		 * Here we race with the list_del; list_add in
> +		 * userfaultfd_ctx_read(), however because we don't ever run
> +		 * list_del_init() to refile across the two lists, the prev
> +		 * and next pointers will never point to self. list_add also
> +		 * would never let any of the two pointers to point to
> +		 * self. So list_empty_careful won't risk to see both pointers
> +		 * pointing to self at any time during the list refile. The
> +		 * only case where list_del_init() is called is the full
> +		 * removal in the wake function and there we don't re-list_add
> +		 * and it's fine not to block on the spinlock. The uwq on this
> +		 * kernel stack can be released after the list_del_init.
>  		 */
> -		list_del(&uwq.wq.entry);
> -		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
> +		if (!list_empty_careful(&uwq.wq.entry)) {
> +			spin_lock_irq(&ctx->fault_pending_wqh.lock);
> +			/*
> +			 * No need of list_del_init(), the uwq on the stack
> +			 * will be freed shortly anyway.
> +			 */
> +			list_del(&uwq.wq.entry);
> +			spin_unlock_irq(&ctx->fault_pending_wqh.lock);
> +		}
>  	}
> -
>  	/*
>  	 * ctx may go away after this if the userfault pseudo fd is
>  	 * already released.
> @@ -1861,11 +1875,14 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  		return ret;
>  
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
> -			       UFFDIO_WRITEPROTECT_MODE_WP))
> +			       UFFDIO_WRITEPROTECT_MODE_WP |
> +			       UFFDIO_WRITEPROTECT_MODE_ASYNC_WP))
>  		return -EINVAL;
>  
> -	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
> +	mode_wp = uffdio_wp.mode & (UFFDIO_WRITEPROTECT_MODE_WP |
> +				    UFFDIO_WRITEPROTECT_MODE_ASYNC_WP);
>  	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
> +	ctx->async = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ASYNC_WP;

Please no..  ctx attributes shouldn't be easily changed by a single ioctl.

I suggest we have a new feature bit as I mentioned above (say,
UFFD_FEATURE_WP_ASYNC), set it once with UFFDIO_API and it should apply to
the whole lifecycle of this uffd handle.  That flag should (something I can
quickly think of):

  - Have effect only if the uffd will be registered with WP mode (of
    course) or ignored in any other modes,

  - Should fail any attempts of UFFDIO_WRITEPROTECT with wp=false on this
    uffd handle because with async faults no page fault resolution needed
    from userspace,

  - Should apply to any region registered with this uffd ctx, so it's
    exclusively used with sync uffd-wp mode.

Then when the app wants to wr-protect in async mode, it simply goes ahead
with UFFDIO_WRITEPROTECT(wp=true), it'll happen exactly the same as when it
was sync mode.  It's only the pf handling procedure that's different (along
with how the fault is reported - rather than as a message but it'll be
consolidated into the soft-dirty bit).

>  
>  	if (mode_wp && mode_dontwake)
>  		return -EINVAL;
> @@ -2126,6 +2143,7 @@ static int new_userfaultfd(int flags)
>  	ctx->flags = flags;
>  	ctx->features = 0;
>  	ctx->released = false;
> +	ctx->async = false;
>  	atomic_set(&ctx->mmap_changing, 0);
>  	ctx->mm = current->mm;
>  	/* prevent the mm struct to be freed */
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index 005e5e306266..b89665653861 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -284,6 +284,11 @@ struct uffdio_writeprotect {
>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
>   * any wait thread after the operation succeeds.
>   *
> + * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a
> + * range, the flag is unset automatically when the page is written.
> + * This is used to track which pages have been written to from the
> + * time the memory was write protected.
> + *
>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
>   * protection (WP=0) in response to a page fault wakes the faulting
> @@ -291,6 +296,7 @@ struct uffdio_writeprotect {
>   */
>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> +#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP	((__u64)1<<2)
>  	__u64 mode;
>  };
>  
> -- 
> 2.30.2
>
Muhammad Usama Anjum Jan. 19, 2023, 3:09 p.m. UTC | #2
Hi Peter,

Thank you so much for reviewing.

On 1/18/23 9:54 PM, Peter Xu wrote:
> Hi, Muhammad,
> 
> On Mon, Jan 09, 2023 at 11:45:16AM +0500, Muhammad Usama Anjum wrote:
>> Add new WP Async mode (UFFDIO_WRITEPROTECT_MODE_ASYNC_WP) which resolves
>> the page faults on its own. It can be used to track that which pages have
>> been written to from the time the pages were write protected. It is very
>> efficient way to track the changes as uffd is by nature pte/pmd based.
>>
>> UFFD WP (UFFDIO_WRITEPROTECT_MODE_WP) sends the page faults to the
>> userspace where the pages which have been written-to can be tracked. But
>> it is not efficient. This is why this async version is being added.
>> After setting the WP Async, the pages which have been written to can be
>> found in the pagemap file or information can be obtained from the
>> PAGEMAP_IOCTL (see next patches).
>>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
>> ---
>>  fs/userfaultfd.c                 | 150 +++++++++++++++++--------------
>>  include/uapi/linux/userfaultfd.h |   6 ++
>>  2 files changed, 90 insertions(+), 66 deletions(-)
>>
>> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
>> index 15a5bf765d43..be5e10d15058 100644
>> --- a/fs/userfaultfd.c
>> +++ b/fs/userfaultfd.c
>> @@ -69,6 +69,7 @@ struct userfaultfd_ctx {
>>  	unsigned int features;
>>  	/* released */
>>  	bool released;
>> +	bool async;
> 
> Let's just make it a feature flag,
> 
>   UFFD_FEATURE_WP_ASYNC
This would really make things easier. Thank you so much for suggesting it.

> 
>>  	/* memory mappings are changing because of non-cooperative event */
>>  	atomic_t mmap_changing;
>>  	/* mm with one ore more vmas attached to this userfaultfd_ctx */
>> @@ -497,80 +498,93 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>  
>>  	/* take the reference before dropping the mmap_lock */
>>  	userfaultfd_ctx_get(ctx);
>> +	if (ctx->async) {
> 
> Firstly, please consider not touching the existing code/indent as much as
> what this patch did.  Hopefully we can keep the major part of sync uffd be
> there with its git log, it also helps reviewing your code.  You can add the
> async block before that, handle the fault and return just earlier.
This is possible. Will do in next revision.

> 
> And, I think this is a bit too late because we're going to return with
> VM_FAULT_RETRY here, while maybe we don't need to retry at all here because
> we're going to resolve the page fault immediately.
> 
> I assume you added this because you wanted userfaultfd_ctx_get() to make
> sure the uffd context will not go away from under us, but it's not needed
> if we're still holding the mmap read lock.  I'd expect for async mode we
> don't really need to release it at all.
I'll have to check the what should be returned here. We should return
something which shows that the fault has been resolved.

> 
>> +		// Resolve page fault of this page
> 
> Please use "/* ... */" as that's the common pattern of commenting in the
> Linux kernel, at least what I see in mm/.
Will do.

> 
>> +		unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ?
>> +				      vmf->real_address : vmf->address;
>> +		struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr);
>> +		size_t s = PAGE_SIZE;
> 
> This is weird - if we want async uffd-wp, let's consider huge page from the
> 1st day.
> 
>> +
>> +		if (dst_vma->vm_flags & VM_HUGEPAGE) {
> 
> VM_HUGEPAGE is only a hint.  It doesn't mean this page is always a huge
> page.  For anon, we can have thp wr-protected as a whole, not happening for
> !anon because we'll split already.
> 
> For anon, if a write happens to a thp being uffd-wp-ed, we'll keep that pmd
> wr-protected and report the uffd message.  The pmd split happens when the
> user invokes UFFDIO_WRITEPROTECT on the small page.  I think it'll stop
> working for async uffd-wp because we're going to resolve the page faults
> right away.
> 
> So for async uffd-wp (note: this will be different from hugetlb), you may
> want to consider having a pre-requisite patch to change wp_huge_pmd()
> behavior: rather than calling handle_userfault(), IIUC you can also just
> fallback to the split path right below (__split_huge_pmd) so the thp will
> split now even before the uffd message is generated.
I'll make the changes and make this. I wasn't aware that the thp is being
broken in the UFFD WP. At this time, I'm not sure if thp will be handled by
handle_userfault() in one go. Probably it will as the length is stored in
the vmf.

> 
> I think it should be transparent to the user and it'll start working for
> you with async uffd-wp here, because it means when reaching
> handle_userfault, it should not be possible to have thp at all since they
> should have all split up.
> 
>> +			s = HPAGE_SIZE;
>> +			addr &= HPAGE_MASK;
>> +		}
>>  
>> -	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
>> -	uwq.wq.private = current;
>> -	uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
>> -				reason, ctx->features);
>> -	uwq.ctx = ctx;
>> -	uwq.waken = false;
>> -
>> -	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>> +		ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing);
> 
> This is an overkill - we're pretty sure it's a single page, no need to call
> a range function here.
Probably change_pte_range() should be used here to directly remove the WP here?

> 
>> +	} else {
>> +		init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
>> +		uwq.wq.private = current;
>> +		uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
>> +					reason, ctx->features);
>> +		uwq.ctx = ctx;
>> +		uwq.waken = false;
>>  
>> -        /*
>> -         * Take the vma lock now, in order to safely call
>> -         * userfaultfd_huge_must_wait() later. Since acquiring the
>> -         * (sleepable) vma lock can modify the current task state, that
>> -         * must be before explicitly calling set_current_state().
>> -         */
>> -	if (is_vm_hugetlb_page(vma))
>> -		hugetlb_vma_lock_read(vma);
>> +		blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>>  
>> -	spin_lock_irq(&ctx->fault_pending_wqh.lock);
>> -	/*
>> -	 * After the __add_wait_queue the uwq is visible to userland
>> -	 * through poll/read().
>> -	 */
>> -	__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
>> -	/*
>> -	 * The smp_mb() after __set_current_state prevents the reads
>> -	 * following the spin_unlock to happen before the list_add in
>> -	 * __add_wait_queue.
>> -	 */
>> -	set_current_state(blocking_state);
>> -	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>> +		/*
>> +		 * Take the vma lock now, in order to safely call
>> +		 * userfaultfd_huge_must_wait() later. Since acquiring the
>> +		 * (sleepable) vma lock can modify the current task state, that
>> +		 * must be before explicitly calling set_current_state().
>> +		 */
>> +		if (is_vm_hugetlb_page(vma))
>> +			hugetlb_vma_lock_read(vma);
>>  
>> -	if (!is_vm_hugetlb_page(vma))
>> -		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
>> -						  reason);
>> -	else
>> -		must_wait = userfaultfd_huge_must_wait(ctx, vma,
>> -						       vmf->address,
>> -						       vmf->flags, reason);
>> -	if (is_vm_hugetlb_page(vma))
>> -		hugetlb_vma_unlock_read(vma);
>> -	mmap_read_unlock(mm);
>> +		spin_lock_irq(&ctx->fault_pending_wqh.lock);
>> +		/*
>> +		 * After the __add_wait_queue the uwq is visible to userland
>> +		 * through poll/read().
>> +		 */
>> +		__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
>> +		/*
>> +		 * The smp_mb() after __set_current_state prevents the reads
>> +		 * following the spin_unlock to happen before the list_add in
>> +		 * __add_wait_queue.
>> +		 */
>> +		set_current_state(blocking_state);
>> +		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>>  
>> -	if (likely(must_wait && !READ_ONCE(ctx->released))) {
>> -		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>> -		schedule();
>> -	}
>> +		if (!is_vm_hugetlb_page(vma))
>> +			must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
>> +							  reason);
>> +		else
>> +			must_wait = userfaultfd_huge_must_wait(ctx, vma,
>> +							       vmf->address,
>> +							       vmf->flags, reason);
>> +		if (is_vm_hugetlb_page(vma))
>> +			hugetlb_vma_unlock_read(vma);
>> +		mmap_read_unlock(mm);
>> +
>> +		if (likely(must_wait && !READ_ONCE(ctx->released))) {
>> +			wake_up_poll(&ctx->fd_wqh, EPOLLIN);
>> +			schedule();
>> +		}
>>  
>> -	__set_current_state(TASK_RUNNING);
>> +		__set_current_state(TASK_RUNNING);
>>  
>> -	/*
>> -	 * Here we race with the list_del; list_add in
>> -	 * userfaultfd_ctx_read(), however because we don't ever run
>> -	 * list_del_init() to refile across the two lists, the prev
>> -	 * and next pointers will never point to self. list_add also
>> -	 * would never let any of the two pointers to point to
>> -	 * self. So list_empty_careful won't risk to see both pointers
>> -	 * pointing to self at any time during the list refile. The
>> -	 * only case where list_del_init() is called is the full
>> -	 * removal in the wake function and there we don't re-list_add
>> -	 * and it's fine not to block on the spinlock. The uwq on this
>> -	 * kernel stack can be released after the list_del_init.
>> -	 */
>> -	if (!list_empty_careful(&uwq.wq.entry)) {
>> -		spin_lock_irq(&ctx->fault_pending_wqh.lock);
>>  		/*
>> -		 * No need of list_del_init(), the uwq on the stack
>> -		 * will be freed shortly anyway.
>> +		 * Here we race with the list_del; list_add in
>> +		 * userfaultfd_ctx_read(), however because we don't ever run
>> +		 * list_del_init() to refile across the two lists, the prev
>> +		 * and next pointers will never point to self. list_add also
>> +		 * would never let any of the two pointers to point to
>> +		 * self. So list_empty_careful won't risk to see both pointers
>> +		 * pointing to self at any time during the list refile. The
>> +		 * only case where list_del_init() is called is the full
>> +		 * removal in the wake function and there we don't re-list_add
>> +		 * and it's fine not to block on the spinlock. The uwq on this
>> +		 * kernel stack can be released after the list_del_init.
>>  		 */
>> -		list_del(&uwq.wq.entry);
>> -		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>> +		if (!list_empty_careful(&uwq.wq.entry)) {
>> +			spin_lock_irq(&ctx->fault_pending_wqh.lock);
>> +			/*
>> +			 * No need of list_del_init(), the uwq on the stack
>> +			 * will be freed shortly anyway.
>> +			 */
>> +			list_del(&uwq.wq.entry);
>> +			spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>> +		}
>>  	}
>> -
>>  	/*
>>  	 * ctx may go away after this if the userfault pseudo fd is
>>  	 * already released.
>> @@ -1861,11 +1875,14 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>>  		return ret;
>>  
>>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>> -			       UFFDIO_WRITEPROTECT_MODE_WP))
>> +			       UFFDIO_WRITEPROTECT_MODE_WP |
>> +			       UFFDIO_WRITEPROTECT_MODE_ASYNC_WP))
>>  		return -EINVAL;
>>  
>> -	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>> +	mode_wp = uffdio_wp.mode & (UFFDIO_WRITEPROTECT_MODE_WP |
>> +				    UFFDIO_WRITEPROTECT_MODE_ASYNC_WP);
>>  	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>> +	ctx->async = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ASYNC_WP;
> 
> Please no..  ctx attributes shouldn't be easily changed by a single ioctl.
> 
> I suggest we have a new feature bit as I mentioned above (say,
> UFFD_FEATURE_WP_ASYNC), set it once with UFFDIO_API and it should apply to
> the whole lifecycle of this uffd handle.  That flag should (something I can
> quickly think of):
> 
>   - Have effect only if the uffd will be registered with WP mode (of
>     course) or ignored in any other modes,
> 
>   - Should fail any attempts of UFFDIO_WRITEPROTECT with wp=false on this
>     uffd handle because with async faults no page fault resolution needed
>     from userspace,
> 
>   - Should apply to any region registered with this uffd ctx, so it's
>     exclusively used with sync uffd-wp mode.
All of these are necesary and must be done to consolidate the interface of
UFFD. Agreed!

> 
> Then when the app wants to wr-protect in async mode, it simply goes ahead
> with UFFDIO_WRITEPROTECT(wp=true), it'll happen exactly the same as when it
> was sync mode.  It's only the pf handling procedure that's different (along
> with how the fault is reported - rather than as a message but it'll be
> consolidated into the soft-dirty bit).
PF handling will resovle the fault after un-setting the _PAGE_*_UFFD_WP on
the page. I'm not changing the soft-dirty bit. It is too delicate (if you
get the joke).

> 
>>  
>>  	if (mode_wp && mode_dontwake)
>>  		return -EINVAL;
>> @@ -2126,6 +2143,7 @@ static int new_userfaultfd(int flags)
>>  	ctx->flags = flags;
>>  	ctx->features = 0;
>>  	ctx->released = false;
>> +	ctx->async = false;
>>  	atomic_set(&ctx->mmap_changing, 0);
>>  	ctx->mm = current->mm;
>>  	/* prevent the mm struct to be freed */
>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>> index 005e5e306266..b89665653861 100644
>> --- a/include/uapi/linux/userfaultfd.h
>> +++ b/include/uapi/linux/userfaultfd.h
>> @@ -284,6 +284,11 @@ struct uffdio_writeprotect {
>>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
>>   * any wait thread after the operation succeeds.
>>   *
>> + * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a
>> + * range, the flag is unset automatically when the page is written.
>> + * This is used to track which pages have been written to from the
>> + * time the memory was write protected.
>> + *
>>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
>>   * protection (WP=0) in response to a page fault wakes the faulting
>> @@ -291,6 +296,7 @@ struct uffdio_writeprotect {
>>   */
>>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
>>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
>> +#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP	((__u64)1<<2)
>>  	__u64 mode;
>>  };
>>  
>> -- 
>> 2.30.2
>>
> 

I should have added Suggested-by: Peter Xy <peterx@redhat.com> to this
patch. I'll add in the next revision if you don't object.

I've started working on next revision. I'll reply to other highly valuable
review emails a bit later.
Peter Xu Jan. 19, 2023, 4:35 p.m. UTC | #3
On Thu, Jan 19, 2023 at 08:09:52PM +0500, Muhammad Usama Anjum wrote:

[...]

> >> @@ -497,80 +498,93 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> >>  
> >>  	/* take the reference before dropping the mmap_lock */
> >>  	userfaultfd_ctx_get(ctx);
> >> +	if (ctx->async) {
> > 
> > Firstly, please consider not touching the existing code/indent as much as
> > what this patch did.  Hopefully we can keep the major part of sync uffd be
> > there with its git log, it also helps reviewing your code.  You can add the
> > async block before that, handle the fault and return just earlier.
> This is possible. Will do in next revision.
> 
> > 
> > And, I think this is a bit too late because we're going to return with
> > VM_FAULT_RETRY here, while maybe we don't need to retry at all here because
> > we're going to resolve the page fault immediately.
> > 
> > I assume you added this because you wanted userfaultfd_ctx_get() to make
> > sure the uffd context will not go away from under us, but it's not needed
> > if we're still holding the mmap read lock.  I'd expect for async mode we
> > don't really need to release it at all.
> I'll have to check the what should be returned here. We should return
> something which shows that the fault has been resolved.

VM_FAULT_NOPAGE may be the best to describe it, but I guess it shouldn't
have a difference here if to just return zero.  And, I guess you don't even
need to worry on the retval here because I think you can leverage do_wp_page.
More below.

> 
> > 
> >> +		// Resolve page fault of this page
> > 
> > Please use "/* ... */" as that's the common pattern of commenting in the
> > Linux kernel, at least what I see in mm/.
> Will do.
> 
> > 
> >> +		unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ?
> >> +				      vmf->real_address : vmf->address;
> >> +		struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr);
> >> +		size_t s = PAGE_SIZE;
> > 
> > This is weird - if we want async uffd-wp, let's consider huge page from the
> > 1st day.
> > 
> >> +
> >> +		if (dst_vma->vm_flags & VM_HUGEPAGE) {
> > 
> > VM_HUGEPAGE is only a hint.  It doesn't mean this page is always a huge
> > page.  For anon, we can have thp wr-protected as a whole, not happening for
> > !anon because we'll split already.
> > 
> > For anon, if a write happens to a thp being uffd-wp-ed, we'll keep that pmd
> > wr-protected and report the uffd message.  The pmd split happens when the
> > user invokes UFFDIO_WRITEPROTECT on the small page.  I think it'll stop
> > working for async uffd-wp because we're going to resolve the page faults
> > right away.
> > 
> > So for async uffd-wp (note: this will be different from hugetlb), you may
> > want to consider having a pre-requisite patch to change wp_huge_pmd()
> > behavior: rather than calling handle_userfault(), IIUC you can also just
> > fallback to the split path right below (__split_huge_pmd) so the thp will
> > split now even before the uffd message is generated.
> I'll make the changes and make this. I wasn't aware that the thp is being
> broken in the UFFD WP. At this time, I'm not sure if thp will be handled by
> handle_userfault() in one go. Probably it will as the length is stored in
> the vmf.

Yes I think THP can actually be handled in one go with uffd-wp anon (even
if vmf doesn't store any length because page fault is about address only
not length, afaict).  E.g. thp firstly get wr-protected in thp size, then
when unprotect the user app sends UFFDIO_WRITEPROTECT(wp=false) with a
range covering the whole thp.

But AFAIU that should be quite rare because most uffd-wp scenarios are
latency sensitive, resolving page faults in large chunk definitely enlarges
that.  It could happen though when it's not resolving an immediate page
fault, so it could happen in the background.

So after a second thought, a safer approach is we only go to the split path
if async is enabled, in wp_huge_pmd().  Then it doesn't need to be a
pre-requisite patch too, it can be part of the major patch to implement the
uffd-wp async mode.

> 
> > 
> > I think it should be transparent to the user and it'll start working for
> > you with async uffd-wp here, because it means when reaching
> > handle_userfault, it should not be possible to have thp at all since they
> > should have all split up.
> > 
> >> +			s = HPAGE_SIZE;
> >> +			addr &= HPAGE_MASK;
> >> +		}
> >>  
> >> -	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> >> -	uwq.wq.private = current;
> >> -	uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
> >> -				reason, ctx->features);
> >> -	uwq.ctx = ctx;
> >> -	uwq.waken = false;
> >> -
> >> -	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> >> +		ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing);
> > 
> > This is an overkill - we're pretty sure it's a single page, no need to call
> > a range function here.
> Probably change_pte_range() should be used here to directly remove the WP here?

Here we can persue the best performance, or we can also persue the easist
way to implement.  I think the best we can have is we don't release either
the mmap read lock _and_ the pgtable lock, so we resolve the page fault
completely here.  But that requires more code changes.

So far an probably intermediate (and very easy to implement) solution is:

(1) Remap the pte (vmf->pte) and retake the lock (vmf->ptl).  Note: you
    need to move the chunk to be before mmap read lock released first,
    because we'll need that to make sure pgtable lock and the pgtable page
    being still exist at the first place.

(2) If *vmf->pte != vmf->orig_pte, it means the pgtable changed, retry
    (with VM_FAULT_NOPAGE).  We must have orig_pte set btw in this path.

(2) Remove the uffd-wp bit if it's set (and it must be set, because we
    checked again on orig_pte with pgtable lock held).

(3) Invoke do_wp_page() again with the same vmf.

This will focus the resolution on the single page and resolve CoW in one
shot if needed.  We may need to redo the map/lock of pte* but I suppose it
won't hurt a lot if we just modified the fields anyway, so we can leave
that for later.

[...]

> > Then when the app wants to wr-protect in async mode, it simply goes ahead
> > with UFFDIO_WRITEPROTECT(wp=true), it'll happen exactly the same as when it
> > was sync mode.  It's only the pf handling procedure that's different (along
> > with how the fault is reported - rather than as a message but it'll be
> > consolidated into the soft-dirty bit).
> PF handling will resovle the fault after un-setting the _PAGE_*_UFFD_WP on
> the page. I'm not changing the soft-dirty bit. It is too delicate (if you
> get the joke).

It's unfortunate that the old soft-dirty solution didn't go through easily.
Soft-dirty still covers something that uffd-wp cannot do right now, e.g. on
tracking mostly any type of pte mappings.  Uffd-wp can so far only track
fully ram backed pages like shmem or hugetlb for files but not any random
page cache.  Hopefully it still works at least for your use case, or it's
time to rethink otherwise.

> 
> > 
> >>  
> >>  	if (mode_wp && mode_dontwake)
> >>  		return -EINVAL;
> >> @@ -2126,6 +2143,7 @@ static int new_userfaultfd(int flags)
> >>  	ctx->flags = flags;
> >>  	ctx->features = 0;
> >>  	ctx->released = false;
> >> +	ctx->async = false;
> >>  	atomic_set(&ctx->mmap_changing, 0);
> >>  	ctx->mm = current->mm;
> >>  	/* prevent the mm struct to be freed */
> >> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> >> index 005e5e306266..b89665653861 100644
> >> --- a/include/uapi/linux/userfaultfd.h
> >> +++ b/include/uapi/linux/userfaultfd.h
> >> @@ -284,6 +284,11 @@ struct uffdio_writeprotect {
> >>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
> >>   * any wait thread after the operation succeeds.
> >>   *
> >> + * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a
> >> + * range, the flag is unset automatically when the page is written.
> >> + * This is used to track which pages have been written to from the
> >> + * time the memory was write protected.
> >> + *
> >>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
> >>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
> >>   * protection (WP=0) in response to a page fault wakes the faulting
> >> @@ -291,6 +296,7 @@ struct uffdio_writeprotect {
> >>   */
> >>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> >>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> >> +#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP	((__u64)1<<2)
> >>  	__u64 mode;
> >>  };
> >>  
> >> -- 
> >> 2.30.2
> >>
> > 
> 
> I should have added Suggested-by: Peter Xy <peterx@redhat.com> to this
> patch. I'll add in the next revision if you don't object.

I'm fine with it.  If so, please do s/Xy/Xu/.

> 
> I've started working on next revision. I'll reply to other highly valuable
> review emails a bit later.

Thanks,
Peter Xu Jan. 20, 2023, 2:53 p.m. UTC | #4
On Thu, Jan 19, 2023 at 11:35:39AM -0500, Peter Xu wrote:
> On Thu, Jan 19, 2023 at 08:09:52PM +0500, Muhammad Usama Anjum wrote:
> 
> [...]
> 
> > >> @@ -497,80 +498,93 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
> > >>  
> > >>  	/* take the reference before dropping the mmap_lock */
> > >>  	userfaultfd_ctx_get(ctx);
> > >> +	if (ctx->async) {
> > > 
> > > Firstly, please consider not touching the existing code/indent as much as
> > > what this patch did.  Hopefully we can keep the major part of sync uffd be
> > > there with its git log, it also helps reviewing your code.  You can add the
> > > async block before that, handle the fault and return just earlier.
> > This is possible. Will do in next revision.
> > 
> > > 
> > > And, I think this is a bit too late because we're going to return with
> > > VM_FAULT_RETRY here, while maybe we don't need to retry at all here because
> > > we're going to resolve the page fault immediately.
> > > 
> > > I assume you added this because you wanted userfaultfd_ctx_get() to make
> > > sure the uffd context will not go away from under us, but it's not needed
> > > if we're still holding the mmap read lock.  I'd expect for async mode we
> > > don't really need to release it at all.
> > I'll have to check the what should be returned here. We should return
> > something which shows that the fault has been resolved.
> 
> VM_FAULT_NOPAGE may be the best to describe it, but I guess it shouldn't
> have a difference here if to just return zero.  And, I guess you don't even
> need to worry on the retval here because I think you can leverage do_wp_page.
> More below.
> 
> > 
> > > 
> > >> +		// Resolve page fault of this page
> > > 
> > > Please use "/* ... */" as that's the common pattern of commenting in the
> > > Linux kernel, at least what I see in mm/.
> > Will do.
> > 
> > > 
> > >> +		unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ?
> > >> +				      vmf->real_address : vmf->address;
> > >> +		struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr);
> > >> +		size_t s = PAGE_SIZE;
> > > 
> > > This is weird - if we want async uffd-wp, let's consider huge page from the
> > > 1st day.
> > > 
> > >> +
> > >> +		if (dst_vma->vm_flags & VM_HUGEPAGE) {
> > > 
> > > VM_HUGEPAGE is only a hint.  It doesn't mean this page is always a huge
> > > page.  For anon, we can have thp wr-protected as a whole, not happening for
> > > !anon because we'll split already.
> > > 
> > > For anon, if a write happens to a thp being uffd-wp-ed, we'll keep that pmd
> > > wr-protected and report the uffd message.  The pmd split happens when the
> > > user invokes UFFDIO_WRITEPROTECT on the small page.  I think it'll stop
> > > working for async uffd-wp because we're going to resolve the page faults
> > > right away.
> > > 
> > > So for async uffd-wp (note: this will be different from hugetlb), you may
> > > want to consider having a pre-requisite patch to change wp_huge_pmd()
> > > behavior: rather than calling handle_userfault(), IIUC you can also just
> > > fallback to the split path right below (__split_huge_pmd) so the thp will
> > > split now even before the uffd message is generated.
> > I'll make the changes and make this. I wasn't aware that the thp is being
> > broken in the UFFD WP. At this time, I'm not sure if thp will be handled by
> > handle_userfault() in one go. Probably it will as the length is stored in
> > the vmf.
> 
> Yes I think THP can actually be handled in one go with uffd-wp anon (even
> if vmf doesn't store any length because page fault is about address only
> not length, afaict).  E.g. thp firstly get wr-protected in thp size, then
> when unprotect the user app sends UFFDIO_WRITEPROTECT(wp=false) with a
> range covering the whole thp.
> 
> But AFAIU that should be quite rare because most uffd-wp scenarios are
> latency sensitive, resolving page faults in large chunk definitely enlarges
> that.  It could happen though when it's not resolving an immediate page
> fault, so it could happen in the background.
> 
> So after a second thought, a safer approach is we only go to the split path
> if async is enabled, in wp_huge_pmd().  Then it doesn't need to be a
> pre-requisite patch too, it can be part of the major patch to implement the
> uffd-wp async mode.
> 
> > 
> > > 
> > > I think it should be transparent to the user and it'll start working for
> > > you with async uffd-wp here, because it means when reaching
> > > handle_userfault, it should not be possible to have thp at all since they
> > > should have all split up.
> > > 
> > >> +			s = HPAGE_SIZE;
> > >> +			addr &= HPAGE_MASK;
> > >> +		}
> > >>  
> > >> -	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
> > >> -	uwq.wq.private = current;
> > >> -	uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
> > >> -				reason, ctx->features);
> > >> -	uwq.ctx = ctx;
> > >> -	uwq.waken = false;
> > >> -
> > >> -	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
> > >> +		ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing);
> > > 
> > > This is an overkill - we're pretty sure it's a single page, no need to call
> > > a range function here.
> > Probably change_pte_range() should be used here to directly remove the WP here?
> 
> Here we can persue the best performance, or we can also persue the easist
> way to implement.  I think the best we can have is we don't release either
> the mmap read lock _and_ the pgtable lock, so we resolve the page fault
> completely here.  But that requires more code changes.
> 
> So far an probably intermediate (and very easy to implement) solution is:
> 
> (1) Remap the pte (vmf->pte) and retake the lock (vmf->ptl).  Note: you
>     need to move the chunk to be before mmap read lock released first,
>     because we'll need that to make sure pgtable lock and the pgtable page
>     being still exist at the first place.
> 
> (2) If *vmf->pte != vmf->orig_pte, it means the pgtable changed, retry
>     (with VM_FAULT_NOPAGE).  We must have orig_pte set btw in this path.
> 
> (2) Remove the uffd-wp bit if it's set (and it must be set, because we
>     checked again on orig_pte with pgtable lock held).
> 
> (3) Invoke do_wp_page() again with the same vmf.
> 
> This will focus the resolution on the single page and resolve CoW in one
> shot if needed.  We may need to redo the map/lock of pte* but I suppose it
> won't hurt a lot if we just modified the fields anyway, so we can leave
> that for later.

I just noticed it's actually quite straigtforward to just not fall into
handle_userfault at all.  It can be as simple as:

---8<---
diff --git a/mm/memory.c b/mm/memory.c
index 4000e9f017e0..09aab434654c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3351,8 +3351,20 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)

        if (likely(!unshare)) {
                if (userfaultfd_pte_wp(vma, *vmf->pte)) {
-                       pte_unmap_unlock(vmf->pte, vmf->ptl);
-                       return handle_userfault(vmf, VM_UFFD_WP);
+                       if (userfaultfd_uffd_wp_async(vma)) {
+                               /*
+                                * Nothing needed (cache flush, TLB
+                                * invalidations, etc.) because we're only
+                                * removing the uffd-wp bit, which is
+                                * completely invisible to the user.
+                                * This falls through to possible CoW.
+                                */
+                               set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
+                                          pte_clear_uffd_wp(*vmf->pte));
+                       } else {
+                               pte_unmap_unlock(vmf->pte, vmf->ptl);
+                               return handle_userfault(vmf, VM_UFFD_WP);
+                       }
                }
---8<---

Similar thing will be needed for hugetlb if that'll be supported.

One thing worth mention is, I think for async wp it doesn't need to be
restricted by UFFD_USER_MODE_ONLY, because comparing to the sync messages
it has no risk of being utilized for malicious purposes.

> 
> [...]
> 
> > > Then when the app wants to wr-protect in async mode, it simply goes ahead
> > > with UFFDIO_WRITEPROTECT(wp=true), it'll happen exactly the same as when it
> > > was sync mode.  It's only the pf handling procedure that's different (along
> > > with how the fault is reported - rather than as a message but it'll be
> > > consolidated into the soft-dirty bit).
> > PF handling will resovle the fault after un-setting the _PAGE_*_UFFD_WP on
> > the page. I'm not changing the soft-dirty bit. It is too delicate (if you
> > get the joke).
> 
> It's unfortunate that the old soft-dirty solution didn't go through easily.
> Soft-dirty still covers something that uffd-wp cannot do right now, e.g. on
> tracking mostly any type of pte mappings.  Uffd-wp can so far only track
> fully ram backed pages like shmem or hugetlb for files but not any random
> page cache.  Hopefully it still works at least for your use case, or it's
> time to rethink otherwise.
> 
> > 
> > > 
> > >>  
> > >>  	if (mode_wp && mode_dontwake)
> > >>  		return -EINVAL;
> > >> @@ -2126,6 +2143,7 @@ static int new_userfaultfd(int flags)
> > >>  	ctx->flags = flags;
> > >>  	ctx->features = 0;
> > >>  	ctx->released = false;
> > >> +	ctx->async = false;
> > >>  	atomic_set(&ctx->mmap_changing, 0);
> > >>  	ctx->mm = current->mm;
> > >>  	/* prevent the mm struct to be freed */
> > >> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> > >> index 005e5e306266..b89665653861 100644
> > >> --- a/include/uapi/linux/userfaultfd.h
> > >> +++ b/include/uapi/linux/userfaultfd.h
> > >> @@ -284,6 +284,11 @@ struct uffdio_writeprotect {
> > >>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
> > >>   * any wait thread after the operation succeeds.
> > >>   *
> > >> + * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a
> > >> + * range, the flag is unset automatically when the page is written.
> > >> + * This is used to track which pages have been written to from the
> > >> + * time the memory was write protected.
> > >> + *
> > >>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
> > >>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
> > >>   * protection (WP=0) in response to a page fault wakes the faulting
> > >> @@ -291,6 +296,7 @@ struct uffdio_writeprotect {
> > >>   */
> > >>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
> > >>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
> > >> +#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP	((__u64)1<<2)
> > >>  	__u64 mode;
> > >>  };
> > >>  
> > >> -- 
> > >> 2.30.2
> > >>
> > > 
> > 
> > I should have added Suggested-by: Peter Xy <peterx@redhat.com> to this
> > patch. I'll add in the next revision if you don't object.
> 
> I'm fine with it.  If so, please do s/Xy/Xu/.
> 
> > 
> > I've started working on next revision. I'll reply to other highly valuable
> > review emails a bit later.
> 
> Thanks,
> 
> -- 
> Peter Xu
Muhammad Usama Anjum Jan. 23, 2023, 10:11 a.m. UTC | #5
On 1/20/23 7:53 PM, Peter Xu wrote:
> On Thu, Jan 19, 2023 at 11:35:39AM -0500, Peter Xu wrote:
>> On Thu, Jan 19, 2023 at 08:09:52PM +0500, Muhammad Usama Anjum wrote:
>>
>> [...]
>>
>>>>> @@ -497,80 +498,93 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>>>>>  
>>>>>  	/* take the reference before dropping the mmap_lock */
>>>>>  	userfaultfd_ctx_get(ctx);
>>>>> +	if (ctx->async) {
>>>>
>>>> Firstly, please consider not touching the existing code/indent as much as
>>>> what this patch did.  Hopefully we can keep the major part of sync uffd be
>>>> there with its git log, it also helps reviewing your code.  You can add the
>>>> async block before that, handle the fault and return just earlier.
>>> This is possible. Will do in next revision.
>>>
>>>>
>>>> And, I think this is a bit too late because we're going to return with
>>>> VM_FAULT_RETRY here, while maybe we don't need to retry at all here because
>>>> we're going to resolve the page fault immediately.
>>>>
>>>> I assume you added this because you wanted userfaultfd_ctx_get() to make
>>>> sure the uffd context will not go away from under us, but it's not needed
>>>> if we're still holding the mmap read lock.  I'd expect for async mode we
>>>> don't really need to release it at all.
>>> I'll have to check the what should be returned here. We should return
>>> something which shows that the fault has been resolved.
>>
>> VM_FAULT_NOPAGE may be the best to describe it, but I guess it shouldn't
>> have a difference here if to just return zero.  And, I guess you don't even
>> need to worry on the retval here because I think you can leverage do_wp_page.
>> More below.
Yeah, this has been changed with your below patch.

>>
>>>
>>>>
>>>>> +		// Resolve page fault of this page
>>>>
>>>> Please use "/* ... */" as that's the common pattern of commenting in the
>>>> Linux kernel, at least what I see in mm/.
>>> Will do.
>>>
>>>>
>>>>> +		unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ?
>>>>> +				      vmf->real_address : vmf->address;
>>>>> +		struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr);
>>>>> +		size_t s = PAGE_SIZE;
>>>>
>>>> This is weird - if we want async uffd-wp, let's consider huge page from the
>>>> 1st day.
>>>>
>>>>> +
>>>>> +		if (dst_vma->vm_flags & VM_HUGEPAGE) {
>>>>
>>>> VM_HUGEPAGE is only a hint.  It doesn't mean this page is always a huge
>>>> page.  For anon, we can have thp wr-protected as a whole, not happening for
>>>> !anon because we'll split already.
>>>>
>>>> For anon, if a write happens to a thp being uffd-wp-ed, we'll keep that pmd
>>>> wr-protected and report the uffd message.  The pmd split happens when the
>>>> user invokes UFFDIO_WRITEPROTECT on the small page.  I think it'll stop
>>>> working for async uffd-wp because we're going to resolve the page faults
>>>> right away.
>>>>
>>>> So for async uffd-wp (note: this will be different from hugetlb), you may
>>>> want to consider having a pre-requisite patch to change wp_huge_pmd()
>>>> behavior: rather than calling handle_userfault(), IIUC you can also just
>>>> fallback to the split path right below (__split_huge_pmd) so the thp will
>>>> split now even before the uffd message is generated.
>>> I'll make the changes and make this. I wasn't aware that the thp is being
>>> broken in the UFFD WP. At this time, I'm not sure if thp will be handled by
>>> handle_userfault() in one go. Probably it will as the length is stored in
>>> the vmf.
>>
>> Yes I think THP can actually be handled in one go with uffd-wp anon (even
>> if vmf doesn't store any length because page fault is about address only
>> not length, afaict).  E.g. thp firstly get wr-protected in thp size, then
>> when unprotect the user app sends UFFDIO_WRITEPROTECT(wp=false) with a
>> range covering the whole thp.
>>
>> But AFAIU that should be quite rare because most uffd-wp scenarios are
>> latency sensitive, resolving page faults in large chunk definitely enlarges
>> that.  It could happen though when it's not resolving an immediate page
>> fault, so it could happen in the background.
>>
>> So after a second thought, a safer approach is we only go to the split path
>> if async is enabled, in wp_huge_pmd().  Then it doesn't need to be a
>> pre-requisite patch too, it can be part of the major patch to implement the
>> uffd-wp async mode.
>>
>>>
>>>>
>>>> I think it should be transparent to the user and it'll start working for
>>>> you with async uffd-wp here, because it means when reaching
>>>> handle_userfault, it should not be possible to have thp at all since they
>>>> should have all split up.
>>>>
>>>>> +			s = HPAGE_SIZE;
>>>>> +			addr &= HPAGE_MASK;
>>>>> +		}
>>>>>  
>>>>> -	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
>>>>> -	uwq.wq.private = current;
>>>>> -	uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
>>>>> -				reason, ctx->features);
>>>>> -	uwq.ctx = ctx;
>>>>> -	uwq.waken = false;
>>>>> -
>>>>> -	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
>>>>> +		ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing);
>>>>
>>>> This is an overkill - we're pretty sure it's a single page, no need to call
>>>> a range function here.
>>> Probably change_pte_range() should be used here to directly remove the WP here?
>>
>> Here we can persue the best performance, or we can also persue the easist
>> way to implement.  I think the best we can have is we don't release either
>> the mmap read lock _and_ the pgtable lock, so we resolve the page fault
>> completely here.  But that requires more code changes.
>>
>> So far an probably intermediate (and very easy to implement) solution is:
>>
>> (1) Remap the pte (vmf->pte) and retake the lock (vmf->ptl).  Note: you
>>     need to move the chunk to be before mmap read lock released first,
>>     because we'll need that to make sure pgtable lock and the pgtable page
>>     being still exist at the first place.
>>
>> (2) If *vmf->pte != vmf->orig_pte, it means the pgtable changed, retry
>>     (with VM_FAULT_NOPAGE).  We must have orig_pte set btw in this path.
>>
>> (2) Remove the uffd-wp bit if it's set (and it must be set, because we
>>     checked again on orig_pte with pgtable lock held).
>>
>> (3) Invoke do_wp_page() again with the same vmf.
>>
>> This will focus the resolution on the single page and resolve CoW in one
>> shot if needed.  We may need to redo the map/lock of pte* but I suppose it
>> won't hurt a lot if we just modified the fields anyway, so we can leave
>> that for later.
> 
> I just noticed it's actually quite straigtforward to just not fall into
> handle_userfault at all.  It can be as simple as:
> 
> ---8<---
> diff --git a/mm/memory.c b/mm/memory.c
> index 4000e9f017e0..09aab434654c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3351,8 +3351,20 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> 
>         if (likely(!unshare)) {
>                 if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> -                       pte_unmap_unlock(vmf->pte, vmf->ptl);
> -                       return handle_userfault(vmf, VM_UFFD_WP);
> +                       if (userfaultfd_uffd_wp_async(vma)) {
> +                               /*
> +                                * Nothing needed (cache flush, TLB
> +                                * invalidations, etc.) because we're only
> +                                * removing the uffd-wp bit, which is
> +                                * completely invisible to the user.
> +                                * This falls through to possible CoW.
> +                                */
> +                               set_pte_at(vma->vm_mm, vmf->address, vmf->pte,
> +                                          pte_clear_uffd_wp(*vmf->pte));
> +                       } else {
> +                               pte_unmap_unlock(vmf->pte, vmf->ptl);
> +                               return handle_userfault(vmf, VM_UFFD_WP);
> +                       }
>                 }
> ---8<---
> 
> Similar thing will be needed for hugetlb if that'll be supported.
At the moment, hugetlb support is required. Thank you so much for sending
this up. This has made things simple. I've replicated this same patch in
wp_huge_pmd() for huge pages.

> 
> One thing worth mention is, I think for async wp it doesn't need to be
> restricted by UFFD_USER_MODE_ONLY, because comparing to the sync messages
> it has no risk of being utilized for malicious purposes.
I think with updated handling path updated in do_wp_page() and
wp_huge_pmd() in version, UFFD_USER_MODE_ONLY will not affect us.

> 
>>
>> [...]
>>
>>>> Then when the app wants to wr-protect in async mode, it simply goes ahead
>>>> with UFFDIO_WRITEPROTECT(wp=true), it'll happen exactly the same as when it
>>>> was sync mode.  It's only the pf handling procedure that's different (along
>>>> with how the fault is reported - rather than as a message but it'll be
>>>> consolidated into the soft-dirty bit).
>>> PF handling will resovle the fault after un-setting the _PAGE_*_UFFD_WP on
>>> the page. I'm not changing the soft-dirty bit. It is too delicate (if you
>>> get the joke).
>>
>> It's unfortunate that the old soft-dirty solution didn't go through easily.
>> Soft-dirty still covers something that uffd-wp cannot do right now, e.g. on
>> tracking mostly any type of pte mappings.  Uffd-wp can so far only track
>> fully ram backed pages like shmem or hugetlb for files but not any random
>> page cache.  Hopefully it still works at least for your use case, or it's
>> time to rethink otherwise.
>>
>>>
>>>>
>>>>>  
>>>>>  	if (mode_wp && mode_dontwake)
>>>>>  		return -EINVAL;
>>>>> @@ -2126,6 +2143,7 @@ static int new_userfaultfd(int flags)
>>>>>  	ctx->flags = flags;
>>>>>  	ctx->features = 0;
>>>>>  	ctx->released = false;
>>>>> +	ctx->async = false;
>>>>>  	atomic_set(&ctx->mmap_changing, 0);
>>>>>  	ctx->mm = current->mm;
>>>>>  	/* prevent the mm struct to be freed */
>>>>> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
>>>>> index 005e5e306266..b89665653861 100644
>>>>> --- a/include/uapi/linux/userfaultfd.h
>>>>> +++ b/include/uapi/linux/userfaultfd.h
>>>>> @@ -284,6 +284,11 @@ struct uffdio_writeprotect {
>>>>>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
>>>>>   * any wait thread after the operation succeeds.
>>>>>   *
>>>>> + * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a
>>>>> + * range, the flag is unset automatically when the page is written.
>>>>> + * This is used to track which pages have been written to from the
>>>>> + * time the memory was write protected.
>>>>> + *
>>>>>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>>>>>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
>>>>>   * protection (WP=0) in response to a page fault wakes the faulting
>>>>> @@ -291,6 +296,7 @@ struct uffdio_writeprotect {
>>>>>   */
>>>>>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
>>>>>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
>>>>> +#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP	((__u64)1<<2)
>>>>>  	__u64 mode;
>>>>>  };
>>>>>  
>>>>> -- 
>>>>> 2.30.2
>>>>>
>>>>
>>>
>>> I should have added Suggested-by: Peter Xy <peterx@redhat.com> to this
>>> patch. I'll add in the next revision if you don't object.
>>
>> I'm fine with it.  If so, please do s/Xy/Xu/.
Sorry,

Suggested-by: Peter Xu <peterx@redhat.com>

>>
>>>
>>> I've started working on next revision. I'll reply to other highly valuable
>>> review emails a bit later.
>>
>> Thanks,
>>
>> -- 
>> Peter Xu
>
Peter Xu Jan. 24, 2023, 5:26 p.m. UTC | #6
On Mon, Jan 23, 2023 at 03:11:20PM +0500, Muhammad Usama Anjum wrote:
> > One thing worth mention is, I think for async wp it doesn't need to be
> > restricted by UFFD_USER_MODE_ONLY, because comparing to the sync messages
> > it has no risk of being utilized for malicious purposes.
> I think with updated handling path updated in do_wp_page() and
> wp_huge_pmd() in version, UFFD_USER_MODE_ONLY will not affect us.

This is more or less a comment for the design, the new code should work (by
bypassing handle_userfaultfd(), where this bit was checked).

We'll need an man page update if this feature will be merged [1], and if so
it'll need to be updated for the UFFD_USER_MODE_ONLY section regarding to
async uffd-wp support too.  I think that can also be worked out after the
series being accepted first, so just a heads up.

[1] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
Muhammad Usama Anjum Jan. 25, 2023, 12:18 p.m. UTC | #7
On 1/24/23 10:26 PM, Peter Xu wrote:
> On Mon, Jan 23, 2023 at 03:11:20PM +0500, Muhammad Usama Anjum wrote:
>>> One thing worth mention is, I think for async wp it doesn't need to be
>>> restricted by UFFD_USER_MODE_ONLY, because comparing to the sync messages
>>> it has no risk of being utilized for malicious purposes.
>> I think with updated handling path updated in do_wp_page() and
>> wp_huge_pmd() in version, UFFD_USER_MODE_ONLY will not affect us.
> 
> This is more or less a comment for the design, the new code should work (by
> bypassing handle_userfaultfd(), where this bit was checked).
> 
> We'll need an man page update if this feature will be merged [1], and if so
> it'll need to be updated for the UFFD_USER_MODE_ONLY section regarding to
> async uffd-wp support too.  I think that can also be worked out after the
> series being accepted first, so just a heads up.
> 
> [1] https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git
Thank you for explaining it. Definitely, we'll update the man pages after
the change has been merged. I've added it to my notes.>
diff mbox series

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 15a5bf765d43..be5e10d15058 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -69,6 +69,7 @@  struct userfaultfd_ctx {
 	unsigned int features;
 	/* released */
 	bool released;
+	bool async;
 	/* memory mappings are changing because of non-cooperative event */
 	atomic_t mmap_changing;
 	/* mm with one ore more vmas attached to this userfaultfd_ctx */
@@ -497,80 +498,93 @@  vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 
 	/* take the reference before dropping the mmap_lock */
 	userfaultfd_ctx_get(ctx);
+	if (ctx->async) {
+		// Resolve page fault of this page
+		unsigned long addr = (ctx->features & UFFD_FEATURE_EXACT_ADDRESS) ?
+				      vmf->real_address : vmf->address;
+		struct vm_area_struct *dst_vma = find_vma(ctx->mm, addr);
+		size_t s = PAGE_SIZE;
+
+		if (dst_vma->vm_flags & VM_HUGEPAGE) {
+			s = HPAGE_SIZE;
+			addr &= HPAGE_MASK;
+		}
 
-	init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
-	uwq.wq.private = current;
-	uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
-				reason, ctx->features);
-	uwq.ctx = ctx;
-	uwq.waken = false;
-
-	blocking_state = userfaultfd_get_blocking_state(vmf->flags);
+		ret = mwriteprotect_range(ctx->mm, addr, s, false, &ctx->mmap_changing);
+	} else {
+		init_waitqueue_func_entry(&uwq.wq, userfaultfd_wake_function);
+		uwq.wq.private = current;
+		uwq.msg = userfault_msg(vmf->address, vmf->real_address, vmf->flags,
+					reason, ctx->features);
+		uwq.ctx = ctx;
+		uwq.waken = false;
 
-        /*
-         * Take the vma lock now, in order to safely call
-         * userfaultfd_huge_must_wait() later. Since acquiring the
-         * (sleepable) vma lock can modify the current task state, that
-         * must be before explicitly calling set_current_state().
-         */
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_vma_lock_read(vma);
+		blocking_state = userfaultfd_get_blocking_state(vmf->flags);
 
-	spin_lock_irq(&ctx->fault_pending_wqh.lock);
-	/*
-	 * After the __add_wait_queue the uwq is visible to userland
-	 * through poll/read().
-	 */
-	__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
-	/*
-	 * The smp_mb() after __set_current_state prevents the reads
-	 * following the spin_unlock to happen before the list_add in
-	 * __add_wait_queue.
-	 */
-	set_current_state(blocking_state);
-	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+		/*
+		 * Take the vma lock now, in order to safely call
+		 * userfaultfd_huge_must_wait() later. Since acquiring the
+		 * (sleepable) vma lock can modify the current task state, that
+		 * must be before explicitly calling set_current_state().
+		 */
+		if (is_vm_hugetlb_page(vma))
+			hugetlb_vma_lock_read(vma);
 
-	if (!is_vm_hugetlb_page(vma))
-		must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
-						  reason);
-	else
-		must_wait = userfaultfd_huge_must_wait(ctx, vma,
-						       vmf->address,
-						       vmf->flags, reason);
-	if (is_vm_hugetlb_page(vma))
-		hugetlb_vma_unlock_read(vma);
-	mmap_read_unlock(mm);
+		spin_lock_irq(&ctx->fault_pending_wqh.lock);
+		/*
+		 * After the __add_wait_queue the uwq is visible to userland
+		 * through poll/read().
+		 */
+		__add_wait_queue(&ctx->fault_pending_wqh, &uwq.wq);
+		/*
+		 * The smp_mb() after __set_current_state prevents the reads
+		 * following the spin_unlock to happen before the list_add in
+		 * __add_wait_queue.
+		 */
+		set_current_state(blocking_state);
+		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
-	if (likely(must_wait && !READ_ONCE(ctx->released))) {
-		wake_up_poll(&ctx->fd_wqh, EPOLLIN);
-		schedule();
-	}
+		if (!is_vm_hugetlb_page(vma))
+			must_wait = userfaultfd_must_wait(ctx, vmf->address, vmf->flags,
+							  reason);
+		else
+			must_wait = userfaultfd_huge_must_wait(ctx, vma,
+							       vmf->address,
+							       vmf->flags, reason);
+		if (is_vm_hugetlb_page(vma))
+			hugetlb_vma_unlock_read(vma);
+		mmap_read_unlock(mm);
+
+		if (likely(must_wait && !READ_ONCE(ctx->released))) {
+			wake_up_poll(&ctx->fd_wqh, EPOLLIN);
+			schedule();
+		}
 
-	__set_current_state(TASK_RUNNING);
+		__set_current_state(TASK_RUNNING);
 
-	/*
-	 * Here we race with the list_del; list_add in
-	 * userfaultfd_ctx_read(), however because we don't ever run
-	 * list_del_init() to refile across the two lists, the prev
-	 * and next pointers will never point to self. list_add also
-	 * would never let any of the two pointers to point to
-	 * self. So list_empty_careful won't risk to see both pointers
-	 * pointing to self at any time during the list refile. The
-	 * only case where list_del_init() is called is the full
-	 * removal in the wake function and there we don't re-list_add
-	 * and it's fine not to block on the spinlock. The uwq on this
-	 * kernel stack can be released after the list_del_init.
-	 */
-	if (!list_empty_careful(&uwq.wq.entry)) {
-		spin_lock_irq(&ctx->fault_pending_wqh.lock);
 		/*
-		 * No need of list_del_init(), the uwq on the stack
-		 * will be freed shortly anyway.
+		 * Here we race with the list_del; list_add in
+		 * userfaultfd_ctx_read(), however because we don't ever run
+		 * list_del_init() to refile across the two lists, the prev
+		 * and next pointers will never point to self. list_add also
+		 * would never let any of the two pointers to point to
+		 * self. So list_empty_careful won't risk to see both pointers
+		 * pointing to self at any time during the list refile. The
+		 * only case where list_del_init() is called is the full
+		 * removal in the wake function and there we don't re-list_add
+		 * and it's fine not to block on the spinlock. The uwq on this
+		 * kernel stack can be released after the list_del_init.
 		 */
-		list_del(&uwq.wq.entry);
-		spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+		if (!list_empty_careful(&uwq.wq.entry)) {
+			spin_lock_irq(&ctx->fault_pending_wqh.lock);
+			/*
+			 * No need of list_del_init(), the uwq on the stack
+			 * will be freed shortly anyway.
+			 */
+			list_del(&uwq.wq.entry);
+			spin_unlock_irq(&ctx->fault_pending_wqh.lock);
+		}
 	}
-
 	/*
 	 * ctx may go away after this if the userfault pseudo fd is
 	 * already released.
@@ -1861,11 +1875,14 @@  static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 		return ret;
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
-			       UFFDIO_WRITEPROTECT_MODE_WP))
+			       UFFDIO_WRITEPROTECT_MODE_WP |
+			       UFFDIO_WRITEPROTECT_MODE_ASYNC_WP))
 		return -EINVAL;
 
-	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
+	mode_wp = uffdio_wp.mode & (UFFDIO_WRITEPROTECT_MODE_WP |
+				    UFFDIO_WRITEPROTECT_MODE_ASYNC_WP);
 	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
+	ctx->async = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ASYNC_WP;
 
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
@@ -2126,6 +2143,7 @@  static int new_userfaultfd(int flags)
 	ctx->flags = flags;
 	ctx->features = 0;
 	ctx->released = false;
+	ctx->async = false;
 	atomic_set(&ctx->mmap_changing, 0);
 	ctx->mm = current->mm;
 	/* prevent the mm struct to be freed */
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index 005e5e306266..b89665653861 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -284,6 +284,11 @@  struct uffdio_writeprotect {
  * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
  * any wait thread after the operation succeeds.
  *
+ * UFFDIO_WRITEPROTECT_MODE_ASYNC_WP: set the flag to write protect a
+ * range, the flag is unset automatically when the page is written.
+ * This is used to track which pages have been written to from the
+ * time the memory was write protected.
+ *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
  * protection (WP=0) in response to a page fault wakes the faulting
@@ -291,6 +296,7 @@  struct uffdio_writeprotect {
  */
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
+#define UFFDIO_WRITEPROTECT_MODE_ASYNC_WP	((__u64)1<<2)
 	__u64 mode;
 };