diff mbox series

[RFC,v2,3/5] userfaultfd: introduce write-likely mode for copy/wp operations

Message ID 20220619233449.181323-4-namit@vmware.com (mailing list archive)
State New
Headers show
Series userfaultfd: support access/write hints | expand

Commit Message

Nadav Amit June 19, 2022, 11:34 p.m. UTC
From: Nadav Amit <namit@vmware.com>

Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
However, setting read-only PTEs as dirty can have several undesired
implications.

First, setting read-only PTEs as dirty, can cause these PTEs to become
writable during mprotect() syscall. See in change_pte_range():

	/* Avoid taking write faults for known dirty pages */
	if (dirty_accountable && pte_dirty(ptent) &&
			(pte_soft_dirty(ptent) ||
			 !(vma->vm_flags & VM_SOFTDIRTY))) {
		ptent = pte_mkwrite(ptent);
	}

Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
See try_to_unmap_one():

	/*
	 * Page is dirty. Flush the TLB if a writable entry
	 * potentially exists to avoid CPU writes after IO
	 * starts and then write it out here.
	 */
	try_to_unmap_flush_dirty();

Similarly batching TLB flushed might be prevented in zap_pte_range():

	if (!PageAnon(page)) {
		if (pte_dirty(ptent)) {
			force_flush = 1;
			set_page_dirty(page);
		}
	...

In general, setting a PTE as dirty seems for read-only entries might be
dangerous. It should be reminded the dirty-COW vulnerability mitigation
also relies on the dirty bit being set only after COW (although it does
not appear to apply to userfaultfd).

To summarize, setting the dirty bit for read-only PTEs is dangerous. But
even if we only consider writable pages, always setting the dirty bit or
always leaving it clear, does not seem as the best policy. Leaving the
bit clear introduces overhead on the first write-access to set the bit.
Setting the bit for pages the are eventually not written to can require
more TLB flushes.

Let the userfaultfd users control whether PTEs are marked as dirty or
clean. Introduce UFFDIO_COPY_MODE_WRITE and
UFFDIO_COPY_MODE_WRITE_LIKELY and UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY
to enable userspace to indicate whether pages are likely to be written
and set the dirty-bit if they are likely to be written.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 fs/userfaultfd.c                 | 22 ++++++++++++++--------
 include/linux/userfaultfd_k.h    |  1 +
 include/uapi/linux/userfaultfd.h | 27 +++++++++++++++++++--------
 mm/hugetlb.c                     |  3 +++
 mm/shmem.c                       |  3 +++
 mm/userfaultfd.c                 | 11 +++++++++--
 6 files changed, 49 insertions(+), 18 deletions(-)

Comments

Peter Xu June 21, 2022, 4:38 p.m. UTC | #1
Hi, Nadav,

On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
> From: Nadav Amit <namit@vmware.com>
> 
> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
> However, setting read-only PTEs as dirty can have several undesired
> implications.
> 
> First, setting read-only PTEs as dirty, can cause these PTEs to become
> writable during mprotect() syscall. See in change_pte_range():
> 
> 	/* Avoid taking write faults for known dirty pages */
> 	if (dirty_accountable && pte_dirty(ptent) &&
> 			(pte_soft_dirty(ptent) ||
> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
> 		ptent = pte_mkwrite(ptent);
> 	}

IMHO this is not really the direct reason to add the feature to "allow user
to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
IIUC what's missing is the pte_uffd_wp() check that I should have added in
the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..

But since this is fixed by David's patch to optimize mprotect() altogether
which checks pte_uffd_wp() (and afaict that's only needed after the
shmem+hugetlb patchset, not before), so I think we're safe now with all the
branches.

So IMHO we don't need to mention this as it's kind of misleading.  It'll be
welcomed if you want to recover the pte_dirty behavior in
mfill_atomic_install_pte() but probably this is not the right patch for it?

> 
> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
> See try_to_unmap_one():
> 
> 	/*
> 	 * Page is dirty. Flush the TLB if a writable entry
> 	 * potentially exists to avoid CPU writes after IO
> 	 * starts and then write it out here.
> 	 */
> 	try_to_unmap_flush_dirty();
> 
> Similarly batching TLB flushed might be prevented in zap_pte_range():
> 
> 	if (!PageAnon(page)) {
> 		if (pte_dirty(ptent)) {
> 			force_flush = 1;
> 			set_page_dirty(page);
> 		}
> 	...

I still keep the pure question here (which I asked in the private reply to
you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
here.  I'll rewrite what I have in the private email to you here again that
I think should be the write thing to do here..

        if (!PageAnon(page)) {
                if (pte_dirty(ptent)) {
                        set_page_dirty(page);
                        if (pte_write(ptent))
                                force_flush = 1;
                }
        }

I also mentioned the other example of doing mprotect(PROT_READ) upon a
dirty pte can also create a pte with dirty=1 and write=0 which should be
the same condition we have here with uffd, afaict. So it's at least not a
solo problem for uffd, and I still think if we treat this tlb flush issue
as a perf bug we should consider fixing up the tlb flush code instead
because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
able to hit it.

Meanwhile, I'll have the same comment that this is not helping any reviewer
to understand why we need to grant user ability to conditionally set dirty
bit from uABI, so I think it's better to drop this paragraph too for this
patch alone.

> 
> In general, setting a PTE as dirty seems for read-only entries might be
> dangerous. It should be reminded the dirty-COW vulnerability mitigation
> also relies on the dirty bit being set only after COW (although it does
> not appear to apply to userfaultfd).
> 
> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
> even if we only consider writable pages, always setting the dirty bit or
> always leaving it clear, does not seem as the best policy. Leaving the
> bit clear introduces overhead on the first write-access to set the bit.
> Setting the bit for pages the are eventually not written to can require
> more TLB flushes.

And IMHO only this paragraph is the real and proper reasoning for this
patch..

> 
> Let the userfaultfd users control whether PTEs are marked as dirty or
> clean. Introduce UFFDIO_COPY_MODE_WRITE and
> UFFDIO_COPY_MODE_WRITE_LIKELY and UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY
> to enable userspace to indicate whether pages are likely to be written
> and set the dirty-bit if they are likely to be written.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Mike Rapoport <rppt@linux.ibm.com>
> Signed-off-by: Nadav Amit <namit@vmware.com>
> ---
>  fs/userfaultfd.c                 | 22 ++++++++++++++--------
>  include/linux/userfaultfd_k.h    |  1 +
>  include/uapi/linux/userfaultfd.h | 27 +++++++++++++++++++--------
>  mm/hugetlb.c                     |  3 +++
>  mm/shmem.c                       |  3 +++
>  mm/userfaultfd.c                 | 11 +++++++++--
>  6 files changed, 49 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 35a8c4347c54..a56983b594d5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -1700,7 +1700,7 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	struct uffdio_copy uffdio_copy;
>  	struct uffdio_copy __user *user_uffdio_copy;
>  	struct userfaultfd_wake_range range;
> -	bool mode_wp, mode_access_likely;
> +	bool mode_wp, mode_access_likely, mode_write_likely;
>  	uffd_flags_t uffd_flags;
>  
>  	user_uffdio_copy = (struct uffdio_copy __user *) arg;
> @@ -1727,14 +1727,17 @@ static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
>  	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
>  		goto out;
>  	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
> -				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
> +				 UFFDIO_COPY_MODE_ACCESS_LIKELY|
> +				 UFFDIO_COPY_MODE_WRITE_LIKELY))
>  		goto out;
>  
>  	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
>  	mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY;
> +	mode_write_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY;
>  
>  	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
> -		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
> +		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
> +		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
>  
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
> @@ -1819,7 +1822,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  	struct uffdio_writeprotect uffdio_wp;
>  	struct uffdio_writeprotect __user *user_uffdio_wp;
>  	struct userfaultfd_wake_range range;
> -	bool mode_wp, mode_dontwake, mode_access_likely;
> +	bool mode_wp, mode_dontwake, mode_access_likely, mode_write_likely;
>  	uffd_flags_t uffd_flags;
>  
>  	if (atomic_read(&ctx->mmap_changing))
> @@ -1838,18 +1841,21 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
>  
>  	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
>  			       UFFDIO_WRITEPROTECT_MODE_WP |
> -			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
> +			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
> +			       UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
>  		return -EINVAL;
>  
>  	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
>  	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
>  	mode_access_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
> +	mode_write_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY;
>  
>  	if (mode_wp && mode_dontwake)
>  		return -EINVAL;
>  
>  	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
> -		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
> +		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
> +		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
>  
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>  	    uffdio_continue.range.start) {
>  		goto out;
>  	}
> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
>  		goto out;
>  
> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;

Setting dirty by default for CONTINUE may make some sense (unlike young),
at least to keep the old behavior of the code where the pte was
unconditionally set.

But there's another thought a real CONTINUE user modifies the page cache
elsewhere so logically the PageDirty should be set elsewhere already
anyways, in most cases when the userapp is updating the page cache with
another mapping before installing pgtables for this mapping.  Then this
dirty is not required, it seems.

If we're going to export this to uABI, I'm wondering maybe it'll be nicer
to not apply either young or dirty bit for CONTINUE, because fundamentally
losing dirty bit doesn't sound risky, meanwhile the user app should have
the best knowledge of what to do (whether page was requested or it's just
during a pre-faulting in the background).

Axel may have some better thoughts.

>  
>  	if (mmget_not_zero(ctx->mm)) {
>  		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index e6ac165ec044..261a3fa750d0 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -59,6 +59,7 @@ typedef unsigned int __bitwise uffd_flags_t;
>  
>  #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
>  #define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
> +#define UFFD_FLAGS_WRITE_LIKELY		((__force uffd_flags_t)BIT(2))
>  
>  extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  				    struct vm_area_struct *dst_vma,
> diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
> index d9c8ce9ba777..6ad93a13282e 100644
> --- a/include/uapi/linux/userfaultfd.h
> +++ b/include/uapi/linux/userfaultfd.h
> @@ -267,12 +267,20 @@ struct uffdio_copy {
>  	 */
>  	__s64 copy;
>  	/*
> -	 * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young.
> -	 * This can reduce the time that the first access to the page takes.
> -	 * Yet, if set opportunistically to memory that is not used, it might
> -	 * extend the time before the unused memory pages are reclaimed.
> +	 * UFFDIO_COPY_MODE_ACCESS_LIKELY indicates that the memory is likely to
> +	 * be accessed in the near future, in contrast to memory that is
> +	 * opportunistically copied and might not be accessed. The kernel will
> +	 * act accordingly, for instance by setting the access-bit in the PTE to
> +	 * reduce the access time to the page.
> +	 *
> +	 * UFFDIO_COPY_MODE_WRITE_LIKELY indicates that the memory is likely to
> +	 * be written to. The kernel will act accordingly, for instance by
> +	 * setting the dirty-bit in the PTE to reduce the write time to the
> +	 * page. This flag will be silently ignored if UFFDIO_COPY_MODE_WP is
> +	 * set.
>  	 */
> -#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<3)
> +#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
> +#define UFFDIO_COPY_MODE_WRITE_LIKELY		((__u64)1<<3)
>  };
>  
>  struct uffdio_zeropage {
> @@ -297,9 +305,11 @@ struct uffdio_writeprotect {
>   * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
>   * any wait thread after the operation succeeds.
>   *
> - * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to mark the modified
> - * memory as young, which can reduce the time that the first access
> - * to the page takes.
> + * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to indicate the memory
> + * is likely to be accessed in the near future.
> + *
> + * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: set the flag to indicate that the
> + * memory is likely to be written to in the near future.
>   *
>   * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
>   * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
> @@ -309,6 +319,7 @@ struct uffdio_writeprotect {
>  #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
>  #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
>  #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
> +#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY	((__u64)1<<3)
>  	__u64 mode;
>  };
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2beff8a4bf7c..46814fc7762f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5962,6 +5962,9 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		*pagep = NULL;
>  	}
>  
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
> +
>  	/*
>  	 * The memory barrier inside __SetPageUptodate makes sure that
>  	 * preceding stores to the page contents become visible before
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 89c775275bae..7488cd186c32 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2404,6 +2404,9 @@ int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
>  	VM_BUG_ON(PageSwapBacked(page));
>  	__SetPageLocked(page);
>  	__SetPageSwapBacked(page);
> +
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
>  	__SetPageUptodate(page);
>  
>  	ret = -EFAULT;
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 140c8d3e946e..3172158d8faa 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -70,7 +70,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	pgoff_t offset, max_off;
>  
>  	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> -	_dst_pte = pte_mkdirty(_dst_pte);
>  	if (page_in_cache && !vm_shared)
>  		writable = false;
>  
> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  
>  	if (writable)
>  		_dst_pte = pte_mkwrite(_dst_pte);
> -	else
> +	else {
>  		/*
>  		 * We need this to make sure write bit removed; as mk_pte()
>  		 * could return a pte with write bit set.
>  		 */
>  		_dst_pte = pte_wrprotect(_dst_pte);
>  
> +		/* Marking RO entries as dirty can mess with other code */
> +		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
> +			_dst_pte = pte_mkdirty(_dst_pte);

Hmm.. what about "writable=true" ones?

> +	}
> +
>  	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
>  		_dst_pte = pte_mkyoung(_dst_pte);
>  
> @@ -180,6 +184,9 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		*pagep = NULL;
>  	}
>  
> +	/* The PTE is not marked as dirty unconditionally */
> +	SetPageDirty(page);
> +
>  	/*
>  	 * The memory barrier inside __SetPageUptodate makes sure that
>  	 * preceding stores to the page contents become visible before
> -- 
> 2.25.1
> 
>
Nadav Amit June 21, 2022, 5:14 p.m. UTC | #2
On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:

> Hi, Nadav,
> 
> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>> From: Nadav Amit <namit@vmware.com>
>> 
>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>> However, setting read-only PTEs as dirty can have several undesired
>> implications.
>> 
>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>> writable during mprotect() syscall. See in change_pte_range():
>> 
>> 	/* Avoid taking write faults for known dirty pages */
>> 	if (dirty_accountable && pte_dirty(ptent) &&
>> 			(pte_soft_dirty(ptent) ||
>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>> 		ptent = pte_mkwrite(ptent);
>> 	}
> 
> IMHO this is not really the direct reason to add the feature to "allow user
> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> IIUC what's missing is the pte_uffd_wp() check that I should have added in
> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
> 
> But since this is fixed by David's patch to optimize mprotect() altogether
> which checks pte_uffd_wp() (and afaict that's only needed after the
> shmem+hugetlb patchset, not before), so I think we're safe now with all the
> branches.
> 
> So IMHO we don't need to mention this as it's kind of misleading. It'll be
> welcomed if you want to recover the pte_dirty behavior in
> mfill_atomic_install_pte() but probably this is not the right patch for it?

Sorry, I will remove it from the commit log.

I am not sure whether there should be an additional patch to revert (or
partially revert) 9ae0f87d009ca and to be backported. What do you say?

>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>> See try_to_unmap_one():
>> 
>> 	/*
>> 	 * Page is dirty. Flush the TLB if a writable entry
>> 	 * potentially exists to avoid CPU writes after IO
>> 	 * starts and then write it out here.
>> 	 */
>> 	try_to_unmap_flush_dirty();
>> 
>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>> 
>> 	if (!PageAnon(page)) {
>> 		if (pte_dirty(ptent)) {
>> 			force_flush = 1;
>> 			set_page_dirty(page);
>> 		}
>> 	...
> 
> I still keep the pure question here (which I asked in the private reply to
> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> here. I'll rewrite what I have in the private email to you here again that
> I think should be the write thing to do here..
> 
> if (!PageAnon(page)) {
> if (pte_dirty(ptent)) {
> set_page_dirty(page);
> if (pte_write(ptent))
> force_flush = 1;
> }
> }

The “Second” part regards a perf issue, not a correctness issue. As I told
you offline, I think if makes sense to look both on pte_dirty() and
pte_write(), but I am afraid of other architectures than x86, which I know.
After our discussion, I looked at other architectures, and it does look
safe. So I will add an additional patch for that (with your suggested-by).

> I also mentioned the other example of doing mprotect(PROT_READ) upon a
> dirty pte can also create a pte with dirty=1 and write=0 which should be
> the same condition we have here with uffd, afaict. So it's at least not a
> solo problem for uffd, and I still think if we treat this tlb flush issue
> as a perf bug we should consider fixing up the tlb flush code instead
> because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> able to hit it.
> 
> Meanwhile, I'll have the same comment that this is not helping any reviewer
> to understand why we need to grant user ability to conditionally set dirty
> bit from uABI, so I think it's better to drop this paragraph too for this
> patch alone.

Ok. Point taken.

> 
>> In general, setting a PTE as dirty seems for read-only entries might be
>> dangerous. It should be reminded the dirty-COW vulnerability mitigation
>> also relies on the dirty bit being set only after COW (although it does
>> not appear to apply to userfaultfd).
>> 
>> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
>> even if we only consider writable pages, always setting the dirty bit or
>> always leaving it clear, does not seem as the best policy. Leaving the
>> bit clear introduces overhead on the first write-access to set the bit.
>> Setting the bit for pages the are eventually not written to can require
>> more TLB flushes.
> 
> And IMHO only this paragraph is the real and proper reasoning for this
> patch..

Will do.

> 
>> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
>> 	 uffdio_continue.range.start) {
>> 		goto out;
>> 	}
>> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
>> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
>> 		goto out;
>> 
>> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
>> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
> 
> Setting dirty by default for CONTINUE may make some sense (unlike young),
> at least to keep the old behavior of the code where the pte was
> unconditionally set.
> 
> But there's another thought a real CONTINUE user modifies the page cache
> elsewhere so logically the PageDirty should be set elsewhere already
> anyways, in most cases when the userapp is updating the page cache with
> another mapping before installing pgtables for this mapping. Then this
> dirty is not required, it seems.
> 
> If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> to not apply either young or dirty bit for CONTINUE, because fundamentally
> losing dirty bit doesn't sound risky, meanwhile the user app should have
> the best knowledge of what to do (whether page was requested or it's just
> during a pre-faulting in the background).
> 
> Axel may have some better thoughts.

I think that perhaps I did not communicate well enough the reason it makes
sense to set the dirty-bit.

Setting the access-bit and dirty-bit takes quite some time. So regardless of
whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
and later you access the page - you are going to pay ~550 cycles for
nothing.

For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2

If you want me to allow userspace to provide a hint, let me know.

>> @@ -85,13 +84,18 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>> 
>> 	if (writable)
>> 		_dst_pte = pte_mkwrite(_dst_pte);
>> -	else
>> +	else {
>> 		/*
>> 		 * We need this to make sure write bit removed; as mk_pte()
>> 		 * could return a pte with write bit set.
>> 		 */
>> 		_dst_pte = pte_wrprotect(_dst_pte);
>> 
>> +		/* Marking RO entries as dirty can mess with other code */
>> +		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
>> +			_dst_pte = pte_mkdirty(_dst_pte);
> 
> Hmm.. what about "writable=true" ones?

Oh boy, I reverted the condition… :(

Thanks for noticing. I’ll fix it.
Peter Xu June 21, 2022, 6:10 p.m. UTC | #3
On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Nadav,
> > 
> > On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
> >> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
> >> However, setting read-only PTEs as dirty can have several undesired
> >> implications.
> >> 
> >> First, setting read-only PTEs as dirty, can cause these PTEs to become
> >> writable during mprotect() syscall. See in change_pte_range():
> >> 
> >> 	/* Avoid taking write faults for known dirty pages */
> >> 	if (dirty_accountable && pte_dirty(ptent) &&
> >> 			(pte_soft_dirty(ptent) ||
> >> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
> >> 		ptent = pte_mkwrite(ptent);
> >> 	}
> > 
> > IMHO this is not really the direct reason to add the feature to "allow user
> > to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> > IIUC what's missing is the pte_uffd_wp() check that I should have added in
> > the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
> > 
> > But since this is fixed by David's patch to optimize mprotect() altogether
> > which checks pte_uffd_wp() (and afaict that's only needed after the
> > shmem+hugetlb patchset, not before), so I think we're safe now with all the
> > branches.
> > 
> > So IMHO we don't need to mention this as it's kind of misleading. It'll be
> > welcomed if you want to recover the pte_dirty behavior in
> > mfill_atomic_install_pte() but probably this is not the right patch for it?
> 
> Sorry, I will remove it from the commit log.
> 
> I am not sure whether there should be an additional patch to revert (or
> partially revert) 9ae0f87d009ca and to be backported. What do you say?

I think we discussed it offlist, I'd not bother but I don't have a strong
opinion.  Please feel free to go with your preference.

Just to mention that it might not be a clean revert since at that time we
don't have continue and uffd-wp on files, so we may need to add the
complete check back if we're going to make it.  Please also do proper swap
tests for both anon and shmem with things like memcg, and when you posted
the patches I can do some tests too.

> 
> >> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
> >> See try_to_unmap_one():
> >> 
> >> 	/*
> >> 	 * Page is dirty. Flush the TLB if a writable entry
> >> 	 * potentially exists to avoid CPU writes after IO
> >> 	 * starts and then write it out here.
> >> 	 */
> >> 	try_to_unmap_flush_dirty();
> >> 
> >> Similarly batching TLB flushed might be prevented in zap_pte_range():
> >> 
> >> 	if (!PageAnon(page)) {
> >> 		if (pte_dirty(ptent)) {
> >> 			force_flush = 1;
> >> 			set_page_dirty(page);
> >> 		}
> >> 	...
> > 
> > I still keep the pure question here (which I asked in the private reply to
> > you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> > here. I'll rewrite what I have in the private email to you here again that
> > I think should be the write thing to do here..
> > 
> > if (!PageAnon(page)) {
> > if (pte_dirty(ptent)) {
> > set_page_dirty(page);
> > if (pte_write(ptent))
> > force_flush = 1;
> > }
> > }
> 
> The “Second” part regards a perf issue, not a correctness issue. As I told
> you offline, I think if makes sense to look both on pte_dirty() and
> pte_write(), but I am afraid of other architectures than x86, which I know.

I don't really see why this logic is arch-specific.  I meant, as long as
pte_write() returns a value that means "this page is writable", I still
think it's making sense.

> After our discussion, I looked at other architectures, and it does look
> safe. So I will add an additional patch for that (with your suggested-by).

Only if you agree with what I thought, thanks.  And that could be a
separate patch for sure out of this one even if to come.

> 
> > I also mentioned the other example of doing mprotect(PROT_READ) upon a
> > dirty pte can also create a pte with dirty=1 and write=0 which should be
> > the same condition we have here with uffd, afaict. So it's at least not a
> > solo problem for uffd, and I still think if we treat this tlb flush issue
> > as a perf bug we should consider fixing up the tlb flush code instead
> > because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> > able to hit it.
> > 
> > Meanwhile, I'll have the same comment that this is not helping any reviewer
> > to understand why we need to grant user ability to conditionally set dirty
> > bit from uABI, so I think it's better to drop this paragraph too for this
> > patch alone.
> 
> Ok. Point taken.
> 
> > 
> >> In general, setting a PTE as dirty seems for read-only entries might be
> >> dangerous. It should be reminded the dirty-COW vulnerability mitigation
> >> also relies on the dirty bit being set only after COW (although it does
> >> not appear to apply to userfaultfd).
> >> 
> >> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
> >> even if we only consider writable pages, always setting the dirty bit or
> >> always leaving it clear, does not seem as the best policy. Leaving the
> >> bit clear introduces overhead on the first write-access to set the bit.
> >> Setting the bit for pages the are eventually not written to can require
> >> more TLB flushes.
> > 
> > And IMHO only this paragraph is the real and proper reasoning for this
> > patch..
> 
> Will do.
> 
> > 
> >> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> >> 	 uffdio_continue.range.start) {
> >> 		goto out;
> >> 	}
> >> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> >> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
> >> 		goto out;
> >> 
> >> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
> >> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
> > 
> > Setting dirty by default for CONTINUE may make some sense (unlike young),
> > at least to keep the old behavior of the code where the pte was
> > unconditionally set.
> > 
> > But there's another thought a real CONTINUE user modifies the page cache
> > elsewhere so logically the PageDirty should be set elsewhere already
> > anyways, in most cases when the userapp is updating the page cache with
> > another mapping before installing pgtables for this mapping. Then this
> > dirty is not required, it seems.
> > 
> > If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> > to not apply either young or dirty bit for CONTINUE, because fundamentally
> > losing dirty bit doesn't sound risky, meanwhile the user app should have
> > the best knowledge of what to do (whether page was requested or it's just
> > during a pre-faulting in the background).
> > 
> > Axel may have some better thoughts.
> 
> I think that perhaps I did not communicate well enough the reason it makes
> sense to set the dirty-bit.
> 
> Setting the access-bit and dirty-bit takes quite some time. So regardless of
> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
> and later you access the page - you are going to pay ~550 cycles for
> nothing.
> 
> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> 
> If you want me to allow userspace to provide a hint, let me know.

You did communicate well, so probably it's me that didn't. :)

Yes I think it'll be nicer to allow userspace provide the same hint for
UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
current qemu impl [1]) proactively be used to install pgtables even if
they're code pages and they may not be accessed in the near future.  So
IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
this point of view.

There's just still some differences on young/dirty here so I'm not sure
whether the idea applies to both, but I think at least it applies to young
bit (ACCESS_LIKELY).

[1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582

Thanks,
Nadav Amit June 21, 2022, 6:30 p.m. UTC | #4
On Jun 21, 2022, at 11:10 AM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
>> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> Hi, Nadav,
>>> 
>>> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>>>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>>>> However, setting read-only PTEs as dirty can have several undesired
>>>> implications.
>>>> 
>>>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>>>> writable during mprotect() syscall. See in change_pte_range():
>>>> 
>>>> 	/* Avoid taking write faults for known dirty pages */
>>>> 	if (dirty_accountable && pte_dirty(ptent) &&
>>>> 			(pte_soft_dirty(ptent) ||
>>>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>>>> 		ptent = pte_mkwrite(ptent);
>>>> 	}
>>> 
>>> IMHO this is not really the direct reason to add the feature to "allow user
>>> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
>>> IIUC what's missing is the pte_uffd_wp() check that I should have added in
>>> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
>>> 
>>> But since this is fixed by David's patch to optimize mprotect() altogether
>>> which checks pte_uffd_wp() (and afaict that's only needed after the
>>> shmem+hugetlb patchset, not before), so I think we're safe now with all the
>>> branches.
>>> 
>>> So IMHO we don't need to mention this as it's kind of misleading. It'll be
>>> welcomed if you want to recover the pte_dirty behavior in
>>> mfill_atomic_install_pte() but probably this is not the right patch for it?
>> 
>> Sorry, I will remove it from the commit log.
>> 
>> I am not sure whether there should be an additional patch to revert (or
>> partially revert) 9ae0f87d009ca and to be backported. What do you say?
> 
> I think we discussed it offlist, I'd not bother but I don't have a strong
> opinion. Please feel free to go with your preference.
> 
> Just to mention that it might not be a clean revert since at that time we
> don't have continue and uffd-wp on files, so we may need to add the
> complete check back if we're going to make it. Please also do proper swap
> tests for both anon and shmem with things like memcg, and when you posted
> the patches I can do some tests too.

That’s what I was afraid of. It moves it down in my priority list...

>>>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>>>> See try_to_unmap_one():
>>>> 
>>>> 	/*
>>>> 	 * Page is dirty. Flush the TLB if a writable entry
>>>> 	 * potentially exists to avoid CPU writes after IO
>>>> 	 * starts and then write it out here.
>>>> 	 */
>>>> 	try_to_unmap_flush_dirty();
>>>> 
>>>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>>>> 
>>>> 	if (!PageAnon(page)) {
>>>> 		if (pte_dirty(ptent)) {
>>>> 			force_flush = 1;
>>>> 			set_page_dirty(page);
>>>> 		}
>>>> 	...
>>> 
>>> I still keep the pure question here (which I asked in the private reply to
>>> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
>>> here. I'll rewrite what I have in the private email to you here again that
>>> I think should be the write thing to do here..
>>> 
>>> if (!PageAnon(page)) {
>>> if (pte_dirty(ptent)) {
>>> set_page_dirty(page);
>>> if (pte_write(ptent))
>>> force_flush = 1;
>>> }
>>> }
>> 
>> The “Second” part regards a perf issue, not a correctness issue. As I told
>> you offline, I think if makes sense to look both on pte_dirty() and
>> pte_write(), but I am afraid of other architectures than x86, which I know.
> 
> I don't really see why this logic is arch-specific. I meant, as long as
> pte_write() returns a value that means "this page is writable", I still
> think it's making sense.
> 
>> After our discussion, I looked at other architectures, and it does look
>> safe. So I will add an additional patch for that (with your suggested-by).
> 
> Only if you agree with what I thought, thanks. And that could be a
> separate patch for sure out of this one even if to come.

I do agree. I did not quite understand your second sentence. I guess you
meant not to combine it into this patchset (or at least that’s what I
thought and I attribute it to you).

>> 
>> 
>> I think that perhaps I did not communicate well enough the reason it makes
>> sense to set the dirty-bit.
>> 
>> Setting the access-bit and dirty-bit takes quite some time. So regardless of
>> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
>> and later you access the page - you are going to pay ~550 cycles for
>> nothing.
>> 
>> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
>> 
>> If you want me to allow userspace to provide a hint, let me know.
> 
> You did communicate well, so probably it's me that didn't. :)
> 
> Yes I think it'll be nicer to allow userspace provide the same hint for
> UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
> bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
> current qemu impl [1]) proactively be used to install pgtables even if
> they're code pages and they may not be accessed in the near future. So
> IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
> this point of view.
> 
> There's just still some differences on young/dirty here so I'm not sure
> whether the idea applies to both, but I think at least it applies to young
> bit (ACCESS_LIKELY).
> 
> [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582

We have a saying in Hebrew: “When you explain slowly, I understand quickly”.
Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.
Peter Xu June 21, 2022, 6:43 p.m. UTC | #5
On Tue, Jun 21, 2022 at 11:30:52AM -0700, Nadav Amit wrote:
> > Only if you agree with what I thought, thanks. And that could be a
> > separate patch for sure out of this one even if to come.
> 
> I do agree. I did not quite understand your second sentence. I guess you
> meant not to combine it into this patchset (or at least that’s what I
> thought and I attribute it to you).

Feel free to not attribute anything to me, even no need on a suggested-by.
But if you plan to post for sure I'd be glad if you copy me, I may be very
willing to either provide a r-b or read about what's missed there on the
thought if that isn't work like that.

So the only purpose of the 2nd sentense is to make sure it shouldn't block
your current series (you explained why you're hurry and catching time), and
that's all I wanted to express.

> 
> >> 
> >> 
> >> I think that perhaps I did not communicate well enough the reason it makes
> >> sense to set the dirty-bit.
> >> 
> >> Setting the access-bit and dirty-bit takes quite some time. So regardless of
> >> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
> >> and later you access the page - you are going to pay ~550 cycles for
> >> nothing.
> >> 
> >> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> >> 
> >> If you want me to allow userspace to provide a hint, let me know.
> > 
> > You did communicate well, so probably it's me that didn't. :)
> > 
> > Yes I think it'll be nicer to allow userspace provide the same hint for
> > UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
> > bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
> > current qemu impl [1]) proactively be used to install pgtables even if
> > they're code pages and they may not be accessed in the near future. So
> > IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
> > this point of view.
> > 
> > There's just still some differences on young/dirty here so I'm not sure
> > whether the idea applies to both, but I think at least it applies to young
> > bit (ACCESS_LIKELY).
> > 
> > [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582
> 
> We have a saying in Hebrew: “When you explain slowly, I understand quickly”.
> Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.

I'm always happy to learn these interesting idioms, and I'm glad we seem to
be on the same page now.  Thanks!
diff mbox series

Patch

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 35a8c4347c54..a56983b594d5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -1700,7 +1700,7 @@  static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	struct uffdio_copy uffdio_copy;
 	struct uffdio_copy __user *user_uffdio_copy;
 	struct userfaultfd_wake_range range;
-	bool mode_wp, mode_access_likely;
+	bool mode_wp, mode_access_likely, mode_write_likely;
 	uffd_flags_t uffd_flags;
 
 	user_uffdio_copy = (struct uffdio_copy __user *) arg;
@@ -1727,14 +1727,17 @@  static int userfaultfd_copy(struct userfaultfd_ctx *ctx,
 	if (uffdio_copy.src + uffdio_copy.len <= uffdio_copy.src)
 		goto out;
 	if (uffdio_copy.mode & ~(UFFDIO_COPY_MODE_DONTWAKE|UFFDIO_COPY_MODE_WP|
-				 UFFDIO_COPY_MODE_ACCESS_LIKELY))
+				 UFFDIO_COPY_MODE_ACCESS_LIKELY|
+				 UFFDIO_COPY_MODE_WRITE_LIKELY))
 		goto out;
 
 	mode_wp = uffdio_copy.mode & UFFDIO_COPY_MODE_WP;
 	mode_access_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_ACCESS_LIKELY;
+	mode_write_likely = uffdio_copy.mode & UFFDIO_COPY_MODE_WRITE_LIKELY;
 
 	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
-		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
+		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
+		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_atomic(ctx->mm, uffdio_copy.dst, uffdio_copy.src,
@@ -1819,7 +1822,7 @@  static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 	struct uffdio_writeprotect uffdio_wp;
 	struct uffdio_writeprotect __user *user_uffdio_wp;
 	struct userfaultfd_wake_range range;
-	bool mode_wp, mode_dontwake, mode_access_likely;
+	bool mode_wp, mode_dontwake, mode_access_likely, mode_write_likely;
 	uffd_flags_t uffd_flags;
 
 	if (atomic_read(&ctx->mmap_changing))
@@ -1838,18 +1841,21 @@  static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
 
 	if (uffdio_wp.mode & ~(UFFDIO_WRITEPROTECT_MODE_DONTWAKE |
 			       UFFDIO_WRITEPROTECT_MODE_WP |
-			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY))
+			       UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY |
+			       UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY))
 		return -EINVAL;
 
 	mode_wp = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WP;
 	mode_dontwake = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_DONTWAKE;
 	mode_access_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY;
+	mode_write_likely = uffdio_wp.mode & UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY;
 
 	if (mode_wp && mode_dontwake)
 		return -EINVAL;
 
 	uffd_flags = (mode_wp ? UFFD_FLAGS_WP : 0) |
-		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0);
+		     (mode_access_likely ? UFFD_FLAGS_ACCESS_LIKELY : 0) |
+		     (mode_write_likely ? UFFD_FLAGS_WRITE_LIKELY : 0);
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
@@ -1902,10 +1908,10 @@  static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
 	    uffdio_continue.range.start) {
 		goto out;
 	}
-	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
+	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
 		goto out;
 
-	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
+	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
 
 	if (mmget_not_zero(ctx->mm)) {
 		ret = mcopy_continue(ctx->mm, uffdio_continue.range.start,
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index e6ac165ec044..261a3fa750d0 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -59,6 +59,7 @@  typedef unsigned int __bitwise uffd_flags_t;
 
 #define UFFD_FLAGS_WP			((__force uffd_flags_t)BIT(0))
 #define UFFD_FLAGS_ACCESS_LIKELY	((__force uffd_flags_t)BIT(1))
+#define UFFD_FLAGS_WRITE_LIKELY		((__force uffd_flags_t)BIT(2))
 
 extern int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 				    struct vm_area_struct *dst_vma,
diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h
index d9c8ce9ba777..6ad93a13282e 100644
--- a/include/uapi/linux/userfaultfd.h
+++ b/include/uapi/linux/userfaultfd.h
@@ -267,12 +267,20 @@  struct uffdio_copy {
 	 */
 	__s64 copy;
 	/*
-	 * UFFDIO_COPY_MODE_ACCESS_LIKELY will set the mapped page as young.
-	 * This can reduce the time that the first access to the page takes.
-	 * Yet, if set opportunistically to memory that is not used, it might
-	 * extend the time before the unused memory pages are reclaimed.
+	 * UFFDIO_COPY_MODE_ACCESS_LIKELY indicates that the memory is likely to
+	 * be accessed in the near future, in contrast to memory that is
+	 * opportunistically copied and might not be accessed. The kernel will
+	 * act accordingly, for instance by setting the access-bit in the PTE to
+	 * reduce the access time to the page.
+	 *
+	 * UFFDIO_COPY_MODE_WRITE_LIKELY indicates that the memory is likely to
+	 * be written to. The kernel will act accordingly, for instance by
+	 * setting the dirty-bit in the PTE to reduce the write time to the
+	 * page. This flag will be silently ignored if UFFDIO_COPY_MODE_WP is
+	 * set.
 	 */
-#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<3)
+#define UFFDIO_COPY_MODE_ACCESS_LIKELY		((__u64)1<<2)
+#define UFFDIO_COPY_MODE_WRITE_LIKELY		((__u64)1<<3)
 };
 
 struct uffdio_zeropage {
@@ -297,9 +305,11 @@  struct uffdio_writeprotect {
  * UFFDIO_WRITEPROTECT_MODE_DONTWAKE: set the flag to avoid waking up
  * any wait thread after the operation succeeds.
  *
- * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to mark the modified
- * memory as young, which can reduce the time that the first access
- * to the page takes.
+ * UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY: set the flag to indicate the memory
+ * is likely to be accessed in the near future.
+ *
+ * UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY: set the flag to indicate that the
+ * memory is likely to be written to in the near future.
  *
  * NOTE: Write protecting a region (WP=1) is unrelated to page faults,
  * therefore DONTWAKE flag is meaningless with WP=1.  Removing write
@@ -309,6 +319,7 @@  struct uffdio_writeprotect {
 #define UFFDIO_WRITEPROTECT_MODE_WP		((__u64)1<<0)
 #define UFFDIO_WRITEPROTECT_MODE_DONTWAKE	((__u64)1<<1)
 #define UFFDIO_WRITEPROTECT_MODE_ACCESS_LIKELY	((__u64)1<<2)
+#define UFFDIO_WRITEPROTECT_MODE_WRITE_LIKELY	((__u64)1<<3)
 	__u64 mode;
 };
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2beff8a4bf7c..46814fc7762f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5962,6 +5962,9 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * preceding stores to the page contents become visible before
diff --git a/mm/shmem.c b/mm/shmem.c
index 89c775275bae..7488cd186c32 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2404,6 +2404,9 @@  int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
 	VM_BUG_ON(PageSwapBacked(page));
 	__SetPageLocked(page);
 	__SetPageSwapBacked(page);
+
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
 	__SetPageUptodate(page);
 
 	ret = -EFAULT;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 140c8d3e946e..3172158d8faa 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -70,7 +70,6 @@  int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	pgoff_t offset, max_off;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
 
@@ -85,13 +84,18 @@  int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 
 	if (writable)
 		_dst_pte = pte_mkwrite(_dst_pte);
-	else
+	else {
 		/*
 		 * We need this to make sure write bit removed; as mk_pte()
 		 * could return a pte with write bit set.
 		 */
 		_dst_pte = pte_wrprotect(_dst_pte);
 
+		/* Marking RO entries as dirty can mess with other code */
+		if (uffd_flags & UFFD_FLAGS_WRITE_LIKELY)
+			_dst_pte = pte_mkdirty(_dst_pte);
+	}
+
 	if (uffd_flags & UFFD_FLAGS_ACCESS_LIKELY)
 		_dst_pte = pte_mkyoung(_dst_pte);
 
@@ -180,6 +184,9 @@  static int mcopy_atomic_pte(struct mm_struct *dst_mm,
 		*pagep = NULL;
 	}
 
+	/* The PTE is not marked as dirty unconditionally */
+	SetPageDirty(page);
+
 	/*
 	 * The memory barrier inside __SetPageUptodate makes sure that
 	 * preceding stores to the page contents become visible before