diff mbox series

[v3] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTNUE behavior

Message ID 20210329234131.304999-1-axelrasmussen@google.com (mailing list archive)
State New, archived
Headers show
Series [v3] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTNUE behavior | expand

Commit Message

Axel Rasmussen March 29, 2021, 11:41 p.m. UTC
Previously, we shared too much of the code with COPY and ZEROPAGE, so we
manipulated things in various invalid ways:

- Previously, we unconditionally called shmem_inode_acct_block. In the
  continue case, we're looking up an existing page which would have been
  accounted for properly when it was allocated. So doing it twice
  results in double-counting, and eventually leaking.

- Previously, we made the pte writable whenever the VMA was writable.
  However, for continue, consider this case:

  1. A tmpfs file was created
  2. The non-UFFD-registered side mmap()-s with MAP_SHARED
  3. The UFFD-registered side mmap()-s with MAP_PRIVATE

  In this case, even though the UFFD-registered VMA may be writable, we
  still want CoW behavior. So, check for this case and don't make the
  pte writable.

- The initial pgoff / max_off check isn't necessary, so we can skip past
  it. The second one seems likely to be unnecessary too, but keep it
  just in case. Modify both checks to use pgoff, as offset is equivalent
  and not needed.

- Previously, we unconditionally called ClearPageDirty() in the error
  path. In the continue case though, since this is an existing page, it
  might have already been dirty before we started touching it. It's very
  problematic to clear the bit incorrectly, but not a problem to leave
  it - so, just omit the ClearPageDirty() entirely.

- Previously, we unconditionally removed the page from the page cache in
  the error path. But in the continue case, we didn't add it - it was
  already there because the page is present in some second
  (non-UFFD-registered) mapping. So, removing it is invalid.

Because the error handling issues are easy to exercise in the selftest,
make a small modification there to do so.

Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've
added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just
check for that mode first thing, and then "goto" down to where the parts
we actually want are. This leaves the code in between cleaner.

Changes since v2:
- Drop the ClearPageDirty() entirely, instead of trying to remember the
  old value.
- Modify both pgoff / max_off checks to use pgoff. It's equivalent to
  offset, but offset wasn't initialized until the first check (which
  we're skipping).
- Keep the second pgoff / max_off check in the continue case.

Changes since v1:
- Refactor to skip ahead with goto, instead of adding several more
  "if (!is_continue)".
- Fix unconditional ClearPageDirty().
- Don't pte_mkwrite() when is_continue && !VM_SHARED.

Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 mm/shmem.c                               | 60 +++++++++++++-----------
 tools/testing/selftests/vm/userfaultfd.c | 12 +++++
 2 files changed, 44 insertions(+), 28 deletions(-)

Comments

Peter Xu March 30, 2021, 8:55 p.m. UTC | #1
On Mon, Mar 29, 2021 at 04:41:31PM -0700, Axel Rasmussen wrote:
> Previously, we shared too much of the code with COPY and ZEROPAGE, so we
> manipulated things in various invalid ways:
> 
> - Previously, we unconditionally called shmem_inode_acct_block. In the
>   continue case, we're looking up an existing page which would have been
>   accounted for properly when it was allocated. So doing it twice
>   results in double-counting, and eventually leaking.
> 
> - Previously, we made the pte writable whenever the VMA was writable.
>   However, for continue, consider this case:
> 
>   1. A tmpfs file was created
>   2. The non-UFFD-registered side mmap()-s with MAP_SHARED
>   3. The UFFD-registered side mmap()-s with MAP_PRIVATE
> 
>   In this case, even though the UFFD-registered VMA may be writable, we
>   still want CoW behavior. So, check for this case and don't make the
>   pte writable.
> 
> - The initial pgoff / max_off check isn't necessary, so we can skip past
>   it. The second one seems likely to be unnecessary too, but keep it
>   just in case. Modify both checks to use pgoff, as offset is equivalent
>   and not needed.
> 
> - Previously, we unconditionally called ClearPageDirty() in the error
>   path. In the continue case though, since this is an existing page, it
>   might have already been dirty before we started touching it. It's very
>   problematic to clear the bit incorrectly, but not a problem to leave
>   it - so, just omit the ClearPageDirty() entirely.
> 
> - Previously, we unconditionally removed the page from the page cache in
>   the error path. But in the continue case, we didn't add it - it was
>   already there because the page is present in some second
>   (non-UFFD-registered) mapping. So, removing it is invalid.
> 
> Because the error handling issues are easy to exercise in the selftest,
> make a small modification there to do so.
> 
> Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've
> added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just
> check for that mode first thing, and then "goto" down to where the parts
> we actually want are. This leaves the code in between cleaner.
> 
> Changes since v2:
> - Drop the ClearPageDirty() entirely, instead of trying to remember the
>   old value.
> - Modify both pgoff / max_off checks to use pgoff. It's equivalent to
>   offset, but offset wasn't initialized until the first check (which
>   we're skipping).
> - Keep the second pgoff / max_off check in the continue case.
> 
> Changes since v1:
> - Refactor to skip ahead with goto, instead of adding several more
>   "if (!is_continue)".
> - Fix unconditional ClearPageDirty().
> - Don't pte_mkwrite() when is_continue && !VM_SHARED.
> 
> Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  mm/shmem.c                               | 60 +++++++++++++-----------
>  tools/testing/selftests/vm/userfaultfd.c | 12 +++++
>  2 files changed, 44 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d2e0e81b7d2e..fbcce850a16e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2377,18 +2377,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	struct page *page;
>  	pte_t _dst_pte, *dst_pte;
>  	int ret;
> -	pgoff_t offset, max_off;
> -
> -	ret = -ENOMEM;
> -	if (!shmem_inode_acct_block(inode, 1))
> -		goto out;
> +	pgoff_t max_off;
> +	int writable;

Nit: can be bool.

[...]

> +install_ptes:
>  	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> -	if (dst_vma->vm_flags & VM_WRITE)
> +	/* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */
> +	writable = is_continue && !(dst_vma->vm_flags & VM_SHARED)
> +		? 0
> +		: dst_vma->vm_flags & VM_WRITE;

Nit: this code is slightly hard to read..  I'd slightly prefer "if
(is_continue)...".  But more below.

> +	if (writable)
>  		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
>  	else {
>  		/*
> @@ -2455,7 +2458,7 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  
>  	ret = -EFAULT;
>  	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> -	if (unlikely(offset >= max_off))
> +	if (unlikely(pgoff >= max_off))
>  		goto out_release_unlock;
>  
>  	ret = -EEXIST;
> @@ -2485,13 +2488,14 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
>  	return ret;
>  out_release_unlock:
>  	pte_unmap_unlock(dst_pte, ptl);
> -	ClearPageDirty(page);
> -	delete_from_page_cache(page);
> +	if (!is_continue)
> +		delete_from_page_cache(page);
>  out_release:
>  	unlock_page(page);
>  	put_page(page);
>  out_unacct_blocks:
> -	shmem_inode_unacct_blocks(inode, 1);
> +	if (!is_continue)
> +		shmem_inode_unacct_blocks(inode, 1);

If you see we still have tons of "if (!is_continue)".  Those are the places
error prone.. even if not in this patch, could be in the patch when this
function got changed again.

Sorry to say this a bit late: how about introduce a helper to install the pte?
Pesudo code:

int shmem_install_uffd_pte(..., bool writable)
{
	...
	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
	if (dst_vma->vm_flags & VM_WRITE)
		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
	else
		set_page_dirty(page);

	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
	if (!pte_none(*dst_pte)) {
		pte_unmap_unlock(dst_pte, ptl);
		return -EEXIST;
	}

	inc_mm_counter(dst_mm, mm_counter_file(page));
	page_add_file_rmap(page, false);
	set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(dst_vma, dst_addr, dst_pte);
	pte_unmap_unlock(dst_pte, ptl);
	return 0;
}

Then at the entry of shmem_mcopy_atomic_pte():

	if (is_continue) {
		page = find_lock_page(mapping, pgoff);
		if (!page)
                    return -EFAULT;
                ret = shmem_install_uffd_pte(...,
                        is_continue && !(dst_vma->vm_flags & VM_SHARED));
                unlock_page(page);
                if (ret)
                    put_page(page);
                return ret;
        }

Do you think this would be cleaner?
Axel Rasmussen March 30, 2021, 11:30 p.m. UTC | #2
On Tue, Mar 30, 2021 at 1:55 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Mar 29, 2021 at 04:41:31PM -0700, Axel Rasmussen wrote:
> > Previously, we shared too much of the code with COPY and ZEROPAGE, so we
> > manipulated things in various invalid ways:
> >
> > - Previously, we unconditionally called shmem_inode_acct_block. In the
> >   continue case, we're looking up an existing page which would have been
> >   accounted for properly when it was allocated. So doing it twice
> >   results in double-counting, and eventually leaking.
> >
> > - Previously, we made the pte writable whenever the VMA was writable.
> >   However, for continue, consider this case:
> >
> >   1. A tmpfs file was created
> >   2. The non-UFFD-registered side mmap()-s with MAP_SHARED
> >   3. The UFFD-registered side mmap()-s with MAP_PRIVATE
> >
> >   In this case, even though the UFFD-registered VMA may be writable, we
> >   still want CoW behavior. So, check for this case and don't make the
> >   pte writable.
> >
> > - The initial pgoff / max_off check isn't necessary, so we can skip past
> >   it. The second one seems likely to be unnecessary too, but keep it
> >   just in case. Modify both checks to use pgoff, as offset is equivalent
> >   and not needed.
> >
> > - Previously, we unconditionally called ClearPageDirty() in the error
> >   path. In the continue case though, since this is an existing page, it
> >   might have already been dirty before we started touching it. It's very
> >   problematic to clear the bit incorrectly, but not a problem to leave
> >   it - so, just omit the ClearPageDirty() entirely.
> >
> > - Previously, we unconditionally removed the page from the page cache in
> >   the error path. But in the continue case, we didn't add it - it was
> >   already there because the page is present in some second
> >   (non-UFFD-registered) mapping. So, removing it is invalid.
> >
> > Because the error handling issues are easy to exercise in the selftest,
> > make a small modification there to do so.
> >
> > Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've
> > added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just
> > check for that mode first thing, and then "goto" down to where the parts
> > we actually want are. This leaves the code in between cleaner.
> >
> > Changes since v2:
> > - Drop the ClearPageDirty() entirely, instead of trying to remember the
> >   old value.
> > - Modify both pgoff / max_off checks to use pgoff. It's equivalent to
> >   offset, but offset wasn't initialized until the first check (which
> >   we're skipping).
> > - Keep the second pgoff / max_off check in the continue case.
> >
> > Changes since v1:
> > - Refactor to skip ahead with goto, instead of adding several more
> >   "if (!is_continue)".
> > - Fix unconditional ClearPageDirty().
> > - Don't pte_mkwrite() when is_continue && !VM_SHARED.
> >
> > Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem")
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  mm/shmem.c                               | 60 +++++++++++++-----------
> >  tools/testing/selftests/vm/userfaultfd.c | 12 +++++
> >  2 files changed, 44 insertions(+), 28 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index d2e0e81b7d2e..fbcce850a16e 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2377,18 +2377,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >       struct page *page;
> >       pte_t _dst_pte, *dst_pte;
> >       int ret;
> > -     pgoff_t offset, max_off;
> > -
> > -     ret = -ENOMEM;
> > -     if (!shmem_inode_acct_block(inode, 1))
> > -             goto out;
> > +     pgoff_t max_off;
> > +     int writable;
>
> Nit: can be bool.
>
> [...]
>
> > +install_ptes:
> >       _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> > -     if (dst_vma->vm_flags & VM_WRITE)
> > +     /* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */
> > +     writable = is_continue && !(dst_vma->vm_flags & VM_SHARED)
> > +             ? 0
> > +             : dst_vma->vm_flags & VM_WRITE;
>
> Nit: this code is slightly hard to read..  I'd slightly prefer "if
> (is_continue)...".  But more below.
>
> > +     if (writable)
> >               _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> >       else {
> >               /*
> > @@ -2455,7 +2458,7 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >
> >       ret = -EFAULT;
> >       max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> > -     if (unlikely(offset >= max_off))
> > +     if (unlikely(pgoff >= max_off))
> >               goto out_release_unlock;
> >
> >       ret = -EEXIST;
> > @@ -2485,13 +2488,14 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
> >       return ret;
> >  out_release_unlock:
> >       pte_unmap_unlock(dst_pte, ptl);
> > -     ClearPageDirty(page);
> > -     delete_from_page_cache(page);
> > +     if (!is_continue)
> > +             delete_from_page_cache(page);
> >  out_release:
> >       unlock_page(page);
> >       put_page(page);
> >  out_unacct_blocks:
> > -     shmem_inode_unacct_blocks(inode, 1);
> > +     if (!is_continue)
> > +             shmem_inode_unacct_blocks(inode, 1);
>
> If you see we still have tons of "if (!is_continue)".  Those are the places
> error prone.. even if not in this patch, could be in the patch when this
> function got changed again.
>
> Sorry to say this a bit late: how about introduce a helper to install the pte?

No worries. :)

> Pesudo code:
>
> int shmem_install_uffd_pte(..., bool writable)
> {
>         ...
>         _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
>         if (dst_vma->vm_flags & VM_WRITE)
>                 _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
>         else
>                 set_page_dirty(page);
>
>         dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
>         if (!pte_none(*dst_pte)) {
>                 pte_unmap_unlock(dst_pte, ptl);
>                 return -EEXIST;
>         }
>
>         inc_mm_counter(dst_mm, mm_counter_file(page));
>         page_add_file_rmap(page, false);
>         set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
>
>         /* No need to invalidate - it was non-present before */
>         update_mmu_cache(dst_vma, dst_addr, dst_pte);
>         pte_unmap_unlock(dst_pte, ptl);
>         return 0;
> }
>
> Then at the entry of shmem_mcopy_atomic_pte():
>
>         if (is_continue) {
>                 page = find_lock_page(mapping, pgoff);
>                 if (!page)
>                     return -EFAULT;
>                 ret = shmem_install_uffd_pte(...,
>                         is_continue && !(dst_vma->vm_flags & VM_SHARED));
>                 unlock_page(page);
>                 if (ret)
>                     put_page(page);
>                 return ret;
>         }
>
> Do you think this would be cleaner?

Yes, a refactor like that is promising. It's hard to say for certain
without actually looking at the result - I'll spend some time tomorrow
on a few options, and send along the cleanest version I come up with.

Thanks for all the feedback and advice on this feature, Peter!

>
> --
> Peter Xu
>
Peter Xu March 31, 2021, 12:54 p.m. UTC | #3
Axel,

On Tue, Mar 30, 2021 at 04:30:13PM -0700, Axel Rasmussen wrote:
> Yes, a refactor like that is promising. It's hard to say for certain
> without actually looking at the result - I'll spend some time tomorrow
> on a few options, and send along the cleanest version I come up with.

Before you move onto a new version...  See this commit:

5b51072e97d5 ("userfaultfd: shmem: allocate anonymous memory for MAP_PRIVATE shmem", 2018-11-30)

I found it when I was thinking why not move the whole continue logic directly
into mfill_atomic_pte(), if we can have the pte installation helper, because
that's all we need.

So previously I got the semantics a bit mixed up: for private shmem mappings,
UFFDIO_COPY won't fill in page cache at all, but it's all private.  We keep the
page cache empty even after UFFDIO_COPY for a private mapping.

UFFDIO_CONTINUE is slightly different, since we _know_ the page cache is
there..  So I'm thinking maybe you need to handle the continue request in
mfill_atomic_pte() before the VM_SHARED check so as to cover both cases.
diff mbox series

Patch

diff --git a/mm/shmem.c b/mm/shmem.c
index d2e0e81b7d2e..fbcce850a16e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2377,18 +2377,22 @@  int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	struct page *page;
 	pte_t _dst_pte, *dst_pte;
 	int ret;
-	pgoff_t offset, max_off;
-
-	ret = -ENOMEM;
-	if (!shmem_inode_acct_block(inode, 1))
-		goto out;
+	pgoff_t max_off;
+	int writable;
 
 	if (is_continue) {
 		ret = -EFAULT;
 		page = find_lock_page(mapping, pgoff);
 		if (!page)
-			goto out_unacct_blocks;
-	} else if (!*pagep) {
+			goto out;
+		goto install_ptes;
+	}
+
+	ret = -ENOMEM;
+	if (!shmem_inode_acct_block(inode, 1))
+		goto out;
+
+	if (!*pagep) {
 		page = shmem_alloc_page(gfp, info, pgoff);
 		if (!page)
 			goto out_unacct_blocks;
@@ -2415,30 +2419,29 @@  int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 		*pagep = NULL;
 	}
 
-	if (!is_continue) {
-		VM_BUG_ON(PageSwapBacked(page));
-		VM_BUG_ON(PageLocked(page));
-		__SetPageLocked(page);
-		__SetPageSwapBacked(page);
-		__SetPageUptodate(page);
-	}
+	VM_BUG_ON(PageSwapBacked(page));
+	VM_BUG_ON(PageLocked(page));
+	__SetPageLocked(page);
+	__SetPageSwapBacked(page);
+	__SetPageUptodate(page);
 
 	ret = -EFAULT;
-	offset = linear_page_index(dst_vma, dst_addr);
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(offset >= max_off))
+	if (unlikely(pgoff >= max_off))
 		goto out_release;
 
-	/* If page wasn't already in the page cache, add it. */
-	if (!is_continue) {
-		ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
-					      gfp & GFP_RECLAIM_MASK, dst_mm);
-		if (ret)
-			goto out_release;
-	}
+	ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
+				      gfp & GFP_RECLAIM_MASK, dst_mm);
+	if (ret)
+		goto out_release;
 
+install_ptes:
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
-	if (dst_vma->vm_flags & VM_WRITE)
+	/* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */
+	writable = is_continue && !(dst_vma->vm_flags & VM_SHARED)
+		? 0
+		: dst_vma->vm_flags & VM_WRITE;
+	if (writable)
 		_dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
 	else {
 		/*
@@ -2455,7 +2458,7 @@  int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 
 	ret = -EFAULT;
 	max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
-	if (unlikely(offset >= max_off))
+	if (unlikely(pgoff >= max_off))
 		goto out_release_unlock;
 
 	ret = -EEXIST;
@@ -2485,13 +2488,14 @@  int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	return ret;
 out_release_unlock:
 	pte_unmap_unlock(dst_pte, ptl);
-	ClearPageDirty(page);
-	delete_from_page_cache(page);
+	if (!is_continue)
+		delete_from_page_cache(page);
 out_release:
 	unlock_page(page);
 	put_page(page);
 out_unacct_blocks:
-	shmem_inode_unacct_blocks(inode, 1);
+	if (!is_continue)
+		shmem_inode_unacct_blocks(inode, 1);
 	goto out;
 }
 #endif /* CONFIG_USERFAULTFD */
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index f6c86b036d0f..d8541a59dae5 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -485,6 +485,7 @@  static void wp_range(int ufd, __u64 start, __u64 len, bool wp)
 static void continue_range(int ufd, __u64 start, __u64 len)
 {
 	struct uffdio_continue req;
+	int ret;
 
 	req.range.start = start;
 	req.range.len = len;
@@ -493,6 +494,17 @@  static void continue_range(int ufd, __u64 start, __u64 len)
 	if (ioctl(ufd, UFFDIO_CONTINUE, &req))
 		err("UFFDIO_CONTINUE failed for address 0x%" PRIx64,
 		    (uint64_t)start);
+
+	/*
+	 * Error handling within the kernel for continue is subtly different
+	 * from copy or zeropage, so it may be a source of bugs. Trigger an
+	 * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG.
+	 */
+	req.mapped = 0;
+	ret = ioctl(ufd, UFFDIO_CONTINUE, &req);
+	if (ret >= 0 || req.mapped != -EEXIST)
+		err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64,
+		    ret, req.mapped);
 }
 
 static void *locking_thread(void *arg)