diff mbox series

mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY

Message ID 20210513234309.366727-1-almasrymina@google.com (mailing list archive)
State New
Headers show
Series mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY | expand

Commit Message

Mina Almasry May 13, 2021, 11:43 p.m. UTC
When hugetlb_mcopy_atomic_pte() is called with:
- mode==MCOPY_ATOMIC_NORMAL and,
- we already have a page in the page cache corresponding to the
associated address,

We will allocate a huge page from the reserves, and then fail to insert it
into the cache and return -EEXIST. In this case, we need to return -EEXIST
without allocating a new page as the page already exists in the cache.
Allocating the extra page causes the resv_huge_pages to underflow temporarily
until the extra page is freed.

To fix this we check if a page exists in the cache, and allocate it and
insert it in the cache immediately while holding the lock. After that we
copy the contents into the page.

As a side effect of this, pages may exist in the cache for which the
copy failed and for these pages PageUptodate(page) == false. Modify code
that query the cache to handle this correctly.

Tested using:
./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \
/mnt/huge

Test passes, and dmesg shows no underflow warnings.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

---
 fs/hugetlbfs/inode.c |   2 +-
 mm/hugetlb.c         | 109 +++++++++++++++++++++++--------------------
 2 files changed, 60 insertions(+), 51 deletions(-)

--
2.31.1.751.gd2f1c929bd-goog

Comments

Mina Almasry May 13, 2021, 11:49 p.m. UTC | #1
On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote:
>
> When hugetlb_mcopy_atomic_pte() is called with:
> - mode==MCOPY_ATOMIC_NORMAL and,
> - we already have a page in the page cache corresponding to the
> associated address,
>
> We will allocate a huge page from the reserves, and then fail to insert it
> into the cache and return -EEXIST. In this case, we need to return -EEXIST
> without allocating a new page as the page already exists in the cache.
> Allocating the extra page causes the resv_huge_pages to underflow temporarily
> until the extra page is freed.
>
> To fix this we check if a page exists in the cache, and allocate it and
> insert it in the cache immediately while holding the lock. After that we
> copy the contents into the page.
>
> As a side effect of this, pages may exist in the cache for which the
> copy failed and for these pages PageUptodate(page) == false. Modify code
> that query the cache to handle this correctly.
>

To be honest, I'm not sure I've done this bit correctly. Please take a
look and let me know what you think. It may be too overly complicated
to have !PageUptodate() pages in the cache and ask the rest of the
code to handle that edge case correctly, but I'm not sure how else to
fix this issue.

> Tested using:
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \
> /mnt/huge
>
> Test passes, and dmesg shows no underflow warnings.
>
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
>
> ---
>  fs/hugetlbfs/inode.c |   2 +-
>  mm/hugetlb.c         | 109 +++++++++++++++++++++++--------------------
>  2 files changed, 60 insertions(+), 51 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a2a42335e8fd..cc027c335242 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -346,7 +346,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
>                 /* Find the page */
>                 page = find_lock_page(mapping, index);
> -               if (unlikely(page == NULL)) {
> +               if (unlikely(page == NULL || !PageUptodate(page))) {
>                         /*
>                          * We have a HOLE, zero out the user-buffer for the
>                          * length of the hole or request.
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 629aa4c2259c..a5a5fbf7ac25 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4543,7 +4543,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>
>  retry:
>         page = find_lock_page(mapping, idx);
> -       if (!page) {
> +       if (!page || !PageUptodate(page)) {
>                 /* Check for page in userfault range */
>                 if (userfaultfd_missing(vma)) {
>                         ret = hugetlb_handle_userfault(vma, mapping, idx,
> @@ -4552,26 +4552,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>                         goto out;
>                 }
>
> -               page = alloc_huge_page(vma, haddr, 0);
> -               if (IS_ERR(page)) {
> -                       /*
> -                        * Returning error will result in faulting task being
> -                        * sent SIGBUS.  The hugetlb fault mutex prevents two
> -                        * tasks from racing to fault in the same page which
> -                        * could result in false unable to allocate errors.
> -                        * Page migration does not take the fault mutex, but
> -                        * does a clear then write of pte's under page table
> -                        * lock.  Page fault code could race with migration,
> -                        * notice the clear pte and try to allocate a page
> -                        * here.  Before returning error, get ptl and make
> -                        * sure there really is no pte entry.
> -                        */
> -                       ptl = huge_pte_lock(h, mm, ptep);
> -                       ret = 0;
> -                       if (huge_pte_none(huge_ptep_get(ptep)))
> -                               ret = vmf_error(PTR_ERR(page));
> -                       spin_unlock(ptl);
> -                       goto out;
> +               if (!page) {
> +                       page = alloc_huge_page(vma, haddr, 0);
> +                       if (IS_ERR(page)) {
> +                               /*
> +                                * Returning error will result in faulting task
> +                                * being sent SIGBUS.  The hugetlb fault mutex
> +                                * prevents two tasks from racing to fault in
> +                                * the same page which could result in false
> +                                * unable to allocate errors.  Page migration
> +                                * does not take the fault mutex, but does a
> +                                * clear then write of pte's under page table
> +                                * lock.  Page fault code could race with
> +                                * migration, notice the clear pte and try to
> +                                * allocate a page here.  Before returning
> +                                * error, get ptl and make sure there really is
> +                                * no pte entry.
> +                                */
> +                               ptl = huge_pte_lock(h, mm, ptep);
> +                               ret = 0;
> +                               if (huge_pte_none(huge_ptep_get(ptep)))
> +                                       ret = vmf_error(PTR_ERR(page));
> +                               spin_unlock(ptl);
> +                               goto out;
> +                       }
>                 }
>                 clear_huge_page(page, address, pages_per_huge_page(h));
>                 __SetPageUptodate(page);
> @@ -4868,31 +4872,55 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                             struct page **pagep)
>  {
>         bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> -       struct address_space *mapping;
> -       pgoff_t idx;
> +       struct hstate *h = hstate_vma(dst_vma);
> +       struct address_space *mapping = dst_vma->vm_file->f_mapping;
> +       pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
>         unsigned long size;
>         int vm_shared = dst_vma->vm_flags & VM_SHARED;
> -       struct hstate *h = hstate_vma(dst_vma);
>         pte_t _dst_pte;
>         spinlock_t *ptl;
> -       int ret;
> +       int ret = -ENOMEM;
>         struct page *page;
>         int writable;
>
> -       mapping = dst_vma->vm_file->f_mapping;
> -       idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> -
>         if (is_continue) {
>                 ret = -EFAULT;
> -               page = find_lock_page(mapping, idx);
> -               if (!page)
> +               page = hugetlbfs_pagecache_page(h, dst_vma, dst_addr);
> +               if (!page) {
> +                       ret = -ENOMEM;
>                         goto out;
> +               }
>         } else if (!*pagep) {
> -               ret = -ENOMEM;
> +               /* If a page already exists, then it's UFFDIO_COPY for
> +                * a non-missing case. Return -EEXIST.
> +                */
> +               if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
> +                       ret = -EEXIST;
> +                       goto out;
> +               }
> +
>                 page = alloc_huge_page(dst_vma, dst_addr, 0);
>                 if (IS_ERR(page))
>                         goto out;
>
> +               /* Add shared, newly allocated pages to the page cache. */
> +               if (vm_shared) {
> +                       size = i_size_read(mapping->host) >> huge_page_shift(h);
> +                       ret = -EFAULT;
> +                       if (idx >= size)
> +                               goto out;
> +
> +                       /*
> +                        * Serialization between remove_inode_hugepages() and
> +                        * huge_add_to_page_cache() below happens through the
> +                        * hugetlb_fault_mutex_table that here must be hold by
> +                        * the caller.
> +                        */
> +                       ret = huge_add_to_page_cache(page, mapping, idx);
> +                       if (ret)
> +                               goto out;
> +               }
> +
>                 ret = copy_huge_page_from_user(page,
>                                                 (const void __user *) src_addr,
>                                                 pages_per_huge_page(h), false);
> @@ -4916,24 +4944,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>          */
>         __SetPageUptodate(page);
>
> -       /* Add shared, newly allocated pages to the page cache. */
> -       if (vm_shared && !is_continue) {
> -               size = i_size_read(mapping->host) >> huge_page_shift(h);
> -               ret = -EFAULT;
> -               if (idx >= size)
> -                       goto out_release_nounlock;
> -
> -               /*
> -                * Serialization between remove_inode_hugepages() and
> -                * huge_add_to_page_cache() below happens through the
> -                * hugetlb_fault_mutex_table that here must be hold by
> -                * the caller.
> -                */
> -               ret = huge_add_to_page_cache(page, mapping, idx);
> -               if (ret)
> -                       goto out_release_nounlock;
> -       }
> -
>         ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
>         spin_lock(ptl);
>
> @@ -4994,7 +5004,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>         spin_unlock(ptl);
>         if (vm_shared || is_continue)
>                 unlock_page(page);
> -out_release_nounlock:
>         put_page(page);
>         goto out;
>  }
> --
> 2.31.1.751.gd2f1c929bd-goog
Mike Kravetz May 14, 2021, 12:14 a.m. UTC | #2
On 5/13/21 4:49 PM, Mina Almasry wrote:
> On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote:
>>
>> When hugetlb_mcopy_atomic_pte() is called with:
>> - mode==MCOPY_ATOMIC_NORMAL and,
>> - we already have a page in the page cache corresponding to the
>> associated address,
>>
>> We will allocate a huge page from the reserves, and then fail to insert it
>> into the cache and return -EEXIST. In this case, we need to return -EEXIST
>> without allocating a new page as the page already exists in the cache.
>> Allocating the extra page causes the resv_huge_pages to underflow temporarily
>> until the extra page is freed.
>>
>> To fix this we check if a page exists in the cache, and allocate it and
>> insert it in the cache immediately while holding the lock. After that we
>> copy the contents into the page.
>>
>> As a side effect of this, pages may exist in the cache for which the
>> copy failed and for these pages PageUptodate(page) == false. Modify code
>> that query the cache to handle this correctly.
>>
> 
> To be honest, I'm not sure I've done this bit correctly. Please take a
> look and let me know what you think. It may be too overly complicated
> to have !PageUptodate() pages in the cache and ask the rest of the
> code to handle that edge case correctly, but I'm not sure how else to
> fix this issue.
> 

I think you just moved the underflow from hugetlb_mcopy_atomic_pte to
hugetlb_no_page.  Why?

Consider the case where there is only one reserve left and someone does
the MCOPY_ATOMIC_NORMAL for the address.  We will allocate the page and
consume the reserve (reserve count == 0) and insert the page into the
cache.  Now, if the copy_huge_page_from_user fails we must drop the
locks/fault mutex to do the copy.  While locks are dropped, someone
faults on the address and ends up in hugetlb_no_page.  The page is in
the cache but not up to date, so we go down the allocate new page path
and will decrement the reserve count again to cause underflow.

How about this approach?
- Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
  that you added.  That will catch the race where the page was added to
  the cache before entering the routine.
- With the above check in place, we only need to worry about the case
  where copy_huge_page_from_user fails and we must drop locks.  In this
  case we:
  - Free the page previously allocated.
  - Allocate a 'temporary' huge page without consuming reserves.  I'm
    thinking of something similar to page migration.
  - Drop the locks and let the copy_huge_page_from_user be done to the
    temporary page.
  - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
    *pagep case) we need to once again check
    hugetlbfs_pagecache_present.
  - We then try to allocate the huge page which will consume the
    reserve.  If successful, copy contents of temporary page to newly
    allocated page.  Free temporary page.

There may be issues with this, and I have not given it deep thought.  It
does abuse the temporary huge page concept, but perhaps no more than
page migration.  Things do slow down if the extra page allocation and
copy is required, but that would only be the case if copy_huge_page_from_user
needs to be done without locks.  Not sure, but hoping that is rare.
Mina Almasry May 14, 2021, 12:23 a.m. UTC | #3
On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/13/21 4:49 PM, Mina Almasry wrote:
> > On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> When hugetlb_mcopy_atomic_pte() is called with:
> >> - mode==MCOPY_ATOMIC_NORMAL and,
> >> - we already have a page in the page cache corresponding to the
> >> associated address,
> >>
> >> We will allocate a huge page from the reserves, and then fail to insert it
> >> into the cache and return -EEXIST. In this case, we need to return -EEXIST
> >> without allocating a new page as the page already exists in the cache.
> >> Allocating the extra page causes the resv_huge_pages to underflow temporarily
> >> until the extra page is freed.
> >>
> >> To fix this we check if a page exists in the cache, and allocate it and
> >> insert it in the cache immediately while holding the lock. After that we
> >> copy the contents into the page.
> >>
> >> As a side effect of this, pages may exist in the cache for which the
> >> copy failed and for these pages PageUptodate(page) == false. Modify code
> >> that query the cache to handle this correctly.
> >>
> >
> > To be honest, I'm not sure I've done this bit correctly. Please take a
> > look and let me know what you think. It may be too overly complicated
> > to have !PageUptodate() pages in the cache and ask the rest of the
> > code to handle that edge case correctly, but I'm not sure how else to
> > fix this issue.
> >
>
> I think you just moved the underflow from hugetlb_mcopy_atomic_pte to
> hugetlb_no_page.  Why?
>
> Consider the case where there is only one reserve left and someone does
> the MCOPY_ATOMIC_NORMAL for the address.  We will allocate the page and
> consume the reserve (reserve count == 0) and insert the page into the
> cache.  Now, if the copy_huge_page_from_user fails we must drop the
> locks/fault mutex to do the copy.  While locks are dropped, someone
> faults on the address and ends up in hugetlb_no_page.  The page is in
> the cache but not up to date, so we go down the allocate new page path
> and will decrement the reserve count again to cause underflow.
>

For what it's worth, I think I fixed the underflow with this patch,
not moved it. I added a check in hugetlb_no_page() such that if we
find a page in the cache with !PageUptodate(page), we will reuse that
page instead of allocating a new one and decrementing the count again.
Running the test with the WARN_ONCE_ON locally shows no warnings
again.

> How about this approach?

I'll give it a shot for sure. FWIW on first glance it looks more
complicated that what I have here, but my guess I'm not doing the
!PageUptodate() handling correctly and that's why it seems this
solution is simpler. I'll give it a shot though.

> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
>   that you added.  That will catch the race where the page was added to
>   the cache before entering the routine.
> - With the above check in place, we only need to worry about the case
>   where copy_huge_page_from_user fails and we must drop locks.  In this
>   case we:
>   - Free the page previously allocated.
>   - Allocate a 'temporary' huge page without consuming reserves.  I'm
>     thinking of something similar to page migration.
>   - Drop the locks and let the copy_huge_page_from_user be done to the
>     temporary page.
>   - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
>     *pagep case) we need to once again check
>     hugetlbfs_pagecache_present.
>   - We then try to allocate the huge page which will consume the
>     reserve.  If successful, copy contents of temporary page to newly
>     allocated page.  Free temporary page.
>
> There may be issues with this, and I have not given it deep thought.  It
> does abuse the temporary huge page concept, but perhaps no more than
> page migration.  Things do slow down if the extra page allocation and
> copy is required, but that would only be the case if copy_huge_page_from_user
> needs to be done without locks.  Not sure, but hoping that is rare.
> --
> Mike Kravetz
Mike Kravetz May 14, 2021, 4:02 a.m. UTC | #4
On 5/13/21 5:23 PM, Mina Almasry wrote:
> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> On 5/13/21 4:49 PM, Mina Almasry wrote:
>>> On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote:
>>>>
>>>> When hugetlb_mcopy_atomic_pte() is called with:
>>>> - mode==MCOPY_ATOMIC_NORMAL and,
>>>> - we already have a page in the page cache corresponding to the
>>>> associated address,
>>>>
>>>> We will allocate a huge page from the reserves, and then fail to insert it
>>>> into the cache and return -EEXIST. In this case, we need to return -EEXIST
>>>> without allocating a new page as the page already exists in the cache.
>>>> Allocating the extra page causes the resv_huge_pages to underflow temporarily
>>>> until the extra page is freed.
>>>>
>>>> To fix this we check if a page exists in the cache, and allocate it and
>>>> insert it in the cache immediately while holding the lock. After that we
>>>> copy the contents into the page.
>>>>
>>>> As a side effect of this, pages may exist in the cache for which the
>>>> copy failed and for these pages PageUptodate(page) == false. Modify code
>>>> that query the cache to handle this correctly.
>>>>
>>>
>>> To be honest, I'm not sure I've done this bit correctly. Please take a
>>> look and let me know what you think. It may be too overly complicated
>>> to have !PageUptodate() pages in the cache and ask the rest of the
>>> code to handle that edge case correctly, but I'm not sure how else to
>>> fix this issue.
>>>
>>
>> I think you just moved the underflow from hugetlb_mcopy_atomic_pte to
>> hugetlb_no_page.  Why?
>>
>> Consider the case where there is only one reserve left and someone does
>> the MCOPY_ATOMIC_NORMAL for the address.  We will allocate the page and
>> consume the reserve (reserve count == 0) and insert the page into the
>> cache.  Now, if the copy_huge_page_from_user fails we must drop the
>> locks/fault mutex to do the copy.  While locks are dropped, someone
>> faults on the address and ends up in hugetlb_no_page.  The page is in
>> the cache but not up to date, so we go down the allocate new page path
>> and will decrement the reserve count again to cause underflow.
>>
> 
> For what it's worth, I think I fixed the underflow with this patch,
> not moved it. I added a check in hugetlb_no_page() such that if we
> find a page in the cache with !PageUptodate(page), we will reuse that
> page instead of allocating a new one and decrementing the count again.
> Running the test with the WARN_ONCE_ON locally shows no warnings
> again.

My bad, sorry I missed that.

I am also concerned with the semantics of this approach and what happens
when a fault races with the userfaultfd copy.  Previously I asked Peter
if we could/should use a page found in the cache for the copy.  His
answer was as follows:

 AFAICT that's the expected behavior, and it need to be like that so as to avoid
 silent data corruption (if the page cache existed, it means the page is not
 "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used
 for uffd page missing case).

The interesting thing in your approach is that there was no page in the
cache before the start of the userfaultfd copy.  Yet, because we drop
the locks we could race with a fault that would normally add a page to
the cache.  Unless I am reading your patch incorrectly again, the
!PageUptodate(page) path in hugetlb_no_page is going to clear the page
in the cache.  This clearing would race with the copy to page done by
the userfaultfd code.

> 
>> How about this approach?
> 
> I'll give it a shot for sure. FWIW on first glance it looks more
> complicated that what I have here, but my guess I'm not doing the
> !PageUptodate() handling correctly and that's why it seems this
> solution is simpler. I'll give it a shot though.

Ok, I will look into this more as well.  Unfortunately, there does not
appear to be a simple elegant solution.
Peter Xu May 14, 2021, 12:31 p.m. UTC | #5
Hi, Mike,

On Thu, May 13, 2021 at 09:02:15PM -0700, Mike Kravetz wrote:

[...]

> I am also concerned with the semantics of this approach and what happens
> when a fault races with the userfaultfd copy.  Previously I asked Peter
> if we could/should use a page found in the cache for the copy.  His
> answer was as follows:
> 
>  AFAICT that's the expected behavior, and it need to be like that so as to avoid
>  silent data corruption (if the page cache existed, it means the page is not
>  "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used
>  for uffd page missing case).

I didn't follow the rest discussion in depth yet... but just to mention that
the above answer was for the question whether we can "update the page in the
page cache", rather than "use a page found in the page cache".

I think reuse the page should be fine, however it'll definitely break existing
user interface (as it'll expect -EEXIST for now - we have kselftest covers
that), meanwhile I don't see why the -EEXIST bothers a lot: it still tells the
user that this page was filled in already.  Normally it was filled in by
another UFFDIO_COPY (as we could have multiple uffd service threads) along with
a valid pte, then this userspace thread can simply skip this message as it
means the event has been handled by some other servicing thread.

(This also reminded me that there won't be a chance of UFFDIO_COPY race on page
 no page fault at least, since no page fault will always go into the uffd
 missing handling rather than filling in the page cache for a VM_UFFD_MISSING
 vma; while mmap read lock should guarantee VM_UFFD_MISSING be persistent)

Thanks,
Mike Kravetz May 14, 2021, 5:56 p.m. UTC | #6
On 5/14/21 5:31 AM, Peter Xu wrote:
> Hi, Mike,
> 
> On Thu, May 13, 2021 at 09:02:15PM -0700, Mike Kravetz wrote:
> 
> [...]
> 
>> I am also concerned with the semantics of this approach and what happens
>> when a fault races with the userfaultfd copy.  Previously I asked Peter
>> if we could/should use a page found in the cache for the copy.  His
>> answer was as follows:
>>
>>  AFAICT that's the expected behavior, and it need to be like that so as to avoid
>>  silent data corruption (if the page cache existed, it means the page is not
>>  "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used
>>  for uffd page missing case).
> 
> I didn't follow the rest discussion in depth yet... but just to mention that
> the above answer was for the question whether we can "update the page in the
> page cache", rather than "use a page found in the page cache".
> 
> I think reuse the page should be fine, however it'll definitely break existing
> user interface (as it'll expect -EEXIST for now - we have kselftest covers
> that), meanwhile I don't see why the -EEXIST bothers a lot: it still tells the
> user that this page was filled in already.  Normally it was filled in by
> another UFFDIO_COPY (as we could have multiple uffd service threads) along with
> a valid pte, then this userspace thread can simply skip this message as it
> means the event has been handled by some other servicing thread.
> 
> (This also reminded me that there won't be a chance of UFFDIO_COPY race on page
>  no page fault at least, since no page fault will always go into the uffd
>  missing handling rather than filling in the page cache for a VM_UFFD_MISSING
>  vma; while mmap read lock should guarantee VM_UFFD_MISSING be persistent)

Perhaps I am missing something.

Since this is a shared mapping, can we not have a 'regular' mapping to
the same range that is uffd registered?  And, that regular mappings could
fault and race with the uffd copy code?
Axel Rasmussen May 14, 2021, 6:30 p.m. UTC | #7
It's complicated and would take some more time for me to be certain,
but after looking for half an hour or so this morning, I agree with
Mike that such a race is possible.

That is, we may back out into the retry path, and drop mmap_lock, and
leave a situation where a page is in the cache, but we have
!PageUptodate(). hugetlb_mcopy_atomic_pte clearly handles the
VM_SHARED case, so I don't see a reason why there can't be another
(non-userfaultfd-registered) mapping. If it were faulted at the right
time, it seems like such a fault would indeed zero the page, and then
the UFFDIO_COPY retry (once it acquired the lock again) would try to
reuse it.

On Fri, May 14, 2021 at 10:56 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/14/21 5:31 AM, Peter Xu wrote:
> > Hi, Mike,
> >
> > On Thu, May 13, 2021 at 09:02:15PM -0700, Mike Kravetz wrote:
> >
> > [...]
> >
> >> I am also concerned with the semantics of this approach and what happens
> >> when a fault races with the userfaultfd copy.  Previously I asked Peter
> >> if we could/should use a page found in the cache for the copy.  His
> >> answer was as follows:
> >>
> >>  AFAICT that's the expected behavior, and it need to be like that so as to avoid
> >>  silent data corruption (if the page cache existed, it means the page is not
> >>  "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used
> >>  for uffd page missing case).
> >
> > I didn't follow the rest discussion in depth yet... but just to mention that
> > the above answer was for the question whether we can "update the page in the
> > page cache", rather than "use a page found in the page cache".
> >
> > I think reuse the page should be fine, however it'll definitely break existing
> > user interface (as it'll expect -EEXIST for now - we have kselftest covers
> > that), meanwhile I don't see why the -EEXIST bothers a lot: it still tells the
> > user that this page was filled in already.  Normally it was filled in by
> > another UFFDIO_COPY (as we could have multiple uffd service threads) along with
> > a valid pte, then this userspace thread can simply skip this message as it
> > means the event has been handled by some other servicing thread.
> >
> > (This also reminded me that there won't be a chance of UFFDIO_COPY race on page
> >  no page fault at least, since no page fault will always go into the uffd
> >  missing handling rather than filling in the page cache for a VM_UFFD_MISSING
> >  vma; while mmap read lock should guarantee VM_UFFD_MISSING be persistent)
>
> Perhaps I am missing something.
>
> Since this is a shared mapping, can we not have a 'regular' mapping to
> the same range that is uffd registered?  And, that regular mappings could
> fault and race with the uffd copy code?
>
> --
> Mike Kravetz
Peter Xu May 14, 2021, 7:16 p.m. UTC | #8
On Fri, May 14, 2021 at 10:56:36AM -0700, Mike Kravetz wrote:
> Since this is a shared mapping, can we not have a 'regular' mapping to
> the same range that is uffd registered?  And, that regular mappings could
> fault and race with the uffd copy code?

Yeah we can..  Please ignore the paragraph in the parenthesis.  Sorry.
Mina Almasry May 20, 2021, 7:18 p.m. UTC | #9
On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/13/21 4:49 PM, Mina Almasry wrote:
> > On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> When hugetlb_mcopy_atomic_pte() is called with:
> >> - mode==MCOPY_ATOMIC_NORMAL and,
> >> - we already have a page in the page cache corresponding to the
> >> associated address,
> >>
> >> We will allocate a huge page from the reserves, and then fail to insert it
> >> into the cache and return -EEXIST. In this case, we need to return -EEXIST
> >> without allocating a new page as the page already exists in the cache.
> >> Allocating the extra page causes the resv_huge_pages to underflow temporarily
> >> until the extra page is freed.
> >>
> >> To fix this we check if a page exists in the cache, and allocate it and
> >> insert it in the cache immediately while holding the lock. After that we
> >> copy the contents into the page.
> >>
> >> As a side effect of this, pages may exist in the cache for which the
> >> copy failed and for these pages PageUptodate(page) == false. Modify code
> >> that query the cache to handle this correctly.
> >>
> >
> > To be honest, I'm not sure I've done this bit correctly. Please take a
> > look and let me know what you think. It may be too overly complicated
> > to have !PageUptodate() pages in the cache and ask the rest of the
> > code to handle that edge case correctly, but I'm not sure how else to
> > fix this issue.
> >
>
> I think you just moved the underflow from hugetlb_mcopy_atomic_pte to
> hugetlb_no_page.  Why?
>
> Consider the case where there is only one reserve left and someone does
> the MCOPY_ATOMIC_NORMAL for the address.  We will allocate the page and
> consume the reserve (reserve count == 0) and insert the page into the
> cache.  Now, if the copy_huge_page_from_user fails we must drop the
> locks/fault mutex to do the copy.  While locks are dropped, someone
> faults on the address and ends up in hugetlb_no_page.  The page is in
> the cache but not up to date, so we go down the allocate new page path
> and will decrement the reserve count again to cause underflow.
>
> How about this approach?
> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
>   that you added.  That will catch the race where the page was added to
>   the cache before entering the routine.
> - With the above check in place, we only need to worry about the case
>   where copy_huge_page_from_user fails and we must drop locks.  In this
>   case we:
>   - Free the page previously allocated.
>   - Allocate a 'temporary' huge page without consuming reserves.  I'm
>     thinking of something similar to page migration.
>   - Drop the locks and let the copy_huge_page_from_user be done to the
>     temporary page.
>   - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
>     *pagep case) we need to once again check
>     hugetlbfs_pagecache_present.
>   - We then try to allocate the huge page which will consume the
>     reserve.  If successful, copy contents of temporary page to newly
>     allocated page.  Free temporary page.
>
> There may be issues with this, and I have not given it deep thought.  It
> does abuse the temporary huge page concept, but perhaps no more than
> page migration.  Things do slow down if the extra page allocation and
> copy is required, but that would only be the case if copy_huge_page_from_user
> needs to be done without locks.  Not sure, but hoping that is rare.

Just following up this a bit: I've implemented this approach locally,
and with it it's passing the test as-is. When I hack the code such
that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into
this edge case, which causes resv_huge_pages to underflow again (this
time permemantly):

- hugetlb_no_page() is called on an index and a page is allocated and
inserted into the cache consuming the reservation.
- remove_huge_page() is called on this index and the page is removed from cache.
- hugetlb_mcopy_atomic_pte() is called on this index, we do not find
the page in the cache and we trigger this code patch and the copy
fails.
- The allocations in this code path seem to double consume the
reservation and resv_huge_pages underflows.

I'm looking at this edge case to understand why a prior
remove_huge_page() causes my code to underflow resv_huge_pages.

> --
> Mike Kravetz
Mina Almasry May 20, 2021, 7:21 p.m. UTC | #10
On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 5/13/21 4:49 PM, Mina Almasry wrote:
> > > On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote:
> > >>
> > >> When hugetlb_mcopy_atomic_pte() is called with:
> > >> - mode==MCOPY_ATOMIC_NORMAL and,
> > >> - we already have a page in the page cache corresponding to the
> > >> associated address,
> > >>
> > >> We will allocate a huge page from the reserves, and then fail to insert it
> > >> into the cache and return -EEXIST. In this case, we need to return -EEXIST
> > >> without allocating a new page as the page already exists in the cache.
> > >> Allocating the extra page causes the resv_huge_pages to underflow temporarily
> > >> until the extra page is freed.
> > >>
> > >> To fix this we check if a page exists in the cache, and allocate it and
> > >> insert it in the cache immediately while holding the lock. After that we
> > >> copy the contents into the page.
> > >>
> > >> As a side effect of this, pages may exist in the cache for which the
> > >> copy failed and for these pages PageUptodate(page) == false. Modify code
> > >> that query the cache to handle this correctly.
> > >>
> > >
> > > To be honest, I'm not sure I've done this bit correctly. Please take a
> > > look and let me know what you think. It may be too overly complicated
> > > to have !PageUptodate() pages in the cache and ask the rest of the
> > > code to handle that edge case correctly, but I'm not sure how else to
> > > fix this issue.
> > >
> >
> > I think you just moved the underflow from hugetlb_mcopy_atomic_pte to
> > hugetlb_no_page.  Why?
> >
> > Consider the case where there is only one reserve left and someone does
> > the MCOPY_ATOMIC_NORMAL for the address.  We will allocate the page and
> > consume the reserve (reserve count == 0) and insert the page into the
> > cache.  Now, if the copy_huge_page_from_user fails we must drop the
> > locks/fault mutex to do the copy.  While locks are dropped, someone
> > faults on the address and ends up in hugetlb_no_page.  The page is in
> > the cache but not up to date, so we go down the allocate new page path
> > and will decrement the reserve count again to cause underflow.
> >
> > How about this approach?
> > - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
> >   that you added.  That will catch the race where the page was added to
> >   the cache before entering the routine.
> > - With the above check in place, we only need to worry about the case
> >   where copy_huge_page_from_user fails and we must drop locks.  In this
> >   case we:
> >   - Free the page previously allocated.
> >   - Allocate a 'temporary' huge page without consuming reserves.  I'm
> >     thinking of something similar to page migration.
> >   - Drop the locks and let the copy_huge_page_from_user be done to the
> >     temporary page.
> >   - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
> >     *pagep case) we need to once again check
> >     hugetlbfs_pagecache_present.
> >   - We then try to allocate the huge page which will consume the
> >     reserve.  If successful, copy contents of temporary page to newly
> >     allocated page.  Free temporary page.
> >
> > There may be issues with this, and I have not given it deep thought.  It
> > does abuse the temporary huge page concept, but perhaps no more than
> > page migration.  Things do slow down if the extra page allocation and
> > copy is required, but that would only be the case if copy_huge_page_from_user
> > needs to be done without locks.  Not sure, but hoping that is rare.
>
> Just following up this a bit: I've implemented this approach locally,
> and with it it's passing the test as-is. When I hack the code such
> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into
> this edge case, which causes resv_huge_pages to underflow again (this
> time permemantly):
>
> - hugetlb_no_page() is called on an index and a page is allocated and
> inserted into the cache consuming the reservation.
> - remove_huge_page() is called on this index and the page is removed from cache.
> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find
> the page in the cache and we trigger this code patch and the copy
> fails.
> - The allocations in this code path seem to double consume the
> reservation and resv_huge_pages underflows.
>
> I'm looking at this edge case to understand why a prior
> remove_huge_page() causes my code to underflow resv_huge_pages.
>

I should also mention, without a prior remove_huge_page() this code
path works fine, so it seems the fact that the reservation is consumed
before causes trouble, but I'm not sure why yet.

> > --
> > Mike Kravetz
Mike Kravetz May 20, 2021, 8 p.m. UTC | #11
On 5/20/21 12:21 PM, Mina Almasry wrote:
> On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote:
>>
>> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> How about this approach?
>>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
>>>   that you added.  That will catch the race where the page was added to
>>>   the cache before entering the routine.
>>> - With the above check in place, we only need to worry about the case
>>>   where copy_huge_page_from_user fails and we must drop locks.  In this
>>>   case we:
>>>   - Free the page previously allocated.
>>>   - Allocate a 'temporary' huge page without consuming reserves.  I'm
>>>     thinking of something similar to page migration.
>>>   - Drop the locks and let the copy_huge_page_from_user be done to the
>>>     temporary page.
>>>   - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
>>>     *pagep case) we need to once again check
>>>     hugetlbfs_pagecache_present.
>>>   - We then try to allocate the huge page which will consume the
>>>     reserve.  If successful, copy contents of temporary page to newly
>>>     allocated page.  Free temporary page.
>>>
>>> There may be issues with this, and I have not given it deep thought.  It
>>> does abuse the temporary huge page concept, but perhaps no more than
>>> page migration.  Things do slow down if the extra page allocation and
>>> copy is required, but that would only be the case if copy_huge_page_from_user
>>> needs to be done without locks.  Not sure, but hoping that is rare.
>>
>> Just following up this a bit: I've implemented this approach locally,
>> and with it it's passing the test as-is. When I hack the code such
>> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into
>> this edge case, which causes resv_huge_pages to underflow again (this
>> time permemantly):
>>
>> - hugetlb_no_page() is called on an index and a page is allocated and
>> inserted into the cache consuming the reservation.
>> - remove_huge_page() is called on this index and the page is removed from cache.
>> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find
>> the page in the cache and we trigger this code patch and the copy
>> fails.
>> - The allocations in this code path seem to double consume the
>> reservation and resv_huge_pages underflows.
>>
>> I'm looking at this edge case to understand why a prior
>> remove_huge_page() causes my code to underflow resv_huge_pages.
>>
> 
> I should also mention, without a prior remove_huge_page() this code
> path works fine, so it seems the fact that the reservation is consumed
> before causes trouble, but I'm not sure why yet.
> 

Hi Mina,

How about quickly posting the code?  I may be able to provide more
suggestions if I can see the actual code.
Mina Almasry May 20, 2021, 8:31 p.m. UTC | #12
On Thu, May 20, 2021 at 1:00 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/20/21 12:21 PM, Mina Almasry wrote:
> > On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote:
> >>
> >> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>>
> >>> How about this approach?
> >>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
> >>>   that you added.  That will catch the race where the page was added to
> >>>   the cache before entering the routine.
> >>> - With the above check in place, we only need to worry about the case
> >>>   where copy_huge_page_from_user fails and we must drop locks.  In this
> >>>   case we:
> >>>   - Free the page previously allocated.
> >>>   - Allocate a 'temporary' huge page without consuming reserves.  I'm
> >>>     thinking of something similar to page migration.
> >>>   - Drop the locks and let the copy_huge_page_from_user be done to the
> >>>     temporary page.
> >>>   - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
> >>>     *pagep case) we need to once again check
> >>>     hugetlbfs_pagecache_present.
> >>>   - We then try to allocate the huge page which will consume the
> >>>     reserve.  If successful, copy contents of temporary page to newly
> >>>     allocated page.  Free temporary page.
> >>>
> >>> There may be issues with this, and I have not given it deep thought.  It
> >>> does abuse the temporary huge page concept, but perhaps no more than
> >>> page migration.  Things do slow down if the extra page allocation and
> >>> copy is required, but that would only be the case if copy_huge_page_from_user
> >>> needs to be done without locks.  Not sure, but hoping that is rare.
> >>
> >> Just following up this a bit: I've implemented this approach locally,
> >> and with it it's passing the test as-is. When I hack the code such
> >> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into
> >> this edge case, which causes resv_huge_pages to underflow again (this
> >> time permemantly):
> >>
> >> - hugetlb_no_page() is called on an index and a page is allocated and
> >> inserted into the cache consuming the reservation.
> >> - remove_huge_page() is called on this index and the page is removed from cache.
> >> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find
> >> the page in the cache and we trigger this code patch and the copy
> >> fails.
> >> - The allocations in this code path seem to double consume the
> >> reservation and resv_huge_pages underflows.
> >>
> >> I'm looking at this edge case to understand why a prior
> >> remove_huge_page() causes my code to underflow resv_huge_pages.
> >>
> >
> > I should also mention, without a prior remove_huge_page() this code
> > path works fine, so it seems the fact that the reservation is consumed
> > before causes trouble, but I'm not sure why yet.
> >
>
> Hi Mina,
>
> How about quickly posting the code?  I may be able to provide more
> suggestions if I can see the actual code.

Sure thing, attached my patch so far. It's quite messy with prints
everywhere and VM_BUG_ON() in error paths that I'm not handling yet.
I've also hacked the code so that the hugetlb_mcopy_atomic_pte() copy
always fails so I exercise that code path.

> --
> Mike Kravetz
Mina Almasry May 21, 2021, 2:05 a.m. UTC | #13
On Thu, May 20, 2021 at 1:31 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Thu, May 20, 2021 at 1:00 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 5/20/21 12:21 PM, Mina Almasry wrote:
> > > On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote:
> > >>
> > >> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >>>
> > >>> How about this approach?
> > >>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte
> > >>>   that you added.  That will catch the race where the page was added to
> > >>>   the cache before entering the routine.
> > >>> - With the above check in place, we only need to worry about the case
> > >>>   where copy_huge_page_from_user fails and we must drop locks.  In this
> > >>>   case we:
> > >>>   - Free the page previously allocated.
> > >>>   - Allocate a 'temporary' huge page without consuming reserves.  I'm
> > >>>     thinking of something similar to page migration.
> > >>>   - Drop the locks and let the copy_huge_page_from_user be done to the
> > >>>     temporary page.
> > >>>   - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the
> > >>>     *pagep case) we need to once again check
> > >>>     hugetlbfs_pagecache_present.
> > >>>   - We then try to allocate the huge page which will consume the
> > >>>     reserve.  If successful, copy contents of temporary page to newly
> > >>>     allocated page.  Free temporary page.
> > >>>
> > >>> There may be issues with this, and I have not given it deep thought.  It
> > >>> does abuse the temporary huge page concept, but perhaps no more than
> > >>> page migration.  Things do slow down if the extra page allocation and
> > >>> copy is required, but that would only be the case if copy_huge_page_from_user
> > >>> needs to be done without locks.  Not sure, but hoping that is rare.
> > >>
> > >> Just following up this a bit: I've implemented this approach locally,
> > >> and with it it's passing the test as-is. When I hack the code such
> > >> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into
> > >> this edge case, which causes resv_huge_pages to underflow again (this
> > >> time permemantly):
> > >>
> > >> - hugetlb_no_page() is called on an index and a page is allocated and
> > >> inserted into the cache consuming the reservation.
> > >> - remove_huge_page() is called on this index and the page is removed from cache.
> > >> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find
> > >> the page in the cache and we trigger this code patch and the copy
> > >> fails.
> > >> - The allocations in this code path seem to double consume the
> > >> reservation and resv_huge_pages underflows.
> > >>
> > >> I'm looking at this edge case to understand why a prior
> > >> remove_huge_page() causes my code to underflow resv_huge_pages.
> > >>
> > >
> > > I should also mention, without a prior remove_huge_page() this code
> > > path works fine, so it seems the fact that the reservation is consumed
> > > before causes trouble, but I'm not sure why yet.
> > >
> >
> > Hi Mina,
> >
> > How about quickly posting the code?  I may be able to provide more
> > suggestions if I can see the actual code.
>
> Sure thing, attached my patch so far. It's quite messy with prints
> everywhere and VM_BUG_ON() in error paths that I'm not handling yet.
> I've also hacked the code so that the hugetlb_mcopy_atomic_pte() copy
> always fails so I exercise that code path.
>

Of course right after I send out my patch, I figure out what's wrong.
It turns out freeing the allocated page when the copy fails is
extremely complicated (or complicated to figure out why it's broken).
Turns out I need to:

restore_page_on_error()
if (!HPageRestoreReserves(page)) {
    hugetlb_unreserve_pages(mapping, idx, idx + 1, 1);
}
put_page(page);

This is because even though we always allocate asking for a page from
the reserves, the page may not come from the reserves if the VM
doesn't have reservation for this index (which is the case if the page
has been allocated by hugetlb_no_page() and then removed with
remove_huge_page()). So, we need to correctly handle both cases.

Sorry for the noise but this was hard to track down. Patch should be
incoming soon unless I run into other issues with a closer look.

> > --
> > Mike Kravetz
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a2a42335e8fd..cc027c335242 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -346,7 +346,7 @@  static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)

 		/* Find the page */
 		page = find_lock_page(mapping, index);
-		if (unlikely(page == NULL)) {
+		if (unlikely(page == NULL || !PageUptodate(page))) {
 			/*
 			 * We have a HOLE, zero out the user-buffer for the
 			 * length of the hole or request.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..a5a5fbf7ac25 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4543,7 +4543,7 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

 retry:
 	page = find_lock_page(mapping, idx);
-	if (!page) {
+	if (!page || !PageUptodate(page)) {
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
 			ret = hugetlb_handle_userfault(vma, mapping, idx,
@@ -4552,26 +4552,30 @@  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			goto out;
 		}

-		page = alloc_huge_page(vma, haddr, 0);
-		if (IS_ERR(page)) {
-			/*
-			 * Returning error will result in faulting task being
-			 * sent SIGBUS.  The hugetlb fault mutex prevents two
-			 * tasks from racing to fault in the same page which
-			 * could result in false unable to allocate errors.
-			 * Page migration does not take the fault mutex, but
-			 * does a clear then write of pte's under page table
-			 * lock.  Page fault code could race with migration,
-			 * notice the clear pte and try to allocate a page
-			 * here.  Before returning error, get ptl and make
-			 * sure there really is no pte entry.
-			 */
-			ptl = huge_pte_lock(h, mm, ptep);
-			ret = 0;
-			if (huge_pte_none(huge_ptep_get(ptep)))
-				ret = vmf_error(PTR_ERR(page));
-			spin_unlock(ptl);
-			goto out;
+		if (!page) {
+			page = alloc_huge_page(vma, haddr, 0);
+			if (IS_ERR(page)) {
+				/*
+				 * Returning error will result in faulting task
+				 * being sent SIGBUS.  The hugetlb fault mutex
+				 * prevents two tasks from racing to fault in
+				 * the same page which could result in false
+				 * unable to allocate errors.  Page migration
+				 * does not take the fault mutex, but does a
+				 * clear then write of pte's under page table
+				 * lock.  Page fault code could race with
+				 * migration, notice the clear pte and try to
+				 * allocate a page here.  Before returning
+				 * error, get ptl and make sure there really is
+				 * no pte entry.
+				 */
+				ptl = huge_pte_lock(h, mm, ptep);
+				ret = 0;
+				if (huge_pte_none(huge_ptep_get(ptep)))
+					ret = vmf_error(PTR_ERR(page));
+				spin_unlock(ptl);
+				goto out;
+			}
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
@@ -4868,31 +4872,55 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct page **pagep)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
-	struct address_space *mapping;
-	pgoff_t idx;
+	struct hstate *h = hstate_vma(dst_vma);
+	struct address_space *mapping = dst_vma->vm_file->f_mapping;
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
-	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
-	int ret;
+	int ret = -ENOMEM;
 	struct page *page;
 	int writable;

-	mapping = dst_vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
-
 	if (is_continue) {
 		ret = -EFAULT;
-		page = find_lock_page(mapping, idx);
-		if (!page)
+		page = hugetlbfs_pagecache_page(h, dst_vma, dst_addr);
+		if (!page) {
+			ret = -ENOMEM;
 			goto out;
+		}
 	} else if (!*pagep) {
-		ret = -ENOMEM;
+		/* If a page already exists, then it's UFFDIO_COPY for
+		 * a non-missing case. Return -EEXIST.
+		 */
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			ret = -EEXIST;
+			goto out;
+		}
+
 		page = alloc_huge_page(dst_vma, dst_addr, 0);
 		if (IS_ERR(page))
 			goto out;

+		/* Add shared, newly allocated pages to the page cache. */
+		if (vm_shared) {
+			size = i_size_read(mapping->host) >> huge_page_shift(h);
+			ret = -EFAULT;
+			if (idx >= size)
+				goto out;
+
+			/*
+			 * Serialization between remove_inode_hugepages() and
+			 * huge_add_to_page_cache() below happens through the
+			 * hugetlb_fault_mutex_table that here must be hold by
+			 * the caller.
+			 */
+			ret = huge_add_to_page_cache(page, mapping, idx);
+			if (ret)
+				goto out;
+		}
+
 		ret = copy_huge_page_from_user(page,
 						(const void __user *) src_addr,
 						pages_per_huge_page(h), false);
@@ -4916,24 +4944,6 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 */
 	__SetPageUptodate(page);

-	/* Add shared, newly allocated pages to the page cache. */
-	if (vm_shared && !is_continue) {
-		size = i_size_read(mapping->host) >> huge_page_shift(h);
-		ret = -EFAULT;
-		if (idx >= size)
-			goto out_release_nounlock;
-
-		/*
-		 * Serialization between remove_inode_hugepages() and
-		 * huge_add_to_page_cache() below happens through the
-		 * hugetlb_fault_mutex_table that here must be hold by
-		 * the caller.
-		 */
-		ret = huge_add_to_page_cache(page, mapping, idx);
-		if (ret)
-			goto out_release_nounlock;
-	}
-
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);

@@ -4994,7 +5004,6 @@  int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	spin_unlock(ptl);
 	if (vm_shared || is_continue)
 		unlock_page(page);
-out_release_nounlock:
 	put_page(page);
 	goto out;
 }