diff mbox series

[v3] mm/gup: Allow real explicit breaking of COW

Message ID 20200811183950.10603-1-peterx@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v3] mm/gup: Allow real explicit breaking of COW | expand

Commit Message

Peter Xu Aug. 11, 2020, 6:39 p.m. UTC
Starting from commit 17839856fd58 ("gup: document and work around "COW can
break either way" issue", 2020-06-02), explicit copy-on-write behavior is
enforced for private gup pages even if it's a read-only.  It is achieved by
always passing FOLL_WRITE to emulate a write.

That should fix the COW issue that we were facing, however above commit could
also break userfaultfd-wp and applications like umapsort [1,2].

One general routine of umap-like program is: userspace library will manage page
allocations, and it will evict the least recently used pages from memory to
external storages (e.g., file systems).  Below are the general steps to evict
an in-memory page in the uffd service thread when the page pool is full:

  (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
      further writes to page P will block (keep page P clean)
  (2) Copy page P to external storage (e.g. file system)
  (3) MADV_DONTNEED to evict page P

Here step (1) makes sure that the page to dump will always be up-to-date, so
that the page snapshot in the file system is consistent with the one that was
in the memory.  However with commit 17839856fd58, step (2) can potentially hang
itself because e.g. if we use write() to a file system fd to dump the page
data, that will be a translated read gup request in the file system driver to
read the page content, then the read gup will be translated to a write gup due
to the new enforced COW behavior.  This write gup will further trigger
handle_userfault() and hang the uffd service thread itself.

I think the problem will go away too if we replace the write() to the file
system into a memory write to a mmaped region in the userspace library, because
normal page faults will not enforce COW, only gup is affected.  However we
cannot forbid users to use write() or any form of kernel level read gup.

One solution is actually already mentioned in commit 17839856fd58, which is to
provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
"enfornced COW (read) request".

With the enforced COW, we also need to inherit UFFD_WP bit during COW because
now COW can happen with UFFD_WP enabled (previously, it cannot).

Since at it, rename the variable in __handle_mm_fault() from "dirty" to "cow"
to better suite its functionality.

[1] https://github.com/LLNL/umap-apps/blob/develop/src/umapsort/umapsort.cpp
[2] https://github.com/LLNL/umap

CC: Marty Mcfadden <mcfadden8@llnl.gov>
CC: Maya B. Gokhale <gokhale2@llnl.gov>
CC: Andrea Arcangeli <aarcange@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jann Horn <jannh@google.com>
CC: Christoph Hellwig <hch@lst.de>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Kirill Shutemov <kirill@shutemov.name>
CC: Jan Kara <jack@suse.cz>
Fixes: 17839856fd588f4ab6b789f482ed3ffd7c403e1f
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v3:
- inherit UFFD_WP bit for COW too
- take care of huge page cases
- more comments
v2:
- apply FAULT_FLAG_BREAK_COW correctly when FOLL_BREAK_COW [Christoph]
- removed comments above do_wp_page which seems redundant
---
 include/linux/mm.h |  3 +++
 mm/gup.c           |  6 ++++--
 mm/huge_memory.c   | 12 +++++++++++-
 mm/memory.c        | 39 +++++++++++++++++++++++++++++++--------
 4 files changed, 49 insertions(+), 11 deletions(-)

Comments

Jann Horn Aug. 11, 2020, 7:07 p.m. UTC | #1
On Tue, Aug 11, 2020 at 8:39 PM Peter Xu <peterx@redhat.com> wrote:
> Starting from commit 17839856fd58 ("gup: document and work around "COW can
> break either way" issue", 2020-06-02), explicit copy-on-write behavior is
> enforced for private gup pages even if it's a read-only.  It is achieved by
> always passing FOLL_WRITE to emulate a write.
>
> That should fix the COW issue that we were facing, however above commit could
> also break userfaultfd-wp and applications like umapsort [1,2].
>
> One general routine of umap-like program is: userspace library will manage page
> allocations, and it will evict the least recently used pages from memory to
> external storages (e.g., file systems).  Below are the general steps to evict
> an in-memory page in the uffd service thread when the page pool is full:
>
>   (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
>       further writes to page P will block (keep page P clean)
>   (2) Copy page P to external storage (e.g. file system)
>   (3) MADV_DONTNEED to evict page P
>
> Here step (1) makes sure that the page to dump will always be up-to-date, so
> that the page snapshot in the file system is consistent with the one that was
> in the memory.  However with commit 17839856fd58, step (2) can potentially hang
> itself because e.g. if we use write() to a file system fd to dump the page
> data, that will be a translated read gup request in the file system driver to
> read the page content, then the read gup will be translated to a write gup due
> to the new enforced COW behavior.  This write gup will further trigger
> handle_userfault() and hang the uffd service thread itself.
>
> I think the problem will go away too if we replace the write() to the file
> system into a memory write to a mmaped region in the userspace library, because
> normal page faults will not enforce COW, only gup is affected.  However we
> cannot forbid users to use write() or any form of kernel level read gup.
>
> One solution is actually already mentioned in commit 17839856fd58, which is to
> provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
> use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
> "enfornced COW (read) request".
>
> With the enforced COW, we also need to inherit UFFD_WP bit during COW because
> now COW can happen with UFFD_WP enabled (previously, it cannot).
>
> Since at it, rename the variable in __handle_mm_fault() from "dirty" to "cow"
> to better suite its functionality.
[...]
> diff --git a/mm/memory.c b/mm/memory.c
[...]
> +                        * copied due to enfornced COW.  When it happens, we

(typo here and in the huge_memory version)

[...]
> diff --git a/mm/gup.c b/mm/gup.c
> index d8a33dd1430d..c33e84ab9c36 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -870,6 +870,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
>                 return -ENOENT;
>         if (*flags & FOLL_WRITE)
>                 fault_flags |= FAULT_FLAG_WRITE;
> +       if (*flags & FOLL_BREAK_COW)
> +               fault_flags |= FAULT_FLAG_BREAK_COW;
>         if (*flags & FOLL_REMOTE)
>                 fault_flags |= FAULT_FLAG_REMOTE;
>         if (locked)
> @@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>                         }
>                         if (is_vm_hugetlb_page(vma)) {
>                                 if (should_force_cow_break(vma, foll_flags))
> -                                       foll_flags |= FOLL_WRITE;
> +                                       foll_flags |= FOLL_BREAK_COW;

How does this interact with the FOLL_WRITE check in follow_page_pte()?
If we want the COW to be broken, follow_page_pte() would have to treat
FOLL_BREAK_COW similarly to FOLL_WRITE, right?
Linus Torvalds Aug. 11, 2020, 7:24 p.m. UTC | #2
On Tue, Aug 11, 2020 at 11:40 AM Peter Xu <peterx@redhat.com> wrote:
>
> index 206f52b36ffb..c88f773d03af 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1296,7 +1296,17 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
>         if (reuse_swap_page(page, NULL)) {
>                 pmd_t entry;
>                 entry = pmd_mkyoung(orig_pmd);
> -               entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> +               entry = pmd_mkdirty(entry);
> +               if (pmd_uffd_wp(orig_pmd))
> +                       /*
> +                        * This can happen when an uffd-wp protected page is
> +                        * copied due to enfornced COW.  When it happens, we
> +                        * need to keep the uffd-wp bit even after COW, and
> +                        * make sure write bit is kept cleared.
> +                        */
> +                       entry = pmd_mkuffd_wp(pmd_wrprotect(entry));
> +               else
> +                       entry = maybe_pmd_mkwrite(entry, vma);
>                 if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
>                         update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
>                 unlock_page(page);
> diff --git a/mm/memory.c b/mm/memory.c
> index c39a13b09602..b27b555a9df8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2706,7 +2706,17 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>                 flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>                 entry = mk_pte(new_page, vma->vm_page_prot);
>                 entry = pte_sw_mkyoung(entry);
> -               entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> +               entry = pte_mkdirty(entry);
> +               if (pte_uffd_wp(vmf->orig_pte))
> +                       /*
> +                        * This can happen when an uffd-wp protected page is
> +                        * copied due to enfornced COW.  When it happens, we
> +                        * need to keep the uffd-wp bit even after COW, and
> +                        * make sure write bit is kept cleared.
> +                        */
> +                       entry = pte_mkuffd_wp(pte_wrprotect(entry));
> +               else
> +                       entry = maybe_mkwrite(entry, vma);
>                 /*
>                  * Clear the pte entry and flush it first, before updating the
>                  * pte with the new entry. This will avoid a race condition

I think this needs to be cleaned up some way. I realize it's not an
exact duplicate (pmd vs pte), but this code is illegible.

Maybe just making it a helper inline function (well, two separate
ones) with the comment above the function would resolve my "this is
very ugly" concerns.


> @@ -2900,7 +2910,13 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
>
> -       if (userfaultfd_pte_wp(vma, *vmf->pte)) {
> +       /*
> +        * Userfaultfd-wp only cares about real writes.  E.g., enforced COW for
> +        * read does not count.  When that happens, we will do the COW with the
> +        * UFFD_WP bit inherited from the original PTE/PMD.
> +        */
> +       if ((vmf->flags & FAULT_FLAG_WRITE) &&
> +           userfaultfd_pte_wp(vma, *vmf->pte)) {
>                 pte_unmap_unlock(vmf->pte, vmf->ptl);
>                 return handle_userfault(vmf, VM_UFFD_WP);
>         }
> @@ -4117,7 +4133,14 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
>  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
>  {
>         if (vma_is_anonymous(vmf->vma)) {
> -               if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
> +               /*
> +                * Userfaultfd-wp only cares about real writes.  E.g., enforced
> +                * COW for read does not count.  When that happens, we will do
> +                * the COW with the UFFD_WP bit inherited from the original
> +                * PTE/PMD.
> +                */
> +               if ((vmf->flags & FAULT_FLAG_WRITE) &&
> +                   userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
>                         return handle_userfault(vmf, VM_UFFD_WP);

Here again the comment placement could be improved. Particularly in
the do_wp_page() case, we have a big and somewhat complex function,
and this duplicated boiler-plate makes me worry.

Making it a helper function with a comment above would again I think
make it more legible.

And I think Jann is on the money wrt the follow_page_pte() issue.

I think you broke COW break there entirely.

That was one of the reasons I did just that "make it use FOLL_WRITE"
originally, because it meant that we couldn't have any subtle places
we'd missed.

Now I wonder if there's any other case of FOLL_WRITE that is missing.

            Linus
Peter Xu Aug. 11, 2020, 8:02 p.m. UTC | #3
On Tue, Aug 11, 2020 at 09:07:17PM +0200, Jann Horn wrote:
> On Tue, Aug 11, 2020 at 8:39 PM Peter Xu <peterx@redhat.com> wrote:
> > Starting from commit 17839856fd58 ("gup: document and work around "COW can
> > break either way" issue", 2020-06-02), explicit copy-on-write behavior is
> > enforced for private gup pages even if it's a read-only.  It is achieved by
> > always passing FOLL_WRITE to emulate a write.
> >
> > That should fix the COW issue that we were facing, however above commit could
> > also break userfaultfd-wp and applications like umapsort [1,2].
> >
> > One general routine of umap-like program is: userspace library will manage page
> > allocations, and it will evict the least recently used pages from memory to
> > external storages (e.g., file systems).  Below are the general steps to evict
> > an in-memory page in the uffd service thread when the page pool is full:
> >
> >   (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
> >       further writes to page P will block (keep page P clean)
> >   (2) Copy page P to external storage (e.g. file system)
> >   (3) MADV_DONTNEED to evict page P
> >
> > Here step (1) makes sure that the page to dump will always be up-to-date, so
> > that the page snapshot in the file system is consistent with the one that was
> > in the memory.  However with commit 17839856fd58, step (2) can potentially hang
> > itself because e.g. if we use write() to a file system fd to dump the page
> > data, that will be a translated read gup request in the file system driver to
> > read the page content, then the read gup will be translated to a write gup due
> > to the new enforced COW behavior.  This write gup will further trigger
> > handle_userfault() and hang the uffd service thread itself.
> >
> > I think the problem will go away too if we replace the write() to the file
> > system into a memory write to a mmaped region in the userspace library, because
> > normal page faults will not enforce COW, only gup is affected.  However we
> > cannot forbid users to use write() or any form of kernel level read gup.
> >
> > One solution is actually already mentioned in commit 17839856fd58, which is to
> > provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
> > use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
> > "enfornced COW (read) request".
> >
> > With the enforced COW, we also need to inherit UFFD_WP bit during COW because
> > now COW can happen with UFFD_WP enabled (previously, it cannot).
> >
> > Since at it, rename the variable in __handle_mm_fault() from "dirty" to "cow"
> > to better suite its functionality.
> [...]
> > diff --git a/mm/memory.c b/mm/memory.c
> [...]
> > +                        * copied due to enfornced COW.  When it happens, we
> 
> (typo here and in the huge_memory version)

Right..

> 
> [...]
> > diff --git a/mm/gup.c b/mm/gup.c
> > index d8a33dd1430d..c33e84ab9c36 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -870,6 +870,8 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
> >                 return -ENOENT;
> >         if (*flags & FOLL_WRITE)
> >                 fault_flags |= FAULT_FLAG_WRITE;
> > +       if (*flags & FOLL_BREAK_COW)
> > +               fault_flags |= FAULT_FLAG_BREAK_COW;
> >         if (*flags & FOLL_REMOTE)
> >                 fault_flags |= FAULT_FLAG_REMOTE;
> >         if (locked)
> > @@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >                         }
> >                         if (is_vm_hugetlb_page(vma)) {
> >                                 if (should_force_cow_break(vma, foll_flags))
> > -                                       foll_flags |= FOLL_WRITE;
> > +                                       foll_flags |= FOLL_BREAK_COW;
> 
> How does this interact with the FOLL_WRITE check in follow_page_pte()?
> If we want the COW to be broken, follow_page_pte() would have to treat
> FOLL_BREAK_COW similarly to FOLL_WRITE, right?

Good point...  I did checked follow_page_mask() that FOLL_COW will still be set
correctly after applying the patch, though I forgot the FOLL_WRITE part.

Does below look good to you?

diff --git a/mm/gup.c b/mm/gup.c
index 9d1f44b01165..f4f2a69c6fe7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -439,7 +439,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
        }
        if ((flags & FOLL_NUMA) && pte_protnone(pte))
                goto no_page;
-       if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
+       if ((flags & (FOLL_WRITE | FOLL_BREAK_COW)) &&
+           !can_follow_write_pte(pte, flags)) {
                pte_unmap_unlock(ptep, ptl);
                return NULL;
        }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4f192efef37c..edbd42c9d576 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1340,7 +1340,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 
        assert_spin_locked(pmd_lockptr(mm, pmd));
 
-       if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
+       if (flags & (FOLL_WRITE | FOLL_BREAK_COW) &&
+           !can_follow_write_pmd(*pmd, flags))
                goto out;
 
        /* Avoid dumping huge zero page */

Thanks,
Linus Torvalds Aug. 11, 2020, 8:06 p.m. UTC | #4
;

On Tue, Aug 11, 2020 at 12:24 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Now I wonder if there's any other case of FOLL_WRITE that is missing.

Actually, now I wonder if we really should have tried to handle the
wrong-way cow reuse case some other way entirely.

When discussing this wrong-way-COW issue originally I looked at just doing

        struct page *page = vmf->page;

        if (page_count(page) != 1)
                goto copy;
        if (!trylock_page(page))
                goto copy;
        if (page_mapcount(page) != 1 && page_count(page) != 1) {
                unlock_page(page);
                goto copy;
        }
        /* Ok, we've got the only map reference, and the only
         *  page count reference, and the page is locked,
         * it's dark out, and we're wearing sunglasses. Hit it.
         */
        wp_page_reuse(vmf);
        unlock_page(page);
        return VM_FAULT_WRITE

at the top of the PageAnon() case in do_wp_page(), and be entirely done with it.

Make the rule be that we *only* re-use the page if there is no
question what-so-ever that we're the only possible owner of it.
Anything else at all - whether they be GUP users, other processes,
KSM, hughepage collapsing, whatever: don't even try.

That would possibly cause a lot of extra copies, but our rules for
"can we re-use this page" are just crazy complicated. And now the
"minimal" thing of just always breaking COW ends up causing
complications of its own.

IOW, maybe all of this falls under "yeah, we have historical reasons
for all of it, but it's just not worth the pain".

We do a _lot_ of complex stuff just to check whether we could possibly
share the page.

Maybe trying to reuse the page just isn't worth it?

Adding Andrea to the cc (although he probably sees this through
linux-mm anyway). Maybe he can go "naah, that will be horribly bad,
because..."

Then we could get rid of all the FAULT_FORCE_COW games again entirely,
and people can point fingers at me and laugh behind my back because of
what a bad idea it was.

                    Linus
Jann Horn Aug. 11, 2020, 8:22 p.m. UTC | #5
On Tue, Aug 11, 2020 at 10:03 PM Peter Xu <peterx@redhat.com> wrote:
> On Tue, Aug 11, 2020 at 09:07:17PM +0200, Jann Horn wrote:
> > On Tue, Aug 11, 2020 at 8:39 PM Peter Xu <peterx@redhat.com> wrote:
> > > Starting from commit 17839856fd58 ("gup: document and work around "COW can
> > > break either way" issue", 2020-06-02), explicit copy-on-write behavior is
> > > enforced for private gup pages even if it's a read-only.  It is achieved by
> > > always passing FOLL_WRITE to emulate a write.
> > >
> > > That should fix the COW issue that we were facing, however above commit could
> > > also break userfaultfd-wp and applications like umapsort [1,2].
> > >
> > > One general routine of umap-like program is: userspace library will manage page
> > > allocations, and it will evict the least recently used pages from memory to
> > > external storages (e.g., file systems).  Below are the general steps to evict
> > > an in-memory page in the uffd service thread when the page pool is full:
> > >
> > >   (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
> > >       further writes to page P will block (keep page P clean)
> > >   (2) Copy page P to external storage (e.g. file system)
> > >   (3) MADV_DONTNEED to evict page P
> > >
> > > Here step (1) makes sure that the page to dump will always be up-to-date, so
> > > that the page snapshot in the file system is consistent with the one that was
> > > in the memory.  However with commit 17839856fd58, step (2) can potentially hang
> > > itself because e.g. if we use write() to a file system fd to dump the page
> > > data, that will be a translated read gup request in the file system driver to
> > > read the page content, then the read gup will be translated to a write gup due
> > > to the new enforced COW behavior.  This write gup will further trigger
> > > handle_userfault() and hang the uffd service thread itself.
> > >
> > > I think the problem will go away too if we replace the write() to the file
> > > system into a memory write to a mmaped region in the userspace library, because
> > > normal page faults will not enforce COW, only gup is affected.  However we
> > > cannot forbid users to use write() or any form of kernel level read gup.
> > >
> > > One solution is actually already mentioned in commit 17839856fd58, which is to
> > > provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
> > > use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
> > > "enfornced COW (read) request".
> > >
> > > With the enforced COW, we also need to inherit UFFD_WP bit during COW because
> > > now COW can happen with UFFD_WP enabled (previously, it cannot).
[...]
> > > @@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > >                         }
> > >                         if (is_vm_hugetlb_page(vma)) {
> > >                                 if (should_force_cow_break(vma, foll_flags))
> > > -                                       foll_flags |= FOLL_WRITE;
> > > +                                       foll_flags |= FOLL_BREAK_COW;
> >
> > How does this interact with the FOLL_WRITE check in follow_page_pte()?
> > If we want the COW to be broken, follow_page_pte() would have to treat
> > FOLL_BREAK_COW similarly to FOLL_WRITE, right?
>
> Good point...  I did checked follow_page_mask() that FOLL_COW will still be set
> correctly after applying the patch, though I forgot the FOLL_WRITE part.
>
> Does below look good to you?
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 9d1f44b01165..f4f2a69c6fe7 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -439,7 +439,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
>         }
>         if ((flags & FOLL_NUMA) && pte_protnone(pte))
>                 goto no_page;
> -       if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
> +       if ((flags & (FOLL_WRITE | FOLL_BREAK_COW)) &&
> +           !can_follow_write_pte(pte, flags)) {
>                 pte_unmap_unlock(ptep, ptl);
>                 return NULL;
>         }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4f192efef37c..edbd42c9d576 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1340,7 +1340,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>
>         assert_spin_locked(pmd_lockptr(mm, pmd));
>
> -       if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
> +       if (flags & (FOLL_WRITE | FOLL_BREAK_COW) &&
> +           !can_follow_write_pmd(*pmd, flags))
>                 goto out;
>
>         /* Avoid dumping huge zero page */

Well, I don't see anything immediately wrong with it, at least. Not
that that means much...

Although in addition to the normal-page path and the transhuge path,
you'll probably also have to make the same change in the hugetlb path.
I guess you may have to grep through all the uses of FOLL_WRITE, as
Linus suggested, to see if there are any other missing spots.
Linus Torvalds Aug. 11, 2020, 8:46 p.m. UTC | #6
On Tue, Aug 11, 2020 at 1:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Maybe trying to reuse the page just isn't worth it?

Well, the attached patch boots, and hasn't slowed kernel compiles
down. But it may do horrible things due to leaving swap cache pages
and KSM pages to be reaped by the memory scanner, instead of being
reused.

I wouldn't notice, I have too much memory in this machine anyway.

It might have positive side effects too, of course. Not waiting for
the page lock in the page fault case could be a big win on some loads.
We do_wp_page() was one of the paths to the page lock that caused the
nasty latency spikes (I'm not sure it was a dominant one, but it was
up there).

So maybe it is worth running some test loads on. And while this patch
doesn't do it, applying this should mean that you can just revert all
the COW games entirely, and we can remove the should_force_cow_break()
from the GUP paths.

(Also - if this actually works, we can get rid of reuse_ksm_page(),
this was the only user)

                 Linus
Peter Xu Aug. 11, 2020, 9:23 p.m. UTC | #7
On Tue, Aug 11, 2020 at 10:22:00PM +0200, Jann Horn wrote:
> On Tue, Aug 11, 2020 at 10:03 PM Peter Xu <peterx@redhat.com> wrote:
> > On Tue, Aug 11, 2020 at 09:07:17PM +0200, Jann Horn wrote:
> > > On Tue, Aug 11, 2020 at 8:39 PM Peter Xu <peterx@redhat.com> wrote:
> > > > Starting from commit 17839856fd58 ("gup: document and work around "COW can
> > > > break either way" issue", 2020-06-02), explicit copy-on-write behavior is
> > > > enforced for private gup pages even if it's a read-only.  It is achieved by
> > > > always passing FOLL_WRITE to emulate a write.
> > > >
> > > > That should fix the COW issue that we were facing, however above commit could
> > > > also break userfaultfd-wp and applications like umapsort [1,2].
> > > >
> > > > One general routine of umap-like program is: userspace library will manage page
> > > > allocations, and it will evict the least recently used pages from memory to
> > > > external storages (e.g., file systems).  Below are the general steps to evict
> > > > an in-memory page in the uffd service thread when the page pool is full:
> > > >
> > > >   (1) UFFDIO_WRITEPROTECT with mode=WP on some to-be-evicted page P, so that
> > > >       further writes to page P will block (keep page P clean)
> > > >   (2) Copy page P to external storage (e.g. file system)
> > > >   (3) MADV_DONTNEED to evict page P
> > > >
> > > > Here step (1) makes sure that the page to dump will always be up-to-date, so
> > > > that the page snapshot in the file system is consistent with the one that was
> > > > in the memory.  However with commit 17839856fd58, step (2) can potentially hang
> > > > itself because e.g. if we use write() to a file system fd to dump the page
> > > > data, that will be a translated read gup request in the file system driver to
> > > > read the page content, then the read gup will be translated to a write gup due
> > > > to the new enforced COW behavior.  This write gup will further trigger
> > > > handle_userfault() and hang the uffd service thread itself.
> > > >
> > > > I think the problem will go away too if we replace the write() to the file
> > > > system into a memory write to a mmaped region in the userspace library, because
> > > > normal page faults will not enforce COW, only gup is affected.  However we
> > > > cannot forbid users to use write() or any form of kernel level read gup.
> > > >
> > > > One solution is actually already mentioned in commit 17839856fd58, which is to
> > > > provide an explicit BREAK_COW scemantics for enforced COW.  Then we can still
> > > > use FAULT_FLAG_WRITE to identify whether this is a "real write request" or an
> > > > "enfornced COW (read) request".
> > > >
> > > > With the enforced COW, we also need to inherit UFFD_WP bit during COW because
> > > > now COW can happen with UFFD_WP enabled (previously, it cannot).
> [...]
> > > > @@ -1076,7 +1078,7 @@ static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > > >                         }
> > > >                         if (is_vm_hugetlb_page(vma)) {
> > > >                                 if (should_force_cow_break(vma, foll_flags))
> > > > -                                       foll_flags |= FOLL_WRITE;
> > > > +                                       foll_flags |= FOLL_BREAK_COW;
> > >
> > > How does this interact with the FOLL_WRITE check in follow_page_pte()?
> > > If we want the COW to be broken, follow_page_pte() would have to treat
> > > FOLL_BREAK_COW similarly to FOLL_WRITE, right?
> >
> > Good point...  I did checked follow_page_mask() that FOLL_COW will still be set
> > correctly after applying the patch, though I forgot the FOLL_WRITE part.
> >
> > Does below look good to you?
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 9d1f44b01165..f4f2a69c6fe7 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -439,7 +439,8 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> >         }
> >         if ((flags & FOLL_NUMA) && pte_protnone(pte))
> >                 goto no_page;
> > -       if ((flags & FOLL_WRITE) && !can_follow_write_pte(pte, flags)) {
> > +       if ((flags & (FOLL_WRITE | FOLL_BREAK_COW)) &&
> > +           !can_follow_write_pte(pte, flags)) {
> >                 pte_unmap_unlock(ptep, ptl);
> >                 return NULL;
> >         }
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 4f192efef37c..edbd42c9d576 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1340,7 +1340,8 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> >
> >         assert_spin_locked(pmd_lockptr(mm, pmd));
> >
> > -       if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
> > +       if (flags & (FOLL_WRITE | FOLL_BREAK_COW) &&
> > +           !can_follow_write_pmd(*pmd, flags))
> >                 goto out;
> >
> >         /* Avoid dumping huge zero page */
> 
> Well, I don't see anything immediately wrong with it, at least. Not
> that that means much...
> 
> Although in addition to the normal-page path and the transhuge path,
> you'll probably also have to make the same change in the hugetlb path.
> I guess you may have to grep through all the uses of FOLL_WRITE, as
> Linus suggested, to see if there are any other missing spots.

Right.  Moreover, I feel like the enforced cow is not completely done in
hugetlbfs code even without this patch.  E.g., it lacks the proper return of
VM_FAULT_WRITE in hugetlb_cow(), and also the further convertion logics to
translate that into FOLL_COW (which, iiuc, should probably be done in
follow_hugetlb_page()).

I quickly went over the other FOLL_WRITE references and I didn't see any other
suspicious spots besides hugetlb (I only looked at the places that can be
called by __get_user_pages() though; that's the only place we set
FOLL_BREAK_COW after all).  At the meantime we ignored the fast gups for strict
breaking of cow always, so those ones seem ok too.

Thanks,
Peter Xu Aug. 11, 2020, 9:42 p.m. UTC | #8
On Tue, Aug 11, 2020 at 01:46:05PM -0700, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 1:06 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Maybe trying to reuse the page just isn't worth it?
> 
> Well, the attached patch boots, and hasn't slowed kernel compiles
> down. But it may do horrible things due to leaving swap cache pages
> and KSM pages to be reaped by the memory scanner, instead of being
> reused.
> 
> I wouldn't notice, I have too much memory in this machine anyway.
> 
> It might have positive side effects too, of course. Not waiting for
> the page lock in the page fault case could be a big win on some loads.
> We do_wp_page() was one of the paths to the page lock that caused the
> nasty latency spikes (I'm not sure it was a dominant one, but it was
> up there).
> 
> So maybe it is worth running some test loads on. And while this patch
> doesn't do it, applying this should mean that you can just revert all
> the COW games entirely, and we can remove the should_force_cow_break()
> from the GUP paths.
> 
> (Also - if this actually works, we can get rid of reuse_ksm_page(),
> this was the only user)

I don't know good enough on the reuse refactoring patch (which at least looks
functionally correct), but... IMHO we still need the enforced cow logic no
matter we refactor the page reuse logic or not, am I right?

Example:

  - Process A & B shares private anonymous page P0

  - Process A does READ of get_user_pages() on page P0

  - Process A (e.g., another thread of process A, or as long as process A still
    holds the page P0 somehow) writes to page P0 which triggers cow, so for
    process A the page P0 is replaced by P1 with identical content

Then process A still keeps the reference to page P0 that potentially belongs to
process B or others?

Thanks,
Linus Torvalds Aug. 11, 2020, 11:10 p.m. UTC | #9
On Tue, Aug 11, 2020 at 2:43 PM Peter Xu <peterx@redhat.com> wrote:
>
> I don't know good enough on the reuse refactoring patch (which at least looks
> functionally correct), but... IMHO we still need the enforced cow logic no
> matter we refactor the page reuse logic or not, am I right?
>
> Example:
>
>   - Process A & B shares private anonymous page P0
>
>   - Process A does READ of get_user_pages() on page P0
>
>   - Process A (e.g., another thread of process A, or as long as process A still
>     holds the page P0 somehow) writes to page P0 which triggers cow, so for
>     process A the page P0 is replaced by P1 with identical content
>
> Then process A still keeps the reference to page P0 that potentially belongs to
> process B or others?

The COW from process A will indeed keep a reference to page P0 (for
whatever nefarious kernel use it did the GUP for). And yes, that page
is still mapped into process B.

HOWEVER.

Since the GUP will be incrementing the reference count of said page,
the actual problem has gone away. Because the GUP copy won't be
modifying the page (it's a read lookup), and as long as process B only
reads from the page, we're happily sharing a read-only page.

And if process B now starts writing to it, the "page_count()" check at
fault time will trigger, and process B will do a COW copy.

So now we'll have three copies of the page: the original one is being
kept for the GUP, and both A and B did their COW copies in their page
tables.

And that's exactly what we wanted - they are all now containing
different data, after all.

The problem with the *current* code is that we don't actually look at
the page count at all, only the mapping count, so the GUP reference
count is basically invisible.

And the reason we don't look too closely at the page count is that
there's a lot of incidental things that can affect it, like the whole
KSM reference, the swap cache reference, other GUP users etc etc. So
we historically have tried to maximize the amount of sharing we can
do.

But that "maximize sharing" is really complicated.

That's the big change of that simplification patch - it's basically
saying that "whenever _anything_ else has a reference to that page,
we'll just copy and not even try to share".

             Linus
Peter Xu Aug. 20, 2020, 9:54 p.m. UTC | #10
On Tue, Aug 11, 2020 at 04:10:57PM -0700, Linus Torvalds wrote:
> On Tue, Aug 11, 2020 at 2:43 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > I don't know good enough on the reuse refactoring patch (which at least looks
> > functionally correct), but... IMHO we still need the enforced cow logic no
> > matter we refactor the page reuse logic or not, am I right?
> >
> > Example:
> >
> >   - Process A & B shares private anonymous page P0
> >
> >   - Process A does READ of get_user_pages() on page P0
> >
> >   - Process A (e.g., another thread of process A, or as long as process A still
> >     holds the page P0 somehow) writes to page P0 which triggers cow, so for
> >     process A the page P0 is replaced by P1 with identical content
> >
> > Then process A still keeps the reference to page P0 that potentially belongs to
> > process B or others?
> 
> The COW from process A will indeed keep a reference to page P0 (for
> whatever nefarious kernel use it did the GUP for). And yes, that page
> is still mapped into process B.
> 
> HOWEVER.
> 
> Since the GUP will be incrementing the reference count of said page,
> the actual problem has gone away. Because the GUP copy won't be
> modifying the page (it's a read lookup), and as long as process B only
> reads from the page, we're happily sharing a read-only page.
> 
> And if process B now starts writing to it, the "page_count()" check at
> fault time will trigger, and process B will do a COW copy.
> 
> So now we'll have three copies of the page: the original one is being
> kept for the GUP, and both A and B did their COW copies in their page
> tables.
> 
> And that's exactly what we wanted - they are all now containing
> different data, after all.
> 
> The problem with the *current* code is that we don't actually look at
> the page count at all, only the mapping count, so the GUP reference
> count is basically invisible.
> 
> And the reason we don't look too closely at the page count is that
> there's a lot of incidental things that can affect it, like the whole
> KSM reference, the swap cache reference, other GUP users etc etc. So
> we historically have tried to maximize the amount of sharing we can
> do.
> 
> But that "maximize sharing" is really complicated.
> 
> That's the big change of that simplification patch - it's basically
> saying that "whenever _anything_ else has a reference to that page,
> we'll just copy and not even try to share".

Sorry for the late reply, and thanks for the explanations.  That definitely
helped me to understand.

So, which way should we go?

I kind of prefer the new suggestion to remove code rather than adding new
codes.  I definitely don't know enough on the side effect of it, especially
performance-wise on either ksm or swap, but... IIUC the worst case is we'll get
some perf report later on, and it seems also not hard to revert the patch later
if we want.

Thanks,
Linus Torvalds Aug. 20, 2020, 10:01 p.m. UTC | #11
On Thu, Aug 20, 2020 at 2:54 PM Peter Xu <peterx@redhat.com> wrote:
>
> I kind of prefer the new suggestion to remove code rather than adding new
> codes.  I definitely don't know enough on the side effect of it, especially
> performance-wise on either ksm or swap, but... IIUC the worst case is we'll get
> some perf report later on, and it seems also not hard to revert the patch later
> if we want.

Well, would you be willing to try this patch out?

After you apply that patch, you should be able to remove the
should_force_cow_break() games entirely, because a write to the page
should always break cow towards the writer if there are any GUP users
around (put another way: away from the GUP).

However, to make the test meaningful, it really should do some swap testing too.

            Linus
Peter Xu Aug. 21, 2020, 2:34 a.m. UTC | #12
Yep.  I'll run some test and update soon.

Thanks,
Peter

On Thu., Aug. 20, 2020, 18:01 Linus Torvalds <torvalds@linux-foundation.org>
wrote:

> On Thu, Aug 20, 2020 at 2:54 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > I kind of prefer the new suggestion to remove code rather than adding new
> > codes.  I definitely don't know enough on the side effect of it,
> especially
> > performance-wise on either ksm or swap, but... IIUC the worst case is
> we'll get
> > some perf report later on, and it seems also not hard to revert the
> patch later
> > if we want.
>
> Well, would you be willing to try this patch out?
>
> After you apply that patch, you should be able to remove the
> should_force_cow_break() games entirely, because a write to the page
> should always break cow towards the writer if there are any GUP users
> around (put another way: away from the GUP).
>
> However, to make the test meaningful, it really should do some swap
> testing too.
>
>             Linus
>
Jan Kara Aug. 21, 2020, 10:13 a.m. UTC | #13
On Thu 20-08-20 15:01:00, Linus Torvalds wrote:
> On Thu, Aug 20, 2020 at 2:54 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > I kind of prefer the new suggestion to remove code rather than adding new
> > codes.  I definitely don't know enough on the side effect of it, especially
> > performance-wise on either ksm or swap, but... IIUC the worst case is we'll get
> > some perf report later on, and it seems also not hard to revert the patch later
> > if we want.
> 
> Well, would you be willing to try this patch out?
> 
> After you apply that patch, you should be able to remove the
> should_force_cow_break() games entirely, because a write to the page
> should always break cow towards the writer if there are any GUP users
> around (put another way: away from the GUP).
> 
> However, to make the test meaningful, it really should do some swap testing too.
> 
>             Linus

> From f41082844ea82ad1278e167fe6e973fa4efc974a Mon Sep 17 00:00:00 2001
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Date: Tue, 11 Aug 2020 14:23:04 -0700
> Subject: [PATCH] Trial do_wp_page() simplification
> 
> How about we just make sure we're the only possible valid user fo the
> page before we bother to reuse it?
> 
> Simplify, simplify, simplify.
> 
> And get rid of the nasty serialization on the page lock at the same time.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>  mm/memory.c | 58 +++++++++++++++--------------------------------------
>  1 file changed, 16 insertions(+), 42 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 602f4283122f..a43004dd2ff6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2927,50 +2927,24 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>  	 * not dirty accountable.
>  	 */
>  	if (PageAnon(vmf->page)) {
> -		int total_map_swapcount;
> -		if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
> -					   page_count(vmf->page) != 1))
> +		struct page *page = vmf->page;
> +
> +		if (page_count(page) != 1)
> +			goto copy;
> +		if (!trylock_page(page))
> +			goto copy;
> +		if (page_mapcount(page) != 1 && page_count(page) != 1) {

So this condition looks strange to me... Did you mean:

		if (page_mapcount(page) != 1 || page_count(page) != 1)

? Because page mapcount is >= 1 here, and if mapcount is > 1, then page_count
is certainly greater than 1 as well...

> +			unlock_page(page);
>  			goto copy;
> -		if (!trylock_page(vmf->page)) {
> -			get_page(vmf->page);
> -			pte_unmap_unlock(vmf->pte, vmf->ptl);
> -			lock_page(vmf->page);
> -			vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> -					vmf->address, &vmf->ptl);
> -			if (!pte_same(*vmf->pte, vmf->orig_pte)) {
> -				update_mmu_tlb(vma, vmf->address, vmf->pte);
> -				unlock_page(vmf->page);
> -				pte_unmap_unlock(vmf->pte, vmf->ptl);
> -				put_page(vmf->page);
> -				return 0;
> -			}
> -			put_page(vmf->page);
> -		}
> -		if (PageKsm(vmf->page)) {

Also I know nothing about KSM but looking at reuse_ksm_page() I can see it
plays some tricks with page index & mapping even for pages with page_count
== 1 so you cannot just drop those bits AFAICT.

> -			bool reused = reuse_ksm_page(vmf->page, vmf->vma,
> -						     vmf->address);
> -			unlock_page(vmf->page);
> -			if (!reused)
> -				goto copy;
> -			wp_page_reuse(vmf);
> -			return VM_FAULT_WRITE;
> -		}
> -		if (reuse_swap_page(vmf->page, &total_map_swapcount)) {

Also I'm not sure if dropping this is safe for THP - reuse_swap_page()
seems to be a misnomer and seems to do also some THP handling. In
particular a comment before page_trans_huge_mapcount() states that
page_mapcount() isn't fully accurate for THP and page_trans_huge_mapcount()
should be used instead for checking whether a copy is needed on write fault.

									Honza
> -			if (total_map_swapcount == 1) {
> -				/*
> -				 * The page is all ours. Move it to
> -				 * our anon_vma so the rmap code will
> -				 * not search our parent or siblings.
> -				 * Protected against the rmap code by
> -				 * the page lock.
> -				 */
> -				page_move_anon_rmap(vmf->page, vma);
> -			}
> -			unlock_page(vmf->page);
> -			wp_page_reuse(vmf);
> -			return VM_FAULT_WRITE;
>  		}
> -		unlock_page(vmf->page);
> +		/*
> +		 * Ok, we've got the only map reference, and the only
> +		 * page count reference, and the page is locked,
> +		 * it's dark out, and we're wearing sunglasses. Hit it.
> +		 */
> +		wp_page_reuse(vmf);
> +		unlock_page(page);
> +		return VM_FAULT_WRITE;
>  	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
>  					(VM_WRITE|VM_SHARED))) {
>  		return wp_page_shared(vmf);
> -- 
> 2.28.0.218.gc12ef3d349
>
Linus Torvalds Aug. 21, 2020, 12:27 p.m. UTC | #14
On Fri, Aug 21, 2020 at 3:13 AM Jan Kara <jack@suse.cz> wrote:
>
> > +             if (page_mapcount(page) != 1 && page_count(page) != 1) {
>
> So this condition looks strange to me... Did you mean:
>
>                 if (page_mapcount(page) != 1 || page_count(page) != 1)

Duh. Yes.

> > -             if (PageKsm(vmf->page)) {
>
> Also I know nothing about KSM but looking at reuse_ksm_page() I can see it
> plays some tricks with page index & mapping even for pages with page_count
> == 1 so you cannot just drop those bits AFAICT.

Yeah, I wasn't really sure what we want to do.

In an optimal world, I was thinking that we'd actually do exactly what
we do at munmap time.

Which is not to get the page lock at all. Just look at what
zap_pte_range() does for an a page when it unmaps it:

                        page_remove_rmap(page, false);

and that's it. No games.

And guess what? That "'page_remove_rmap()" is what wp_page_copy() already does.

So I really think *all* of these games we play are complete garbage
and completely wrong.

Because the zap_page_range() path is a *lot* more common than the WP
path, and triggers for every single page when we do munmap or exit or
whatever.

So why would WP need to do anything else for correctness? Absolutely
no reason I can see.

> Also I'm not sure if dropping this is safe for THP - reuse_swap_page()
> seems to be a misnomer and seems to do also some THP handling.

Again, I think that's a bogus argument.

Because this all is actually not the common path at all, and the thing
is, the common path does none of these odd games.

I really think this COW handling magic is just legacy garbage because
people have carried it along forever and everybody is worried about
it. The fact is, the "copy" case is always safe, because all it does
is basically the same as zap_page_range() does, with just adding a new
page instead.

And it's also possible I'm missing something.

But yes. That '&&' should very much be a '||'. What can I say? Oops.

              Linus
Jan Kara Aug. 21, 2020, 3:47 p.m. UTC | #15
On Fri 21-08-20 05:27:40, Linus Torvalds wrote:
> On Fri, Aug 21, 2020 at 3:13 AM Jan Kara <jack@suse.cz> wrote:
> >
> > > +             if (page_mapcount(page) != 1 && page_count(page) != 1) {
> >
> > So this condition looks strange to me... Did you mean:
> >
> >                 if (page_mapcount(page) != 1 || page_count(page) != 1)
> 
> Duh. Yes.
> 
> > > -             if (PageKsm(vmf->page)) {
> >
> > Also I know nothing about KSM but looking at reuse_ksm_page() I can see it
> > plays some tricks with page index & mapping even for pages with page_count
> > == 1 so you cannot just drop those bits AFAICT.
> 
> Yeah, I wasn't really sure what we want to do.
> 
> In an optimal world, I was thinking that we'd actually do exactly what
> we do at munmap time.
> 
> Which is not to get the page lock at all. Just look at what
> zap_pte_range() does for an a page when it unmaps it:
> 
>                         page_remove_rmap(page, false);
> 
> and that's it. No games.
> 
> And guess what? That "'page_remove_rmap()" is what wp_page_copy() already
> does.

I was more concerned about the case where you decide to writeably map (i.e.
wp_page_reuse() path) a PageKsm() page. That path does not touch
page->mapping in your code AFAICS. And AFAIU the code in mm/ksm.c you are
not supposed to writeably map PageKsm() pages without changing
page->mapping (which also effectively makes PageKsm() return false) but I
don't see anything in your code that would achieve that because KSM code
references a page without being accounted in page_count() for $reasons (see
comment before get_ksm_page()) and instead plays tricks with validating
cookies in page->mapping...

> So I really think *all* of these games we play are complete garbage
> and completely wrong.
> 
> Because the zap_page_range() path is a *lot* more common than the WP
> path, and triggers for every single page when we do munmap or exit or
> whatever.
> 
> So why would WP need to do anything else for correctness? Absolutely
> no reason I can see.
> 
> > Also I'm not sure if dropping this is safe for THP - reuse_swap_page()
> > seems to be a misnomer and seems to do also some THP handling.
> 
> Again, I think that's a bogus argument.
> 
> Because this all is actually not the common path at all, and the thing
> is, the common path does none of these odd games.
> 
> I really think this COW handling magic is just legacy garbage because
> people have carried it along forever and everybody is worried about
> it. The fact is, the "copy" case is always safe, because all it does
> is basically the same as zap_page_range() does, with just adding a new
> page instead.

And also here I was more concerned that page_mapcount != 1 || page_count !=
1 check could be actually a weaker check than what reuse_swap_page() does.
So the old code could decide to copy while your new code would decide to go
the wp_page_reuse() path. And for this case I don't see how your "but unmap
path is simple" argument would apply...

								Honza
Linus Torvalds Aug. 21, 2020, 5 p.m. UTC | #16
On Fri, Aug 21, 2020 at 8:48 AM Jan Kara <jack@suse.cz> wrote:
>
> I was more concerned about the case where you decide to writeably map (i.e.
> wp_page_reuse() path) a PageKsm() page.

Yeah, so I think what I do is stricter than what we used to do - any
KSM page will never be re-used, simply because the KSM part will have
incremented the page count.

So as far as I can tell, with that patch we will never ever share
except for the "I really am the _only_ user of the page, there are no
KSM or swap cache pages" case.

That's the whole point of the patch. Get rid of all the games. If
there is *any* possible other use - be it KSM or swap cache or
*anything*, we don't try to re-use it.

> And also here I was more concerned that page_mapcount != 1 || page_count !=
> 1 check could be actually a weaker check than what reuse_swap_page() does.

If that is the case, then yes, that would be a problem.

But really, if page_count() == 1, then we're the only possible thing
that holds that page. Nothing else can have a reference to it - by
definition.

And if page_count() != 1, we will not share. Ever. We'll just do what
zap_paghe_range() does - unmap the old page and do the
page_remove_rmap().

The only small worry would be the race between releasing the page
table lock - when we allocate a new page - and somebody coming in and
doing something magical to that page. But that's where holding the
page lock comes in.

Plus that part isn't anything my patch changes.

               Linus
Peter Xu Aug. 21, 2020, 6:08 p.m. UTC | #17
On Fri, Aug 21, 2020 at 10:00:59AM -0700, Linus Torvalds wrote:
> On Fri, Aug 21, 2020 at 8:48 AM Jan Kara <jack@suse.cz> wrote:
> >
> > I was more concerned about the case where you decide to writeably map (i.e.
> > wp_page_reuse() path) a PageKsm() page.
> 
> Yeah, so I think what I do is stricter than what we used to do - any
> KSM page will never be re-used, simply because the KSM part will have
> incremented the page count.

IIUC, Jan wanted to point out the fact that KSM didn't increase page count for
stable pages (reasons are above get_ksm_page()).

> 
> So as far as I can tell, with that patch we will never ever share
> except for the "I really am the _only_ user of the page, there are no
> KSM or swap cache pages" case.
> 
> That's the whole point of the patch. Get rid of all the games. If
> there is *any* possible other use - be it KSM or swap cache or
> *anything*, we don't try to re-use it.
> 
> > And also here I was more concerned that page_mapcount != 1 || page_count !=
> > 1 check could be actually a weaker check than what reuse_swap_page() does.
> 
> If that is the case, then yes, that would be a problem.
> 
> But really, if page_count() == 1, then we're the only possible thing
> that holds that page. Nothing else can have a reference to it - by
> definition.

Do we still at least need to check the swap count if PageSwapCache(page)?

Besides a complete cleanup, I'm now thinking whether we should use a smaller
change like below.  IMHO we can still simplify the ksm special case before
taking the page lock. Since ksm page should be rare in general, then it seems
not worth it to let every single cow to check against it:

diff --git a/mm/memory.c b/mm/memory.c
index 602f4283122f..b852d393bcc7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2928,9 +2928,6 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
         */
        if (PageAnon(vmf->page)) {
                int total_map_swapcount;
-               if (PageKsm(vmf->page) && (PageSwapCache(vmf->page) ||
-                                          page_count(vmf->page) != 1))
-                       goto copy;
                if (!trylock_page(vmf->page)) {
                        get_page(vmf->page);
                        pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -2946,6 +2943,10 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
                        }
                        put_page(vmf->page);
                }
+               if (page_count(vmf->page) != 1) {
+                       unlock_page(vmf->page);
+                       goto copy;
+               }
                if (PageKsm(vmf->page)) {
                        bool reused = reuse_ksm_page(vmf->page, vmf->vma,
                                                     vmf->address);

So we check page_count() (which covers KSM or normal pages) after we've got the
page lock, while we keep all the rest.  It's also safe for the removed
condition of PageSwapCache() && PageKsm() because reuse_ksm_page() later on
will check PageSwapCache() again.

Thanks,
Linus Torvalds Aug. 21, 2020, 6:23 p.m. UTC | #18
On Fri, Aug 21, 2020 at 11:08 AM Peter Xu <peterx@redhat.com> wrote:
>
> IIUC, Jan wanted to point out the fact that KSM didn't increase page count for
> stable pages (reasons are above get_ksm_page()).

Ouch.

> Do we still at least need to check the swap count if PageSwapCache(page)?

No. Because a PageSwapCache() page should be a perfectly normal page
cache thing. It will increment the page count if it's active.

That PageKsm() thing that *doesn't* increment the page could does look
worrisome, but

> So we check page_count() (which covers KSM or normal pages) after we've got the
> page lock, while we keep all the rest.

Why would we keep the rest? I

The actual thing I would really want to get rid of is the page lock,
in fact. We shouldn't need it in this path, and it's the most
expensive part of it all. But that's also why I did the page count
test optimistically unlocked - because if page_count is 1, then we
really shouldn't contend with anything else, so hopefully the
(currently quite expensive) page locking is actually a non-issue once
you get there.

But the PageKsm() page_count() issue I didn't even realize. That worries me.

                Linus
Linus Torvalds Aug. 21, 2020, 7:05 p.m. UTC | #19
On Fri, Aug 21, 2020 at 11:23 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> But the PageKsm() page_count() issue I didn't even realize. That worries me.

Well, the fix is simple, although I don't love the magic PageKsm
semantics that hide it from the page count.

But since (a) a Ksm page is presumably normally shared (ie things like
all zeroes) and (b) copying should always be safe, just do that.

The case we *used* to have with trying to reuse the KSM page seems
like it's not just adding complexity, it's optimizing for entirely the
wrong case.

Check both before and after getting the page lock, for the same reason
we do it for the page count.

The logic there matches the "reuse swap page", but while that old
logic may have made sense 20 years ago, the swap cache case should be
*so* rare these days that it feels completely pointless to try to
reuse it.

Aggressively doing a new allocation, copy, and freeing the old swap
cache page is quite possibly cheaper than taking the page lock anyway,
but more importantly, it's not a case that should normally trigger in
the first place.

That said, looking at this code again, I get the feeling that the
mapcount check is pointless.

Afaik, page_count() should always be larger than page_mapcount(), so
if mapcount is > 1, then we'd have caught it with the page_count()
check.

Hmm? Am I popssibly missing some other subtle special case?

Are there any THP issues? Again, doing the copy should always be the
safe thing to do, and since we get the page lock for the reuse case I
think we're ok on that front.

What else possible special cases could we hit?

                Linus
Linus Torvalds Aug. 21, 2020, 7:06 p.m. UTC | #20
On Fri, Aug 21, 2020 at 12:05 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Well, the fix is simple, although I don't love the magic PageKsm
> semantics that hide it from the page count.
>
> But since (a) a Ksm page is presumably normally shared (ie things like
> all zeroes) and (b) copying should always be safe, just do that.

I meant to attach the patch that did that, but didn't. Here is the
obvious modified version.

             Linus
Peter Xu Aug. 21, 2020, 7:31 p.m. UTC | #21
On Fri, Aug 21, 2020 at 11:23:31AM -0700, Linus Torvalds wrote:
> On Fri, Aug 21, 2020 at 11:08 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > IIUC, Jan wanted to point out the fact that KSM didn't increase page count for
> > stable pages (reasons are above get_ksm_page()).
> 
> Ouch.
> 
> > Do we still at least need to check the swap count if PageSwapCache(page)?
> 
> No. Because a PageSwapCache() page should be a perfectly normal page
> cache thing. It will increment the page count if it's active.
> 
> That PageKsm() thing that *doesn't* increment the page could does look
> worrisome, but
> 
> > So we check page_count() (which covers KSM or normal pages) after we've got the
> > page lock, while we keep all the rest.
> 
> Why would we keep the rest? I
> 
> The actual thing I would really want to get rid of is the page lock,
> in fact. We shouldn't need it in this path, and it's the most
> expensive part of it all. But that's also why I did the page count
> test optimistically unlocked - because if page_count is 1, then we
> really shouldn't contend with anything else, so hopefully the
> (currently quite expensive) page locking is actually a non-issue once
> you get there.
> 
> But the PageKsm() page_count() issue I didn't even realize. That worries me.

That's definitely tricky.. Though IMHO that's another KSM topic that we might
want to look into later besides the current effort to fix up the COW breaking
issue for gup.  For now, it shouldn't be hard for us as long as we do cow
always for KSM pages.  However, do we really want to revert the whole logic of
52d1e606ee73?  Asking because it still seems to be an improvement to me (after
all we'll need to look after KSM pages here).  So it seems still good to keep.

Thanks,
Linus Torvalds Aug. 21, 2020, 7:42 p.m. UTC | #22
On Fri, Aug 21, 2020 at 12:31 PM Peter Xu <peterx@redhat.com> wrote:
>
> However, do we really want to revert the whole logic of
> 52d1e606ee73?  Asking because it still seems to be an improvement to me (after
> all we'll need to look after KSM pages here).  So it seems still good to keep.

Does anybody have numbers for it?

I'd rather simplify and get rid of the locking that has been
problematic, and then re-introduce limited cases with actual numbers.

Right now that commit has no real argument for it except for "do what
we do for swap cache". And since we're getting rid of the swap cache
special case, I'd say that commit 52d1e606ee73 argues for getting rid
of the KSM special case too.

Honestly, I'd expect that if KSM is effective, it's for pages that
really *are* shared. If you get a lot of "write fault on the last
copy" and that ends up being a problem, I think that says more about
the KSM issue than it says about the write fault..

               Linus
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6a82f9bccd7..a1f5c92b44cb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -409,6 +409,7 @@  extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_BREAK_COW: Do COW explicitly for the fault (even for read).
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -439,6 +440,7 @@  extern pgprot_t protection_map[16];
 #define FAULT_FLAG_REMOTE			0x80
 #define FAULT_FLAG_INSTRUCTION  		0x100
 #define FAULT_FLAG_INTERRUPTIBLE		0x200
+#define FAULT_FLAG_BREAK_COW			0x400
 
 /*
  * The default fault flags that should be used by most of the
@@ -2756,6 +2758,7 @@  struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_BREAK_COW  0x100000 /* request for explicit COW (even for read) */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
diff --git a/mm/gup.c b/mm/gup.c
index d8a33dd1430d..c33e84ab9c36 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -870,6 +870,8 @@  static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 		return -ENOENT;
 	if (*flags & FOLL_WRITE)
 		fault_flags |= FAULT_FLAG_WRITE;
+	if (*flags & FOLL_BREAK_COW)
+		fault_flags |= FAULT_FLAG_BREAK_COW;
 	if (*flags & FOLL_REMOTE)
 		fault_flags |= FAULT_FLAG_REMOTE;
 	if (locked)
@@ -1076,7 +1078,7 @@  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			}
 			if (is_vm_hugetlb_page(vma)) {
 				if (should_force_cow_break(vma, foll_flags))
-					foll_flags |= FOLL_WRITE;
+					foll_flags |= FOLL_BREAK_COW;
 				i = follow_hugetlb_page(mm, vma, pages, vmas,
 						&start, &nr_pages, i,
 						foll_flags, locked);
@@ -1095,7 +1097,7 @@  static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 		}
 
 		if (should_force_cow_break(vma, foll_flags))
-			foll_flags |= FOLL_WRITE;
+			foll_flags |= FOLL_BREAK_COW;
 
 retry:
 		/*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 206f52b36ffb..c88f773d03af 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1296,7 +1296,17 @@  vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (reuse_swap_page(page, NULL)) {
 		pmd_t entry;
 		entry = pmd_mkyoung(orig_pmd);
-		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		entry = pmd_mkdirty(entry);
+		if (pmd_uffd_wp(orig_pmd))
+			/*
+			 * This can happen when an uffd-wp protected page is
+			 * copied due to enfornced COW.  When it happens, we
+			 * need to keep the uffd-wp bit even after COW, and
+			 * make sure write bit is kept cleared.
+			 */
+			entry = pmd_mkuffd_wp(pmd_wrprotect(entry));
+		else
+			entry = maybe_pmd_mkwrite(entry, vma);
 		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
 			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
 		unlock_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index c39a13b09602..b27b555a9df8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2706,7 +2706,17 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = pte_sw_mkyoung(entry);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		entry = pte_mkdirty(entry);
+		if (pte_uffd_wp(vmf->orig_pte))
+			/*
+			 * This can happen when an uffd-wp protected page is
+			 * copied due to enfornced COW.  When it happens, we
+			 * need to keep the uffd-wp bit even after COW, and
+			 * make sure write bit is kept cleared.
+			 */
+			entry = pte_mkuffd_wp(pte_wrprotect(entry));
+		else
+			entry = maybe_mkwrite(entry, vma);
 		/*
 		 * Clear the pte entry and flush it first, before updating the
 		 * pte with the new entry. This will avoid a race condition
@@ -2900,7 +2910,13 @@  static vm_fault_t do_wp_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 
-	if (userfaultfd_pte_wp(vma, *vmf->pte)) {
+	/*
+	 * Userfaultfd-wp only cares about real writes.  E.g., enforced COW for
+	 * read does not count.  When that happens, we will do the COW with the
+	 * UFFD_WP bit inherited from the original PTE/PMD.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) &&
+	    userfaultfd_pte_wp(vma, *vmf->pte)) {
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return handle_userfault(vmf, VM_UFFD_WP);
 	}
@@ -3290,7 +3306,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		put_page(swapcache);
 	}
 
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		ret |= do_wp_page(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
@@ -4117,7 +4133,14 @@  static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	if (vma_is_anonymous(vmf->vma)) {
-		if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
+		/*
+		 * Userfaultfd-wp only cares about real writes.  E.g., enforced
+		 * COW for read does not count.  When that happens, we will do
+		 * the COW with the UFFD_WP bit inherited from the original
+		 * PTE/PMD.
+		 */
+		if ((vmf->flags & FAULT_FLAG_WRITE) &&
+		    userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
 		return do_huge_pmd_wp_page(vmf, orig_pmd);
 	}
@@ -4241,7 +4264,7 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
+	if (vmf->flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW)) {
 		if (!pte_write(entry))
 			return do_wp_page(vmf);
 		entry = pte_mkdirty(entry);
@@ -4281,7 +4304,7 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
+	bool cow = flags & (FAULT_FLAG_WRITE | FAULT_FLAG_BREAK_COW);
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4308,7 +4331,7 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 			/* NUMA case for anonymous PUDs would go here */
 
-			if (dirty && !pud_write(orig_pud)) {
+			if (cow && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4346,7 +4369,7 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
 
-			if (dirty && !pmd_write(orig_pmd)) {
+			if (cow && !pmd_write(orig_pmd)) {
 				ret = wp_huge_pmd(&vmf, orig_pmd);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;