mbox series

[v2,00/24] userfaultfd-wp: Support shmem and hugetlbfs

Message ID 20210427161317.50682-1-peterx@redhat.com (mailing list archive)
Headers show
Series userfaultfd-wp: Support shmem and hugetlbfs | expand

Message

Peter Xu April 27, 2021, 4:12 p.m. UTC
This is v2 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
kernel full feature, as it only supports anonymous before this series.

This patchset is based mostly on tag v5.12-rc8-mmots-2021-04-21-23-08, however
I dropped some patches there otherwise I'll get some weird build errors.  I
think I should have kept most of the -mm-next materials, so hopefully when
Andrew gets a new -mm tree, this series can be applied cleanly upon it.  I'll
respin otherwise.

The whole series can also be found online [1].

The major change from v1->v2 is that I dropped the FAULT_FLAG_UFFD_WP flag in
previous v1, as I think it's not needed and checking of the flag:

  vmf->flags & FAULT_FLAG_UFFD_WP

Can be replaced safely by:

  pte_swp_uffd_wp_special(vmf->orig_pte)

So we don't need to introduce yet another flag, and the code actually even
shrinked a few more lines, which is good.

Also per discussion with Mike, I dropped the READ emulation when a thread
faulted on a uffd-wp special swap pte, as I'll keep the WRITE fault as is in
this series.

Nothing big really changed otherwise.  Full changelog listed below.

v2:
- Add R-bs
- Added patch "mm/hugetlb: Drop __unmap_hugepage_range definition from
  hugetlb.h" as noticed/suggested by Mike Kravets
- Fix commit message of patch "hugetlb/userfaultfd: Only drop uffd-wp special
  pte if required" [MikeK]
- Removing comments for fields in zap_details since they're either incorrect or
  not helping [Matthew]
- Rephrase commit message in patch "hugetlb/userfaultfd: Take care of
  UFFDIO_COPY_MODE_WP" to explain better on why set dirty bit for UFFDIO_COPY
  in hugetlbfs [MikeK]
- Don't emulate READ for uffd-wp-special on both shmem & hugetlbfs.
- Drop FAULT_FLAG_UFFD_WP flag, by checking vmf->orig_pte directly against
  pte_swp_uffd_wp_special()
- Fix race condition of page fault handling on uffd-wp-special [Mike]

About Swap Special PTE
======================

In short, the so-called "swap special pte" in this patchset is a new type of
pte that doesn't exist in the past, but it got used initially in this series in
file-backed memories.  It is used to persist information even if the ptes got
dropped meanwhile when the page cache still existed.  For example, when
splitting a file-backed huge pmd, we could be simply dropping the pmd entry
then wait until another fault coming.  It's okay in the past since all
information in the pte can be retained from the page cache when the next page
fault triggers.  However in this case, uffd-wp is per-pte information which
cannot be kept in page cache, so that information needs to be maintained
somehow still in the pgtable entry, even if the pgtable entry is going to be
dropped.  Here instead of replacing with a none entry, we used the "swap
special pte".  Then when the next page fault triggers, we can observe orig_pte
to retain this information.

I'm copy-pasting some commit message from the patch "mm/swap: Introduce the
idea of special swap ptes", where it tried to explain this pte in another angle:

    We used to have special swap entries, like migration entries, hw-poison
    entries, device private entries, etc.

    Those "special swap entries" reside in the range that they need to be at least
    swap entries first, and their types are decided by swp_type(entry).

    This patch introduces another idea called "special swap ptes".

    It's very easy to get confused against "special swap entries", but a speical
    swap pte should never contain a swap entry at all.  It means, it's illegal to
    call pte_to_swp_entry() upon a special swap pte.

    Make the uffd-wp special pte to be the first special swap pte.

    Before this patch, is_swap_pte()==true means one of the below:

       (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
             example, when an anonymous page got swapped out.

       (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
             example, a migration entry, a hw-poison entry, etc.

    After this patch, is_swap_pte()==true means one of the below, where case (b) is
    added:

     (a) The pte contains a swap entry.

       (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
             example, when an anonymous page got swapped out.

       (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
             example, a migration entry, a hw-poison entry, etc.

     (b) The pte does not contain a swap entry at all (so it cannot be passed
         into pte_to_swp_entry()).  For example, uffd-wp special swap pte.

Hugetlbfs needs similar thing because it's also file-backed.  I directly reused
the same special pte there, though the shmem/hugetlb change on supporting this
new pte is different since they don't share code path a lot.

Patch layout
============

Part (1): Shmem support, this is where the special swap pte is introduced.
Some zap rework is needed within the process:

  shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  mm: Clear vmf->pte after pte_unmap_same() returns
  mm/userfaultfd: Introduce special pte for unmapped file-backed mem
  mm/swap: Introduce the idea of special swap ptes
  shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
  mm: Drop first_index/last_index in zap_details
  mm: Introduce zap_details.zap_flags
  mm: Introduce ZAP_FLAG_SKIP_SWAP
  mm: Pass zap_flags into unmap_mapping_pages()
  shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
  shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
  shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps
  shmem/userfaultfd: Handle the left-overed special swap ptes
  shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

Part (2): Hugetlb support, we need to disable huge pmd sharing for uffd-wp
because not compatible just like uffd minor mode.  The rest is the changes
required to teach hugetlbfs understand the special swap pte too that introduced
with the uffd-wp change:

  mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h
  hugetlb/userfaultfd: Hook page faults for uffd write protection
  hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
  hugetlb: Pass vma into huge_pte_alloc()
  hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  mm/hugetlb: Introduce huge version of special swap pte helpers
  mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
  hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
  hugetlb/userfaultfd: Allow wr-protect none ptes
  hugetlb/userfaultfd: Only drop uffd-wp special pte if required

Part (3): Enable both features in code and test

  userfaultfd: Enable write protection for shmem & hugetlbfs
  userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

Tests
=====

I've tested it using either userfaultfd kselftest program, but also with
umapsort [2] which should be even stricter.  Tested page swapping in/out during
umapsort.

If anyone would like to try umapsort, need to use an extremely hacked version
of umap library [3], because by default umap only supports anonymous.  So to
test it we need to build [3] then [2].

Any comment would be greatly welcomed.  Thanks,

[1] https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs
[2] https://github.com/LLNL/umap-apps
[3] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs

Peter Xu (24):
  shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  mm: Clear vmf->pte after pte_unmap_same() returns
  mm/userfaultfd: Introduce special pte for unmapped file-backed mem
  mm/swap: Introduce the idea of special swap ptes
  shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
  mm: Drop first_index/last_index in zap_details
  mm: Introduce zap_details.zap_flags
  mm: Introduce ZAP_FLAG_SKIP_SWAP
  mm: Pass zap_flags into unmap_mapping_pages()
  shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
  shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
  shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on
    thps
  shmem/userfaultfd: Handle the left-overed special swap ptes
  shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()
  mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h
  hugetlb/userfaultfd: Hook page faults for uffd write protection
  hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
  mm/hugetlb: Introduce huge version of special swap pte helpers
  hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
  hugetlb/userfaultfd: Allow wr-protect none ptes
  hugetlb/userfaultfd: Only drop uffd-wp special pte if required
  mm/userfaultfd: Enable write protection for shmem & hugetlbfs
  userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

 arch/arm64/kernel/mte.c                  |   2 +-
 arch/x86/include/asm/pgtable.h           |  28 +++
 fs/dax.c                                 |  10 +-
 fs/hugetlbfs/inode.c                     |  15 +-
 fs/proc/task_mmu.c                       |  14 +-
 fs/userfaultfd.c                         |  39 ++--
 include/asm-generic/hugetlb.h            |  10 +
 include/asm-generic/pgtable_uffd.h       |   3 +
 include/linux/hugetlb.h                  |  30 ++-
 include/linux/mm.h                       |  48 ++++-
 include/linux/mm_inline.h                |  43 +++++
 include/linux/shmem_fs.h                 |   5 +-
 include/linux/swapops.h                  |  39 +++-
 include/linux/userfaultfd_k.h            |  47 +++++
 include/uapi/linux/userfaultfd.h         |   7 +-
 mm/gup.c                                 |   2 +-
 mm/hmm.c                                 |   2 +-
 mm/hugetlb.c                             | 160 +++++++++++++---
 mm/khugepaged.c                          |  14 +-
 mm/madvise.c                             |   4 +-
 mm/memcontrol.c                          |   2 +-
 mm/memory.c                              | 226 +++++++++++++++++------
 mm/migrate.c                             |   4 +-
 mm/mincore.c                             |   2 +-
 mm/mprotect.c                            |  63 ++++++-
 mm/mremap.c                              |   2 +-
 mm/page_vma_mapped.c                     |   6 +-
 mm/rmap.c                                |   8 +
 mm/shmem.c                               |  40 +++-
 mm/swapfile.c                            |   2 +-
 mm/truncate.c                            |  17 +-
 mm/userfaultfd.c                         |  37 ++--
 tools/testing/selftests/vm/userfaultfd.c |   9 +-
 33 files changed, 753 insertions(+), 187 deletions(-)

Comments

Peter Xu May 12, 2021, 7 p.m. UTC | #1
On Tue, Apr 27, 2021 at 12:12:53PM -0400, Peter Xu wrote:
> This is v2 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
> kernel full feature, as it only supports anonymous before this series.

Ping..

Thinking about a repost, as this series shouldn't be able to apply after we've
got more relevant patches into -mm.  E.g., the full minor fault, and also some
small stuff like pagemap, as we need one more patch to support shmem/hugetlbfs
too.

Hugh, haven't received any further comment from you on shmem side (or on the
general idea).  It would be great to still have some of your input.

Let me know if you prefer to read a fresh new version otherwise.

Thanks!
Hugh Dickins May 14, 2021, 7:07 a.m. UTC | #2
On Wed, 12 May 2021, Peter Xu wrote:
> On Tue, Apr 27, 2021 at 12:12:53PM -0400, Peter Xu wrote:
> > This is v2 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
> > kernel full feature, as it only supports anonymous before this series.
> 
> Ping..
> 
> Thinking about a repost, as this series shouldn't be able to apply after we've
> got more relevant patches into -mm.  E.g., the full minor fault, and also some
> small stuff like pagemap, as we need one more patch to support shmem/hugetlbfs
> too.
> 
> Hugh, haven't received any further comment from you on shmem side (or on the
> general idea).  It would be great to still have some of your input.
> 
> Let me know if you prefer to read a fresh new version otherwise.

I am very sorry to let you down, Peter, repeatedly; but it is now very
clear that I shall *never* have time to review your patchset - I am too
slow, have too much else to attend to, and take too long each time to
sink myself deep enough into userfaultfd.

I realize that you're being considerate, and expecting no more than
a few comments from me, rather than asking for formal review; but it's
still too much for me to get into.

The only reason I was involved at all, was when you were wondering how
to handle the pagetable entries for shmem.  I suggested one encoding,
Andrea suggested slightly differently: Andrea's was more elegant (no
"swap type" required), and it looked like you went with his - good.

I wonder whether you noticed
https://lore.kernel.org/linux-mm/20210407084238.20443-2-apopple@nvidia.com/
which might interfere.  I've had no more time to look at that than yours,
so no opinion on it (and I don't know what happened to it after that).

Please keep uppermost in mind when modifying mm/shmem.c for userfaultfd,
the difference between shared and private; and be on guard against the
ways in which CONFIG_USERFAULTFD=y might open a door to abuse.

Thanks,
Hugh
Peter Xu May 14, 2021, 1:18 p.m. UTC | #3
Hugh,

On Fri, May 14, 2021 at 12:07:38AM -0700, Hugh Dickins wrote:
> On Wed, 12 May 2021, Peter Xu wrote:
> > On Tue, Apr 27, 2021 at 12:12:53PM -0400, Peter Xu wrote:
> > > This is v2 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a
> > > kernel full feature, as it only supports anonymous before this series.
> > 
> > Ping..
> > 
> > Thinking about a repost, as this series shouldn't be able to apply after we've
> > got more relevant patches into -mm.  E.g., the full minor fault, and also some
> > small stuff like pagemap, as we need one more patch to support shmem/hugetlbfs
> > too.
> > 
> > Hugh, haven't received any further comment from you on shmem side (or on the
> > general idea).  It would be great to still have some of your input.
> > 
> > Let me know if you prefer to read a fresh new version otherwise.
> 
> I am very sorry to let you down, Peter, repeatedly; but it is now very
> clear that I shall *never* have time to review your patchset - I am too
> slow, have too much else to attend to, and take too long each time to
> sink myself deep enough into userfaultfd.

Never mind!  It's just that I'm kind of obliged to ask for your opinion as you
contributed part of the idea while you are also the shmem maintainer. :) So
that's what I did before I start to bother Andrew (since I know Andrew is 100%
busy.. that's also why I tend to not ask Andrew for review pings as best as I
can for all my works; while Andrew can chim in anytime anyways as in the loop).

> 
> I realize that you're being considerate, and expecting no more than
> a few comments from me, rather than asking for formal review; but it's
> still too much for me to get into.

I'm actually even be prepared to receive a full-series NACK anytime. :) To me
it's more important to have the right direction first, as I didn't receive that
during RFC so I moved on, assuming no one thinks it wrong.  However it's indeed
true that you never let me down (as far as I see from the other discussions)
that you do very in-depth review to hunt down any single potential risks you
may have noticed even in an rare error path - that's just too attractive a
reviewer to all the patch writters!

> 
> The only reason I was involved at all, was when you were wondering how
> to handle the pagetable entries for shmem.  I suggested one encoding,
> Andrea suggested slightly differently: Andrea's was more elegant (no
> "swap type" required), and it looked like you went with his - good.
> 
> I wonder whether you noticed
> https://lore.kernel.org/linux-mm/20210407084238.20443-2-apopple@nvidia.com/
> which might interfere.  I've had no more time to look at that than yours,
> so no opinion on it (and I don't know what happened to it after that).

Thanks for the pointer.  Looks like there'll be some slight rebase work and
totally orthogonal on the ideas, then we'll see who will do the rebase (yeh
probably me :).

Hmm, meanwhile if that's the initial versions I might go and suggest a renaming
of pfn_swap_entry_to_page() to start with pte_swp_*() as it operates on swp pte
not a pfn.  However probably too late for a v8 series so I'll give up.  It also
has mentioned something like "special swap pte", hope that won't get confused
with what this series is proposing.  We'll see when it becomes a problem, so
far seems still okay.

> 
> Please keep uppermost in mind when modifying mm/shmem.c for userfaultfd,
> the difference between shared and private; and be on guard against the
> ways in which CONFIG_USERFAULTFD=y might open a door to abuse.

Will do.  Then I'll move this series on.

Re shared/private, let me mention one thing just in case for any use of peace
of mind: the most dangerous place for uffd-wp+shmem should be the
UFFDIO_WRITEPROTECT page resolving ioctl when we want to re-grant the write bit
to ptes if needed (for minor mode, the danger point is UFFDIO_CONTINUE
instead), however it should be even safer than UFFDIO_CONTINUE as
UFFDIO_WRITEPROTECT never grants the write bit for real but leave that all to
page fault handler (in change_pte_range()):

                } else if (uffd_wp_resolve) {
                        /*
                         * Leave the write bit to be handled
                         * by PF interrupt handler, then
                         * things like COW could be properly
                         * handled.
                         */
                        ptent = pte_clear_uffd_wp(ptent);
                }

While the newprot will never have the write bit either afaik, mwriteprotect_range():

		newprot = vm_get_page_prot(dst_vma->vm_flags);

The last risk is the dirty_accountable trick in change_pte_range(), but as you
analyzed in the other thread, userfaultfd never uses MM_CP_DIRTY_ACCT, so it
should be safe too.

Thanks,