Message ID | 20210715201422.211004-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | userfaultfd-wp: Support shmem and hugetlbfs | expand |
On 15.07.21 22:13, Peter Xu wrote: > This is v5 of uffd-wp shmem & hugetlbfs support, which completes uffd-wp as a > full feature. It's based on v5.14-rc1. > > I reposted the whole series majorly to trigger the syzbot tests again; sorry if > it brings a bit of noise. Please let me know if there's easier way to trigger > the syzbot test instead of reposting the whole series. > > Meanwhile, recently discussion around soft-dirty shows that soft-dirty may have > similar requirement as uffd-wp on persisting the dirty information: > > https://lore.kernel.org/lkml/20210714152426.216217-1-tiberiu.georgescu@nutanix.com/ > > Then the mechanism provided in this patchset may be suitable for soft-dirty too. > > The whole series can also be found online [1]. > > v5 changelog: > - Fix two issues spotted by syzbot > - Compile test with (1) !USERFAULTFD, (2) USERFAULTFD && !USERFAULTFD_WP > > Previous versions: > > RFC: https://lore.kernel.org/lkml/20210115170907.24498-1-peterx@redhat.com/ > v1: https://lore.kernel.org/lkml/20210323004912.35132-1-peterx@redhat.com/ > v2: https://lore.kernel.org/lkml/20210427161317.50682-1-peterx@redhat.com/ > v3: https://lore.kernel.org/lkml/20210527201927.29586-1-peterx@redhat.com/ > v4: https://lore.kernel.org/lkml/20210714222117.47648-1-peterx@redhat.com/ > > About Swap Special PTE > ====================== > > In short, the so-called "swap special pte" in this patchset is a new type of > pte that doesn't exist in the past, but it got used initially in this series in > file-backed memories. It is used to persist information even if the ptes got > dropped meanwhile when the page cache still existed. For example, when > splitting a file-backed huge pmd, we could be simply dropping the pmd entry > then wait until another fault coming. It's okay in the past since all > information in the pte can be retained from the page cache when the next page > fault triggers. However in this case, uffd-wp is per-pte information which > cannot be kept in page cache, so that information needs to be maintained > somehow still in the pgtable entry, even if the pgtable entry is going to be > dropped. Here instead of replacing with a none entry, we used the "swap > special pte". Then when the next page fault triggers, we can observe orig_pte > to retain this information. > > I'm copy-pasting some commit message from the patch "mm/swap: Introduce the > idea of special swap ptes", where it tried to explain this pte in another angle: > > We used to have special swap entries, like migration entries, hw-poison > entries, device private entries, etc. > > Those "special swap entries" reside in the range that they need to be at least > swap entries first, and their types are decided by swp_type(entry). > > This patch introduces another idea called "special swap ptes". > > It's very easy to get confused against "special swap entries", but a speical > swap pte should never contain a swap entry at all. It means, it's illegal to > call pte_to_swp_entry() upon a special swap pte. > > Make the uffd-wp special pte to be the first special swap pte. > > Before this patch, is_swap_pte()==true means one of the below: > > (a.1) The pte has a normal swap entry (non_swap_entry()==false). For > example, when an anonymous page got swapped out. > > (a.2) The pte has a special swap entry (non_swap_entry()==true). For > example, a migration entry, a hw-poison entry, etc. > > After this patch, is_swap_pte()==true means one of the below, where case (b) is > added: > > (a) The pte contains a swap entry. > > (a.1) The pte has a normal swap entry (non_swap_entry()==false). For > example, when an anonymous page got swapped out. > > (a.2) The pte has a special swap entry (non_swap_entry()==true). For > example, a migration entry, a hw-poison entry, etc. > > (b) The pte does not contain a swap entry at all (so it cannot be passed > into pte_to_swp_entry()). For example, uffd-wp special swap pte. > > Hugetlbfs needs similar thing because it's also file-backed. I directly reused > the same special pte there, though the shmem/hugetlb change on supporting this > new pte is different since they don't share code path a lot. > > Patch layout > ============ > > Part (1): Shmem support, this is where the special swap pte is introduced. > Some zap rework is needed within the process: > > mm/shmem: Unconditionally set pte dirty in mfill_atomic_install_pte > shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP > mm: Clear vmf->pte after pte_unmap_same() returns > mm/userfaultfd: Introduce special pte for unmapped file-backed mem > mm/swap: Introduce the idea of special swap ptes > shmem/userfaultfd: Handle uffd-wp special pte in page fault handler > mm: Drop first_index/last_index in zap_details > mm: Introduce zap_details.zap_flags > mm: Introduce ZAP_FLAG_SKIP_SWAP > shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed > shmem/userfaultfd: Allow wr-protect none pte for file-backed mem > shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps > shmem/userfaultfd: Handle the left-overed special swap ptes > shmem/userfaultfd: Pass over uffd-wp special swap pte when fork() > > Part (2): Hugetlb supportdisable huge pmd sharing for uffd-wp patches have been > merged. The rest is the changes required to teach hugetlbfs understand the > special swap pte too that introduced with the uffd-wp change: > > mm/hugetlb: Drop __unmap_hugepage_range definition from hugetlb.h > mm/hugetlb: Introduce huge pte version of uffd-wp helpers > hugetlb/userfaultfd: Hook page faults for uffd write protection > hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP > hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT > mm/hugetlb: Introduce huge version of special swap pte helpers > hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler > hugetlb/userfaultfd: Allow wr-protect none ptes > hugetlb/userfaultfd: Only drop uffd-wp special pte if required > > Part (3): Enable both features in code and test (plus pagemap support) > > mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs > userfaultfd: Enable write protection for shmem & hugetlbfs > userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs > > Tests > ===== > > I've tested it using either userfaultfd kselftest program, but also with > umapsort [2] which should be even stricter. Tested page swapping in/out during > umapsort. > > If anyone would like to try umapsort, need to use an extremely hacked version > of umap library [3], because by default umap only supports anonymous. So to > test it we need to build [3] then [2]. > > Any comment would be greatly welcomed. Thanks, Hi Peter, I just stumbled over copy_page_range() optimization /* * Don't copy ptes where a page fault will fill them correctly. * Fork becomes much lighter when there are big shared or private * readonly mappings. The tradeoff is that copy_page_range is more * efficient than faulting. */ if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && !src_vma->anon_vma) return 0; IIUC, that means you'll not copy the WP bits for shmem and, therefore, lose them during fork.
On Mon, Jul 19, 2021 at 09:21:18PM +0200, David Hildenbrand wrote: > Hi Peter, Hi, David, > > I just stumbled over copy_page_range() optimization > > /* > * Don't copy ptes where a page fault will fill them correctly. > * Fork becomes much lighter when there are big shared or private > * readonly mappings. The tradeoff is that copy_page_range is more > * efficient than faulting. > */ > if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && > !src_vma->anon_vma) > return 0; > > IIUC, that means you'll not copy the WP bits for shmem and, > therefore, lose them during fork. Good point. I think the fix shouldn't be hard - we can also skip this if dst_vma->vm_flags has VM_UFFD_WP set (that means UFFD_FEATURE_EVENT_FORK is enabled too). But I'll check a bit into page copy later to make sure it works (maybe I'll add a small test case too). Thanks!
On Thu, Jul 15, 2021 at 04:13:56PM -0400, Peter Xu wrote: > About Swap Special PTE > ====================== I've got some more feedback regarding this series, either within review comment or from other threads. Hugh shared his concern on using such type of pte level operation could make things even worse: https://lore.kernel.org/linux-mm/796cbb7-5a1c-1ba0-dde5-479aba8224f2@google.com/ Since most context is irrelevant, only quotting the p.s. section: p.s. Peter, unrelated to this particular bug, and should not divert from fixing it: but looking again at those swap encodings, and particularly the soft_dirty manipulations: they look very fragile. I think uffd_wp was wrong to follow that bad example, and your upcoming new encoding (that I have previously called elegant) takes it a worse step further. Alistair shared his preference on keep using swp_entry to store these extra information: https://lore.kernel.org/linux-mm/5071185.SEdLSG93TQ@nvdebian/ So I'm trying to do some self introspection to see maybe I was just too bold to try introducing that pte idea, either I'm not the "suitable one" to introduce it as it's indeed challenging, or maybe it's as simple as we don't really need to worry using up swap address space yet, at least for now (currently worst case MAX_SWAPFILES=32-4-2-1=25). I don't yet have plan to think about Hugh's idea on further dropping the usage of per-arch bits in swap ptes, e.g. _PAGE_SWP_SOFT_DIRTY or _PAGE_SWP_UFFD_WP. I need more thoughts there. But what I can still do is think about whether we can still go back to swap entry ptes for this series. Originally I was afraid of wasting a whole type of swp entry just for one single pte, so we came up with the idea (thanks again for Andrea and Hugh on proposing and discussions around it!). But did we just worry too much on that while it comes from nothing? So as time passes, there're indeed some more similar requirements coming that has issues that look like what uffd-wp file-backed wanted to solve on pagemap, they're: - PM_SWAP info missing when shmem page swapped out - PM_SOFT_DIRTY lost when shmem page swapped out The 1st issue might be solved by other way and there're still discussed here: https://lore.kernel.org/linux-mm/YPmX7ZyDFRCuLXrh@t490s/ I don't see a good way to solve the 2nd issue (if we would like to solve it first, though; I don't know whether that's intended to not be fixed for some reason), if without similar solution like what we will like to apply to maintain the uffd-wp bit, because they're all potentially issues around persisting pte information for file-backed memories. These requirements at least show that even if we introduce a new swp type (maybe let's just call it SWP_PTE_MARKER) then uffd-wp won't be the only user, so there're already potential users of more bit out of the entry. In summary, I'm considering whether I should switch the special swap pte idea back to the swp entry idea (safer, according to Hugh, also arch-independent, according to Alistair). Before working on that, any early comment would be greatly welcomed. Thanks.