From patchwork Fri Mar 4 05:16:45 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 12768455 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FE10C433F5 for ; Fri, 4 Mar 2022 05:17:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E12F58D0002; Fri, 4 Mar 2022 00:17:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DC33A8D0001; Fri, 4 Mar 2022 00:17:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C8AB48D0002; Fri, 4 Mar 2022 00:17:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id B9A7D8D0001 for ; Fri, 4 Mar 2022 00:17:28 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 77548120610 for ; Fri, 4 Mar 2022 05:17:28 +0000 (UTC) X-FDA: 79205545776.07.019F850 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf16.hostedemail.com (Postfix) with ESMTP id A3A9A18000C for ; Fri, 4 Mar 2022 05:17:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1646371047; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=e0HRpvWMBQ8SLXvfPiOEp5b+QRkrUs7xMrgcNzBhT4w=; b=L8mkktRCUOCM0EIFl8LPlzoYgyTTq+haz7o3RKZzpp465CzY/1QEnsVS0uu/NgTEAiinGk 9P0zD2FpV5RE3kuzsGcq6PKEt2NRfULmHV8h0IV6xxw1coKkwpOsyyMOMw0amFAgvLIKea jNHwPHRFo88iDoYSgm/sU6O7Mk697/Q= Received: from mail-pj1-f71.google.com (mail-pj1-f71.google.com [209.85.216.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-519-NVcQb3yqMjCgKQHlbdmnMg-1; Fri, 04 Mar 2022 00:17:24 -0500 X-MC-Unique: NVcQb3yqMjCgKQHlbdmnMg-1 Received: by mail-pj1-f71.google.com with SMTP id j10-20020a17090a7e8a00b001bbef243093so6742407pjl.1 for ; Thu, 03 Mar 2022 21:17:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=e0HRpvWMBQ8SLXvfPiOEp5b+QRkrUs7xMrgcNzBhT4w=; b=QKLuSs8+OlNIYplhl6Y/DPgo2GcCdVwKCjfPgOjgh6pY0aasNdvezRQslVr9UpGSIT +e+VIWJM3fTFMYa/TtYj2nCf5KxoAkLHhparatIP0w7igvhwVV+30nM5PUYlakBL5S3v OimvEF7PhMm1dIl786NkOpFWwHyENcpXu83ylkgLEjTetLTveQVEeFoZ+bNo7Xob0lI3 wcLDxRMdTqEKj8xW4xV7M78JUrvL2ZIaBj7Qcla5tYUsushqT5M3FAXHHaVWrXtVUBlY U1OPrOBpR6j3RmAt7lASthJUzTLAaH/b+bBRqHsod2sK6sOF8MS66L5aSxB4ygQleWrF 03NA== X-Gm-Message-State: AOAM5309yJhz6AoKSLOvxTscIek1+0/jYZFqC9CSh3MBmvyVmhFVVTqm R8HQCO48C8ypTdrq2luOfsnEBNgpqUpwEhSftkppr08NHezKQp8dvtKGR+ycRt9KvA3mZjm5DHc 7XDOI0BpyeNo1fBrnIn8s8juzHerccGt6/cc4a/F3DyfpZtdhJqJ6rMqNZQwf X-Received: by 2002:a17:902:be0c:b0:14f:d9b3:92a9 with SMTP id r12-20020a170902be0c00b0014fd9b392a9mr39274956pls.53.1646371043243; Thu, 03 Mar 2022 21:17:23 -0800 (PST) X-Google-Smtp-Source: ABdhPJzxSCA3qOeKysFG5xoACVzdh0Kg3/Srw376djx9Rs6fEWpfjBjo+w/2U4uStqvssmYX9OXKFA== X-Received: by 2002:a17:902:be0c:b0:14f:d9b3:92a9 with SMTP id r12-20020a170902be0c00b0014fd9b392a9mr39274911pls.53.1646371042604; Thu, 03 Mar 2022 21:17:22 -0800 (PST) Received: from localhost.localdomain ([94.177.118.59]) by smtp.gmail.com with ESMTPSA id p16-20020a056a000b5000b004f669806cd9sm4323865pfo.87.2022.03.03.21.17.14 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Thu, 03 Mar 2022 21:17:22 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: peterx@redhat.com, Nadav Amit , Hugh Dickins , David Hildenbrand , Axel Rasmussen , Matthew Wilcox , Alistair Popple , Mike Rapoport , Andrew Morton , Jerome Glisse , Mike Kravetz , "Kirill A . Shutemov" , Andrea Arcangeli Subject: [PATCH v7 00/23] userfaultfd-wp: Support shmem and hugetlbfs Date: Fri, 4 Mar 2022 13:16:45 +0800 Message-Id: <20220304051708.86193-1-peterx@redhat.com> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A3A9A18000C X-Stat-Signature: ih8n49hfwcmj54iud93h5pyk71xz3ff8 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=L8mkktRC; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf16.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=peterx@redhat.com X-HE-Tag: 1646371047-952142 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This is v7 of the series to add shmem+hugetlbfs support for userfaultfd write protection. It is based on linux-next tag next-20220301. The whole tree can be found here for testing: https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs One tweak needed is to replace Nadav's v2 patch "userfaultfd: provide unmasked address on page-fault" with v3 to unbreak hugetlb in general. There aren't a lot of changes comparing to v6, majorly rebase and retest to make sure nothing breaks. Meanwhile should have addressed comments from Alistair. v7 changelog: - Rebased to next-20220301 - Renamed s/is_pte_marker_uffd_wp/pte_marker_uffd_wp/, add another helper pte_marker_entry_uffd_wp as suggested to operate on swp_entry_t [Alistair] - Drop pte_unmap_same() in pte_marker_handle_uffd_wp() [Alistair] - In finish_fault(), init vmf->orig_pte with pte_clear(), because some pte_none() ptes are not really all zeros, e.g. xtensa and s390 [Alistair] Previous versions: RFC: https://lore.kernel.org/lkml/20210115170907.24498-1-peterx@redhat.com/ v1: https://lore.kernel.org/lkml/20210323004912.35132-1-peterx@redhat.com/ v2: https://lore.kernel.org/lkml/20210427161317.50682-1-peterx@redhat.com/ v3: https://lore.kernel.org/lkml/20210527201927.29586-1-peterx@redhat.com/ v4: https://lore.kernel.org/lkml/20210714222117.47648-1-peterx@redhat.com/ v5: https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/ v6: https://lore.kernel.org/lkml/20211115075522.73795-1-peterx@redhat.com/ Overview ======== Userfaultfd-wp anonymous support was merged two years ago. There're quite a few applications that started to leverage this capability either to take snapshots for user-app memory, or use it for full user controled swapping. This series tries to complete the feature for uffd-wp so as to cover all the RAM-based memory types. So far uffd-wp is the only missing piece of the rest features (uffd-missing & uffd-minor mode). One major reason to do so is that anonymous pages are sometimes not satisfying the need of applications, and there're growing users of either shmem and hugetlbfs for either sharing purpose (e.g., sharing guest mem between hypervisor process and device emulation process, shmem local live migration for upgrades), or for performance on tlb hits. All these mean that if a uffd-wp app wants to switch to any of the memory types, it'll stop working. I think it's worthwhile to have the kernel to cover all these aspects. This series chose to protect pages in pte level not page level. One major reason is safety. I have no idea how we could make it safe if any of the uffd-privileged app can wr-protect a page that any other application can use. It means this app can block any process potentially for any time it wants. The other reason is that it aligns very well with not only the anonymous uffd-wp solution, but also uffd as a whole. For example, userfaultfd is implemented fundamentally based on VMAs. We set flags to VMAs showing the status of uffd tracking. For another per-page based protection solution, it'll be crossing the fundation line on VMA-based, and it could simply be too far away already from what's called userfaultfd. PTE markers =========== The patchset is based on the idea called PTE markers. It was discussed in one of the mm alignment sessions, proposed starting from v6, and this is the 2nd version of it using PTE marker idea. PTE marker is a new type of swap entry that is ony applicable to file backed memories like shmem and hugetlbfs. It's used to persist some pte-level information even if the original present ptes in pgtable are zapped. Logically pte markers can store more than uffd-wp information, but so far only one bit is used for uffd-wp purpose. When the pte marker is installed with uffd-wp bit set, it means this pte is wr-protected by uffd. It solves the problem on e.g. file-backed memory mapped ptes got zapped due to any reason (e.g. thp split, or swapped out), we can still keep the wr-protect information in the ptes. Then when the page fault triggers again, we'll know this pte is wr-protected so we can treat the pte the same as a normal uffd wr-protected pte. The extra information is encoded into the swap entry, or swp_offset to be explicit, with the swp_type being PTE_MARKER. So far uffd-wp only uses one bit out of the swap entry, the rest bits of swp_offset are still reserved for other purposes. There're two configs to enable/disable PTE markers: CONFIG_PTE_MARKER CONFIG_PTE_MARKER_UFFD_WP We can set !PTE_MARKER to completely disable all the PTE markers, along with uffd-wp support. I made two config so we can also enable PTE marker but disable uffd-wp file-backed for other purposes. At the end of current series, I'll enable CONFIG_PTE_MARKER by default, but that patch is standalone and if anyone worries about having it by default, we can also consider turn it off by dropping that oneliner patch. So far I don't see a huge risk of doing so, so I kept that patch. In most cases, PTE markers should be treated as none ptes. It is because that unlike most of the other swap entry types, there's no PFN or block offset information encoded into PTE markers but some extra well-defined bits showing the status of the pte. These bits should only be used as extra data when servicing an upcoming page fault, and then we behave as if it's a none pte. I did spend a lot of time observing all the pte_none() users this time. It is indeed a challenge because there're a lot, and I hope I didn't miss a single of them when we should take care of pte markers. Luckily, I don't think it'll need to be considered in many cases, for example: boot code, arch code (especially non-x86), kernel-only page handlings (e.g. CPA), or device driver codes when we're tackling with pure PFN mappings. I introduced pte_none_mostly() in this series when we need to handle pte markers the same as none pte, the "mostly" is the other way to write "either none pte or a pte marker". I didn't replace pte_none() to cover pte markers for below reasons: - Very rare case of pte_none() callers will handle pte markers. E.g., all the kernel pages do not require knowledge of pte markers. So we don't pollute the major use cases. - Unconditionally change pte_none() semantics could confuse people, because pte_none() existed for so long a time. - Unconditionally change pte_none() semantics could make pte_none() slower even if in many cases pte markers do not exist. - There're cases where we'd like to handle pte markers differntly from pte_none(), so a full replace is also impossible. E.g. khugepaged should still treat pte markers as normal swap ptes rather than none ptes, because pte markers will always need a fault-in to merge the marker with a valid pte. Or the smap code will need to parse PTE markers not none ptes. Patch Layout ============ Introducing PTE marker and uffd-wp bit in PTE marker: mm: Introduce PTE_MARKER swap entry mm: Teach core mm about pte markers mm: Check against orig_pte for finish_fault() mm/uffd: PTE_MARKER_UFFD_WP Adding support for shmem uffd-wp: mm/shmem: Take care of UFFDIO_COPY_MODE_WP mm/shmem: Handle uffd-wp special pte in page fault handler mm/shmem: Persist uffd-wp bit across zapping for file-backed mm/shmem: Allow uffd wr-protect none pte for file-backed mem mm/shmem: Allows file-back mem to be uffd wr-protected on thps mm/shmem: Handle uffd-wp during fork() Adding support for hugetlbfs uffd-wp: mm/hugetlb: Introduce huge pte version of uffd-wp helpers mm/hugetlb: Hook page faults for uffd write protection mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP mm/hugetlb: Handle UFFDIO_WRITEPROTECT mm/hugetlb: Handle pte markers in page faults mm/hugetlb: Allow uffd wr-protect none ptes mm/hugetlb: Only drop uffd-wp special pte if required mm/hugetlb: Handle uffd-wp during fork() Misc handling on the rest mm for uffd-wp file-backed: mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs Enabling of uffd-wp on file-backed memory: mm/uffd: Enable write protection for shmem & hugetlbfs mm: Enable PTE markers by default selftests/uffd: Enable uffd-wp for shmem/hugetlbfs Tests ===== - Compile test on x86_64 and aarch64 on different configs - Kernel selftests - uffd-test [0] - Umapsort [1,2] test for shmem/hugetlb, with swap on/off Please review, thanks. [0] https://github.com/xzpeter/clibs/tree/master/uffd-test [1] https://github.com/xzpeter/umap-apps/tree/peter [2] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs Peter Xu (23): mm: Introduce PTE_MARKER swap entry mm: Teach core mm about pte markers mm: Check against orig_pte for finish_fault() mm/uffd: PTE_MARKER_UFFD_WP mm/shmem: Take care of UFFDIO_COPY_MODE_WP mm/shmem: Handle uffd-wp special pte in page fault handler mm/shmem: Persist uffd-wp bit across zapping for file-backed mm/shmem: Allow uffd wr-protect none pte for file-backed mem mm/shmem: Allows file-back mem to be uffd wr-protected on thps mm/shmem: Handle uffd-wp during fork() mm/hugetlb: Introduce huge pte version of uffd-wp helpers mm/hugetlb: Hook page faults for uffd write protection mm/hugetlb: Take care of UFFDIO_COPY_MODE_WP mm/hugetlb: Handle UFFDIO_WRITEPROTECT mm/hugetlb: Handle pte markers in page faults mm/hugetlb: Allow uffd wr-protect none ptes mm/hugetlb: Only drop uffd-wp special pte if required mm/hugetlb: Handle uffd-wp during fork() mm/khugepaged: Don't recycle vma pgtable if uffd-wp registered mm/pagemap: Recognize uffd-wp bit for shmem/hugetlbfs mm/uffd: Enable write protection for shmem & hugetlbfs mm: Enable PTE markers by default selftests/uffd: Enable uffd-wp for shmem/hugetlbfs arch/s390/include/asm/hugetlb.h | 15 ++ fs/hugetlbfs/inode.c | 15 +- fs/proc/task_mmu.c | 7 + fs/userfaultfd.c | 31 +--- include/asm-generic/hugetlb.h | 24 +++ include/linux/hugetlb.h | 27 ++-- include/linux/mm.h | 10 ++ include/linux/mm_inline.h | 43 +++++ include/linux/shmem_fs.h | 4 +- include/linux/swap.h | 15 +- include/linux/swapops.h | 79 +++++++++ include/linux/userfaultfd_k.h | 71 ++++++++ include/uapi/linux/userfaultfd.h | 10 +- mm/Kconfig | 16 ++ mm/filemap.c | 5 + mm/hmm.c | 2 +- mm/hugetlb.c | 182 ++++++++++++++++----- mm/khugepaged.c | 14 +- mm/memcontrol.c | 8 +- mm/memory.c | 196 ++++++++++++++++++++--- mm/mincore.c | 3 +- mm/mprotect.c | 75 ++++++++- mm/rmap.c | 8 + mm/shmem.c | 4 +- mm/userfaultfd.c | 54 +++++-- tools/testing/selftests/vm/userfaultfd.c | 4 +- 26 files changed, 795 insertions(+), 127 deletions(-)