From patchwork Mon Nov 15 08:00:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 12618861 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37C8BC433EF for ; Mon, 15 Nov 2021 08:00:53 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B69B46321B for ; Mon, 15 Nov 2021 08:00:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org B69B46321B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 1A6AA6B007B; Mon, 15 Nov 2021 03:00:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 156AF6B007D; Mon, 15 Nov 2021 03:00:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 045286B007E; Mon, 15 Nov 2021 03:00:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0037.hostedemail.com [216.40.44.37]) by kanga.kvack.org (Postfix) with ESMTP id E782A6B007B for ; Mon, 15 Nov 2021 03:00:51 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A44ED180ED780 for ; Mon, 15 Nov 2021 08:00:51 +0000 (UTC) X-FDA: 78810418302.23.C8C59B3 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf06.hostedemail.com (Postfix) with ESMTP id 379E0801AB0D for ; Mon, 15 Nov 2021 08:00:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1636963250; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=enD0wV1PzTMcVZir6gX0xAfGiKSstXRuvW8GppqUCLE=; b=RS1jTk+j0wKyYoaLKLKgB19AzvbAjLpnu6HBuSIVfDflg2BIXOxT5wjL3UTC6/iHVMf1we YowlJx4R9xrergHlifumORrVmU71AMy54fb2wkAjP4XTdbB6LO4VNPQF1kNzhX7CAuVxRK KMtf4zp27cFO3i39xp+InBLccR5An7U= Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-499-_F_6ASZhNiWVtDLfeD6G2g-1; Mon, 15 Nov 2021 03:00:49 -0500 X-MC-Unique: _F_6ASZhNiWVtDLfeD6G2g-1 Received: by mail-pg1-f199.google.com with SMTP id 76-20020a63054f000000b002c9284978aaso8771449pgf.10 for ; Mon, 15 Nov 2021 00:00:49 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=enD0wV1PzTMcVZir6gX0xAfGiKSstXRuvW8GppqUCLE=; b=aVg0KMNOxiONr3hzPJEo/B6seYMzXReflbsqOydHGQZrgiuJhsfIfNrh8qidANRx2w 3tH55N7EbMBnefFYJCNxa/S/Lxa2KDjn2dDqT5DXL/tKaFqyGZJGgaQis+3zvtMgNosq 8/YkbC1IPEgbRZCzN/KEgnF/xfbVB0Qt19U2eOIAFUuQPnU39REc7qd0uW/Sb8MGg2j4 a6E5oWmty8G1/lAxpD1lOh8dvklitEO7JbiamV8+RTEqO5trTwPZo52remnrqL/0J7SM AzB0j/4wn/t2vCiJFi1MueKXXh53Yv8NrfuKFJuwtx5cuYYU8NPohb30MYztEmFi53wU ZVtw== X-Gm-Message-State: AOAM531hq4bAvz2rdqeUHK5z5ORfgBnCKUohAA/g/hUpR5jx1KzjrJ+g 4/xbM/BR5vv9zbWwMgHauslXEF4FyMmeq2WcLPUpk2suADedi3F4IHObQ2gIQwYz4qC9hihyAs8 Y2sDU00jtS6/w0o/E2LC6/YYpbh8rSwp5V0C4bxJBZ44wXHiR5QJPX3mNYANp X-Received: by 2002:a05:6a00:84c:b0:494:6d40:ed76 with SMTP id q12-20020a056a00084c00b004946d40ed76mr31113179pfk.65.1636963248283; Mon, 15 Nov 2021 00:00:48 -0800 (PST) X-Google-Smtp-Source: ABdhPJxnNCoV7LVEKS2kGNyP+wQ8iY7s2Exa4jA+VMzI9tOHbNahQ0/af4BuXB848choZ6Ilf+Dr0g== X-Received: by 2002:a05:6a00:84c:b0:494:6d40:ed76 with SMTP id q12-20020a056a00084c00b004946d40ed76mr31113125pfk.65.1636963247852; Mon, 15 Nov 2021 00:00:47 -0800 (PST) Received: from localhost.localdomain ([94.177.118.89]) by smtp.gmail.com with ESMTPSA id t40sm14468176pfg.107.2021.11.15.00.00.41 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 15 Nov 2021 00:00:47 -0800 (PST) From: Peter Xu To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Nadav Amit , peterx@redhat.com, Alistair Popple , Andrew Morton , Mike Kravetz , Mike Rapoport , Matthew Wilcox , Jerome Glisse , Axel Rasmussen , "Kirill A . Shutemov" , David Hildenbrand , Andrea Arcangeli , Hugh Dickins Subject: [PATCH v6 08/23] mm/shmem: Allow uffd wr-protect none pte for file-backed mem Date: Mon, 15 Nov 2021 16:00:34 +0800 Message-Id: <20211115080034.74526-1-peterx@redhat.com> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20211115075522.73795-1-peterx@redhat.com> References: <20211115075522.73795-1-peterx@redhat.com> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 379E0801AB0D X-Stat-Signature: 16bw99i51dt58kx6w5xgdj3c55wsfszu Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=RS1jTk+j; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf06.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=peterx@redhat.com X-HE-Tag: 1636963251-508064 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: File-backed memory differs from anonymous memory in that even if the pte is missing, the data could still resides either in the file or in page/swap cache. So when wr-protect a pte, we need to consider none ptes too. We do that by installing the uffd-wp pte markers when necessary. So when there's a future write to the pte, the fault handler will go the special path to first fault-in the page as read-only, then report to userfaultfd server with the wr-protect message. On the other hand, when unprotecting a page, it's also possible that the pte got unmapped but replaced by the special uffd-wp marker. Then we'll need to be able to recover from a uffd-wp pte marker into a none pte, so that the next access to the page will fault in correctly as usual when accessed the next time. Special care needs to be taken throughout the change_protection_range() process. Since now we allow user to wr-protect a none pte, we need to be able to pre-populate the page table entries if we see (!anonymous && MM_CP_UFFD_WP) requests, otherwise change_protection_range() will always skip when the pgtable entry does not exist. For example, the pgtable can be missing for a whole chunk of 2M pmd, but the page cache can exist for the 2M range. When we want to wr-protect one 4K page within the 2M pmd range, we need to pre-populate the pgtable and install the pte marker showing that we want to get a message and block the thread when the page cache of that 4K page is written. Without pre-populating the pmd, change_protection() will simply skip that whole pmd. Note that this patch only covers the small pages (pte level) but not covering any of the transparent huge pages yet. That will be done later, and this patch will be a preparation for it too. Signed-off-by: Peter Xu --- mm/mprotect.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 890bc1f9ca24..be837c4dbc64 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -174,7 +175,16 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); } else if (is_pte_marker_entry(entry)) { - /* Skip it, the same as none pte */ + /* + * If this is uffd-wp pte marker and we'd like + * to unprotect it, drop it; the next page + * fault will trigger without uffd trapping. + */ + if (uffd_wp_resolve && + (pte_marker_get(entry) & PTE_MARKER_UFFD_WP)) { + pte_clear(vma->vm_mm, addr, pte); + pages++; + } continue; } else { newpte = oldpte; @@ -189,6 +199,20 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, set_pte_at(vma->vm_mm, addr, pte, newpte); pages++; } + } else { + /* It must be an none page, or what else?.. */ + WARN_ON_ONCE(!pte_none(oldpte)); + if (unlikely(uffd_wp && !vma_is_anonymous(vma))) { + /* + * For file-backed mem, we need to be able to + * wr-protect a none pte, because even if the + * pte is none, the page/swap cache could + * exist. Doing that by install a marker. + */ + set_pte_at(vma->vm_mm, addr, pte, + make_pte_marker(PTE_MARKER_UFFD_WP)); + pages++; + } } } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); @@ -222,6 +246,39 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd) return 0; } +/* Return true if we're uffd wr-protecting file-backed memory, or false */ +static inline bool +uffd_wp_protect_file(struct vm_area_struct *vma, unsigned long cp_flags) +{ + return (cp_flags & MM_CP_UFFD_WP) && !vma_is_anonymous(vma); +} + +/* + * If wr-protecting the range for file-backed, populate pgtable for the case + * when pgtable is empty but page cache exists. When {pte|pmd|...}_alloc() + * failed it means no memory, we don't have a better option but stop. + */ +#define change_pmd_prepare(vma, pmd, cp_flags) \ + do { \ + if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd))) \ + break; \ + } \ + } while (0) +/* + * This is the general pud/p4d/pgd version of change_pmd_prepare(). We need to + * have separate change_pmd_prepare() because pte_alloc() returns 0 on success, + * while {pmd|pud|p4d}_alloc() returns the valid pointer on success. + */ +#define change_prepare(vma, high, low, addr, cp_flags) \ + do { \ + if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \ + low##_t *p = low##_alloc(vma->vm_mm, high, addr); \ + if (WARN_ON_ONCE(p == NULL)) \ + break; \ + } \ + } while (0) + static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -240,6 +297,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, next = pmd_addr_end(addr, end); + change_pmd_prepare(vma, pmd, cp_flags); /* * Automatic NUMA balancing walks the tables with mmap_lock * held for read. It's possible a parallel update to occur @@ -305,6 +363,7 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); + change_prepare(vma, pud, pmd, addr, cp_flags); if (pud_none_or_clear_bad(pud)) continue; pages += change_pmd_range(vma, pud, addr, next, newprot, @@ -325,6 +384,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma, p4d = p4d_offset(pgd, addr); do { next = p4d_addr_end(addr, end); + change_prepare(vma, p4d, pud, addr, cp_flags); if (p4d_none_or_clear_bad(p4d)) continue; pages += change_pud_range(vma, p4d, addr, next, newprot, @@ -350,6 +410,7 @@ static unsigned long change_protection_range(struct vm_area_struct *vma, inc_tlb_flush_pending(mm); do { next = pgd_addr_end(addr, end); + change_prepare(vma, pgd, p4d, addr, cp_flags); if (pgd_none_or_clear_bad(pgd)) continue; pages += change_p4d_range(vma, pgd, addr, next, newprot,