From patchwork Fri Aug 2 15:55:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Hildenbrand X-Patchwork-Id: 13751719 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 098BAC3DA4A for ; Fri, 2 Aug 2024 15:55:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 881996B0089; Fri, 2 Aug 2024 11:55:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 80ACA6B008A; Fri, 2 Aug 2024 11:55:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 684516B008C; Fri, 2 Aug 2024 11:55:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 464DF6B0089 for ; Fri, 2 Aug 2024 11:55:57 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 01D7EA1269 for ; Fri, 2 Aug 2024 15:55:56 +0000 (UTC) X-FDA: 82407756354.16.5103870 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf01.hostedemail.com (Postfix) with ESMTP id 336134000A for ; Fri, 2 Aug 2024 15:55:55 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LP28Xdaq; spf=pass (imf01.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722614149; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UsiWiNzwiDquKdRCTf6m6XAEYG/mEcIY3V2xwGU3/s4=; b=O5YwtKaHkH+q3QW6Zf7UvaNiVi8y+zSzdGe6RnhvP0yTbgds8WMkaYS40dLyoijeixwKuf 35z2+lKHusLyH1QP6q+jmD5mnydIYZ55ksH7fX7BxpzSw0p0o6Jj3t5B0g4UH2cee05NYb nZXUn0nkq1u0GWArSyH7wIU/DR+GoR4= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LP28Xdaq; spf=pass (imf01.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722614149; a=rsa-sha256; cv=none; b=sRP1OodgfFGEIgTOpSL1vHzMjJLpGbJw/7M5S/yfevNLnTKj6iNdltdsNzii0TZt8ldx30 yI0MxSPWb6YvbQ4nUMDIxSKmzuwdPkXjLc6p2jzCSu6rN6VxYtPvfXkbdoAQ8nyZPrbh3w jDZoDdl/YIXkVPGTLZIFcXfQQYr1pOI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1722614154; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UsiWiNzwiDquKdRCTf6m6XAEYG/mEcIY3V2xwGU3/s4=; b=LP28XdaqMpvN7CrwYwiiDeYxIP4R/zKrTQFdqyWFqUmt7RSFB2LyeqrXhNSPeswvT/w4SO cEjlUY2Wk9VIVzASnxQQGQ7jZd5aywtC0+dfPxPRii4LLVuVZeUsLXNBrqgalkPn11TLUu FsOq9EcjevJe2aYG57GY770bYEAdjII= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-295-BK2N-7WpOw6In_X1MJIYww-1; Fri, 02 Aug 2024 11:55:48 -0400 X-MC-Unique: BK2N-7WpOw6In_X1MJIYww-1 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 19D131944AAF; Fri, 2 Aug 2024 15:55:46 +0000 (UTC) Received: from t14s.redhat.com (unknown [10.39.192.113]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id 5F05D300018D; Fri, 2 Aug 2024 15:55:40 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, kvm@vger.kernel.org, linux-s390@vger.kernel.org, linux-fsdevel@vger.kernel.org, David Hildenbrand , Andrew Morton , "Matthew Wilcox (Oracle)" , Jonathan Corbet , Christian Borntraeger , Janosch Frank , Claudio Imbrenda , Heiko Carstens , Vasily Gorbik , Alexander Gordeev , Sven Schnelle , Gerald Schaefer Subject: [PATCH v1 02/11] mm/pagewalk: introduce folio_walk_start() + folio_walk_end() Date: Fri, 2 Aug 2024 17:55:15 +0200 Message-ID: <20240802155524.517137-3-david@redhat.com> In-Reply-To: <20240802155524.517137-1-david@redhat.com> References: <20240802155524.517137-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Rspam-User: X-Stat-Signature: pn1u8rqkokmj9g5iuu8tmtz8h36jb433 X-Rspamd-Queue-Id: 336134000A X-Rspamd-Server: rspam11 X-HE-Tag: 1722614155-689752 X-HE-Meta: U2FsdGVkX1/bUjxNcg6B9bYXxk0tw1eGGgUdD9SqKIxxpnaM3YUaAhEYy4FSyYmhmR1dC8bDmA3xGQ0+pLLwgR6nlEz2aMuA/WSTbkXLGk0H648ZG5nQxilVwzeHBu8feTEYuoaYjw6N3txQ4Yhsgo6wuEFFe1Tb2fDbtLtkpuCLHm/0wGnqFwHo0pwRh49YWthrsIM2ah+c453EmXr7r5cC4Ct1yR7KwXf1/MjsGGNnYJgCgQvmvB/eooVNOJPLOW5NcwFiJw1Ss4PbNCrFj1wZp7EDKZPsXCh9KDmKhsNTAELd7Ct20/UQ/K/YKII6A8fVLsqP45uGgW5b2gymlDBW8vqC84JBsHUjhBbwujz2mHo11tYZCvIi4d98gjrOrKc9ZVL4JmLcmAsSIBIIYI+xO7M6OzbfECJNwj89RH6mDTpL9YnalpUoRnVqOFiu6WQrwfSSBKwcg7RIrU5Hk06lI+GXrQiRcotu6IuxuduWOP6ANi6X/UcU9wuhGoN9Pz7L5MDVC+sIRiXIR6Bv2/54HWtSFq2Qx/qwVJQlwIy2A8LPQ/yk4ta4NhXoC+6BnYpgc/4FalGWuKbRDYe0twCkEDJ3pYnAok9szVzlEFW+TLExFMYoGVwMcy9EfUQAibzZHmwRzOXHFGoi4ysa0CxvnAE9TkV7dxQKl0ujypNyQRNBDkBjrPMF8s5QelW32i79Uu+XmNcn6oguQdwjlMiBDdRaJ6bocR2O6QC9pDYtidQwZtf8lH+Yt6rOwiGALoI04LHUKGgvpBMrQgZoTYmpl6rTa9FmV86pT5w4SQYU5O/61u9GZl8+hfVIVd2JLpoT9rCyk1bfNYxEByAaHU+lcRKaJY8r5iuiVGzmC8kR7BOA8yeeWRCxm3Qov44vlwRl6BHef3BLyZRiF2K0+7f1yPlt+0lIuTOorTjJ8hfC2c8qUYUSatLLiLH6SmMCfr8fGwg0qidXWlyJlnU YVYXcrDx YLHzciuBFud46eUJB2w0K7HBZHCYCE6ZWkBk7WNTttKA29J6cpKohBG2loDZio14YcQPUUIhumnsdLqAd6CcxkfZV+lxUX42x7YENkeh3n+wrITGS0fPI1DE9YXEBIXbF1nJJ5XXK69tE7hSHzZGmvm9/Osvkk3qGo4j6nzHtfKpzKv+7yWp6y5aCYZd5NqFeRqHJMkj2k7bpgyCbm2jus0qJZU+NaZBSkrE8o3bYvPSYY0UcsNSTdn0kPQl8JzmrGw/6bxp9EYm6dmeDwlV1UCXo6+s0cgbrDuO2bGHXeh5XAlYevapsCvSYq23+r353wweT5H+9ZlOoaiaqyFsDg1G78aP7+K70yTPJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: We want to get rid of follow_page(), and have a more reasonable way to just lookup a folio mapped at a certain address, perform some checks while still under PTL, and then only conditionally grab a folio reference if really required. Further, we might want to get rid of some walk_page_range*() users that really only want to temporarily lookup a single folio at a single address. So let's add a new page table walker that does exactly that, similarly to GUP also being able to walk hugetlb VMAs. Add folio_walk_end() as a macro for now: the compiler is not easy to please with the pte_unmap()->kunmap_local(). Note that one difference between follow_page() and get_user_pages(1) is that follow_page() will not trigger faults to get something mapped. So folio_walk is at least currently not a replacement for get_user_pages(1), but could likely be extended/reused to achieve something similar in the future. Signed-off-by: David Hildenbrand --- include/linux/pagewalk.h | 58 +++++++++++ mm/pagewalk.c | 202 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 260 insertions(+) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index 27cd1e59ccf7..f5eb5a32aeed 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -130,4 +130,62 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, pgoff_t nr, const struct mm_walk_ops *ops, void *private); +typedef int __bitwise folio_walk_flags_t; + +/* + * Walk migration entries as well. Careful: a large folio might get split + * concurrently. + */ +#define FW_MIGRATION ((__force folio_walk_flags_t)BIT(0)) + +/* Walk shared zeropages (small + huge) as well. */ +#define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(1)) + +enum folio_walk_level { + FW_LEVEL_PTE, + FW_LEVEL_PMD, + FW_LEVEL_PUD, +}; + +/** + * struct folio_walk - folio_walk_start() / folio_walk_end() data + * @page: exact folio page referenced (if applicable) + * @level: page table level identifying the entry type + * @pte: pointer to the page table entry (FW_LEVEL_PTE). + * @pmd: pointer to the page table entry (FW_LEVEL_PMD). + * @pud: pointer to the page table entry (FW_LEVEL_PUD). + * @ptl: pointer to the page table lock. + * + * (see folio_walk_start() documentation for more details) + */ +struct folio_walk { + /* public */ + struct page *page; + enum folio_walk_level level; + union { + pte_t *ptep; + pud_t *pudp; + pmd_t *pmdp; + }; + union { + pte_t pte; + pud_t pud; + pmd_t pmd; + }; + /* private */ + struct vm_area_struct *vma; + spinlock_t *ptl; +}; + +struct folio *folio_walk_start(struct folio_walk *fw, + struct vm_area_struct *vma, unsigned long addr, + folio_walk_flags_t flags); + +#define folio_walk_end(__fw, __vma) do { \ + spin_unlock((__fw)->ptl); \ + if (likely((__fw)->level == FW_LEVEL_PTE)) \ + pte_unmap((__fw)->ptep); \ + vma_pgtable_walk_end(__vma); \ +} while (0) + #endif /* _LINUX_PAGEWALK_H */ diff --git a/mm/pagewalk.c b/mm/pagewalk.c index ae2f08ce991b..cd79fb3b89e5 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -3,6 +3,8 @@ #include #include #include +#include +#include /* * We want to know the real level where a entry is located ignoring any @@ -654,3 +656,203 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, return err; } + +/** + * folio_walk_start - walk the page tables to a folio + * @fw: filled with information on success. + * @vma: the VMA. + * @addr: the virtual address to use for the page table walk. + * @flags: flags modifying which folios to walk to. + * + * Walk the page tables using @addr in a given @vma to a mapped folio and + * return the folio, making sure that the page table entry referenced by + * @addr cannot change until folio_walk_end() was called. + * + * As default, this function returns only folios that are not special (e.g., not + * the zeropage) and never returns folios that are supposed to be ignored by the + * VM as documented by vm_normal_page(). If requested, zeropages will be + * returned as well. + * + * As default, this function only considers present page table entries. + * If requested, it will also consider migration entries. + * + * If this function returns NULL it might either indicate "there is nothing" or + * "there is nothing suitable". + * + * On success, @fw is filled and the function returns the folio while the PTL + * is still held and folio_walk_end() must be called to clean up, + * releasing any held locks. The returned folio must *not* be used after the + * call to folio_walk_end(), unless a short-term folio reference is taken before + * that call. + * + * @fw->page will correspond to the page that is effectively referenced by + * @addr. However, for migration entries and shared zeropages @fw->page is + * set to NULL. Note that large folios might be mapped by multiple page table + * entries, and this function will always only lookup a single entry as + * specified by @addr, which might or might not cover more than a single page of + * the returned folio. + * + * This function must *not* be used as a naive replacement for + * get_user_pages() / pin_user_pages(), especially not to perform DMA or + * to carelessly modify page content. This function may *only* be used to grab + * short-term folio references, never to grab long-term folio references. + * + * Using the page table entry pointers in @fw for reading or modifying the + * entry should be avoided where possible: however, there might be valid + * use cases. + * + * WARNING: Modifying page table entries in hugetlb VMAs requires a lot of care. + * For example, PMD page table sharing might require prior unsharing. Also, + * logical hugetlb entries might span multiple physical page table entries, + * which *must* be modified in a single operation (set_huge_pte_at(), + * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might + * not correspond to the first physical entry of a logical hugetlb entry. + * + * The mmap lock must be held in read mode. + * + * Return: folio pointer on success, otherwise NULL. + */ +struct folio *folio_walk_start(struct folio_walk *fw, + struct vm_area_struct *vma, unsigned long addr, + folio_walk_flags_t flags) +{ + unsigned long entry_size; + bool expose_page = true; + struct page *page; + pud_t *pudp, pud; + pmd_t *pmdp, pmd; + pte_t *ptep, pte; + spinlock_t *ptl; + pgd_t *pgdp; + p4d_t *p4dp; + + mmap_assert_locked(vma->vm_mm); + vma_pgtable_walk_begin(vma); + + if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end)) + goto not_found; + + pgdp = pgd_offset(vma->vm_mm, addr); + if (pgd_none_or_clear_bad(pgdp)) + goto not_found; + + p4dp = p4d_offset(pgdp, addr); + if (p4d_none_or_clear_bad(p4dp)) + goto not_found; + + pudp = pud_offset(p4dp, addr); + pud = pudp_get(pudp); + if (pud_none(pud)) + goto not_found; + if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pud_leaf(pud)) { + ptl = pud_lock(vma->vm_mm, pudp); + pud = pudp_get(pudp); + + entry_size = PUD_SIZE; + fw->level = FW_LEVEL_PUD; + fw->pudp = pudp; + fw->pud = pud; + + if (!pud_present(pud) || pud_devmap(pud)) { + spin_unlock(ptl); + goto not_found; + } else if (!pud_leaf(pud)) { + spin_unlock(ptl); + goto pmd_table; + } + /* + * TODO: vm_normal_page_pud() will be handy once we want to + * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs. + */ + page = pud_page(pud); + goto found; + } + +pmd_table: + VM_WARN_ON_ONCE(pud_leaf(*pudp)); + pmdp = pmd_offset(pudp, addr); + pmd = pmdp_get_lockless(pmdp); + if (pmd_none(pmd)) + goto not_found; + if (IS_ENABLED(CONFIG_PGTABLE_HAS_HUGE_LEAVES) && pmd_leaf(pmd)) { + ptl = pmd_lock(vma->vm_mm, pmdp); + pmd = pmdp_get(pmdp); + + entry_size = PMD_SIZE; + fw->level = FW_LEVEL_PMD; + fw->pmdp = pmdp; + fw->pmd = pmd; + + if (pmd_none(pmd)) { + spin_unlock(ptl); + goto not_found; + } else if (!pmd_leaf(pmd)) { + spin_unlock(ptl); + goto pte_table; + } else if (pmd_present(pmd)) { + page = vm_normal_page_pmd(vma, addr, pmd); + if (page) { + goto found; + } else if ((flags & FW_ZEROPAGE) && + is_huge_zero_pmd(pmd)) { + page = pfn_to_page(pmd_pfn(pmd)); + expose_page = false; + goto found; + } + } else if ((flags & FW_MIGRATION) && + is_pmd_migration_entry(pmd)) { + swp_entry_t entry = pmd_to_swp_entry(pmd); + + page = pfn_swap_entry_to_page(entry); + expose_page = false; + goto found; + } + spin_unlock(ptl); + goto not_found; + } + +pte_table: + VM_WARN_ON_ONCE(pmd_leaf(pmdp_get_lockless(pmdp))); + ptep = pte_offset_map_lock(vma->vm_mm, pmdp, addr, &ptl); + if (!ptep) + goto not_found; + pte = ptep_get(ptep); + + entry_size = PAGE_SIZE; + fw->level = FW_LEVEL_PTE; + fw->ptep = ptep; + fw->pte = pte; + + if (pte_present(pte)) { + page = vm_normal_page(vma, addr, pte); + if (page) + goto found; + if ((flags & FW_ZEROPAGE) && + is_zero_pfn(pte_pfn(pte))) { + page = pfn_to_page(pte_pfn(pte)); + expose_page = false; + goto found; + } + } else if (!pte_none(pte)) { + swp_entry_t entry = pte_to_swp_entry(pte); + + if ((flags & FW_MIGRATION) && + is_migration_entry(entry)) { + page = pfn_swap_entry_to_page(entry); + expose_page = false; + goto found; + } + } + pte_unmap_unlock(ptep, ptl); +not_found: + vma_pgtable_walk_end(vma); + return NULL; +found: + if (expose_page) + /* Note: Offset from the mapped page, not the folio start. */ + fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT); + else + fw->page = NULL; + fw->ptl = ptl; + return page_folio(page); +}