From patchwork Thu Dec 7 03:09:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Xu Yu X-Patchwork-Id: 13482643 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E16C1C4167B for ; Thu, 7 Dec 2023 03:10:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7D1986B0088; Wed, 6 Dec 2023 22:10:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 782A06B0089; Wed, 6 Dec 2023 22:10:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 62C016B008A; Wed, 6 Dec 2023 22:10:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 545E36B0088 for ; Wed, 6 Dec 2023 22:10:02 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 1D982A0F84 for ; Thu, 7 Dec 2023 03:10:02 +0000 (UTC) X-FDA: 81538543044.25.BB52A5B Received: from out30-130.freemail.mail.aliyun.com (out30-130.freemail.mail.aliyun.com [115.124.30.130]) by imf21.hostedemail.com (Postfix) with ESMTP id C875D1C0013 for ; Thu, 7 Dec 2023 03:09:59 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf21.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701918600; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=twXmh16Wikje7jCnQ2mkCnV95X0g5kIAZNqrB0u7DX8=; b=YqVX1mb13AsPTuOBtHEcAB35HTfVA2jA1pqx27PVCuSHNyeM8MLE97TWVtGqPCo8PDvMQm +RuYY9WEm5UVerHy97yV0bcSXBOOpazWTaSgVjZmjWd2QpfJffh7UN/Tarkya1vyR1Bp1u dQEPoZU4t7KR0RrMGS+8SqYGafBEflQ= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf21.hostedemail.com: domain of xuyu@linux.alibaba.com designates 115.124.30.130 as permitted sender) smtp.mailfrom=xuyu@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701918600; a=rsa-sha256; cv=none; b=yrIW6aF4nOaXbzgnu3I0rOK6+W1ZTgRL/p4nqTWKp7ipgRG1WSnBfhBmCoAOo98Vs3lktH YPDtLwXEO/8aeDH/GFEzwYRgGSYBdag6uRHIBOjvhogxDfiIULHJyr+YjJb0jsWpEDw4BC foan/s8jDaKMopmO9jHqbDW1BPk97ic= X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R551e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=xuyu@linux.alibaba.com;NM=1;PH=DS;RN=2;SR=0;TI=SMTPD_---0Vy-Jdsn_1701918594; Received: from localhost(mailfrom:xuyu@linux.alibaba.com fp:SMTPD_---0Vy-Jdsn_1701918594) by smtp.aliyun-inc.com; Thu, 07 Dec 2023 11:09:55 +0800 From: Xu Yu To: linux-mm@kvack.org Cc: david@redhat.com Subject: [PATCH v2 1/2] mm/khugepaged: attempt to map anonymous pte-mapped THPs by pmds Date: Thu, 7 Dec 2023 11:09:45 +0800 Message-Id: <0919956ecd2b7052fa308a93397fd1e85806e091.1701917546.git.xuyu@linux.alibaba.com> X-Mailer: git-send-email 2.37.1 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Queue-Id: C875D1C0013 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 9y6p1g8gq9ykxkh9tebafz6f1zx4wnuw X-HE-Tag: 1701918599-198320 X-HE-Meta: U2FsdGVkX19kcrBxJn5NZpo+2Tig+C+CRHCATgakwjg/z5KszFkJbrtbl4gzy+GcG8ZB5hb3whyixI816fCB69hyiXkQWxAe0MZ8bYlxdGJp65CkhxdQueaO9M7iPK8UMsbdrkAg08Xoq6PcB+0QfTLVhdgY+mEVp7EhqM1Kt7pD1vfj0q/svWDnAHsSHwIX7TYSrBljIoHRac8lTH1YX43H/2NOk7cWmqY9kTHlnCJDj2blqIvs/Qh3JqYTy8YL9BsXMPkPF6ksYEkLbVQwrP+oqDmVUQEuXfz8yQg+M36zxPpx+DYg819ApbSiWu37c4kx4QNco9lqPjcb8ToWPbTg0uP1Xt57Z2/5/KkmnUX3Cw9RrEdb7zXawTUTUCv4fRKeh3jzjPUdX6Da9TOheVKK7PBN+XsOqYaiKf/c7CJNj/ZKK+KieSAWMXPGc5K4KKrKSgHo1T6X9CMYNkBHLRagiY03gZwAelQsXiSA1bZRdmNcGqAkulk5l612omVzL08uXncNATXwfn62mg7/UJS85i4km7lMH16febMoWwxz+o/MFHC1Gf0dGK2on/6hNUVGDHk17eHg38xxroXHQM6vOYiu6eVpz2dGPzLdJFkenoITNeikh2CEirSw6XGS1DgzCGmimoPFXrFgqlXUROIioB7YOr89teSeL+eX+Y0PGbnEvA7PXx0mM2POi6dgowg//RZR0Y1+l7xo2Og6saDNY1ndrwFvx/bfCPFFzVH7d6FmcrSsQzdQxdaNY/VGZTafg8z8wKs2wCcPFGP87cI8Ap+lxk5A3EGphHGXnX16LLI5sb2N5SB78ozmNJhPNtqs6CJxtUPSneC6wpJz3geGD1wTfpBKGrwXL8ls6owkQKZBc14g8OE5sdhVLgCT1sXPxROa313o1LMKCoih8bDUBO6dwtPsHspa98iYkdtaUvJn4ai0/xwUzSAp8iJPdYjFD7ayn/QtJHfROYe PFEzHx4E OhFhl3hBGAxRAHFhQjSkZGaV1DRUG1WH7QlvjFZqa4QZgVJJQqGp9+430bPQCT7iuRGMl+KCQI+KXP8o= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In the anonymous collapse path, khugepaged always collapses pte-mapped hugepage by allocating and copying to a new hugepage. In some scenarios, we can only update the mapping page tables for anonymous pte-mapped THPs, in the same way as file/shmem-backed pte-mapped THPs, as shown in commit 58ac9a8993a1 ("mm/khugepaged: attempt to map file/shmem-backed pte-mapped THPs by pmds") The simplest scenario that satisfies the conditions, as David points out, is when no subpages are PageAnonExclusive (PTEs must be R/O), we can collapse into a R/O PMD without further action. Let's start from this simplest scenario. Signed-off-by: Xu Yu --- mm/khugepaged.c | 214 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 214 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 88433cc25d8a..85c7a2ab44ce 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1237,6 +1237,197 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, return result; } +static struct page *find_lock_pte_mapped_page(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd) +{ + pte_t *pte, pteval; + struct page *page = NULL; + + pte = pte_offset_map(pmd, addr); + if (!pte) + return NULL; + + pteval = ptep_get_lockless(pte); + if (pte_none(pteval) || !pte_present(pteval)) + goto out; + + page = vm_normal_page(vma, addr, pteval); + if (unlikely(!page) || unlikely(is_zone_device_page(page))) + goto out; + + page = compound_head(page); + + if (!trylock_page(page)) { + page = NULL; + goto out; + } + + if (!get_page_unless_zero(page)) { + unlock_page(page); + page = NULL; + goto out; + } + +out: + pte_unmap(pte); + return page; +} + +static int collapse_pte_mapped_anon_thp(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long haddr, bool *mmap_locked, + struct collapse_control *cc) +{ + struct mmu_notifier_range range; + struct page *hpage; + pte_t *start_pte, *pte; + pmd_t *pmd, pmdval; + spinlock_t *pml, *ptl; + pgtable_t pgtable; + unsigned long addr; + int exclusive = 0; + bool writable = false; + int result, i; + + /* Fast check before locking page if already PMD-mapped */ + result = find_pmd_or_thp_or_none(mm, haddr, &pmd); + if (result == SCAN_PMD_MAPPED) + return result; + + hpage = find_lock_pte_mapped_page(vma, haddr, pmd); + if (!hpage) + return SCAN_PAGE_NULL; + if (!PageHead(hpage)) { + result = SCAN_FAIL; + goto drop_hpage; + } + if (compound_order(hpage) != HPAGE_PMD_ORDER) { + result = SCAN_PAGE_COMPOUND; + goto drop_hpage; + } + + mmap_read_unlock(mm); + *mmap_locked = false; + + /* Prevent all access to pagetables */ + mmap_write_lock(mm); + + result = hugepage_vma_revalidate(mm, haddr, true, &vma, cc); + if (result != SCAN_SUCCEED) + goto up_write; + + result = check_pmd_still_valid(mm, haddr, pmd); + if (result != SCAN_SUCCEED) + goto up_write; + + /* Recheck with mmap write lock */ + result = SCAN_SUCCEED; + start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); + if (!start_pte) + goto drop_hpage; + for (i = 0, addr = haddr, pte = start_pte; + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { + struct page *page; + pte_t pteval = ptep_get(pte); + + if (pte_none(pteval) || !pte_present(pteval)) { + result = SCAN_PTE_NON_PRESENT; + break; + } + + if (pte_uffd_wp(pteval)) { + result = SCAN_PTE_UFFD_WP; + break; + } + + if (pte_write(pteval)) + writable = true; + + page = vm_normal_page(vma, addr, pteval); + + if (unlikely(!page) || unlikely(is_zone_device_page(page))) { + result = SCAN_PAGE_NULL; + break; + } + + if (hpage + i != page) { + result = SCAN_FAIL; + break; + } + + if (PageAnonExclusive(page)) + exclusive++; + } + pte_unmap_unlock(start_pte, ptl); + if (result != SCAN_SUCCEED) + goto drop_hpage; + + /* + * Case 1: + * No subpages are PageAnonExclusive (PTEs must be R/O), we can + * collapse into a R/O PMD without further action. + */ + if (!(exclusive == 0 && !writable)) + goto drop_hpage; + + /* Collapse pmd entry */ + vma_start_write(vma); + anon_vma_lock_write(vma->anon_vma); + + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + haddr, haddr + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + pml = pmd_lock(mm, pmd); /* probably unnecessary */ + pmdval = pmdp_collapse_flush(vma, haddr, pmd); + spin_unlock(pml); + mmu_notifier_invalidate_range_end(&range); + tlb_remove_table_sync_one(); + + anon_vma_unlock_write(vma->anon_vma); + + start_pte = pte_offset_map_lock(mm, &pmdval, haddr, &ptl); + if (!start_pte) + goto rollback; + for (i = 0, addr = haddr, pte = start_pte; + i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) { + struct page *page; + pte_t pteval = ptep_get(pte); + + page = vm_normal_page(vma, addr, pteval); + page_remove_rmap(page, vma, false); + } + pte_unmap_unlock(start_pte, ptl); + + /* Install pmd entry */ + pgtable = pmd_pgtable(pmdval); + pmdval = mk_huge_pmd(hpage, vma->vm_page_prot); + spin_lock(pml); + page_add_anon_rmap(hpage, vma, haddr, RMAP_COMPOUND); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + set_pmd_at(mm, haddr, pmd, pmdval); + update_mmu_cache_pmd(vma, haddr, pmd); + spin_unlock(pml); + + result = SCAN_SUCCEED; + goto up_write; + +rollback: + spin_lock(pml); + pmd_populate(mm, pmd, pmd_pgtable(pmdval)); + spin_unlock(pml); + +up_write: + mmap_write_unlock(mm); + +drop_hpage: + unlock_page(hpage); + put_page(hpage); + + /* TODO: tracepoints */ + return result; +} + static int hpage_collapse_scan_pmd(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, bool *mmap_locked, @@ -1251,6 +1442,8 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, spinlock_t *ptl; int node = NUMA_NO_NODE, unmapped = 0; bool writable = false; + int exclusive = 0; + bool is_hpage = false; VM_BUG_ON(address & ~HPAGE_PMD_MASK); @@ -1333,8 +1526,14 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, } } + if (PageAnonExclusive(page)) + exclusive++; + page = compound_head(page); + if (compound_order(page) == HPAGE_PMD_ORDER) + is_hpage = true; + /* * Record which node the original page is from and save this * information to cc->node_load[]. @@ -1396,7 +1595,22 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, } out_unmap: pte_unmap_unlock(pte, ptl); + + if (is_hpage && (exclusive == 0 && !writable)) { + int res; + + res = collapse_pte_mapped_anon_thp(mm, vma, address, + mmap_locked, cc); + if (res == SCAN_PMD_MAPPED || res == SCAN_SUCCEED) { + result = res; + goto out; + } + + } + if (result == SCAN_SUCCEED) { + if (!*mmap_locked) + mmap_read_lock(mm); result = collapse_huge_page(mm, address, referenced, unmapped, cc); /* collapse_huge_page will return with the mmap_lock released */