From patchwork Fri Sep 29 11:44:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404134 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D72F9E7F154 for ; Fri, 29 Sep 2023 11:45:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=1MzO13lwiuyPWHgjPSb636M6NfNpZx9qtPPxNQuumZ0=; b=qxOZeqdF9gICYr QmhfIAN9b+uEotAANCwx1l0RagGDR4K4vWLJUfKxKSXNWAW4oKiorg1v3BqA5ueu2RAM7aq8HntQ3 YH27D67XYwr67gTnW3KrkM/qzW8HEX60FdvYN/XM9bqmZ04Zw9jiU1s6w161nlFqAvktjSK/0oe6s iYXHyve+9YIxLg7iI5EaU818Rm/mOPPe+zkxhjVrbFRGzHYXISKeFiwVXKku+ClfxpTlsELtHzC90 r+PfzrCPPYyk1m1Ov6jaxHrOTbhEyrZA0au7oeiB0Mr6XFCPMWmD2Vtdd6lMbTgHwPGBA3liEQ7oi FmNDBIaZbO9g877E74Aw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvH-007p7Q-17; Fri, 29 Sep 2023 11:44:43 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvD-007p3I-1s for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:44:41 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 77FA4DA7; Fri, 29 Sep 2023 04:45:14 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 94DAD3F59C; Fri, 29 Sep 2023 04:44:33 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 1/9] mm: Allow deferred splitting of arbitrary anon large folios Date: Fri, 29 Sep 2023 12:44:12 +0100 Message-Id: <20230929114421.3761121-2-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044439_685632_9275BC26 X-CRM114-Status: GOOD ( 13.16 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org In preparation for the introduction of large folios for anonymous memory, we would like to be able to split them when they have unmapped subpages, in order to free those unused pages under memory pressure. So remove the artificial requirement that the large folio needed to be at least PMD-sized. Reviewed-by: Yu Zhao Reviewed-by: Yin Fengwei Reviewed-by: Matthew Wilcox (Oracle) Reviewed-by: David Hildenbrand Signed-off-by: Ryan Roberts --- mm/rmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 9f795b93cf40..8600bd029acf 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1446,11 +1446,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, __lruvec_stat_mod_folio(folio, idx, -nr); /* - * Queue anon THP for deferred split if at least one + * Queue anon large folio for deferred split if at least one * page of the folio is unmapped and at least one page * is still mapped. */ - if (folio_test_pmd_mappable(folio) && folio_test_anon(folio)) + if (folio_test_large(folio) && folio_test_anon(folio)) if (!compound || nr < nr_pmdmapped) deferred_split_folio(folio); } From patchwork Fri Sep 29 11:44:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404135 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id ADBE2E80ABE for ; Fri, 29 Sep 2023 11:45:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=aUirSXQbfjML6shDQaLc2e3Kqw/ApyMX3T9k5DE2eTM=; b=RQi1v4F/lTyzKC tho53723iSiZt2uw0Ed7DOXw5rnIFhJv+Q36S//BtzkUsOP1rFWE2+v8YchRGYE78w0WDh+NVKSp4 WJMdlnxWrNuC9nTgznsvqEUuFQvEzr9s8RgDFrvzOL1fGmkBXergK0GYKo/f9WWpkJYMgKCqQr1s/ PnpdWjgmSDXRT5auz2/gRJpVu7LJ7uF53ZmdSXXCdtZzpBcpUsxjZtPYWIEYNEnoFTrxJtbwbF0D/ eVHyFph/UHSdR+UQL1oBf+N02C5iOyFTCZa8AbYkfl/3jD9v98XygJpNLjSfgxu/pnwj59T9y8BSQ 05xa/E01gjm1xzw5V/WA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvR-007pDW-0P; Fri, 29 Sep 2023 11:44:53 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvE-007p5U-2m for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:44:44 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3537F1007; Fri, 29 Sep 2023 04:45:17 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 527553F59C; Fri, 29 Sep 2023 04:44:36 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 2/9] mm: Non-pmd-mappable, large folios for folio_add_new_anon_rmap() Date: Fri, 29 Sep 2023 12:44:13 +0100 Message-Id: <20230929114421.3761121-3-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044440_991204_B335BA6D X-CRM114-Status: GOOD ( 15.87 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org In preparation for anonymous large folio support, improve folio_add_new_anon_rmap() to allow a non-pmd-mappable, large folio to be passed to it. In this case, all contained pages are accounted using the order-0 folio (or base page) scheme. Reviewed-by: Yu Zhao Reviewed-by: Yin Fengwei Signed-off-by: Ryan Roberts --- mm/rmap.c | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/mm/rmap.c b/mm/rmap.c index 8600bd029acf..106149690366 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1266,31 +1266,44 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, * This means the inc-and-test can be bypassed. * The folio does not have to be locked. * - * If the folio is large, it is accounted as a THP. As the folio + * If the folio is pmd-mappable, it is accounted as a THP. As the folio * is new, it's assumed to be mapped exclusively by a single process. */ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, unsigned long address) { - int nr; + int nr = folio_nr_pages(folio); - VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma); + VM_BUG_ON_VMA(address < vma->vm_start || + address + (nr << PAGE_SHIFT) > vma->vm_end, vma); __folio_set_swapbacked(folio); - if (likely(!folio_test_pmd_mappable(folio))) { + if (likely(!folio_test_large(folio))) { /* increment count (starts at -1) */ atomic_set(&folio->_mapcount, 0); - nr = 1; + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); + } else if (!folio_test_pmd_mappable(folio)) { + int i; + + for (i = 0; i < nr; i++) { + struct page *page = folio_page(folio, i); + + /* increment count (starts at -1) */ + atomic_set(&page->_mapcount, 0); + __page_set_anon_rmap(folio, page, vma, + address + (i << PAGE_SHIFT), 1); + } + + atomic_set(&folio->_nr_pages_mapped, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); atomic_set(&folio->_nr_pages_mapped, COMPOUND_MAPPED); - nr = folio_nr_pages(folio); + __page_set_anon_rmap(folio, &folio->page, vma, address, 1); __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); } __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); - __page_set_anon_rmap(folio, &folio->page, vma, address, 1); } /** From patchwork Fri Sep 29 11:44:14 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404136 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3A150E810D5 for ; Fri, 29 Sep 2023 11:45:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=jkLmpNvWG2rL8R9NXW1HeTmw6vLvq65D/Jo2kBWFdVk=; b=UoxHBVmY8MO2ks ffvBqs8WTFGU4bB9XCTGDMOzMVubv5UGCllVi+SbSP0/TaPDF42rFcHLrigp/+r01p5TaXukTbmJa cz2QyGY4ilAiPCGKNA0cNJ880+x3szfJO88TX6FaQGCO8lJ+nRsGZDXJJJbLX6jyvs7Yhkp8T5qGI QII/smagP1/m/ivWfNujg+c8a/NpYaescEu0DItFkaN2mW/t/4T1f4HuKVZoai9Ca+eZfipZNl1G/ ZQbrcbzbo4olQEdhduauiYuNoEGy48w5KSd540BX0woSl7DUZYisyhKqO0fGSUw2xRKkGLpcU3Vtu UUl2FjA8LmlTjILn0fow==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvR-007pEJ-30; Fri, 29 Sep 2023 11:44:53 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvH-007p7R-20 for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:44:48 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0CDDC1063; Fri, 29 Sep 2023 04:45:20 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 103F93F59C; Fri, 29 Sep 2023 04:44:38 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 3/9] mm: thp: Account pte-mapped anonymous THP usage Date: Fri, 29 Sep 2023 12:44:14 +0100 Message-Id: <20230929114421.3761121-4-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044443_793542_81F9F3E0 X-CRM114-Status: GOOD ( 31.25 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Add accounting for pte-mapped anonymous transparent hugepages at various locations. This visibility will aid in debugging and tuning performance for the "small order" thp extension that will be added in a subsequent commit, where hugepages can be allocated which are large (greater than order-0) but smaller than PMD_ORDER. This new accounting follows a similar pattern to the existing NR_ANON_THPS, which measures pmd-mapped anonymous transparent hugepages. We account pte-mapped anonymous thp mappings per-page, where the page is mapped at least once via PTE and the page belongs to a large folio. So when a page belonging to a large folio is PTE-mapped for the first time, then we add 1 to NR_ANON_THPS_PTEMAPPED. And when a page belonging to a large folio is PTE-unmapped for the last time, then we remove 1 from NR_ANON_THPS_PTEMAPPED. /proc/meminfo: Introduce new "AnonHugePteMap" field, which reports the amount of memory (in KiB) mapped from large folios globally (similar to AnonHugePages field). /proc/vmstat: Introduce new "nr_anon_thp_pte" field, which reports the amount of memory (in pages) mapped from large folios globally (similar to nr_anon_transparent_hugepages field). /sys/devices/system/node/nodeX/meminfo Introduce new "AnonHugePteMap" field, which reports the amount of memory (in KiB) mapped from large folios per-node (similar to AnonHugePages field). show_mem (panic logger): Introduce new "anon_thp_pte" field, which reports the amount of memory (in KiB) mapped from large folios per-node (similar to anon_thp field). memory.stat (cgroup v1 and v2): Introduce new "anon_thp_pte" field, which reports the amount of memory (in bytes) mapped from large folios in the memcg (similar to rss_huge (v1) / anon_thp (v2) fields). /proc//smaps & /proc//smaps_rollup: Introduce new "AnonHugePteMap" field, which reports the amount of memory (in KiB) mapped from large folios within the vma/process (similar to AnonHugePages field). NOTE on charge migration: The new NR_ANON_THPS_PTEMAPPED charge is NOT moved between cgroups, even when the (v1) memory.move_charge_at_immigrate feature is enabled. That feature is marked deprecated and the current code does not attempt to move the NR_ANON_MAPPED charge for large PTE-mapped folios anyway (see comment in mem_cgroup_move_charge_pte_range()). If this code was enhanced to allow moving the NR_ANON_MAPPED charge for large PTE-mapped folios, we would also need to add support for moving the new NR_ANON_THPS_PTEMAPPED charge. This would likely get quite fiddly. Given the deprecation of memory.move_charge_at_immigrate, I assume it is not valuable to implement. NOTE on naming: Given the new small order anonymous thp feature will be exposed to user space as an extension to thp, I've opted to call the new counters after thp also (as aposed to "large"/"large folio"/etc.), so "huge" no longer strictly means PMD - one could argue hugetlb already breaks this rule anyway. I also did not want to risk breaking back compat by renaming/redefining the existing counters (which would have resulted in more consistent and clearer names). So the existing NR_ANON_THPS counters remain and continue to only refer to PMD-mapped THPs. And I've added new counters, which only refer to PTE-mapped THPs. Signed-off-by: Ryan Roberts --- Documentation/ABI/testing/procfs-smaps_rollup | 1 + Documentation/admin-guide/cgroup-v1/memory.rst | 5 ++++- Documentation/admin-guide/cgroup-v2.rst | 6 +++++- Documentation/admin-guide/mm/transhuge.rst | 11 +++++++---- Documentation/filesystems/proc.rst | 14 ++++++++++++-- drivers/base/node.c | 2 ++ fs/proc/meminfo.c | 2 ++ fs/proc/task_mmu.c | 4 ++++ include/linux/mmzone.h | 1 + mm/memcontrol.c | 8 ++++++++ mm/rmap.c | 11 +++++++++-- mm/show_mem.c | 2 ++ mm/vmstat.c | 1 + 13 files changed, 58 insertions(+), 10 deletions(-) diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup index b446a7154a1b..b50b3eda5a3f 100644 --- a/Documentation/ABI/testing/procfs-smaps_rollup +++ b/Documentation/ABI/testing/procfs-smaps_rollup @@ -34,6 +34,7 @@ Description: Anonymous: 68 kB LazyFree: 0 kB AnonHugePages: 0 kB + AnonHugePteMap: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 5f502bf68fbc..b7efc7531896 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -535,7 +535,10 @@ memory.stat file includes following statistics: cache # of bytes of page cache memory. rss # of bytes of anonymous and swap cache memory (includes transparent hugepages). - rss_huge # of bytes of anonymous transparent hugepages. + rss_huge # of bytes of anonymous transparent hugepages, mapped by + PMD. + anon_thp_pte # of bytes of anonymous transparent hugepages, mapped by + PTE. mapped_file # of bytes of mapped file (includes tmpfs/shmem) pgpgin # of charging events to the memory cgroup. The charging event happens each time a page is accounted as either mapped diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index b26b5274eaaf..48b961b8fc6d 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1421,7 +1421,11 @@ PAGE_SIZE multiple when read back. anon_thp Amount of memory used in anonymous mappings backed by - transparent hugepages + transparent hugepages, mapped by PMD + + anon_thp_pte + Amount of memory used in anonymous mappings backed by + transparent hugepages, mapped by PTE file_thp Amount of cached filesystem data backed by transparent diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index b0cc8243e093..ebda57850643 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -291,10 +291,13 @@ Monitoring usage ================ The number of anonymous transparent huge pages currently used by the -system is available by reading the AnonHugePages field in ``/proc/meminfo``. -To identify what applications are using anonymous transparent huge pages, -it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages fields -for each mapping. +system is available by reading the AnonHugePages and AnonHugePteMap +fields in ``/proc/meminfo``. To identify what applications are using +anonymous transparent huge pages, it is necessary to read +``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap +fields for each mapping. Note that in both cases, AnonHugePages refers +only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped +using PTEs. The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 2b59cff8be17..ccbb76a509f0 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -464,6 +464,7 @@ Memory Area, or VMA) there is a series of lines such as the following:: KSM: 0 kB LazyFree: 0 kB AnonHugePages: 0 kB + AnonHugePteMap: 0 kB ShmemPmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB @@ -511,7 +512,11 @@ pressure if the memory is clean. Please note that the printed value might be lower than the real value due to optimizations used in the current implementation. If this is not desirable please file a bug report. -"AnonHugePages" shows the amount of memory backed by transparent hugepage. +"AnonHugePages" shows the amount of memory backed by transparent hugepage, +mapped by PMD. + +"AnonHugePteMap" shows the amount of memory backed by transparent hugepage, +mapped by PTE. "ShmemPmdMapped" shows the amount of shared (shmem/tmpfs) memory backed by huge pages. @@ -1006,6 +1011,7 @@ Example output. You may not have all of these fields. EarlyMemtestBad: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 4149248 kB + AnonHugePteMap: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB @@ -1165,7 +1171,11 @@ HardwareCorrupted The amount of RAM/memory in KB, the kernel identifies as corrupted. AnonHugePages - Non-file backed huge pages mapped into userspace page tables + Non-file backed huge pages mapped into userspace page tables by + PMD +AnonHugePteMap + Non-file backed huge pages mapped into userspace page tables by + PTE ShmemHugePages Memory used by shared memory (shmem) and tmpfs allocated with huge pages diff --git a/drivers/base/node.c b/drivers/base/node.c index 493d533f8375..08f1759387d2 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -443,6 +443,7 @@ static ssize_t node_read_meminfo(struct device *dev, "Node %d SUnreclaim: %8lu kB\n" #ifdef CONFIG_TRANSPARENT_HUGEPAGE "Node %d AnonHugePages: %8lu kB\n" + "Node %d AnonHugePteMap: %8lu kB\n" "Node %d ShmemHugePages: %8lu kB\n" "Node %d ShmemPmdMapped: %8lu kB\n" "Node %d FileHugePages: %8lu kB\n" @@ -475,6 +476,7 @@ static ssize_t node_read_meminfo(struct device *dev, #ifdef CONFIG_TRANSPARENT_HUGEPAGE , nid, K(node_page_state(pgdat, NR_ANON_THPS)), + nid, K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)), nid, K(node_page_state(pgdat, NR_SHMEM_THPS)), nid, K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)), nid, K(node_page_state(pgdat, NR_FILE_THPS)), diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 45af9a989d40..bac20cc60b6a 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -143,6 +143,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v) #ifdef CONFIG_TRANSPARENT_HUGEPAGE show_val_kb(m, "AnonHugePages: ", global_node_page_state(NR_ANON_THPS)); + show_val_kb(m, "AnonHugePteMap: ", + global_node_page_state(NR_ANON_THPS_PTEMAPPED)); show_val_kb(m, "ShmemHugePages: ", global_node_page_state(NR_SHMEM_THPS)); show_val_kb(m, "ShmemPmdMapped: ", diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 3dd5be96691b..7b5dad163533 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -392,6 +392,7 @@ struct mem_size_stats { unsigned long anonymous; unsigned long lazyfree; unsigned long anonymous_thp; + unsigned long anonymous_thp_pte; unsigned long shmem_thp; unsigned long file_thp; unsigned long swap; @@ -452,6 +453,8 @@ static void smaps_account(struct mem_size_stats *mss, struct page *page, mss->anonymous += size; if (!PageSwapBacked(page) && !dirty && !PageDirty(page)) mss->lazyfree += size; + if (!compound && PageTransCompound(page)) + mss->anonymous_thp_pte += size; } if (PageKsm(page)) @@ -833,6 +836,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss, SEQ_PUT_DEC(" kB\nKSM: ", mss->ksm); SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree); SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp); + SEQ_PUT_DEC(" kB\nAnonHugePteMap: ", mss->anonymous_thp_pte); SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp); SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 4106fbc5b4b3..5032fc31c651 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -186,6 +186,7 @@ enum node_stat_item { NR_FILE_THPS, NR_FILE_PMDMAPPED, NR_ANON_THPS, + NR_ANON_THPS_PTEMAPPED, NR_VMSCAN_WRITE, NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */ NR_DIRTIED, /* page dirtyings since bootup */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d13dde2f8b56..07d8e0b55b0e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -809,6 +809,7 @@ void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, case NR_ANON_MAPPED: case NR_FILE_MAPPED: case NR_ANON_THPS: + case NR_ANON_THPS_PTEMAPPED: case NR_SHMEM_PMDMAPPED: case NR_FILE_PMDMAPPED: WARN_ON_ONCE(!in_task()); @@ -1512,6 +1513,7 @@ static const struct memory_stat memory_stats[] = { #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE { "anon_thp", NR_ANON_THPS }, + { "anon_thp_pte", NR_ANON_THPS_PTEMAPPED }, { "file_thp", NR_FILE_THPS }, { "shmem_thp", NR_SHMEM_THPS }, #endif @@ -4052,6 +4054,7 @@ static const unsigned int memcg1_stats[] = { NR_ANON_MAPPED, #ifdef CONFIG_TRANSPARENT_HUGEPAGE NR_ANON_THPS, + NR_ANON_THPS_PTEMAPPED, #endif NR_SHMEM, NR_FILE_MAPPED, @@ -4067,6 +4070,7 @@ static const char *const memcg1_stat_names[] = { "rss", #ifdef CONFIG_TRANSPARENT_HUGEPAGE "rss_huge", + "anon_thp_pte", #endif "shmem", "mapped_file", @@ -6259,6 +6263,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd, * can be done but it would be too convoluted so simply * ignore such a partial THP and keep it in original * memcg. There should be somebody mapping the head. + * This simplification also means that pte-mapped large + * folios are never migrated, which means we don't need + * to worry about migrating the NR_ANON_THPS_PTEMAPPED + * accounting. */ if (PageTransCompound(page)) goto put; diff --git a/mm/rmap.c b/mm/rmap.c index 106149690366..52dabee73023 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1205,7 +1205,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, { struct folio *folio = page_folio(page); atomic_t *mapped = &folio->_nr_pages_mapped; - int nr = 0, nr_pmdmapped = 0; + int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0; bool compound = flags & RMAP_COMPOUND; bool first = true; @@ -1214,6 +1214,7 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, first = atomic_inc_and_test(&page->_mapcount); nr = first; if (first && folio_test_large(folio)) { + nr_lgmapped = 1; nr = atomic_inc_return_relaxed(mapped); nr = (nr < COMPOUND_MAPPED); } @@ -1241,6 +1242,8 @@ void page_add_anon_rmap(struct page *page, struct vm_area_struct *vma, if (nr_pmdmapped) __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr_pmdmapped); + if (nr_lgmapped) + __lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr_lgmapped); if (nr) __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); @@ -1295,6 +1298,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma, } atomic_set(&folio->_nr_pages_mapped, nr); + __lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, nr); } else { /* increment count (starts at -1) */ atomic_set(&folio->_entire_mapcount, 0); @@ -1405,7 +1409,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, { struct folio *folio = page_folio(page); atomic_t *mapped = &folio->_nr_pages_mapped; - int nr = 0, nr_pmdmapped = 0; + int nr = 0, nr_pmdmapped = 0, nr_lgmapped = 0; bool last; enum node_stat_item idx; @@ -1423,6 +1427,7 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, last = atomic_add_negative(-1, &page->_mapcount); nr = last; if (last && folio_test_large(folio)) { + nr_lgmapped = 1; nr = atomic_dec_return_relaxed(mapped); nr = (nr < COMPOUND_MAPPED); } @@ -1454,6 +1459,8 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma, idx = NR_FILE_PMDMAPPED; __lruvec_stat_mod_folio(folio, idx, -nr_pmdmapped); } + if (nr_lgmapped && folio_test_anon(folio)) + __lruvec_stat_mod_folio(folio, NR_ANON_THPS_PTEMAPPED, -nr_lgmapped); if (nr) { idx = folio_test_anon(folio) ? NR_ANON_MAPPED : NR_FILE_MAPPED; __lruvec_stat_mod_folio(folio, idx, -nr); diff --git a/mm/show_mem.c b/mm/show_mem.c index 4b888b18bdde..e648a815f0fb 100644 --- a/mm/show_mem.c +++ b/mm/show_mem.c @@ -254,6 +254,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z " shmem_thp:%lukB" " shmem_pmdmapped:%lukB" " anon_thp:%lukB" + " anon_thp_pte:%lukB" #endif " writeback_tmp:%lukB" " kernel_stack:%lukB" @@ -280,6 +281,7 @@ static void show_free_areas(unsigned int filter, nodemask_t *nodemask, int max_z K(node_page_state(pgdat, NR_SHMEM_THPS)), K(node_page_state(pgdat, NR_SHMEM_PMDMAPPED)), K(node_page_state(pgdat, NR_ANON_THPS)), + K(node_page_state(pgdat, NR_ANON_THPS_PTEMAPPED)), #endif K(node_page_state(pgdat, NR_WRITEBACK_TEMP)), node_page_state(pgdat, NR_KERNEL_STACK_KB), diff --git a/mm/vmstat.c b/mm/vmstat.c index 00e81e99c6ee..267de0e4ddca 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1224,6 +1224,7 @@ const char * const vmstat_text[] = { "nr_file_hugepages", "nr_file_pmdmapped", "nr_anon_transparent_hugepages", + "nr_anon_thp_pte", "nr_vmscan_write", "nr_vmscan_immediate_reclaim", "nr_dirtied", From patchwork Fri Sep 29 11:44:15 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404137 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 0DAC8E7F154 for ; Fri, 29 Sep 2023 11:45:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=ZZo0eKboQCcdoP55jBPJz4J9laFZjRqmiqH2x+Es2Dw=; b=wkW1gwd55N0eCO Y3eFB7riQMp4g8C0s4NLstgNYGpTzVqMDgcQ/XW0k2q1PSL8UVzoSngD6cHbH2E1mgMIA4+ZExBdZ UDc46Gx/qLDyTV8Ju1KxWtFIsw60loXQq18l+SgXK6x8Arss1EQxnx4W7pqysVSbSxV/ZuN1CxyXT yyYShShBmM/NUhgdH+QmY9ZyxV2e43YlRQ3zU7cnPRHY5aM1KaxKmpd4GKswgkGgYdKwr024CMN/T mpTshTLg9+WzQ86dgrxgSSed+O+BVfazaA+3uRkAB3CVrcCJkZAlsWgNGXDudZid8i9+aMdcHhNFy xulUNGVJc9r4JwkBOegA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvT-007pFS-0R; Fri, 29 Sep 2023 11:44:55 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvJ-007p8C-0Q for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:44:49 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D9363DA7; Fri, 29 Sep 2023 04:45:22 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DC3AE3F59C; Fri, 29 Sep 2023 04:44:41 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 4/9] mm: thp: Introduce anon_orders and anon_always_mask sysfs files Date: Fri, 29 Sep 2023 12:44:15 +0100 Message-Id: <20230929114421.3761121-5-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044445_323645_DAB7D9EB X-CRM114-Status: GOOD ( 36.42 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org In preparation for adding support for anonymous large folios that are smaller than the PMD-size, introduce 2 new sysfs files that will be used to control the new behaviours via the transparent_hugepage interface. For now, the kernel still only supports PMD-order anonymous THP, so when reading back anon_orders, it will reflect that. Therefore there are no behavioural changes intended here. The bulk of the change is implemented by converting transhuge_vma_suitable() and hugepage_vma_check() so that they take a bitfield of orders for which the user wants to determine support, and the functions filter out all the orders that can't be supported. If there is only 1 order set in the input then the output can continue to be treated like a boolean; this is the case for most call sites. The remainder is copied from Documentation/admin-guide/mm/transhuge.rst, as modified by this commit. See that file for further details. By default, allocation of anonymous THPs that are smaller than PMD-size is disabled. These smaller allocation orders can be enabled by writing an encoded set of orders as follows:: echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders Where an order refers to the number of pages in the large folio as 2^order, and where each order is encoded in the written value such that each set bit represents an enabled order; So setting bit-2 indicates that order-2 folios are in use, and order-2 means 2^2=4 pages (=16K if the page size is 4K). The example above enables order-9 (PMD-order) and order-3. By enabling multiple orders, allocation of each order will be attempted, highest to lowest, until a successful allocation is made. If the PMD-order is unset, then no PMD-sized THPs will be allocated. The kernel will ignore any orders that it does not support so read the file back to determine which orders are enabled:: cat /sys/kernel/mm/transparent_hugepage/anon_orders For some workloads it may be desirable to limit some THP orders to be used only for MADV_HUGEPAGE regions, while allowing others to be used always. For example, a workload may only benefit from PMD-sized THP in specific areas, but can take benefit of 32K sized THP more generally. In this case, THP can be enabled in ``madvise`` mode as normal, but specific orders can be configured to be allocated as if in ``always`` mode. The below example enables orders 9 and 3, with order-9 only applied to MADV_HUGEPAGE regions, and order-3 applied always:: echo madvise >/sys/kernel/mm/transparent_hugepage/enabled echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 74 ++++++++-- Documentation/filesystems/proc.rst | 6 +- fs/proc/task_mmu.c | 3 +- include/linux/huge_mm.h | 93 +++++++++--- mm/huge_memory.c | 164 ++++++++++++++++++--- mm/khugepaged.c | 18 ++- mm/memory.c | 6 +- mm/page_vma_mapped.c | 3 +- 8 files changed, 296 insertions(+), 71 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index ebda57850643..9f954e73a4ca 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -45,10 +45,22 @@ components: the two is using hugepages just because of the fact the TLB miss is going to run faster. +Furthermore, it is possible to configure THP to allocate large folios +to back anonymous memory, which are smaller than PMD-size (for example +16K, 32K, 64K, etc). These THPs continue to be PTE-mapped, but in many +cases can still provide the similar benefits to those outlined above: +Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, +etc), but latency spikes are much less prominent because the size of +each page isn't as huge as the PMD-sized variant and there is less +memory to clear in each page fault. Some architectures also employ TLB +compression mechanisms to squeeze more entries in when a set of PTEs +are virtually and physically contiguous and approporiately aligned. In +this case, TLB misses will occur less often. + THP can be enabled system wide or restricted to certain tasks or even memory ranges inside task's address space. Unless THP is completely disabled, there is ``khugepaged`` daemon that scans memory and -collapses sequences of basic pages into huge pages. +collapses sequences of basic pages into PMD-sized huge pages. The THP behaviour is controlled via :ref:`sysfs ` interface and using madvise(2) and prctl(2) system calls. @@ -146,25 +158,69 @@ madvise never should be self-explanatory. -By default kernel tries to use huge zero page on read page fault to -anonymous mapping. It's possible to disable huge zero page by writing 0 -or enable it back by writing 1:: +By default kernel tries to use huge, PMD-mapped zero page on read page +fault to anonymous mapping. It's possible to disable huge zero page by +writing 0 or enable it back by writing 1:: echo 0 >/sys/kernel/mm/transparent_hugepage/use_zero_page echo 1 >/sys/kernel/mm/transparent_hugepage/use_zero_page Some userspace (such as a test program, or an optimized memory allocation -library) may want to know the size (in bytes) of a transparent hugepage:: +library) may want to know the size (in bytes) of a PMD-mappable +transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size +By default, allocation of anonymous THPs that are smaller than +PMD-size is disabled. These smaller allocation orders can be enabled +by writing an encoded set of orders as follows:: + + echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders + +Where an order refers to the number of pages in the large folio as +2^order, and where each order is encoded in the written value such +that each set bit represents an enabled order; So setting bit-2 +indicates that order-2 folios are in use, and order-2 means 2^2=4 +pages (=16K if the page size is 4K). The example above enables order-9 +(PMD-order) and order-3. + +By enabling multiple orders, allocation of each order will be +attempted, highest to lowest, until a successful allocation is made. +If the PMD-order is unset, then no PMD-sized THPs will be allocated. + +The kernel will ignore any orders that it does not support so read the +file back to determine which orders are enabled:: + + cat /sys/kernel/mm/transparent_hugepage/anon_orders + +For some workloads it may be desirable to limit some THP orders to be +used only for MADV_HUGEPAGE regions, while allowing others to be used +always. For example, a workload may only benefit from PMD-sized THP in +specific areas, but can take benefit of 32K sized THP more generally. +In this case, THP can be enabled in ``madvise`` mode as normal, but +specific orders can be configured to be allocated as if in ``always`` +mode. The below example enables orders 9 and 3, with order-9 only +applied to MADV_HUGEPAGE regions, and order-3 applied always:: + + echo madvise >/sys/kernel/mm/transparent_hugepage/enabled + echo 0x208 >/sys/kernel/mm/transparent_hugepage/anon_orders + echo 0x008 >/sys/kernel/mm/transparent_hugepage/anon_always_mask + khugepaged will be automatically started when -transparent_hugepage/enabled is set to "always" or "madvise, and it'll -be automatically shutdown if it's set to "never". +transparent_hugepage/enabled is set to "always" or "madvise", +providing the PMD-order is enabled in +transparent_hugepage/anon_orders, and it'll be automatically shutdown +if it's set to "never" or the PMD-order is disabled in +transparent_hugepage/anon_orders. Khugepaged controls ------------------- +.. note:: + khugepaged currently only searches for opportunities to collapse to + PMD-sized THP and no attempt is made to collapse to smaller order + THP. + khugepaged runs usually at low frequency so while one may not want to invoke defrag algorithms synchronously during the page faults, it should be worth invoking defrag at least in khugepaged. However it's @@ -285,7 +341,7 @@ Need of application restart The transparent_hugepage/enabled values and tmpfs mount option only affect future behavior. So to make them effective you need to restart any application that could have been using hugepages. This also applies to the -regions registered in khugepaged. +regions registered in khugepaged, and transparent_hugepage/anon_orders. Monitoring usage ================ @@ -416,7 +472,7 @@ for huge pages. Optimizing the applications =========================== -To be guaranteed that the kernel will map a 2M page immediately in any +To be guaranteed that the kernel will map a thp immediately in any memory region, the mmap region has to be hugepage naturally aligned. posix_memalign() can provide that guarantee. diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index ccbb76a509f0..72526f8bb658 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -533,9 +533,9 @@ replaced by copy-on-write) part of the underlying shmem object out on swap. does not take into account swapped out page of underlying shmem objects. "Locked" indicates whether the mapping is locked in memory or not. -"THPeligible" indicates whether the mapping is eligible for allocating THP -pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise. -It just shows the current status. +"THPeligible" indicates whether the mapping is eligible for allocating +naturally aligned THP pages of any currently enabled order. 1 if true, 0 +otherwise. It just shows the current status. "VmFlags" field deserves a separate description. This member represents the kernel flags associated with the particular virtual memory area in two letter diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 7b5dad163533..f978dce7f7ce 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -869,7 +869,8 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); seq_printf(m, "THPeligible: %8u\n", - hugepage_vma_check(vma, vma->vm_flags, true, false, true)); + !!hugepage_vma_check(vma, vma->vm_flags, true, false, true, + THP_ORDERS_ALL)); if (arch_pkeys_enabled()) seq_printf(m, "ProtectionKey: %8u\n", vma_pkey(vma)); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index fa0350b0812a..2e7c338229a6 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -67,6 +67,21 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) #define HPAGE_PMD_NR (1<vm_start >> PAGE_SHIFT) - vma->vm_pgoff, - HPAGE_PMD_NR)) - return false; +static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma, + unsigned long addr, unsigned int orders) +{ + int order; + + /* + * Iterate over orders, highest to lowest, removing orders that don't + * meet alignment requirements from the set. Exit loop at first order + * that meets requirements, since all lower orders must also meet + * requirements. + */ + + order = first_order(orders); + + while (orders) { + unsigned long hpage_size = PAGE_SIZE << order; + unsigned long haddr = ALIGN_DOWN(addr, hpage_size); + + if (haddr >= vma->vm_start && + haddr + hpage_size <= vma->vm_end) { + if (!vma_is_anonymous(vma)) { + if (IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - + vma->vm_pgoff, + hpage_size >> PAGE_SHIFT)) + break; + } else + break; + } + + order = next_order(&orders, order); } - haddr = addr & HPAGE_PMD_MASK; - - if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) - return false; - return true; + return orders; } static inline bool file_thp_enabled(struct vm_area_struct *vma) @@ -130,8 +173,9 @@ static inline bool file_thp_enabled(struct vm_area_struct *vma) !inode_is_open_for_write(inode) && S_ISREG(inode->i_mode); } -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, - bool smaps, bool in_pf, bool enforce_sysfs); +unsigned int hugepage_vma_check(struct vm_area_struct *vma, + unsigned long vm_flags, bool smaps, bool in_pf, + bool enforce_sysfs, unsigned int orders); #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ @@ -267,17 +311,18 @@ static inline bool folio_test_pmd_mappable(struct folio *folio) return false; } -static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, - unsigned long addr) +static inline unsigned int transhuge_vma_suitable(struct vm_area_struct *vma, + unsigned long addr, unsigned int orders) { - return false; + return 0; } -static inline bool hugepage_vma_check(struct vm_area_struct *vma, - unsigned long vm_flags, bool smaps, - bool in_pf, bool enforce_sysfs) +static inline unsigned int hugepage_vma_check(struct vm_area_struct *vma, + unsigned long vm_flags, bool smaps, + bool in_pf, bool enforce_sysfs, + unsigned int orders) { - return false; + return 0; } static inline void folio_prep_large_rmappable(struct folio *folio) {} diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 064fbd90822b..bcecce769017 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -70,12 +70,48 @@ static struct shrinker deferred_split_shrinker; static atomic_t huge_zero_refcount; struct page *huge_zero_page __read_mostly; unsigned long huge_zero_pfn __read_mostly = ~0UL; +unsigned int huge_anon_orders __read_mostly = BIT(PMD_ORDER); +static unsigned int huge_anon_always_mask __read_mostly; -bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, - bool smaps, bool in_pf, bool enforce_sysfs) +/** + * hugepage_vma_check - determine which hugepage orders can be applied to vma + * @vma: the vm area to check + * @vm_flags: use these vm_flags instead of vma->vm_flags + * @smaps: whether answer will be used for smaps file + * @in_pf: whether answer will be used by page fault handler + * @enforce_sysfs: whether sysfs config should be taken into account + * @orders: bitfield of all orders to consider + * + * Calculates the intersection of the requested hugepage orders and the allowed + * hugepage orders for the provided vma. Permitted orders are encoded as a set + * bit at the corresponding bit position (bit-2 corresponds to order-2, bit-3 + * corresponds to order-3, etc). Order-0 is never considered a hugepage order. + * + * Return: bitfield of orders allowed for hugepage in the vma. 0 if no hugepage + * orders are allowed. + */ +unsigned int hugepage_vma_check(struct vm_area_struct *vma, + unsigned long vm_flags, bool smaps, bool in_pf, + bool enforce_sysfs, unsigned int orders) { + /* + * Fix up the orders mask; Supported orders for file vmas are static. + * Supported orders for anon vmas are configured dynamically - but only + * use the dynamic set if enforce_sysfs=true, otherwise use the full + * set. + */ + if (vma_is_anonymous(vma)) + orders &= enforce_sysfs ? READ_ONCE(huge_anon_orders) + : THP_ORDERS_ALL_ANON; + else + orders &= THP_ORDERS_ALL_FILE; + + /* No orders in the intersection. */ + if (!orders) + return 0; + if (!vma->vm_mm) /* vdso */ - return false; + return 0; /* * Explicitly disabled through madvise or prctl, or some @@ -84,16 +120,16 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, * */ if ((vm_flags & VM_NOHUGEPAGE) || test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags)) - return false; + return 0; /* * If the hardware/firmware marked hugepage support disabled. */ if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_UNSUPPORTED)) - return false; + return 0; /* khugepaged doesn't collapse DAX vma, but page fault is fine. */ if (vma_is_dax(vma)) - return in_pf; + return in_pf ? orders : 0; /* * Special VMA and hugetlb VMA. @@ -101,17 +137,29 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, * VM_MIXEDMAP set. */ if (vm_flags & VM_NO_KHUGEPAGED) - return false; + return 0; /* - * Check alignment for file vma and size for both file and anon vma. + * Check alignment for file vma and size for both file and anon vma by + * filtering out the unsuitable orders. * * Skip the check for page fault. Huge fault does the check in fault - * handlers. And this check is not suitable for huge PUD fault. + * handlers. */ - if (!in_pf && - !transhuge_vma_suitable(vma, (vma->vm_end - HPAGE_PMD_SIZE))) - return false; + if (!in_pf) { + int order = first_order(orders); + unsigned long addr; + + while (orders) { + addr = vma->vm_end - (PAGE_SIZE << order); + if (transhuge_vma_suitable(vma, addr, BIT(order))) + break; + order = next_order(&orders, order); + } + + if (!orders) + return 0; + } /* * Enabled via shmem mount options or sysfs settings. @@ -120,23 +168,35 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, */ if (!in_pf && shmem_file(vma->vm_file)) return shmem_is_huge(file_inode(vma->vm_file), vma->vm_pgoff, - !enforce_sysfs, vma->vm_mm, vm_flags); + !enforce_sysfs, vma->vm_mm, vm_flags) + ? orders : 0; /* Enforce sysfs THP requirements as necessary */ - if (enforce_sysfs && - (!hugepage_flags_enabled() || (!(vm_flags & VM_HUGEPAGE) && - !hugepage_flags_always()))) - return false; + if (enforce_sysfs) { + /* enabled=never. */ + if (!hugepage_flags_enabled()) + return 0; + + /* enabled=madvise without VM_HUGEPAGE. */ + if (!(vm_flags & VM_HUGEPAGE) && !hugepage_flags_always()) { + if (vma_is_anonymous(vma)) { + orders &= READ_ONCE(huge_anon_always_mask); + if (!orders) + return 0; + } else + return 0; + } + } /* Only regular file is valid */ if (!in_pf && file_thp_enabled(vma)) - return true; + return orders; if (!vma_is_anonymous(vma)) - return false; + return 0; if (vma_is_temporary_stack(vma)) - return false; + return 0; /* * THPeligible bit of smaps should show 1 for proper VMAs even @@ -146,9 +206,9 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags, * the first page fault. */ if (!vma->anon_vma) - return (smaps || in_pf); + return (smaps || in_pf) ? orders : 0; - return true; + return orders; } static bool get_huge_zero_page(void) @@ -391,11 +451,69 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size); +static ssize_t anon_orders_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_orders)); +} + +static ssize_t anon_orders_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + int ret = count; + unsigned int orders; + + err = kstrtouint(buf, 0, &orders); + if (err) + ret = -EINVAL; + + if (ret > 0) { + orders &= THP_ORDERS_ALL_ANON; + WRITE_ONCE(huge_anon_orders, orders); + + err = start_stop_khugepaged(); + if (err) + ret = err; + } + + return ret; +} + +static struct kobj_attribute anon_orders_attr = __ATTR_RW(anon_orders); + +static ssize_t anon_always_mask_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "0x%08x\n", READ_ONCE(huge_anon_always_mask)); +} + +static ssize_t anon_always_mask_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err; + unsigned int always_mask; + + err = kstrtouint(buf, 0, &always_mask); + if (err) + return -EINVAL; + + WRITE_ONCE(huge_anon_always_mask, always_mask); + + return count; +} + +static struct kobj_attribute anon_always_mask_attr = __ATTR_RW(anon_always_mask); + static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, &use_zero_page_attr.attr, &hpage_pmd_size_attr.attr, + &anon_orders_attr.attr, + &anon_always_mask_attr.attr, #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, #endif @@ -778,7 +896,7 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf) struct folio *folio; unsigned long haddr = vmf->address & HPAGE_PMD_MASK; - if (!transhuge_vma_suitable(vma, haddr)) + if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER))) return VM_FAULT_FALLBACK; if (unlikely(anon_vma_prepare(vma))) return VM_FAULT_OOM; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 88433cc25d8a..2b5c0321d96b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -446,7 +446,8 @@ void khugepaged_enter_vma(struct vm_area_struct *vma, { if (!test_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags) && hugepage_flags_enabled()) { - if (hugepage_vma_check(vma, vm_flags, false, false, true)) + if (hugepage_vma_check(vma, vm_flags, false, false, true, + BIT(PMD_ORDER))) __khugepaged_enter(vma->vm_mm); } } @@ -921,10 +922,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, if (!vma) return SCAN_VMA_NULL; - if (!transhuge_vma_suitable(vma, address)) + if (!transhuge_vma_suitable(vma, address, BIT(PMD_ORDER))) return SCAN_ADDRESS_RANGE; if (!hugepage_vma_check(vma, vma->vm_flags, false, false, - cc->is_khugepaged)) + cc->is_khugepaged, BIT(PMD_ORDER))) return SCAN_VMA_CHECK; /* * Anon VMA expected, the address may be unmapped then @@ -1499,7 +1500,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, * and map it by a PMD, regardless of sysfs THP settings. As such, let's * analogously elide sysfs THP settings here. */ - if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false)) + if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false, + BIT(PMD_ORDER))) return SCAN_VMA_CHECK; /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ @@ -2369,7 +2371,8 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, progress++; break; } - if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true)) { + if (!hugepage_vma_check(vma, vma->vm_flags, false, false, true, + BIT(PMD_ORDER))) { skip: progress++; continue; @@ -2626,7 +2629,7 @@ int start_stop_khugepaged(void) int err = 0; mutex_lock(&khugepaged_mutex); - if (hugepage_flags_enabled()) { + if (hugepage_flags_enabled() && (huge_anon_orders & BIT(PMD_ORDER))) { if (!khugepaged_thread) khugepaged_thread = kthread_run(khugepaged, NULL, "khugepaged"); @@ -2706,7 +2709,8 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, *prev = vma; - if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false)) + if (!hugepage_vma_check(vma, vma->vm_flags, false, false, false, + BIT(PMD_ORDER))) return -EINVAL; cc = kmalloc(sizeof(*cc), GFP_KERNEL); diff --git a/mm/memory.c b/mm/memory.c index e4b0f6a461d8..b5b82fc8e164 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4256,7 +4256,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page) pmd_t entry; vm_fault_t ret = VM_FAULT_FALLBACK; - if (!transhuge_vma_suitable(vma, haddr)) + if (!transhuge_vma_suitable(vma, haddr, BIT(PMD_ORDER))) return ret; page = compound_head(page); @@ -5055,7 +5055,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, return VM_FAULT_OOM; retry_pud: if (pud_none(*vmf.pud) && - hugepage_vma_check(vma, vm_flags, false, true, true)) { + hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PUD_ORDER))) { ret = create_huge_pud(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; @@ -5089,7 +5089,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, goto retry_pud; if (pmd_none(*vmf.pmd) && - hugepage_vma_check(vma, vm_flags, false, true, true)) { + hugepage_vma_check(vma, vm_flags, false, true, true, BIT(PMD_ORDER))) { ret = create_huge_pmd(&vmf); if (!(ret & VM_FAULT_FALLBACK)) return ret; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index e0b368e545ed..5f7e89c5b595 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -268,7 +268,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) * cleared *pmd but not decremented compound_mapcount(). */ if ((pvmw->flags & PVMW_SYNC) && - transhuge_vma_suitable(vma, pvmw->address) && + transhuge_vma_suitable(vma, pvmw->address, + BIT(PMD_ORDER)) && (pvmw->nr_pages >= HPAGE_PMD_NR)) { spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); From patchwork Fri Sep 29 11:44:16 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404139 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 5A232E810DF for ; Fri, 29 Sep 2023 11:45:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=3prQAW0OE35Y43kxdZJekoJsnbbh5XCztPCQgWOdWgs=; b=g7Tu97XcWW6ght bqiaAuplZmpgXFd23EJXkfh5CMmP3B2VEbqQd680Rxz1NqUe4muSsu8IK3pqEfHGajuNiKpGdeU5K W2xX1eaOiwLsh6Gfmn1VtRrB9KfVGdFfGgIjR9k4VuHQp5q3odDhxvrRpIntPTy+kATa0HLkNKdof mMtNJeoe0jKItVxxzdMyXnxu6Xr+G7buem91JXBCaAuxQZGNnSciBB9PzyJIKx06AfcI4hCgflMJ6 9ky6CS/B7Dl8gjwxR2gh7l7ZRTM55s3unJL63bCa8IQr2Wy5erXKExfReCoBfSl55/sIRZ3RDXshE cpx6MiGbLDR5eVzwuuaw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvT-007pGE-3D; Fri, 29 Sep 2023 11:44:56 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvM-007p9w-0l for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:44:51 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9765D1FB; Fri, 29 Sep 2023 04:45:25 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B3E903F59C; Fri, 29 Sep 2023 04:44:44 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 5/9] mm: thp: Extend THP to allocate anonymous large folios Date: Fri, 29 Sep 2023 12:44:16 +0100 Message-Id: <20230929114421.3761121-6-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044448_392757_E06968C9 X-CRM114-Status: GOOD ( 27.15 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Introduce the logic to allow THP to be configured (through the new anon_orders interface we just added) to allocate large folios to back anonymous memory, which are smaller than PMD-size (for example order-2, order-3, order-4, etc). These THPs continue to be PTE-mapped, but in many cases can still provide similar benefits to traditional PMD-sized THP: Page faults are significantly reduced (by a factor of e.g. 4, 8, 16, etc. depending on the configured order), but latency spikes are much less prominent because the size of each page isn't as huge as the PMD-sized variant and there is less memory to clear in each page fault. The number of per-page operations (e.g. ref counting, rmap management, lru list management) are also significantly reduced since those ops now become per-folio. Some architectures also employ TLB compression mechanisms to squeeze more entries in when a set of PTEs are virtually and physically contiguous and approporiately aligned. In this case, TLB misses will occur less often. The new behaviour is disabled by default because the anon_orders defaults to only enabling PMD-order, but can be enabled at runtime by writing to anon_orders (see documentation in previous commit). The long term aim is to default anon_orders to include suitable lower orders, but there are some risks around internal fragmentation that need to be better understood first. Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 9 +- include/linux/huge_mm.h | 6 +- mm/memory.c | 108 +++++++++++++++++++-- 3 files changed, 111 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 9f954e73a4ca..732c3b2f4ba8 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -353,7 +353,9 @@ anonymous transparent huge pages, it is necessary to read ``/proc/PID/smaps`` and count the AnonHugePages and AnonHugePteMap fields for each mapping. Note that in both cases, AnonHugePages refers only to PMD-mapped THPs. AnonHugePteMap refers to THPs that are mapped -using PTEs. +using PTEs. This includes all THPs whose order is smaller than +PMD-order, as well as any PMD-order THPs that happen to be PTE-mapped +for other reasons. The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. @@ -367,6 +369,11 @@ frequently will incur overhead. There are a number of counters in ``/proc/vmstat`` that may be used to monitor how successfully the system is providing huge pages for use. +.. note:: + Currently the below counters only record events relating to + PMD-order THPs. Events relating to smaller order THPs are not + included. + thp_fault_alloc is incremented every time a huge page is successfully allocated to handle a page fault. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2e7c338229a6..c4860476a1f5 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -68,9 +68,11 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_NR (1<pte + i))) + return true; + } + + return false; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static struct folio *alloc_anon_folio(struct vm_fault *vmf) +{ + gfp_t gfp; + pte_t *pte; + unsigned long addr; + struct folio *folio; + struct vm_area_struct *vma = vmf->vma; + unsigned int orders; + int order; + + /* + * If uffd is active for the vma we need per-page fault fidelity to + * maintain the uffd semantics. + */ + if (userfaultfd_armed(vma)) + goto fallback; + + /* + * Get a list of all the (large) orders below PMD_ORDER that are enabled + * for this vma. Then filter out the orders that can't be allocated over + * the faulting address and still be fully contained in the vma. + */ + orders = hugepage_vma_check(vma, vma->vm_flags, false, true, true, + BIT(PMD_ORDER) - 1); + orders = transhuge_vma_suitable(vma, vmf->address, orders); + + if (!orders) + goto fallback; + + pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK); + if (!pte) + return ERR_PTR(-EAGAIN); + + order = first_order(orders); + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + vmf->pte = pte + pte_index(addr); + if (!vmf_pte_range_changed(vmf, 1 << order)) + break; + order = next_order(&orders, order); + } + + vmf->pte = NULL; + pte_unmap(pte); + + gfp = vma_thp_gfp_mask(vma); + + while (orders) { + addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order); + folio = vma_alloc_folio(gfp, order, vma, addr, true); + if (folio) { + clear_huge_page(&folio->page, addr, 1 << order); + return folio; + } + order = next_order(&orders, order); + } + +fallback: + return vma_alloc_zeroed_movable_folio(vma, vmf->address); +} +#else +#define alloc_anon_folio(vmf) \ + vma_alloc_zeroed_movable_folio((vmf)->vma, (vmf)->address) +#endif + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -4066,6 +4147,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) */ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { + int i; + int nr_pages = 1; + unsigned long addr = vmf->address; bool uffd_wp = vmf_orig_pte_uffd_wp(vmf); struct vm_area_struct *vma = vmf->vma; struct folio *folio; @@ -4110,10 +4194,15 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) /* Allocate our own private page. */ if (unlikely(anon_vma_prepare(vma))) goto oom; - folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); + folio = alloc_anon_folio(vmf); + if (IS_ERR(folio)) + return 0; if (!folio) goto oom; + nr_pages = folio_nr_pages(folio); + addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) goto oom_free_page; folio_throttle_swaprate(folio, GFP_KERNEL); @@ -4130,12 +4219,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry), vma); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); if (!vmf->pte) goto release; - if (vmf_pte_changed(vmf)) { - update_mmu_tlb(vma, vmf->address, vmf->pte); + if (vmf_pte_range_changed(vmf, nr_pages)) { + for (i = 0; i < nr_pages; i++) + update_mmu_tlb(vma, addr + PAGE_SIZE * i, vmf->pte + i); goto release; } @@ -4150,16 +4239,17 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); - folio_add_new_anon_rmap(folio, vma, vmf->address); + folio_ref_add(folio, nr_pages - 1); + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages); + folio_add_new_anon_rmap(folio, vma, addr); folio_add_lru_vma(folio, vma); setpte: if (uffd_wp) entry = pte_mkuffd_wp(entry); - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr_pages); /* No need to invalidate - it was non-present before */ - update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1); + update_mmu_cache_range(vmf, vma, addr, vmf->pte, nr_pages); unlock: if (vmf->pte) pte_unmap_unlock(vmf->pte, vmf->ptl); From patchwork Fri Sep 29 11:44:17 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404140 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EDC63E7F154 for ; Fri, 29 Sep 2023 11:46:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=RvZbUj4wNF4FGeakisFL84QObU5poW4GDhlax5f8Us4=; b=e5UyoEGQO1Lk3R XVafpT89lOVSRtT+Bru0SF2IBBR8SVDRP1RIWhAJyxyChvpv2HEcxaC7Ym7FN0orCHljwj0rZwZ+i woZ1yZxRkC7eXu6ThC/EJfNbiWbvQvI/yIozkdsc3TUbkElbDblLMKvVzQ+QcpBGLlq0+rGm/8PT8 j9YDwFFBsvs5zhUnduZa5RVVADe66dIPCw3u96NGHR0tbgCXKAmki7rQ5Nwf0tCbjvrXPMI0Z1POY u2FiOKgR7WCCJwwL670/kk7ZZvXGQ5uu0yz4VKFLqjZqZPm/68DxtJXtPZnpvi6Fq6SOBfsgP/yeq 0EpKbwdfgL7Dj8zgvepA==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBwD-007pdZ-1k; Fri, 29 Sep 2023 11:45:41 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvP-007p8E-1H for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:44:56 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 523D91007; Fri, 29 Sep 2023 04:45:28 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7040B3F59C; Fri, 29 Sep 2023 04:44:47 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 6/9] mm: thp: Add "recommend" option for anon_orders Date: Fri, 29 Sep 2023 12:44:17 +0100 Message-Id: <20230929114421.3761121-7-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044451_526335_7BBCBF96 X-CRM114-Status: GOOD ( 18.54 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org In addition to passing a bitfield of folio orders to enable for THP, allow the string "recommend" to be written, which has the effect of causing the system to enable the orders preferred by the architecture and by the mm. The user can see what these orders are by subsequently reading back the file. Note that these recommended orders are expected to be static for a given boot of the system, and so the keyword "auto" was deliberately not used, as I want to reserve it for a possible future use where the "best" order is chosen more dynamically at runtime. Recommended orders are determined as follows: - PMD_ORDER: The traditional THP size - arch_wants_pte_order() if implemented by the arch - PAGE_ALLOC_COSTLY_ORDER: The largest order kept on per-cpu free list arch_wants_pte_order() can be overridden by the architecture if desired. Some architectures (e.g. arm64) can coalsece TLB entries if a contiguous set of ptes map physically contigious, naturally aligned memory, so this mechanism allows the architecture to optimize as required. Here we add the default implementation of arch_wants_pte_order(), used when the architecture does not define it, which returns -1, implying that the HW has no preference. Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 4 ++++ include/linux/pgtable.h | 13 +++++++++++++ mm/huge_memory.c | 14 +++++++++++--- 3 files changed, 28 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 732c3b2f4ba8..d6363d4efa3a 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -187,6 +187,10 @@ pages (=16K if the page size is 4K). The example above enables order-9 By enabling multiple orders, allocation of each order will be attempted, highest to lowest, until a successful allocation is made. If the PMD-order is unset, then no PMD-sized THPs will be allocated. +It is also possible to enable the recommended set of orders, which +will be optimized for the architecture and mm:: + + echo recommend >/sys/kernel/mm/transparent_hugepage/anon_orders The kernel will ignore any orders that it does not support so read the file back to determine which orders are enabled:: diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index af7639c3b0a3..0e110ce57cc3 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -393,6 +393,19 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma, } #endif +#ifndef arch_wants_pte_order +/* + * Returns preferred folio order for pte-mapped memory. Must be in range [0, + * PMD_ORDER) and must not be order-1 since THP requires large folios to be at + * least order-2. Negative value implies that the HW has no preference and mm + * will choose it's own default order. + */ +static inline int arch_wants_pte_order(void) +{ + return -1; +} +#endif + #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long address, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bcecce769017..e2e2d3906a21 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -464,10 +464,18 @@ static ssize_t anon_orders_store(struct kobject *kobj, int err; int ret = count; unsigned int orders; + int arch; - err = kstrtouint(buf, 0, &orders); - if (err) - ret = -EINVAL; + if (sysfs_streq(buf, "recommend")) { + arch = max(arch_wants_pte_order(), PAGE_ALLOC_COSTLY_ORDER); + orders = BIT(arch); + orders |= BIT(PAGE_ALLOC_COSTLY_ORDER); + orders |= BIT(PMD_ORDER); + } else { + err = kstrtouint(buf, 0, &orders); + if (err) + ret = -EINVAL; + } if (ret > 0) { orders &= THP_ORDERS_ALL_ANON; From patchwork Fri Sep 29 11:44:18 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404142 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 81654E80ABE for ; Fri, 29 Sep 2023 11:46:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=BdmfQjneXDlm+ooKIUjsh2+K252DHK6CaNf8OYE6Ars=; b=N6VIIcYZqAmGmd EZ8c8A1QR/PHtYbgb9nf8aXaktzD8b7AyqIz0iJ/1mw4GYTarRJAT62f+RZWpSaS5CGh1npETdcoF GvQ6I7we6t4RJ6TFPDwQzx6lper6WtN9dRoCNmhm2ad06I18O5Zif9FYOc2Jr4Osv0z+JlZRkm79W 3b+JOSBVA96d2ocmTiXp/6bXtx1iE2XEW8U+0FAUkAzJ5hd+vfmMJqZLEtue7Vp3L/abUV6q/ksX5 58gkovMTDlptLzZw7pKvlbtrYYxCsys87ytzFEmSl5Xn9z3ReclX9WJ9pFZ7b7CdEKeePwRAIJsb+ IUMSbvWXY3T/wKsThEfQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBwE-007pe0-0N; Fri, 29 Sep 2023 11:45:42 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvR-007p9u-10 for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:44:59 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0EAF31FB; Fri, 29 Sep 2023 04:45:31 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2C6323F59C; Fri, 29 Sep 2023 04:44:50 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 7/9] arm64/mm: Override arch_wants_pte_order() Date: Fri, 29 Sep 2023 12:44:18 +0100 Message-Id: <20230929114421.3761121-8-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044453_409592_0FCDABA4 X-CRM114-Status: UNSURE ( 9.84 ) X-CRM114-Notice: Please train this message. X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Define an arch-specific override of arch_wants_pte_order() so that when anon_orders=recommend is set, large folios will be allocated for anonymous memory with an order that is compatible with arm64's HPA uarch feature. Reviewed-by: Yu Zhao Signed-off-by: Ryan Roberts Acked-by: Catalin Marinas --- arch/arm64/include/asm/pgtable.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7f7d9b1df4e5..e3d2449dec5c 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -1110,6 +1110,16 @@ extern pte_t ptep_modify_prot_start(struct vm_area_struct *vma, extern void ptep_modify_prot_commit(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep, pte_t old_pte, pte_t new_pte); + +#define arch_wants_pte_order arch_wants_pte_order +static inline int arch_wants_pte_order(void) +{ + /* + * Many arm64 CPUs support hardware page aggregation (HPA), which can + * coalesce 4 contiguous pages into a single TLB entry. + */ + return 2; +} #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ From patchwork Fri Sep 29 11:44:19 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404141 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 90194E7F154 for ; Fri, 29 Sep 2023 11:46:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=pn1ai8M6/bmYQaaSS8sElxX0dTlvINpr9nzXGosdiEY=; b=zW/xQdVgigmm13 g3LZJAF7PoSRS4cj3QLQ5vVdyCPzOrumHtdXSjZZ3RZRJWlc/brb2swQCbGRWLs9sSfttbMVoSEWJ Rsk//LYKK1GOqOtgtNBT9PgCzkxb92YRCoEJUbbO/vesmHJXtR4hnwkn3cdoyWX7mstSXhdgwUc3K xuTwQM8cLIXAnGqg/m3WEjuJq1+byUAin7ZfoMjri60kC54ngUGzsW5u7O7F6mqQZU7EE6JUGyiXt TpeDQ1SW1gU9zBcHUKsUWs1EHUTkSa/p/u9VgTyA55xqj88y1oNCTKWGhrFsuPxVF/5htIidSYqTm akkFusiq4xIEnEWZWWow==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBwE-007peS-2u; Fri, 29 Sep 2023 11:45:42 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvU-007pGA-0P for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:45:01 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BF64FDA7; Fri, 29 Sep 2023 04:45:33 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DDAFE3F59C; Fri, 29 Sep 2023 04:44:52 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 8/9] selftests/mm/cow: Generalize do_run_with_thp() helper Date: Fri, 29 Sep 2023 12:44:19 +0100 Message-Id: <20230929114421.3761121-9-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044456_291064_2081D96A X-CRM114-Status: GOOD ( 20.10 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org do_run_with_thp() prepares (PMD-sized) THP memory into different states before running tests. With the introduction of THP orders that are smaller than PMD_ORDER, we would like to reuse this logic to also test those smaller orders. So let's add a size parameter which tells the function what size THP it should operate on. No functional change intended here, but a separate commit will add new tests for smaller order THP, where available. Signed-off-by: Ryan Roberts --- tools/testing/selftests/mm/cow.c | 151 +++++++++++++++++-------------- 1 file changed, 84 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index 7324ce5363c0..d887ce454e34 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -32,7 +32,7 @@ static size_t pagesize; static int pagemap_fd; -static size_t thpsize; +static size_t pmdsize; static int nr_hugetlbsizes; static size_t hugetlbsizes[10]; static int gup_fd; @@ -734,14 +734,14 @@ enum thp_run { THP_RUN_PARTIAL_SHARED, }; -static void do_run_with_thp(test_fn fn, enum thp_run thp_run) +static void do_run_with_thp(test_fn fn, enum thp_run thp_run, size_t size) { char *mem, *mmap_mem, *tmp, *mremap_mem = MAP_FAILED; - size_t size, mmap_size, mremap_size; + size_t mmap_size, mremap_size; int ret; - /* For alignment purposes, we need twice the thp size. */ - mmap_size = 2 * thpsize; + /* For alignment purposes, we need twice the requested size. */ + mmap_size = 2 * size; mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (mmap_mem == MAP_FAILED) { @@ -749,36 +749,40 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) return; } - /* We need a THP-aligned memory area. */ - mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1)); + /* We need to naturally align the memory area. */ + mem = (char *)(((uintptr_t)mmap_mem + size) & ~(size - 1)); - ret = madvise(mem, thpsize, MADV_HUGEPAGE); + ret = madvise(mem, size, MADV_HUGEPAGE); if (ret) { ksft_test_result_fail("MADV_HUGEPAGE failed\n"); goto munmap; } /* - * Try to populate a THP. Touch the first sub-page and test if we get - * another sub-page populated automatically. + * Try to populate a THP. Touch the first sub-page and test if + * we get the last sub-page populated automatically. */ mem[0] = 0; - if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) { + if (!pagemap_is_populated(pagemap_fd, mem + size - pagesize)) { ksft_test_result_skip("Did not get a THP populated\n"); goto munmap; } - memset(mem, 0, thpsize); + memset(mem, 0, size); - size = thpsize; switch (thp_run) { case THP_RUN_PMD: case THP_RUN_PMD_SWAPOUT: + if (size != pmdsize) { + ksft_test_result_fail("test bug: can't PMD-map size\n"); + goto munmap; + } break; case THP_RUN_PTE: case THP_RUN_PTE_SWAPOUT: /* * Trigger PTE-mapping the THP by temporarily mapping a single - * subpage R/O. + * subpage R/O. This is a noop if the THP is not pmdsize (and + * therefore already PTE-mapped). */ ret = mprotect(mem + pagesize, pagesize, PROT_READ); if (ret) { @@ -797,7 +801,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) * Discard all but a single subpage of that PTE-mapped THP. What * remains is a single PTE mapping a single subpage. */ - ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTNEED); + ret = madvise(mem + pagesize, size - pagesize, MADV_DONTNEED); if (ret) { ksft_test_result_fail("MADV_DONTNEED failed\n"); goto munmap; @@ -809,7 +813,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) * Remap half of the THP. We need some new memory location * for that. */ - mremap_size = thpsize / 2; + mremap_size = size / 2; mremap_mem = mmap(NULL, mremap_size, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (mem == MAP_FAILED) { @@ -830,7 +834,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) * child. This will result in some parts of the THP never * have been shared. */ - ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DONTFORK); + ret = madvise(mem + pagesize, size - pagesize, MADV_DONTFORK); if (ret) { ksft_test_result_fail("MADV_DONTFORK failed\n"); goto munmap; @@ -844,7 +848,7 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) } wait(&ret); /* Allow for sharing all pages again. */ - ret = madvise(mem + pagesize, thpsize - pagesize, MADV_DOFORK); + ret = madvise(mem + pagesize, size - pagesize, MADV_DOFORK); if (ret) { ksft_test_result_fail("MADV_DOFORK failed\n"); goto munmap; @@ -875,52 +879,65 @@ static void do_run_with_thp(test_fn fn, enum thp_run thp_run) munmap(mremap_mem, mremap_size); } -static void run_with_thp(test_fn fn, const char *desc) +static int sz2ord(size_t size) +{ + return __builtin_ctzll(size / pagesize); +} + +static void run_with_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with THP\n", desc); - do_run_with_thp(fn, THP_RUN_PMD); + ksft_print_msg("[RUN] %s ... with order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PMD, size); } -static void run_with_thp_swap(test_fn fn, const char *desc) +static void run_with_thp_swap(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with swapped-out THP\n", desc); - do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT); + ksft_print_msg("[RUN] %s ... with swapped-out order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PMD_SWAPOUT, size); } -static void run_with_pte_mapped_thp(test_fn fn, const char *desc) +static void run_with_pte_mapped_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with PTE-mapped THP\n", desc); - do_run_with_thp(fn, THP_RUN_PTE); + ksft_print_msg("[RUN] %s ... with PTE-mapped order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PTE, size); } -static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc) +static void run_with_pte_mapped_thp_swap(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped THP\n", desc); - do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT); + ksft_print_msg("[RUN] %s ... with swapped-out, PTE-mapped order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PTE_SWAPOUT, size); } -static void run_with_single_pte_of_thp(test_fn fn, const char *desc) +static void run_with_single_pte_of_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with single PTE of THP\n", desc); - do_run_with_thp(fn, THP_RUN_SINGLE_PTE); + ksft_print_msg("[RUN] %s ... with single PTE of order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_SINGLE_PTE, size); } -static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc) +static void run_with_single_pte_of_thp_swap(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with single PTE of swapped-out THP\n", desc); - do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT); + ksft_print_msg("[RUN] %s ... with single PTE of swapped-out order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_SINGLE_PTE_SWAPOUT, size); } -static void run_with_partial_mremap_thp(test_fn fn, const char *desc) +static void run_with_partial_mremap_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with partially mremap()'ed THP\n", desc); - do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP); + ksft_print_msg("[RUN] %s ... with partially mremap()'ed order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PARTIAL_MREMAP, size); } -static void run_with_partial_shared_thp(test_fn fn, const char *desc) +static void run_with_partial_shared_thp(test_fn fn, const char *desc, size_t size) { - ksft_print_msg("[RUN] %s ... with partially shared THP\n", desc); - do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED); + ksft_print_msg("[RUN] %s ... with partially shared order-%d THP\n", + desc, sz2ord(size)); + do_run_with_thp(fn, THP_RUN_PARTIAL_SHARED, size); } static void run_with_hugetlb(test_fn fn, const char *desc, size_t hugetlbsize) @@ -1091,15 +1108,15 @@ static void run_anon_test_case(struct test_case const *test_case) run_with_base_page(test_case->fn, test_case->desc); run_with_base_page_swap(test_case->fn, test_case->desc); - if (thpsize) { - run_with_thp(test_case->fn, test_case->desc); - run_with_thp_swap(test_case->fn, test_case->desc); - run_with_pte_mapped_thp(test_case->fn, test_case->desc); - run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc); - run_with_single_pte_of_thp(test_case->fn, test_case->desc); - run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc); - run_with_partial_mremap_thp(test_case->fn, test_case->desc); - run_with_partial_shared_thp(test_case->fn, test_case->desc); + if (pmdsize) { + run_with_thp(test_case->fn, test_case->desc, pmdsize); + run_with_thp_swap(test_case->fn, test_case->desc, pmdsize); + run_with_pte_mapped_thp(test_case->fn, test_case->desc, pmdsize); + run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, pmdsize); + run_with_single_pte_of_thp(test_case->fn, test_case->desc, pmdsize); + run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, pmdsize); + run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize); + run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize); } for (i = 0; i < nr_hugetlbsizes; i++) run_with_hugetlb(test_case->fn, test_case->desc, @@ -1120,7 +1137,7 @@ static int tests_per_anon_test_case(void) { int tests = 2 + nr_hugetlbsizes; - if (thpsize) + if (pmdsize) tests += 8; return tests; } @@ -1329,7 +1346,7 @@ static void run_anon_thp_test_cases(void) { int i; - if (!thpsize) + if (!pmdsize) return; ksft_print_msg("[INFO] Anonymous THP tests\n"); @@ -1338,13 +1355,13 @@ static void run_anon_thp_test_cases(void) struct test_case const *test_case = &anon_thp_test_cases[i]; ksft_print_msg("[RUN] %s\n", test_case->desc); - do_run_with_thp(test_case->fn, THP_RUN_PMD); + do_run_with_thp(test_case->fn, THP_RUN_PMD, pmdsize); } } static int tests_per_anon_thp_test_case(void) { - return thpsize ? 1 : 0; + return pmdsize ? 1 : 0; } typedef void (*non_anon_test_fn)(char *mem, const char *smem, size_t size); @@ -1419,7 +1436,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) } /* For alignment purposes, we need twice the thp size. */ - mmap_size = 2 * thpsize; + mmap_size = 2 * pmdsize; mmap_mem = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (mmap_mem == MAP_FAILED) { @@ -1434,11 +1451,11 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) } /* We need a THP-aligned memory area. */ - mem = (char *)(((uintptr_t)mmap_mem + thpsize) & ~(thpsize - 1)); - smem = (char *)(((uintptr_t)mmap_smem + thpsize) & ~(thpsize - 1)); + mem = (char *)(((uintptr_t)mmap_mem + pmdsize) & ~(pmdsize - 1)); + smem = (char *)(((uintptr_t)mmap_smem + pmdsize) & ~(pmdsize - 1)); - ret = madvise(mem, thpsize, MADV_HUGEPAGE); - ret |= madvise(smem, thpsize, MADV_HUGEPAGE); + ret = madvise(mem, pmdsize, MADV_HUGEPAGE); + ret |= madvise(smem, pmdsize, MADV_HUGEPAGE); if (ret) { ksft_test_result_fail("MADV_HUGEPAGE failed\n"); goto munmap; @@ -1457,7 +1474,7 @@ static void run_with_huge_zeropage(non_anon_test_fn fn, const char *desc) goto munmap; } - fn(mem, smem, thpsize); + fn(mem, smem, pmdsize); munmap: munmap(mmap_mem, mmap_size); if (mmap_smem != MAP_FAILED) @@ -1650,7 +1667,7 @@ static void run_non_anon_test_case(struct non_anon_test_case const *test_case) run_with_zeropage(test_case->fn, test_case->desc); run_with_memfd(test_case->fn, test_case->desc); run_with_tmpfile(test_case->fn, test_case->desc); - if (thpsize) + if (pmdsize) run_with_huge_zeropage(test_case->fn, test_case->desc); for (i = 0; i < nr_hugetlbsizes; i++) run_with_memfd_hugetlb(test_case->fn, test_case->desc, @@ -1671,7 +1688,7 @@ static int tests_per_non_anon_test_case(void) { int tests = 3 + nr_hugetlbsizes; - if (thpsize) + if (pmdsize) tests += 1; return tests; } @@ -1681,10 +1698,10 @@ int main(int argc, char **argv) int err; pagesize = getpagesize(); - thpsize = read_pmd_pagesize(); - if (thpsize) - ksft_print_msg("[INFO] detected THP size: %zu KiB\n", - thpsize / 1024); + pmdsize = read_pmd_pagesize(); + if (pmdsize) + ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n", + pmdsize / 1024); nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes, ARRAY_SIZE(hugetlbsizes)); detect_huge_zeropage(); From patchwork Fri Sep 29 11:44:20 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13404143 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AAA06E810D5 for ; Fri, 29 Sep 2023 11:46:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:References:In-Reply-To: Message-Id:Date:Subject:Cc:To:From:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=A/HYklAq3LtZOmAWNLLtsxXA3KZWYZPRccWAzz/DCCA=; b=lMiZZLSzwwN9Fk d+GQNstsG0CkH4SzE4CFKz3GwU9o2qinKIKRfDHzmws0XlOM2lpOhTrmj4hrtULeOmeTvb6NCbW9d YIlgvqy3PwiTb4sEtH1RxsB/Eoej5jv5bNILqCDuZAG/uMhLXiHjXIRFf1AiMhr69atPgImVdMYV4 fmAr0TCrIbdPUqJXHZTORSAAkarqBu/RiK9FU+gwS5Y+TiVj6jpVPUIPkkF9fVuFvIo/v/WlsAF08 c3ols7NWdAcsHbjgUsXwIuMDshSTZ5Qy0NCXfDukQi7JX/CQBSzE8jPKkRp6xIgx/f0sKD01aHl06 vptI6oJ3SzQxx9Aiqkjg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBwG-007pfF-0M; Fri, 29 Sep 2023 11:45:44 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qmBvW-007pI7-2v for linux-arm-kernel@lists.infradead.org; Fri, 29 Sep 2023 11:45:02 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 7C4B61007; Fri, 29 Sep 2023 04:45:36 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9A1783F59C; Fri, 29 Sep 2023 04:44:55 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Matthew Wilcox , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Anshuman Khandual , Yang Shi , "Huang, Ying" , Zi Yan , Luis Chamberlain , Itaru Kitayama , "Kirill A. Shutemov" , John Hubbard , David Rientjes , Vlastimil Babka , Hugh Dickins Cc: Ryan Roberts , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org Subject: [PATCH v6 9/9] selftests/mm/cow: Add tests for small-order anon THP Date: Fri, 29 Sep 2023 12:44:20 +0100 Message-Id: <20230929114421.3761121-10-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230929114421.3761121-1-ryan.roberts@arm.com> References: <20230929114421.3761121-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230929_044459_069264_D64333DF X-CRM114-Status: GOOD ( 17.71 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Add tests similar to the existing THP tests, but which operate on memory backed by smaller-order, PTE-mapped THP. This reuses all the existing infrastructure. If the test suite detects that small-order THP is not supported by the kernel, the new tests are skipped. Signed-off-by: Ryan Roberts --- tools/testing/selftests/mm/cow.c | 93 ++++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index d887ce454e34..6c5e37d8bb69 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -33,10 +33,13 @@ static size_t pagesize; static int pagemap_fd; static size_t pmdsize; +static size_t ptesize; static int nr_hugetlbsizes; static size_t hugetlbsizes[10]; static int gup_fd; static bool has_huge_zeropage; +static unsigned int orig_anon_orders; +static bool orig_anon_orders_valid; static void detect_huge_zeropage(void) { @@ -1118,6 +1121,14 @@ static void run_anon_test_case(struct test_case const *test_case) run_with_partial_mremap_thp(test_case->fn, test_case->desc, pmdsize); run_with_partial_shared_thp(test_case->fn, test_case->desc, pmdsize); } + if (ptesize) { + run_with_pte_mapped_thp(test_case->fn, test_case->desc, ptesize); + run_with_pte_mapped_thp_swap(test_case->fn, test_case->desc, ptesize); + run_with_single_pte_of_thp(test_case->fn, test_case->desc, ptesize); + run_with_single_pte_of_thp_swap(test_case->fn, test_case->desc, ptesize); + run_with_partial_mremap_thp(test_case->fn, test_case->desc, ptesize); + run_with_partial_shared_thp(test_case->fn, test_case->desc, ptesize); + } for (i = 0; i < nr_hugetlbsizes; i++) run_with_hugetlb(test_case->fn, test_case->desc, hugetlbsizes[i]); @@ -1139,6 +1150,8 @@ static int tests_per_anon_test_case(void) if (pmdsize) tests += 8; + if (ptesize) + tests += 6; return tests; } @@ -1693,6 +1706,80 @@ static int tests_per_non_anon_test_case(void) return tests; } +#define ANON_ORDERS_FILE "/sys/kernel/mm/transparent_hugepage/anon_orders" + +static int read_anon_orders(unsigned int *orders) +{ + ssize_t buflen = 80; + char buf[buflen]; + int fd; + + fd = open(ANON_ORDERS_FILE, O_RDONLY); + if (fd == -1) + return -1; + + buflen = read(fd, buf, buflen); + close(fd); + + if (buflen < 1) + return -1; + + *orders = strtoul(buf, NULL, 16); + + return 0; +} + +static int write_anon_orders(unsigned int orders) +{ + ssize_t buflen = 80; + char buf[buflen]; + int fd; + + fd = open(ANON_ORDERS_FILE, O_WRONLY); + if (fd == -1) + return -1; + + buflen = snprintf(buf, buflen, "0x%08x\n", orders); + buflen = write(fd, buf, buflen); + close(fd); + + if (buflen < 1) + return -1; + + return 0; +} + +static size_t save_thp_anon_orders(void) +{ + /* + * If the kernel supports multiple orders for anon THP (indicated by the + * presence of anon_orders file), configure it for the PMD-order and the + * PMD-order - 1, which we will report back and use as the PTE-order THP + * size. Save the original value so that it can be restored on exit. If + * the kernel does not support multiple orders, report back 0 for the + * PTE-size so those tests are skipped. + */ + + int pteorder = sz2ord(pmdsize) - 1; + unsigned int orders = (1UL << sz2ord(pmdsize)) | (1UL << pteorder); + + if (read_anon_orders(&orig_anon_orders)) + return 0; + + orig_anon_orders_valid = true; + + if (write_anon_orders(orders)) + return 0; + + return pagesize << pteorder; +} + +static void restore_thp_anon_orders(void) +{ + if (orig_anon_orders_valid) + write_anon_orders(orig_anon_orders); +} + int main(int argc, char **argv) { int err; @@ -1702,6 +1789,10 @@ int main(int argc, char **argv) if (pmdsize) ksft_print_msg("[INFO] detected PMD-mapped THP size: %zu KiB\n", pmdsize / 1024); + ptesize = save_thp_anon_orders(); + if (ptesize) + ksft_print_msg("[INFO] configured PTE-mapped THP size: %zu KiB\n", + ptesize / 1024); nr_hugetlbsizes = detect_hugetlb_page_sizes(hugetlbsizes, ARRAY_SIZE(hugetlbsizes)); detect_huge_zeropage(); @@ -1720,6 +1811,8 @@ int main(int argc, char **argv) run_anon_thp_test_cases(); run_non_anon_test_cases(); + restore_thp_anon_orders(); + err = ksft_get_fail_cnt(); if (err) ksft_exit_fail_msg("%d out of %d tests failed\n",