From patchwork Thu Jun 22 14:41:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13289248 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2214AEB64DA for ; Thu, 22 Jun 2023 14:42:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 84CB18D0003; Thu, 22 Jun 2023 10:42:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D6288D0001; Thu, 22 Jun 2023 10:42:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 676B78D0003; Thu, 22 Jun 2023 10:42:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 560718D0001 for ; Thu, 22 Jun 2023 10:42:32 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id C894CA0BC1 for ; Thu, 22 Jun 2023 14:42:31 +0000 (UTC) X-FDA: 80930649702.29.0383D10 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf19.hostedemail.com (Postfix) with ESMTP id 2A28D1A0015 for ; Thu, 22 Jun 2023 14:42:28 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687444949; a=rsa-sha256; cv=none; b=aalEv6QGPfPSaVALZ/rAFM/84llk2T1mfbc5SW8RZzRfcw7aKMkQAmo/0ooP13hvd9cJQ6 BbbFFJQu6rINZceGexmRtilxFRNLIAsvRSuO+wyZjG6AO7Y1mJsveouafCzskKEzMD3Z1b 7BHL8wAEJzO2q1MJKuNHYqC54sIZ8bQ= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=none; spf=pass (imf19.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687444949; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=rdlG0kLYGbok7ma8eEHOftDbeKl3cPbIc5IXoFWmt8A=; b=a6bdsi6krLPvIPrmLRQ87VP673+6eQnxK1xsVpXVOZL11ZIasVlWzKEHIqllAwbEVGfLJI oLO4tuRx68ZE3JWKDC+ix0gEGrgbWudYoJLpUskeUgykLbsfuXJ8EkqBHlrXoHHAVcfLYu 5J8nAD7/5TeXwc+oUGLxNgXQX/9K7Mw= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BA1BCC14; Thu, 22 Jun 2023 07:43:11 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 335663F663; Thu, 22 Jun 2023 07:42:25 -0700 (PDT) From: Ryan Roberts To: Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , Oliver Upton , James Morse , Suzuki K Poulose , Zenghui Yu , Andrey Ryabinin , Alexander Potapenko , Andrey Konovalov , Dmitry Vyukov , Vincenzo Frascino , Andrew Morton , Anshuman Khandual , Matthew Wilcox , Yu Zhao , Mark Rutland Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v1 00/14] Transparent Contiguous PTEs for User Mappings Date: Thu, 22 Jun 2023 15:41:55 +0100 Message-Id: <20230622144210.2623299-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2A28D1A0015 X-Stat-Signature: ogmiozzpsrfzje1k5mqd8a7dtuubzjeb X-HE-Tag: 1687444948-648873 X-HE-Meta: U2FsdGVkX1+ru8S6ooFiINuDRpPdDt7wEKz7nC2Pk9LAz7z7RRKOqumRjqtSDg3M9b/YctJiSKc0T9lQ+1apJGJ8D61RtbsML9j2Jn5ju7P2JkzkbARL5sf4sqDQlcqBrg7jtc1P2WcnHfTRk7Nyi9k77QMlDbtQ8D4LLEJ/NG7HlJXI/BKNuy3MVToeBu7hQyMEQ3D15GrCfgF1OgX5KbXWRIr/XXvJlPekIV5c3ihdFn1zuhw/f26XBizEKwljnELTwPRMp3hp1ncuEu6RCqgFSsmaHyK5EBiZcWFkSPbdmj0U2/yhAJYKSVTJXQiLWoOHqUxkNcwUdVfIxF4El+jZTNteDAObL0KGYQsTfDLHSZlXV6nWVG87lTDh2Sa6XSUYxjtIqlehiYjV8l6cKvG5EL/1wdfqELQzzfNKS+o12CATq/gaobZR7Cb+QcMAnwTu4tE9FXIDN/3+NaPZdKD2E+vIy/V1pcs5wvWozfovjDlgr9J8l3nbMTb1FXt5WyA0V7bOra/qbBQVbjcE6e31iALRCyqtqeglmlBZY2DIyMFdctjNUdELotIlPLqn9hpvaWrowGT+B7AW3LhoEyDfoH19Kiat8QKb72jmeuw3etWayHQNfcZ3iIYwuV5lqTzaL9gabo72D7LPvKxPzyIoICJLU4JEUZlJeFMPxBpJeuzPdoiqPv4mrNlZRgJ5Eq77pGDC00vRQaS2992SwomJ1pRL7nMuLtc/AbnKuDH1DtwqA2UBTVZcrlGV8rnxfVMS5sh/dP4f8YH93Oecy1jMAL1Bykj9odyH1wAellLRZ9TzBHLwPacfRunnJ5xNQeJn9QNDUXOLMWzI3vpwQ3AuCNrJP9ZqHfz1dhsyAkp0DCjRhLxnc2YE2N9X9EmZ6SkRW5aL5mJ0E2eJK2DrDr5Vf4WzmVk4v33DpqZD3H6slNwD7g70i+gdjU9JH3rVQnMroHUI4Diwcqr90nT OsYbvELt 8Ai1j5gtohuGvex8/piLTv+7/z5Nzkwjy9p+lpQogAMJVLzIKeMzanu65uZHjbhDu1xLgXk5t+Sz53NuBgSi2cTVz7q9Us+6NDVV6mYs2ghAPTqeGOacnSHiB8CAfqrCke0MGDI++3ihYwFSrcbtvFPbCnCWA/dyKFVG1bO03G/EVHVThyTmjyujzDXd9XL6b3f6nhfCn+sk2w0W7mLY1JfqSOI7rM8mBPIx7NkvYgjCWBLNBQpoH1g3T7UenOh6PdQIpMyLZtS+RBA2v6LHJ+HV+2Bqw/WFVbvoxLXu18ohxlSiHSqVCu/w08y6N32zsjz//po0AVyCBtaY+KhUUgBDIIQUxFZ7w/vsv X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi All, This is a series to opportunistically and transparently use contpte mappings (set the contiguous bit in ptes) for user memory when those mappings meet the requirements. It is part of a wider effort to improve performance of the 4K kernel with the aim of approaching the performance of the 16K kernel, but without breaking compatibility and without the associated increase in memory. It also benefits the 16K and 64K kernels by enabling 2M THP, since this is the contpte size for those kernels. Of course this is only one half of the change. We require the mapped physical memory to be the correct size and alignment for this to actually be useful (i.e. 64K for 4K pages, or 2M for 16K/64K pages). Fortunately folios are solving this problem for us. Filesystems that support it (XFS, AFS, EROFS, tmpfs) will allocate large folios up to the PMD size today, and more filesystems are coming. And the other half of my work, to enable the use of large folios for anonymous memory, aims to make contpte sized folios prevalent for anonymous memory too. Dependencies ------------ While there is a complicated set of hard and soft dependencies that this patch set depends on, I wanted to split it out as best I could and kick off proper review independently. The series applies on top of these other patch sets, with a tree at: https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v1 v6.4-rc6 - base set_ptes() - hard dependency - Patch set from Matthew Wilcox to set multiple ptes with a single API call - Allows arch backend to more optimally apply contpte mappings - https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ ptep_get() pte encapsulation - hard dependency - Enabler series from me to ensure none of the core code ever directly dereferences a pte_t that lies within a live page table. - Enables gathering access/dirty bits from across the whole contpte range - in mm-stable and linux-next at time of writing - https://lore.kernel.org/linux-mm/d38dc237-6093-d4c5-993e-e8ffdd6cb6fa@arm.com/ Report on physically contiguous memory in smaps - soft dependency - Enables visibility on how much memory is physically contiguous and how much is contpte-mapped - useful for debug - https://lore.kernel.org/linux-mm/20230613160950.3554675-1-ryan.roberts@arm.com/ Additionally there are a couple of other dependencies: anonfolio - soft dependency - ensures more anonymous memory is allocated in contpte-sized folios, so needed to realize the performance improvements (this is the "other half" mentioned above). - RFC: https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/ - Intending to post v1 shortly. exefolio - soft dependency - Tweak readahead to ensure executable memory are in 64K-sized folios, so needed to see reduction in iTLB pressure. - Don't intend to post this until we are further down the track with contpte and anonfolio. Arm ARM Clarification - hard dependency - Current wording disallows the fork() optimization in the final patch. - Arm (ATG) have proposed tightening the wording to permit it. - In conversation with partners to check this wouldn't cause problems for any existing HW deployments All of the _hard_ dependencies need to be resolved before this can be considered for merging. Performance ----------- Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a javascript benchmark running in Chromium). Both cases are running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 reboots and averaged. All improvements are relative to baseline-4k. anonfolio and exefolio are as described above. contpte is this series. (Note that exefolio only gives an improvement because contpte is already in place). Kernel Compilation (smaller is better): | kernel | real-time | kern-time | user-time | |:-------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio | -5.4% | -46.0% | -0.3% | | contpte | -6.8% | -45.7% | -2.1% | | exefolio | -8.4% | -46.4% | -3.7% | | baseline-16k | -8.7% | -49.2% | -3.7% | | baseline-64k | -10.5% | -66.0% | -3.5% | Speedometer 2.0 (bigger is better): | kernel | runs_per_min | |:-------------|---------------:| | baseline-4k | 0.0% | | anonfolio | 1.2% | | contpte | 3.1% | | exefolio | 4.2% | | baseline-16k | 5.3% | I've also run Speedometer 2.0 on Pixel 6 with an Ubuntu SW stack and see similar gains. I've also verified that running the contpte changes without anonfolio and exefolio does not cause any regression vs baseline-4k. Opens ----- The only potential issue that I see right now is that due to there only being 1 access/dirty bit per contpte range, if a single page in the range is accessed/dirtied then all the adjacent pages are reported as accessed/dirtied too. Access/dirty is managed by the kernel per _folio_, so this information gets collapsed down anyway, and nothing changes there. However, the per _page_ access/dirty information is reported through pagemap to user space. I'm not sure if this would/should be considered a break? Thoughts? Thanks, Ryan Ryan Roberts (14): arm64/mm: set_pte(): New layer to manage contig bit arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit arm64/mm: pte_clear(): New layer to manage contig bit arm64/mm: ptep_get_and_clear(): New layer to manage contig bit arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit arm64/mm: ptep_set_access_flags(): New layer to manage contig bit arm64/mm: ptep_get(): New layer to manage contig bit arm64/mm: Split __flush_tlb_range() to elide trailing DSB arm64/mm: Wire up PTE_CONT for user mappings arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown mm: Batch-copy PTE ranges during fork() arm64/mm: Implement ptep_set_wrprotects() to optimize fork() arch/arm64/include/asm/pgtable.h | 305 +++++++++++++++++--- arch/arm64/include/asm/tlbflush.h | 11 +- arch/arm64/kernel/efi.c | 4 +- arch/arm64/kernel/mte.c | 2 +- arch/arm64/kvm/guest.c | 2 +- arch/arm64/mm/Makefile | 3 +- arch/arm64/mm/contpte.c | 443 ++++++++++++++++++++++++++++++ arch/arm64/mm/fault.c | 12 +- arch/arm64/mm/fixmap.c | 4 +- arch/arm64/mm/hugetlbpage.c | 40 +-- arch/arm64/mm/kasan_init.c | 6 +- arch/arm64/mm/mmu.c | 16 +- arch/arm64/mm/pageattr.c | 6 +- arch/arm64/mm/trans_pgd.c | 6 +- include/linux/pgtable.h | 13 + mm/memory.c | 149 +++++++--- 16 files changed, 896 insertions(+), 126 deletions(-) create mode 100644 arch/arm64/mm/contpte.c --- 2.25.1