From patchwork Mon Jun 26 17:14:20 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13293229 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 966E5EB64D7 for ; Mon, 26 Jun 2023 17:14:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3C34B8D0002; Mon, 26 Jun 2023 13:14:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 373638D0001; Mon, 26 Jun 2023 13:14:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 263EB8D0002; Mon, 26 Jun 2023 13:14:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 175AA8D0001 for ; Mon, 26 Jun 2023 13:14:46 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CEE2AA07A0 for ; Mon, 26 Jun 2023 17:14:45 +0000 (UTC) X-FDA: 80945548530.08.F44996C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf07.hostedemail.com (Postfix) with ESMTP id 5F83C4002A for ; Mon, 26 Jun 2023 17:14:42 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687799683; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=E6LG4rOfPg7WolzoVh62k+bSh7yzVp6GZraJCv2Ku4E=; b=QySBlAlTvfnd7tf1jSsmDW5hk3bXXSV/X8LJ64wTiA/Q/pbZZrx5aahms+wJ2rCGQ2s9wp 2JHpJlgwDE9dGy44QoJkpBllm8rUtfNSKwLYSxFI8sydodD2POQzZwrptocQYa3QIQZq8p EXV41uSAUKFU9Ne0tHHVELuuut//QWA= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf07.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687799683; a=rsa-sha256; cv=none; b=DRgEeAikCWmU2KvyOLU62FliXVcfYsHebQv4pbi6k9E1kV+fZElFTOmKTpyRhMWGNfbbQu Pi4HT9usxPpi0We7gekyBUGb1NO1uM0v3WdX8W4fV3Tj9DG+Eq9lMXvzhuPLdMKXxPw4DO Tlgkx1Yt8rnHDeF6FRQ2uS7gqVt80yw= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 23F3C2F4; Mon, 26 Jun 2023 10:15:25 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 37CCD3F663; Mon, 26 Jun 2023 10:14:38 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , "Matthew Wilcox (Oracle)" , "Kirill A. Shutemov" , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Will Deacon , Geert Uytterhoeven , Christian Borntraeger , Sven Schnelle , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-s390@vger.kernel.org Subject: [PATCH v1 00/10] variable-order, large folios for anonymous memory Date: Mon, 26 Jun 2023 18:14:20 +0100 Message-Id: <20230626171430.3167004-1-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 5F83C4002A X-Stat-Signature: i7myekks9hrbh51pjythhb8wotmx1ubc X-HE-Tag: 1687799682-940912 X-HE-Meta: U2FsdGVkX1/Ps1fJ0WHVRx9t/f1EJL+TMk92emu6ZXT8qFySNwAwSM8zVjpQ1WWMglO5lpWIyoXs2shBuecTfvYv+EIOXWaP0MwnEdj26nDf3gpVUgwV8IjAwrW3vH1iY1zpk2s9sIo9ErdIrAOYdBPckUx6A0mqRc+Z6/XSIxNOYAU6VUhZl5V2ZhZpbJOZEx+xNBthPGpQY7a0EZBlSJLjYPQrfPVOn8YXq89KfRFFm++k6NRxhcHQmdcICx4wY2H5MSYEBRLZ5UENpPkMZPZ2t+ejavBEzIm/DJHs7Lvl524Qjvocj0tXVAYPeXO/H2nkBUGlQ7qLMVGqws9G2mKQJDAtgY3HOQ0kBNa2v7m0ZFXWm+ue21bWJS1othCvig7AyOpqhf/Nuy1toSQD6r81PocVfGvgXOTmZdgTEaha6FC+aooOJbcySv5hAEL94nnLWgCG/4FGbSjrnLIp5DfbE8b4I0vTmnuv/YcKk4AsxFlAROY+u5EVjBixQXWxWLMgYgqlD+7NahnbnXEMTS7g1FfFq7sNOXDnHpWXsNnE9/XzvRvaSEOjWEtTGxDJJ1of9HGXH3ddm8pCSdKetZ3yTLzxTHyV+W1Mn7qhYToUruKcLxyVzJVITNoi19OHQJT1XQofIeuR06ZmiubMqLTIwZefd8CWw/qxPqzifQNaYsBm6FwgHa0STiQEUQzt6LyDKPjreRINGvPKFgzBrbbPjmtW1Qlb/j7ORr2meYLXq3N+Erryz4cHYC6qz4294vIQcWQQpgb0agX8/X+OsZcTmcq+W6PP7r4RwKfstCwfHKIXLdhOcienIqqJsaL0Ik4yY8oNafMpaEKZkRrDAaHRpGGIuVQueK4c5c1VzKhkpgaP8pEOcnNVgdd/ajhy152k5veAtdiqdkEkVIE7e48qrEtf+stn/+bFZOrn1qOmsp51gAjbxiGVXhON8C1YilOjtk56teYY4nHQdMA halLHapx sSOHVdRSQYACCgM+mxmNpOQTHb7Kj/Fmw5hayH2wfY/SAJIhOk2AIN7aKnzboIgfbjMOQCd6cU32nAIiqRT/SHVMV+kAO/wP6qeYnfPQwtGDmxC4YaM46JhDvkDCiUTAKKPykLcUj9B6z1Ku8MSNjCcghpVBJTQJ1uhDh6UxocB68nQz9lCOmgYfOiiwu7ovdIi3GgU8nJG+DQo6gjStLy/cUfRlE08XIA4h/8lq2ZfH6tRCsMAx2IDzFL3KFuOSndIGTzECg4UbTx4uAIqPxSF4WjA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi All, Following on from the previous RFCv2 [1], this series implements variable order, large folios for anonymous memory. The objective of this is to improve performance by allocating larger chunks of memory during anonymous page faults: - Since SW (the kernel) is dealing with larger chunks of memory than base pages, there are efficiency savings to be had; fewer page faults, batched PTE and RMAP manipulation, fewer items on lists, etc. In short, we reduce kernel overhead. This should benefit all architectures. - Since we are now mapping physically contiguous chunks of memory, we can take advantage of HW TLB compression techniques. A reduction in TLB pressure speeds up kernel and user space. arm64 systems have 2 mechanisms to coalesce TLB entries; "the contiguous bit" (architectural) and HPA (uarch). This patch set deals with the SW side of things only and based on feedback from the RFC, aims to be the most minimal initial change, upon which future incremental changes can be added. For this reason, the new behaviour is hidden behind a new Kconfig switch, CONFIG_LARGE_ANON_FOLIO, which is disabled by default. Although the code has been refactored to parameterize the desired order of the allocation, when the feature is disabled (by forcing the order to be always 0) my performance tests measure no regression. So I'm hoping this will be a suitable mechanism to allow incremental submissions to the kernel without affecting the rest of the world. The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series [2], which is a hard dependency. I'm not sure of Matthew's exact plans for getting that series into the kernel, but I'm hoping we can start the review process on this patch set independently. I have a branch at [3]. I've posted a separate series concerning the HW part (contpte mapping) for arm64 at [4]. Performance ----------- Below results show 2 benchmarks; kernel compilation and speedometer 2.0 (a javascript benchmark running in Chromium). Both cases are running on Ampere Altra with 1 NUMA node enabled, Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 reboots and averaged. All improvements are relative to baseline-4k. 'anonfolio-basic' is this series. 'anonfolio' is the full patch set similar to the RFC with the additional changes to the extra 3 fault paths. The rest of the configs are described at [4]. Kernel Compilation (smaller is better): | kernel | real-time | kern-time | user-time | |:----------------|------------:|------------:|------------:| | baseline-4k | 0.0% | 0.0% | 0.0% | | anonfolio-basic | -5.3% | -42.9% | -0.6% | | anonfolio | -5.4% | -46.0% | -0.3% | | contpte | -6.8% | -45.7% | -2.1% | | exefolio | -8.4% | -46.4% | -3.7% | | baseline-16k | -8.7% | -49.2% | -3.7% | | baseline-64k | -10.5% | -66.0% | -3.5% | Speedometer 2.0 (bigger is better): | kernel | runs_per_min | |:----------------|---------------:| | baseline-4k | 0.0% | | anonfolio-basic | 0.7% | | anonfolio | 1.2% | | contpte | 3.1% | | exefolio | 4.2% | | baseline-16k | 5.3% | Changes since RFCv2 ------------------- - Simplified series to bare minimum (on David Hildenbrand's advice) - Removed changes to 3 fault paths: - write fault on zero page: wp_page_copy() - write fault on non-exclusive CoW page: wp_page_copy() - write fault on exclusive CoW page: do_wp_page()/wp_page_reuse() - Only 1 fault path change remains: - write fault on unallocated address: do_anonymous_page() - Removed support patches that are no longer needed - Added Kconfig CONFIG_LARGE_ANON_FOLIO and friends - Whole feature defaults to off - Arch opts-in to allowing feature and provides max allocation order Future Work ----------- Once this series is in, there are some more incremental changes I plan to follow up with: - Add the other 3 fault path changes back in - Properly support pte-mapped folios for: - numa balancing (do_numa_page()) - fix assumptions about exclusivity for large folios in madvise() - compaction (although I think this is already a problem for large folios in the file cache so perhaps someone is working on it?) [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v1 [4] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/ Thanks, Ryan Ryan Roberts (10): mm: Expose clear_huge_page() unconditionally mm: pass gfp flags and order to vma_alloc_zeroed_movable_folio() mm: Introduce try_vma_alloc_movable_folio() mm: Implement folio_add_new_anon_rmap_range() mm: Implement folio_remove_rmap_range() mm: Allow deferred splitting of arbitrary large anon folios mm: Batch-zap large anonymous folio PTE mappings mm: Kconfig hooks to determine max anon folio allocation order arm64: mm: Declare support for large anonymous folios mm: Allocate large folios for anonymous memory arch/alpha/include/asm/page.h | 5 +- arch/arm64/Kconfig | 13 ++ arch/arm64/include/asm/page.h | 3 +- arch/arm64/mm/fault.c | 7 +- arch/ia64/include/asm/page.h | 5 +- arch/m68k/include/asm/page_no.h | 7 +- arch/s390/include/asm/page.h | 5 +- arch/x86/include/asm/page.h | 5 +- include/linux/highmem.h | 23 ++- include/linux/mm.h | 3 +- include/linux/rmap.h | 4 + mm/Kconfig | 39 ++++ mm/memory.c | 324 ++++++++++++++++++++++++++++++-- mm/rmap.c | 107 ++++++++++- 14 files changed, 506 insertions(+), 44 deletions(-) --- 2.25.1