From patchwork Wed Jul 17 07:12:54 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13735172 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B44C5C3DA42 for ; Wed, 17 Jul 2024 07:13:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 278E36B0089; Wed, 17 Jul 2024 03:13:14 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 201466B008C; Wed, 17 Jul 2024 03:13:14 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EFB586B0092; Wed, 17 Jul 2024 03:13:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D32B46B0089 for ; Wed, 17 Jul 2024 03:13:13 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 8C1B51208EA for ; Wed, 17 Jul 2024 07:13:13 +0000 (UTC) X-FDA: 82348378266.20.41A655E Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf20.hostedemail.com (Postfix) with ESMTP id DFA171C002E for ; Wed, 17 Jul 2024 07:13:11 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf20.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721200347; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=V8VGjsHO1XOx1TyJrn3GYA8agJ2krXD/+XFuvdGtMf4=; b=bycxZP1otHbYJLJb/F13mfA5jcFFh+FF1zmrbVTpbbGtwbBZ5Ho9H1gSLOELhR4sjrz6g1 ohFF9Z8FaT9M/E1R5sqPMkCxFpu0WWC6TZWUbrOtBHgzEWKsYS4//qaTyJ+mZbKet9/Jrk idw4kUoitCg7ODDWi/k6TfxA59KXhxU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721200347; a=rsa-sha256; cv=none; b=iFxz2tZCCXYMKNquKI++fsZE8Y584METBLje2g5O9LzaP/s3jmR22QrJyYbBtoygs460+Z zUPXAN/BTVg045TdG2bL4AP5OxhODGGI18zn3fqxEOKSY5EunWzEyTdL5Q1IOEoUdBRAXi Lu10VYusGJMMt7F1m3kAgEkFTM2Y/3k= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf20.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 36C541476; Wed, 17 Jul 2024 00:13:36 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 358DD3F762; Wed, 17 Jul 2024 00:13:09 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Hugh Dickins , Jonathan Corbet , "Matthew Wilcox (Oracle)" , David Hildenbrand , Barry Song , Lance Yang , Baolin Wang , Gavin Shan , Pankaj Raghav , Daniel Gomez Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v1 2/4] mm: Introduce "always+exec" for mTHP file_enabled control Date: Wed, 17 Jul 2024 08:12:54 +0100 Message-ID: <20240717071257.4141363-3-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240717071257.4141363-1-ryan.roberts@arm.com> References: <20240717071257.4141363-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: DFA171C002E X-Stat-Signature: 19uoq6do3yxebughu57j7aqnonz4hsgc X-Rspam-User: X-HE-Tag: 1721200391-517627 X-HE-Meta: U2FsdGVkX19s5xXlLppuhIv2h9x3tVPbLe1ivII5tEis+qQhHXa58KxmtxHxg16uTENrzPu2TEnlOMfpIAvYS57E8/bb740arhVe8/1BUletKKmT9RG5aVtnglkYmnXTwXAj2C5FFidB76NANViCT4Nt6wH56dxOqzz3Zq9vBKcbkTSw81s+nApz8l8xySVTynYulbTiMAa2IB04MqAX4VfCld372stlAfOQ8768xCK2n1T3FSJQruIT9DZR7F1+hsNJV9d97ZxdWMYnAPaXEqaw2uLv4d+/Ix28ayBxa/OVdvuBjj/LIcQE1x8BDf3RMiwX3laUxFytFqCxbxrJhWKAGcUiyUtguqrWNEV7cqhgeukpwgjiG8CTr91wvsR4UqddqjzCJibSCJP8D3VaPdLuidg4n4Vv5GPr3UWdYw4Vlze8Rb88SUfkann7kZKRgIavMn0kNs6CTRaQ86Qcda/8drhPqJgbtQZZyHLm6KXrv3OKVb3btm3Lu73lzbUyGrFogaMBGDKjd7prxEZhvv/AM6LrUj6BusUkYmA5UU6Yo5k6/o4192wJ2BpouvXLxydTDuyzrAYtb8HRwyWYO+XKFV8MmnAubfjRQ2o7GbATmaDnhFJOF8gP5O6vddcbeom9ypSn/IJenQPZ9L7mMs6JIhVj9m29J6bFzpUr4JTqstMpchDPHjb/IQi45GORo0mv0n5IJNrWNhF3gc3Vdqr+/OLnU4hvVmhfVrY22GBCHmT3pnO9jjT59m9Rcds9BnmTZuQqluV1Mk1iEph30JO7D5qWqYWkz9mQV6krMDpBNXvnuT7O4qpxqcHD4kM/QhPSF5GRsGV0ZlaZDXJWW2nNbmPCctz0eImrCBX3FpZpj9jCVVbSDRHRvNw4Ml7KPyoQAY4bvMDB/tbq+Cmf4mAVOQEvRZmXgrXXXOlh54G65aUfSx+/wCPZoR79KWzmFTUyOLN6fsJ70ME5mrS F1PFQdzR 8SyjKa0qsaYeqkowCiDpyt+ZMYDeepn3zjUddXA8qhBqKNge6lRUYk8t8UZJ/kmUsWRWbUSXuu9Us/H3OPmY7/BXHWPBV1YR3jTf1L3EUtyBWiiHnlRTvPRRvQXs0N+EtcR9skRUWhbx+FEJq7nVcscHKEJZx1iXAKeTRPvrvIVlCjjpVMW/WlE1uZ/TvbB3b2KJhfIHGbBk/lMw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: In addition to `always` and `never`, add `always+exec` as an option for: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/file_enabled `always+exec` acts like `always` but additionally marks the hugepage size as the preferred hugepage size for sections of any file mapped with execute permission. A maximum of one hugepage size can be marked as `exec` at a time, so applying it to a new size implicitly removes it from any size it was previously set for. Change readahead to use this flagged exec size; when a request is made for an executable mapping, do a synchronous read of the size in a naturally aligned manner. On arm64 if memory is physically contiguous and naturally aligned to the "contpte" size, we can use contpte mappings, which improves utilization of the TLB. When paired with the "multi-size THP" changes, this works well to reduce dTLB pressure. However iTLB pressure is still high due to executable mappings having a low liklihood of being in the required folio size and mapping alignment, even when the filesystem supports readahead into large folios (e.g. XFS). The reason for the low liklihood is that the current readahead algorithm starts with an order-2 folio and increases the folio order by 2 every time the readahead mark is hit. But most executable memory is faulted in fairly randomly and so the readahead mark is rarely hit and most executable folios remain order-2. This is observed impirically and confirmed from discussion with a gnu linker expert; in general, the linker does nothing to group temporally accessed text together spacially. Additionally, with the current read-around approach there are no alignment guarrantees between the file and folio. This is insufficient for arm64's contpte mapping requirement (order-4 for 4K base pages). So it seems reasonable to special-case the read(ahead) logic for executable mappings. The trade-off is performance improvement (due to more efficient storage of the translations in iTLB) vs potential read amplification (due to reading too much data around the fault which won't be used), and the latter is independent of base page size. Of course if no hugepage size is marked as `always+exec` the old behaviour is maintained. Performance Benchmarking ------------------------ The below shows kernel compilation and speedometer javascript benchmarks on Ampere Altra arm64 system. When the patch is applied, `always+exec` is set for 64K folios. First, confirmation that this patch causes more memory to be contained in 64K folios (this is for all file-backed memory so includes non-executable too): | File-backed folios | Speedometer | Kernel Compile | | by size as percentage |-----------------|-----------------| | of all mapped file mem | before | after | before | after | |=========================|========|========|========|========| |file-thp-aligned-16kB | 45% | 9% | 46% | 7% | |file-thp-aligned-32kB | 2% | 0% | 3% | 1% | |file-thp-aligned-64kB | 3% | 63% | 5% | 80% | |file-thp-aligned-128kB | 11% | 11% | 0% | 0% | |file-thp-unaligned-16kB | 1% | 0% | 3% | 1% | |file-thp-unaligned-128kB | 1% | 0% | 0% | 0% | |file-thp-partial | 0% | 0% | 0% | 0% | |-------------------------|--------|--------|--------|--------| |file-cont-aligned-64kB | 16% | 75% | 5% | 80% | The above shows that for both use cases, the amount of file memory backed by 16K folios reduces and the amount backed by 64K folios increases significantly. And the amount of memory that is contpte-mapped significantly increases (last line). And this is reflected in performance improvement: Kernel Compilation (smaller is faster): | kernel | real-time | kern-time | user-time | peak memory | |----------|-------------|-------------|-------------|---------------| | before | 0.0% | 0.0% | 0.0% | 0.0% | | after | -1.6% | -2.1% | -1.7% | 0.0% | Speedometer (bigger is faster): | kernel | runs_per_min | peak memory | |----------|----------------|---------------| | before | 0.0% | 0.0% | | after | 1.3% | 1.0% | Both benchmarks show a ~1.5% improvement once the patch is applied. Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 6 +++++ include/linux/huge_mm.h | 11 ++++++++ mm/filemap.c | 11 ++++++++ mm/huge_memory.c | 31 +++++++++++++++++----- 4 files changed, 52 insertions(+), 7 deletions(-) -- 2.43.0 diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 9f3ed504c646..1aaf8e3a0b5a 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -292,12 +292,18 @@ memory from a set of allowed sizes. By default all THP sizes that the page cache supports are allowed, but this set can be modified with one of:: echo always >/sys/kernel/mm/transparent_hugepage/hugepages-kB/file_enabled + echo always+exec >/sys/kernel/mm/transparent_hugepage/hugepages-kB/file_enabled echo never >/sys/kernel/mm/transparent_hugepage/hugepages-kB/file_enabled where is the hugepage size being addressed, the available sizes for which vary by system. ``always`` adds the hugepage size to the set of allowed sizes, and ``never`` removes the hugepage size from the set of allowed sizes. +``always+exec`` acts like ``always`` but additionally marks the hugepage size as +the preferred hugepage size for sections of any file mapped executable. A +maximum of one hugepage size can be marked as ``exec`` at a time, so applying it +to a new size implicitly removes it from any size it was previously set for. + In some situations, constraining the allowed sizes can reduce memory fragmentation, resulting in fewer allocation fallbacks and improved system performance. diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 19ced8192d39..3571ea0c3d8c 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -177,12 +177,18 @@ extern unsigned long huge_anon_orders_always; extern unsigned long huge_anon_orders_madvise; extern unsigned long huge_anon_orders_inherit; extern unsigned long huge_file_orders_always; +extern int huge_file_exec_order; static inline unsigned long file_orders_always(void) { return READ_ONCE(huge_file_orders_always); } +static inline int file_exec_order(void) +{ + return READ_ONCE(huge_file_exec_order); +} + static inline bool hugepage_global_enabled(void) { return transparent_hugepage_flags & @@ -453,6 +459,11 @@ static inline unsigned long file_orders_always(void) return 0; } +static inline int file_exec_order(void) +{ + return -1; +} + static inline bool folio_test_pmd_mappable(struct folio *folio) { return false; diff --git a/mm/filemap.c b/mm/filemap.c index 870016fcfdde..c4a3cc6a2e46 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3128,6 +3128,7 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) struct file *fpin = NULL; unsigned long vm_flags = vmf->vma->vm_flags; unsigned int mmap_miss; + int exec_order = file_exec_order(); #ifdef CONFIG_TRANSPARENT_HUGEPAGE /* Use the readahead code, even if readahead is disabled */ @@ -3147,6 +3148,16 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf) } #endif + /* If explicit order is set for exec mappings, use it. */ + if ((vm_flags & VM_EXEC) && exec_order >= 0) { + fpin = maybe_unlock_mmap_for_io(vmf, fpin); + ra->size = 1UL << exec_order; + ra->async_size = 0; + ractl._index &= ~((unsigned long)ra->size - 1); + page_cache_ra_order(&ractl, ra, exec_order); + return fpin; + } + /* If we don't want any read-ahead, don't bother */ if (vm_flags & VM_RAND_READ) return fpin; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e8fe28fe9cf9..4249c0bc9388 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -81,6 +81,7 @@ unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; unsigned long huge_file_orders_always __read_mostly; +int huge_file_exec_order __read_mostly = -1; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, @@ -462,6 +463,7 @@ static const struct attribute_group hugepage_attr_group = { static void hugepage_exit_sysfs(struct kobject *hugepage_kobj); static void thpsize_release(struct kobject *kobj); static DEFINE_SPINLOCK(huge_anon_orders_lock); +static DEFINE_SPINLOCK(huge_file_orders_lock); static LIST_HEAD(thpsize_list); static ssize_t anon_enabled_show(struct kobject *kobj, @@ -531,11 +533,15 @@ static ssize_t file_enabled_show(struct kobject *kobj, { int order = to_thpsize(kobj)->order; const char *output; + bool exec; - if (test_bit(order, &huge_file_orders_always)) - output = "[always] never"; - else - output = "always [never]"; + if (test_bit(order, &huge_file_orders_always)) { + exec = READ_ONCE(huge_file_exec_order) == order; + output = exec ? "always [always+exec] never" : + "[always] always+exec never"; + } else { + output = "always always+exec [never]"; + } return sysfs_emit(buf, "%s\n", output); } @@ -547,13 +553,24 @@ static ssize_t file_enabled_store(struct kobject *kobj, int order = to_thpsize(kobj)->order; ssize_t ret = count; - if (sysfs_streq(buf, "always")) + spin_lock(&huge_file_orders_lock); + + if (sysfs_streq(buf, "always")) { set_bit(order, &huge_file_orders_always); - else if (sysfs_streq(buf, "never")) + if (huge_file_exec_order == order) + huge_file_exec_order = -1; + } else if (sysfs_streq(buf, "always+exec")) { + set_bit(order, &huge_file_orders_always); + huge_file_exec_order = order; + } else if (sysfs_streq(buf, "never")) { clear_bit(order, &huge_file_orders_always); - else + if (huge_file_exec_order == order) + huge_file_exec_order = -1; + } else { ret = -EINVAL; + } + spin_unlock(&huge_file_orders_lock); return ret; }