From patchwork Wed Jul 17 07:12:53 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ryan Roberts X-Patchwork-Id: 13735171 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F249C3DA4B for ; Wed, 17 Jul 2024 07:13:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D4E896B0088; Wed, 17 Jul 2024 03:13:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CFDB16B0089; Wed, 17 Jul 2024 03:13:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B02856B008C; Wed, 17 Jul 2024 03:13:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 91BA86B0088 for ; Wed, 17 Jul 2024 03:13:11 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 48FEEC03F1 for ; Wed, 17 Jul 2024 07:13:11 +0000 (UTC) X-FDA: 82348378182.02.5A89FAA Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf30.hostedemail.com (Postfix) with ESMTP id 925678002A for ; Wed, 17 Jul 2024 07:13:09 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721200359; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CGmugk0vXEPiWKdhzxDCKcliGZ27MPEateLv5viIHqE=; b=mRiW1YOw56bUB1FSsryaLnXcH2ucuNIUbTgDeXTC0vLChq4DOhyDMtCv1D9enfYhNbc7zs PKzJBxCjty/KLHMK9TNxJYXHlVmFDTCXMQeW6KyYYUF3zP3yhlr/DE2UCJCelbdcWv/L5o HgRlKq9pPsyPn8qvUhbcHr3cTLubU8I= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721200359; a=rsa-sha256; cv=none; b=6TftOzKJLIV5wfr8Z0MwKbhJcsaWmuqL+H0sx5Sy8fUrwYwvSpPr6FVoL8QbLh3rS+tGw0 stSopqdykLFllRj8OJ032dL5H9zDsZyvMv6VwVCXR6e/O+7ZIYq11IYKp+R8XU1t3QWiF2 jep2cyyHeI+QRfHv4TLacW8bXtArupg= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 49719106F; Wed, 17 Jul 2024 00:13:34 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2E1173F762; Wed, 17 Jul 2024 00:13:07 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Hugh Dickins , Jonathan Corbet , "Matthew Wilcox (Oracle)" , David Hildenbrand , Barry Song , Lance Yang , Baolin Wang , Gavin Shan , Pankaj Raghav , Daniel Gomez Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v1 1/4] mm: mTHP user controls to configure pagecache large folio sizes Date: Wed, 17 Jul 2024 08:12:53 +0100 Message-ID: <20240717071257.4141363-2-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240717071257.4141363-1-ryan.roberts@arm.com> References: <20240717071257.4141363-1-ryan.roberts@arm.com> MIME-Version: 1.0 X-Stat-Signature: t66qjhstwhjkapo4bnwjs9ntutw8o7se X-Rspam-User: X-Rspamd-Queue-Id: 925678002A X-Rspamd-Server: rspam02 X-HE-Tag: 1721200389-580224 X-HE-Meta: U2FsdGVkX1/R+CcEtxgikBVBK0rqVxTlwayjVAekmYDeBSNLiBBEVDSSyCN4WRidj2l3GukVc2TtS0dZDd2t9Ska1vQlfqObhVMRBMYvUh1XEGl9Ihg1IuO1eRSYjnngKPVVdm+aBTu3OW88smlP4Hax1WHfmRuyR7GFPgq4wNNUFApVtc4nWTK/MSpnTi0EpfY7LXlZVhaF7UARosx2hOi0BkqoY2WPX14HUwMCs3J99AgXJXkrZYLMS83UaOhlFp+sANzK2NT8ZIKQId/6GNMsnoNG7b8yDTibVZSbzmY9pNFG66nt/4MGv55mNJikqTugyDL2j/DbxsWJ1zboSVEPPayox5mvD8rWiKc6lXsdA2fDaVmdHftJznzKurXe0eoKgMC/jX96ShSyqaDwQtbQMcNDsat/A0B/T1kMIaTewBYo7+NF3CPlyBeq1aX2fwqoUrsDnCFCme+urtEyX58EmZFby3LVEVyj9PzaCxZ+HjDTC1+b0yVe4tIrU41b/Ff1NfbH2DCRCh/kwRnRALpi5zsyhs9p2LRdwbpy/vi/c30JUo/DhR7l/PWISFfs5ifRV7ADPcNfG8VL0qJnjw7vjKVKquoMkAIeDgtoWJJq6X/Y4GHOOXZJczuEVLKSiGzinJUaKdlKeBzZnyOZSASMtbaMPX6R42kqzH/m2lhy985gV4V0AFl41V/cUSLgjpEX/e7BWM5eytRru0YQJmJ8zsipucB1nroXb1lfONYqqBVzKLotkrJ7oiJhFiNIHs/THKAcjw4OL8ZL+MN162IBEby8O0ZQLMsPqbStE/55jNZYiVmQOOcrXVXFfs9Hc3KEa9ijW7HXOG8EJ30xtZy5cntJwFRFEVL4oNS7fUvvObo/XOMff9e7sMadLx2chdqVGGPnDWMQy7cQ4Qiow+Q4a295Fu6jipnSF5iCvhtmoAg+ng+CYhCfgqIL8qN5JRwNa4+idUKo+/k8HGO ygTmdhrN Hyq4tJ7uaCVVeImPqOeV+gx1vMm5xbHX9qcCEB2+kxrqtR4eOPUO12IX9nga7oufav2uBf26ztWXYrVwnbu7qgwNrb25mjF0Yccf4Th3pQ7ZdBw81bVjT2gHrq6RLgtot7v1M/Squzq/uSHgAcOfQoguvNZkrixAH7zSI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Add mTHP controls to sysfs to allow user space to configure the folio sizes that can be considered for allocation of file-backed memory: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/file_enable For now, the control can be set to either `always` or `never` to enable or disable that size. More options may be added in future. By default, at boot, all folio sizes are enabled, and the algorithm used to select a folio size remains conceptually unchanged; increase by 2 enabled orders each time a readahead marker is hit then reduce to the closest enabled order to fit within bounds of ra size, index alignment and EOF. So when all folio sizes are enabled, behavior should be unchanged. When folio sizes are disabled, the algorithm will never select them. Systems such as Android are always under extreme memory pressure and as a result fragmentation often causes attempts to allocate large folios to fail and fallback to smaller folios. By fixing the pagecache to one large folio size (e.g. 64K) plus fallback to small folios, a large source of this fragmentation can be removed and 64K mTHP allocations succeed more often, allowing the system to benefit from improved performance on arm64 and other arches that support "contpte". Signed-off-by: Ryan Roberts --- Documentation/admin-guide/mm/transhuge.rst | 21 +++++++++ include/linux/huge_mm.h | 50 +++++++++++++--------- mm/filemap.c | 15 ++++--- mm/huge_memory.c | 43 +++++++++++++++++++ mm/readahead.c | 43 +++++++++++++++---- 5 files changed, 138 insertions(+), 34 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index d4857e457add..9f3ed504c646 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -284,6 +284,27 @@ that THP is shared. Exceeding the number would block the collapse:: A higher value may increase memory footprint for some workloads. +File-Backed Hugepages +--------------------- + +The kernel will automatically select an appropriate THP size for file-backed +memory from a set of allowed sizes. By default all THP sizes that the page cache +supports are allowed, but this set can be modified with one of:: + + echo always >/sys/kernel/mm/transparent_hugepage/hugepages-kB/file_enabled + echo never >/sys/kernel/mm/transparent_hugepage/hugepages-kB/file_enabled + +where is the hugepage size being addressed, the available sizes for which +vary by system. ``always`` adds the hugepage size to the set of allowed sizes, +and ``never`` removes the hugepage size from the set of allowed sizes. + +In some situations, constraining the allowed sizes can reduce memory +fragmentation, resulting in fewer allocation fallbacks and improved system +performance. + +Note that any changes to the allowed set of sizes only applies to future +file-backed THP allocations. + Boot parameter ============== diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 4f9109fcdded..19ced8192d39 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -114,6 +114,24 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr; #define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) #define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) +static inline int lowest_order(unsigned long orders) +{ + if (orders) + return __ffs(orders); + return -1; +} + +static inline int highest_order(unsigned long orders) +{ + return fls_long(orders) - 1; +} + +static inline int next_order(unsigned long *orders, int prev) +{ + *orders &= ~BIT(prev); + return highest_order(*orders); +} + enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, @@ -158,6 +176,12 @@ extern unsigned long transparent_hugepage_flags; extern unsigned long huge_anon_orders_always; extern unsigned long huge_anon_orders_madvise; extern unsigned long huge_anon_orders_inherit; +extern unsigned long huge_file_orders_always; + +static inline unsigned long file_orders_always(void) +{ + return READ_ONCE(huge_file_orders_always); +} static inline bool hugepage_global_enabled(void) { @@ -172,17 +196,6 @@ static inline bool hugepage_global_always(void) (1< MAX_PAGECACHE_ORDER) - order = MAX_PAGECACHE_ORDER; + + orders = file_orders_always() | BIT(0); + orders &= BIT(order + 1) - 1; /* If we're not aligned, allocate a smaller folio */ if (index & ((1UL << order) - 1)) - order = __ffs(index); + orders &= BIT(__ffs(index) + 1) - 1; + order = highest_order(orders); - do { + while (orders) { gfp_t alloc_gfp = gfp; err = -ENOMEM; @@ -1962,7 +1965,9 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index, break; folio_put(folio); folio = NULL; - } while (order-- > 0); + + order = next_order(&orders, order); + }; if (err == -EEXIST) goto repeat; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 26d558e3e80f..e8fe28fe9cf9 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -80,6 +80,7 @@ unsigned long huge_zero_pfn __read_mostly = ~0UL; unsigned long huge_anon_orders_always __read_mostly; unsigned long huge_anon_orders_madvise __read_mostly; unsigned long huge_anon_orders_inherit __read_mostly; +unsigned long huge_file_orders_always __read_mostly; unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma, unsigned long vm_flags, @@ -525,6 +526,37 @@ static ssize_t anon_enabled_store(struct kobject *kobj, return ret; } +static ssize_t file_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int order = to_thpsize(kobj)->order; + const char *output; + + if (test_bit(order, &huge_file_orders_always)) + output = "[always] never"; + else + output = "always [never]"; + + return sysfs_emit(buf, "%s\n", output); +} + +static ssize_t file_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int order = to_thpsize(kobj)->order; + ssize_t ret = count; + + if (sysfs_streq(buf, "always")) + set_bit(order, &huge_file_orders_always); + else if (sysfs_streq(buf, "never")) + clear_bit(order, &huge_file_orders_always); + else + ret = -EINVAL; + + return ret; +} + static struct kobj_attribute anon_enabled_attr = __ATTR(enabled, 0644, anon_enabled_show, anon_enabled_store); @@ -537,7 +569,11 @@ static const struct attribute_group anon_ctrl_attr_grp = { .attrs = anon_ctrl_attrs, }; +static struct kobj_attribute file_enabled_attr = + __ATTR(file_enabled, 0644, file_enabled_show, file_enabled_store); + static struct attribute *file_ctrl_attrs[] = { + &file_enabled_attr.attr, #ifdef CONFIG_SHMEM &thpsize_shmem_enabled_attr.attr, #endif @@ -712,6 +748,13 @@ static int __init hugepage_init_sysfs(struct kobject **hugepage_kobj) */ huge_anon_orders_inherit = BIT(PMD_ORDER); + /* + * For pagecache, default to enabling all orders. powerpc's PMD_ORDER + * (and therefore THP_ORDERS_ALL_FILE_DEFAULT) isn't a compile-time + * constant so we have to do this here. + */ + huge_file_orders_always = THP_ORDERS_ALL_FILE_DEFAULT; + *hugepage_kobj = kobject_create_and_add("transparent_hugepage", mm_kobj); if (unlikely(!*hugepage_kobj)) { pr_err("failed to create transparent hugepage kobject\n"); diff --git a/mm/readahead.c b/mm/readahead.c index 517c0be7ce66..e05f85974396 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -432,6 +432,34 @@ static inline int ra_alloc_folio(struct readahead_control *ractl, pgoff_t index, return 0; } +static int select_new_order(int old_order, int max_order, unsigned long orders) +{ + unsigned long hi_orders, lo_orders; + + /* + * Select the next order to use from the set in `orders`, while ensuring + * we don't go above max_order. Prefer the next + 1 highest allowed + * order after old_order, unless there isn't one, in which case return + * the closest allowed order, which is either the next highest allowed + * order or less than or equal to old_order. The "next + 1" skip + * behaviour is intended to allow ramping up to large folios quickly. + */ + + orders &= BIT(max_order + 1) - 1; + VM_WARN_ON(!orders); + hi_orders = orders & ~(BIT(old_order + 1) - 1); + + if (hi_orders) { + old_order = lowest_order(hi_orders); + hi_orders &= ~BIT(old_order); + if (hi_orders) + return lowest_order(hi_orders); + } + + lo_orders = orders & (BIT(old_order + 1) - 1); + return highest_order(lo_orders); +} + void page_cache_ra_order(struct readahead_control *ractl, struct file_ra_state *ra, unsigned int new_order) { @@ -443,17 +471,15 @@ void page_cache_ra_order(struct readahead_control *ractl, unsigned int nofs; int err = 0; gfp_t gfp = readahead_gfp_mask(mapping); + unsigned long orders; - if (!mapping_large_folio_support(mapping) || ra->size < 4) + if (!mapping_large_folio_support(mapping)) goto fallback; limit = min(limit, index + ra->size - 1); - if (new_order < MAX_PAGECACHE_ORDER) - new_order += 2; - - new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order); - new_order = min_t(unsigned int, new_order, ilog2(ra->size)); + orders = file_orders_always() | BIT(0); + new_order = select_new_order(new_order, ilog2(ra->size), orders); /* See comment in page_cache_ra_unbounded() */ nofs = memalloc_nofs_save(); @@ -463,9 +489,10 @@ void page_cache_ra_order(struct readahead_control *ractl, /* Align with smaller pages if needed */ if (index & ((1UL << order) - 1)) - order = __ffs(index); + order = select_new_order(order, __ffs(index), orders); /* Don't allocate pages past EOF */ - while (index + (1UL << order) - 1 > limit) + while (index + (1UL << order) - 1 > limit && + (BIT(order) & orders) == 0) order--; err = ra_alloc_folio(ractl, index, mark, order, gfp); if (err)