From patchwork Tue Feb 18 18:16:47 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 13980418 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92AA2C021AA for ; Tue, 18 Feb 2025 18:17:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 405B1280187; Tue, 18 Feb 2025 13:17:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 38AA8280181; Tue, 18 Feb 2025 13:17:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 16B66280187; Tue, 18 Feb 2025 13:17:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id DE8CD280181 for ; Tue, 18 Feb 2025 13:17:39 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 866C5C03CB for ; Tue, 18 Feb 2025 18:17:39 +0000 (UTC) X-FDA: 83133873438.30.7921E2D Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf24.hostedemail.com (Postfix) with ESMTP id B9099180016 for ; Tue, 18 Feb 2025 18:17:37 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JPIo+knw; spf=pass (imf24.hostedemail.com: domain of 3wM60ZwQKCHUYoWeZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--fvdl.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3wM60ZwQKCHUYoWeZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--fvdl.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739902657; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eP0AuGGA/jTNyPHaz+0221ekTfSMUwsfV2inUw2T3XU=; b=n70kBkpj1Fc14sAGPCqlthbbQ/xSuFsHUI+G4OayHXs1qRywbmHby+oVvXec5aQdtvIRXH cLgQlRRkmvEh7j9P9n9wBarqG24dKrkmrH0ujSlTHBcSiy6b4gIlRBnxJLa9BrAu2PR+HQ 0wllClDxfqO8F1zSa7pnZUcUD1YaXX8= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JPIo+knw; spf=pass (imf24.hostedemail.com: domain of 3wM60ZwQKCHUYoWeZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--fvdl.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3wM60ZwQKCHUYoWeZhhZeX.Vhfebgnq-ffdoTVd.hkZ@flex--fvdl.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739902657; a=rsa-sha256; cv=none; b=cW00GVNMNy1C8jEnGK2AgSTfnWuGy85jACyjZbG9PQnj56uPA09jv9enfSoDT2jTmJco7K BRGkA9Xwcpw1vnmPBkJWXyOzB5PNUA5SU3H3ip+UsYbts1uUH41Yh5htX5VtoSr7ohsUJE tvZi2MtSbhjSbfmSPhtY2MvhvfyiJAI= Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2fc1c3b3dc7so10609566a91.2 for ; Tue, 18 Feb 2025 10:17:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1739902657; x=1740507457; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=eP0AuGGA/jTNyPHaz+0221ekTfSMUwsfV2inUw2T3XU=; b=JPIo+knwxCJK8HwUbt2nW/q49DdaWFely1SSdgnBtJnqukvNS30fmImu8ypObqWATt c8g1Nfpz2wZvDb2zxfgI7qPAoo47WogeS3v+/LgTdIQFnLL3owToVVisb/YWJmR1YGz3 YDz/ozRWLKcXwPYEjLjNR3ItoV7BdAfA6toim3z2XiYOeyzsPNziup8qYzzz0bS1POoC XjJ3XQRtjjna5wBPAatfTmNeM1e7mJa/z7fkQ+3E/cckSGgkHdBfNxr6f9fHFeZL5vhk eWzBA1GFDC4disSnYNMkyESiRSytrEgKO2xFR7TxSb81LTxQ+GSL5boCTAgbe6sydcMZ LRpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739902657; x=1740507457; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=eP0AuGGA/jTNyPHaz+0221ekTfSMUwsfV2inUw2T3XU=; b=YpcjrfvWXvVjMGgeLTgyQTaJoexlMXH6va8eNSxV89DtTuivDVdLLQAtwNxMA6kUH8 +9MPo3J/pvq4OrOjQ/NBmUkARa/jtDcgbtAi30EHrA9DXPoKYapvQrIc4NjJtqudBiGH 5Cjv1CbG4SoSDOe5K2ILCE2z5FjzAUO9fjC5SDLgxCK68lJGyNhu8leIeElNcD8sgh5F 4TUui4QFA/qTGcPDu975cFHuhLbYTlBvHxO+orJhJWiS/U9Lv6EYwVU5IhtZHahCaMtD 0pYjYUlgjFH5s/qiCmk2cpum9RfHwi0wXsvkFFSFuR+FHx80HE/NNQnvUR3s1gJ/hSIy OobQ== X-Forwarded-Encrypted: i=1; AJvYcCWpwHrOsMA7EstpBAcf9l9LPaABJDLl1FtSgbt66aPm874qlOL739axmxSHx8OcIUpM9CYDVmIKXw==@kvack.org X-Gm-Message-State: AOJu0YwUlvzh2uW+mUJ0czL/Kxwjh3risfR9pIWZQPbJBJFs1J7t2HwW 4Iv4Pjbb+PnpLDJdB0FPpmuhGZVpe3rUHjrckUANinU1QlTWWpJmFtCHpAZESePrRC70Rw== X-Google-Smtp-Source: AGHT+IFduQgbCRNLRZnaXpDInvEojX9LUGLV+zf9Z4pRhPbL38b/THp8EijDiXQ4hiilnJVfp24fh2cB X-Received: from pfbji12.prod.google.com ([2002:a05:6a00:8f0c:b0:730:50c0:136d]) (user=fvdl job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:3c8e:b0:732:288b:c049 with SMTP id d2e1a72fcca58-7329de4ec7emr452685b3a.1.1739902656672; Tue, 18 Feb 2025 10:17:36 -0800 (PST) Date: Tue, 18 Feb 2025 18:16:47 +0000 In-Reply-To: <20250218181656.207178-1-fvdl@google.com> Mime-Version: 1.0 References: <20250218181656.207178-1-fvdl@google.com> X-Mailer: git-send-email 2.48.1.601.g30ceb7b040-goog Message-ID: <20250218181656.207178-20-fvdl@google.com> Subject: [PATCH v4 19/27] mm/hugetlb: do pre-HVO for bootmem allocated pages From: Frank van der Linden To: akpm@linux-foundation.org, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: yuzhao@google.com, usamaarif642@gmail.com, joao.m.martins@oracle.com, roman.gushchin@linux.dev, Frank van der Linden X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: B9099180016 X-Stat-Signature: j9mbopu4ngp79mfgfworozsmhsbeadma X-HE-Tag: 1739902657-77277 X-HE-Meta: U2FsdGVkX1924V2srULTVW+u/zs2/SPxR34B6fv1LwdVQnxclaDxjeryIGQ/LT5SOBF/f8P32vDA9aXLs+l1oU7SUXpsWq9+krNeSaRyD/D0LGWDslsxRtoB2P0zacUoVtE6hRkH2/Pj8hmWwJ5qMeyMxQZBawMNVmwu39WqWZSZuyylMjr4XF7EtIU+ErmbiewScTvTphAiF3sZkSGKwvePEOByYHPrjmIzii/ona182gCk3BkIuHYx/KOzRobf87/KYY8qm7cqSwf4tNRoAII6BnqlwvzwyaG9Aeb1Y1PAtoYz+rNv/r2HW+N7eR0A31mKpU2IWdVYodtTLkg9TKI7ZD7R2BrBFlUxZiiVrZIQCAAV1XKNIz5cao6RkHjMgguM25E64/fpnNz6Swr1hGXHqNAwlGcKtkpciGv5MeVgHXbH8Eapqgo8MDuxVAWg1xpKV5eqXiV/xVXja/3uPEX+u9bmdwzQ3vxPBV11I+x75LnBD1q8uFQWkS4YqUYm2N+Ei42dYwMoZPGma89EiccVLV92zbNssSq8tirLJUeekr/l4Vkf3JbJwO3mfX8eTT6/ZZr+XvKrvgpXFfFtZkLQrxnUsjtNj2M4Q2EkWZjzpUhpg8lyyPyW4P60x6zKRrkU2Dl8MHqBdfmJlRyTYnY0wOwk/6/k20+C6RvEcwS5UWzDyMOeH/Tnh/yGo/IjRmFIRi8TaxeuewbmoQafZy+sH8CCli8CAmFNTNtSKC0R9+FVDVnv7RxUy/XGxV7SOQ29f0UPfo8vV+l1nntu/cf5GUuSIiriOCOAdhUwBGeXDWFQpwy2xCxC5QOIYyL8JW9L1M3O3kz3puonZDemTMJm1iXWXx9YPt0hk6hIR+7ka7pi1Ggeth0FinFxrPrrbgOnqpziMyLeLfMtYVFrexQvlq+AzTx9x6q6NjQcoH765GHcjyGeyXdW9fb5bj5tyjlvQLENpBGP3cDFXy1 MNdFTFAL 1JJoKNkG8pFSDJPL0OITzMb6DPNMPAZuplFwyMavdm+7V1B3bpjUBRskVT8f1P+OOmkG3dZSoATINCHB4fo/r/kDaSp9k9PcTqdq12QWUUyWc+zYocKdjouU7h6gO0uN7rrPnItp7dsvVeOTii0xJRCdzsIGosOZmTXeL4IyafQCbHB60l0Iv2v2R3JCBiyxYUgsxXDpR6g7ai6rRB0b38kcqMTjprEjAj5fU60O953ivO3Xd7h0m8O/KKd6t1QlcSXBIEfu2ZS9Z3zOWFem6lDubACOo0o+Qq/10wnjCcE2aRgsCA91ieYPtfe/lQH0ZMpbiVnwcm9IhkssJmiX5gJIG+S0yF71ntbAuEO9z4vhezo+ZWgG5Z8LK9UHrXhS3Sj2d5+4ga6iZlTbP5MtHYYdPL73B4W1jctvQmJSWHqGdZolamqsPhF8TjOoWgk4VIFdX3ZBIr2kRKvM4z6u7BbSTX2skqjbKdiZOf4Lh6+Oz5sRa/txS0WFY1iq7CacpKnqlimh/lmHswHrIpJHXkbecEg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: For large systems, the overhead of vmemmap pages for hugetlb is substantial. It's about 1.5% of memory, which is about 45G for a 3T system. If you want to configure most of that system for hugetlb (e.g. to use as backing memory for VMs), there is a chance of running out of memory on boot, even though you know that the 45G will become available later. To avoid this scenario, and since it's a waste to first allocate and then free that 45G during boot, do pre-HVO for hugetlb bootmem allocated pages ('gigantic' pages). pre-HVO is done by adding functions that are called from sparse_init_nid_early and sparse_init_nid_late. The first is called before memmap allocation, so it takes care of allocating memmap HVO-style. The second verifies that all bootmem pages look good, specifically it checks that they do not intersect with multiple zones. This can only be done from sparse_init_nid_late path, when zones have been initialized. The hugetlb page size must be aligned to the section size, and aligned to the size of memory described by the number of page structures contained in one PMD (since pre-HVO is not prepared to split PMDs). This should be true for most 'gigantic' pages, it is for 1G pages on x86, where both of these alignment requirements are 128M. This will only have an effect if hugetlb_bootmem_alloc was called early in boot. If not, it won't do anything, and HVO for bootmem hugetlb pages works as before. Signed-off-by: Frank van der Linden --- include/linux/hugetlb.h | 2 + mm/hugetlb.c | 17 ++++- mm/hugetlb_vmemmap.c | 143 ++++++++++++++++++++++++++++++++++++++++ mm/hugetlb_vmemmap.h | 14 ++++ mm/sparse-vmemmap.c | 4 ++ 5 files changed, 177 insertions(+), 3 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 10a7ce2b95e1..2512463bca49 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -687,6 +687,8 @@ struct huge_bootmem_page { #define HUGE_BOOTMEM_HVO 0x0001 #define HUGE_BOOTMEM_ZONES_VALID 0x0002 +bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m); + int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 40c88c46b34f..634dc53f1e3e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3211,7 +3211,18 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) */ memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE), huge_page_size(h) - PAGE_SIZE); - /* Put them into a private list first because mem_map is not up yet */ + + /* + * Put them into a private list first because mem_map is not up yet. + * + * For pre-HVO to work correctly, pages need to be on the list for + * the node they were actually allocated from. That node may be + * different in the case of fallback by memblock_alloc_try_nid_raw. + * So, extract the actual node first. + */ + if (nid == NUMA_NO_NODE) + node = early_pfn_to_nid(PHYS_PFN(virt_to_phys(m))); + INIT_LIST_HEAD(&m->list); list_add(&m->list, &huge_boot_pages[node]); m->hstate = h; @@ -3306,8 +3317,8 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h, } } -static bool __init hugetlb_bootmem_page_zones_valid(int nid, - struct huge_bootmem_page *m) +bool __init hugetlb_bootmem_page_zones_valid(int nid, + struct huge_bootmem_page *m) { unsigned long start_pfn; bool valid; diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index be6b33ecbc8e..9a99dfa3c495 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -743,6 +743,149 @@ void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head __hugetlb_vmemmap_optimize_folios(h, folio_list, true); } +#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT + +/* Return true of a bootmem allocated HugeTLB page should be pre-HVO-ed */ +static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m) +{ + unsigned long section_size, psize, pmd_vmemmap_size; + phys_addr_t paddr; + + if (!READ_ONCE(vmemmap_optimize_enabled)) + return false; + + if (!hugetlb_vmemmap_optimizable(m->hstate)) + return false; + + psize = huge_page_size(m->hstate); + paddr = virt_to_phys(m); + + /* + * Pre-HVO only works if the bootmem huge page + * is aligned to the section size. + */ + section_size = (1UL << PA_SECTION_SHIFT); + if (!IS_ALIGNED(paddr, section_size) || + !IS_ALIGNED(psize, section_size)) + return false; + + /* + * The pre-HVO code does not deal with splitting PMDS, + * so the bootmem page must be aligned to the number + * of base pages that can be mapped with one vmemmap PMD. + */ + pmd_vmemmap_size = (PMD_SIZE / (sizeof(struct page))) << PAGE_SHIFT; + if (!IS_ALIGNED(paddr, pmd_vmemmap_size) || + !IS_ALIGNED(psize, pmd_vmemmap_size)) + return false; + + return true; +} + +/* + * Initialize memmap section for a gigantic page, HVO-style. + */ +void __init hugetlb_vmemmap_init_early(int nid) +{ + unsigned long psize, paddr, section_size; + unsigned long ns, i, pnum, pfn, nr_pages; + unsigned long start, end; + struct huge_bootmem_page *m = NULL; + void *map; + + /* + * Noting to do if bootmem pages were not allocated + * early in boot, or if HVO wasn't enabled in the + * first place. + */ + if (!hugetlb_bootmem_allocated()) + return; + + if (!READ_ONCE(vmemmap_optimize_enabled)) + return; + + section_size = (1UL << PA_SECTION_SHIFT); + + list_for_each_entry(m, &huge_boot_pages[nid], list) { + if (!vmemmap_should_optimize_bootmem_page(m)) + continue; + + nr_pages = pages_per_huge_page(m->hstate); + psize = nr_pages << PAGE_SHIFT; + paddr = virt_to_phys(m); + pfn = PHYS_PFN(paddr); + map = pfn_to_page(pfn); + start = (unsigned long)map; + end = start + nr_pages * sizeof(struct page); + + if (vmemmap_populate_hvo(start, end, nid, + HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) + continue; + + memmap_boot_pages_add(HUGETLB_VMEMMAP_RESERVE_SIZE / PAGE_SIZE); + + pnum = pfn_to_section_nr(pfn); + ns = psize / section_size; + + for (i = 0; i < ns; i++) { + sparse_init_early_section(nid, map, pnum, + SECTION_IS_VMEMMAP_PREINIT); + map += section_map_size(); + pnum++; + } + + m->flags |= HUGE_BOOTMEM_HVO; + } +} + +void __init hugetlb_vmemmap_init_late(int nid) +{ + struct huge_bootmem_page *m, *tm; + unsigned long phys, nr_pages, start, end; + unsigned long pfn, nr_mmap; + struct hstate *h; + void *map; + + if (!hugetlb_bootmem_allocated()) + return; + + if (!READ_ONCE(vmemmap_optimize_enabled)) + return; + + list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) { + if (!(m->flags & HUGE_BOOTMEM_HVO)) + continue; + + phys = virt_to_phys(m); + h = m->hstate; + pfn = PHYS_PFN(phys); + nr_pages = pages_per_huge_page(h); + + if (!hugetlb_bootmem_page_zones_valid(nid, m)) { + /* + * Oops, the hugetlb page spans multiple zones. + * Remove it from the list, and undo HVO. + */ + list_del(&m->list); + + map = pfn_to_page(pfn); + + start = (unsigned long)map; + end = start + nr_pages * sizeof(struct page); + + vmemmap_undo_hvo(start, end, nid, + HUGETLB_VMEMMAP_RESERVE_SIZE); + nr_mmap = end - start - HUGETLB_VMEMMAP_RESERVE_SIZE; + memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE)); + + memblock_phys_free(phys, huge_page_size(h)); + continue; + } else + m->flags |= HUGE_BOOTMEM_ZONES_VALID; + } +} +#endif + static const struct ctl_table hugetlb_vmemmap_sysctls[] = { { .procname = "hugetlb_optimize_vmemmap", diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h index 62d3d645a793..18b490825215 100644 --- a/mm/hugetlb_vmemmap.h +++ b/mm/hugetlb_vmemmap.h @@ -9,6 +9,8 @@ #ifndef _LINUX_HUGETLB_VMEMMAP_H #define _LINUX_HUGETLB_VMEMMAP_H #include +#include +#include /* * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See @@ -25,6 +27,10 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio); void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list); void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list); +#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT +void hugetlb_vmemmap_init_early(int nid); +void hugetlb_vmemmap_init_late(int nid); +#endif static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h) @@ -71,6 +77,14 @@ static inline void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, { } +static inline void hugetlb_vmemmap_init_early(int nid) +{ +} + +static inline void hugetlb_vmemmap_init_late(int nid) +{ +} + static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate *h) { return 0; diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 8cc848c4b17c..fd2ab5118e13 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -32,6 +32,8 @@ #include #include +#include "hugetlb_vmemmap.h" + /* * Flags for vmemmap_populate_range and friends. */ @@ -594,6 +596,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, */ void __init sparse_vmemmap_init_nid_early(int nid) { + hugetlb_vmemmap_init_early(nid); } /* @@ -604,5 +607,6 @@ void __init sparse_vmemmap_init_nid_early(int nid) */ void __init sparse_vmemmap_init_nid_late(int nid) { + hugetlb_vmemmap_init_late(nid); } #endif