From patchwork Mon Jan 27 23:22:00 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 13951862 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B99D8C02188 for ; Mon, 27 Jan 2025 23:23:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5BFC92801CB; Mon, 27 Jan 2025 18:23:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 46B9C28013A; Mon, 27 Jan 2025 18:23:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 156262801CE; Mon, 27 Jan 2025 18:23:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 300DB28013A for ; Mon, 27 Jan 2025 18:22:59 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id E2ABCB0A90 for ; Mon, 27 Jan 2025 23:22:58 +0000 (UTC) X-FDA: 83054809236.25.47FD376 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) by imf17.hostedemail.com (Postfix) with ESMTP id 18AB340002 for ; Mon, 27 Jan 2025 23:22:56 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=n+ojoBu0; spf=pass (imf17.hostedemail.com: domain of 3TxWYZwQKCB49P7FAIIAF8.6IGFCHOR-GGEP46E.ILA@flex--fvdl.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3TxWYZwQKCB49P7FAIIAF8.6IGFCHOR-GGEP46E.ILA@flex--fvdl.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738020177; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fEVgSampFcFnztxYiqKe2aRbQGMcWDvZ/ZB5suAPpfY=; b=cFSC1SclxAuMYl0UYtsAxq0ShYQFjPT8D2yGWRaojnyBiAuSjI0fq48BsmsAOmvYsMi3qX KGkYsKf3ygeWEO8aOaQGf5IpiWIXz838nAAD+pQ6JtsVIln+kaH6hWUCbIX5S4V1V1xXxB PQnyKEbhnuJZf1vVaFg9Kbykuai5bM0= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=n+ojoBu0; spf=pass (imf17.hostedemail.com: domain of 3TxWYZwQKCB49P7FAIIAF8.6IGFCHOR-GGEP46E.ILA@flex--fvdl.bounces.google.com designates 209.85.216.73 as permitted sender) smtp.mailfrom=3TxWYZwQKCB49P7FAIIAF8.6IGFCHOR-GGEP46E.ILA@flex--fvdl.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738020177; a=rsa-sha256; cv=none; b=1HQv4DSGdZjOUx/CXaaHMdT7kLI4EbiIWlLLoQy+RR7JVNJrn43U2CxZHE/8Tr0bohs+hg D/Wrcf7ukGtKjf2Wf5Dv0Aj8B3iqBxPTvdWx1f2xYpG4D0mOmQCrYJ33b3MxY9jvfLth0k 2saR3MzPSr9BP2UjeRG7dDQK8b6nc6k= Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2ef9864e006so14698622a91.2 for ; Mon, 27 Jan 2025 15:22:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738020176; x=1738624976; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fEVgSampFcFnztxYiqKe2aRbQGMcWDvZ/ZB5suAPpfY=; b=n+ojoBu0SODxQHRFPT3Mw0aBvXp4hZeF8W1cFmFPhmgyJEFQs8v1GlykSfR0LKAXj9 gdULSDAxHBo4j43M4jd6PDjyLOV4opauGJd7niVmTxwsDPme20JXmYOrzIcl4XCzgAjr q8wVq/y+PwNYu8X8GYRGnJKHDVBgy2QsuVV2bDh0DHpc81xc2tJ6tY2lBivd9DdSro1E ELutgW6i3V4/vEM5BxDB0QZQG07cJteNKFcqfIGapmhYiwh1jWMxPuGksus2XDd1xkAS jP0iaPTEDL81F/nV3JkUesExg4SJidMdaopshSpdb8FNumwiFLcDLS+b4hfOwrd1VSp7 8qbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738020176; x=1738624976; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fEVgSampFcFnztxYiqKe2aRbQGMcWDvZ/ZB5suAPpfY=; b=o1yX/NN+bCKQJhFOdHZ21YbhfiYvKiJfXIVDT+m/gzIf2JouGTH5ITHNGNBiP19PEH fNXAyWiWlOpMG1sV77FD61sT27ipPu+Hk9lN90geHnHrB+znSMiujtY558IcJ4mZUVeO 9paRVtPy7DkZIbJy6n5MRY6xcYQ84A5NIVJCRgaKKTKcrC0Z8tK27vi+b1HdwLqnPi+G rxuT2TsCOv46g36Hg0ToWuscPLT2wUc1IGT1ZA5H3cEq9Q/L98Y7qTcZ43rv2x2W3XVE 7mOLq8OpltYAFvS25W+xKJfS/jPuxhbmRHjsEyorckfRrxqjQj24Vd3F86BEW+NMat6J dtOg== X-Forwarded-Encrypted: i=1; AJvYcCURt5e897VWaiXUcNk38rrvc4cQH1g5Qeh9qmqqBd47TM1u51ylKbzU/erCFTVKuCI1aX6pSrwE0g==@kvack.org X-Gm-Message-State: AOJu0Yy9tCVGpDUsjw/kct5mDN7Dr2GLoZjH6hSS1EBZJilvsY0iSjtY M14s8OnjXws4YJbdXK/YIG0g9cFYX02EwKlY+IUPU/YKzGXUA2YtI8Oe/PNgOTc8pkLmoA== X-Google-Smtp-Source: AGHT+IGJRnN25b4dZlF6olAZgvchiqmQKdipqlEwFeLtoRc3PeD9pO1+0UyZWnQ7s7SBmwjf5eKeBH4R X-Received: from pfbbr13.prod.google.com ([2002:a05:6a00:440d:b0:725:e4b6:901f]) (user=fvdl job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a21:6d88:b0:1e3:e77d:1460 with SMTP id adf61e73a8af0-1eb214e00eamr70005972637.22.1738020175839; Mon, 27 Jan 2025 15:22:55 -0800 (PST) Date: Mon, 27 Jan 2025 23:22:00 +0000 In-Reply-To: <20250127232207.3888640-1-fvdl@google.com> Mime-Version: 1.0 References: <20250127232207.3888640-1-fvdl@google.com> X-Mailer: git-send-email 2.48.1.262.g85cc9f2d1e-goog Message-ID: <20250127232207.3888640-21-fvdl@google.com> Subject: [PATCH 20/27] mm/hugetlb: do pre-HVO for bootmem allocated pages From: Frank van der Linden To: akpm@linux-foundation.org, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: yuzhao@google.com, usama.arif@bytedance.com, joao.m.martins@oracle.com, roman.gushchin@linux.dev, Frank van der Linden X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 18AB340002 X-Stat-Signature: f7sknq4ie7aztqxccok55dac1bnb1nzm X-Rspam-User: X-HE-Tag: 1738020176-177760 X-HE-Meta: U2FsdGVkX189BlC4qpfhChR1GEZfZhnsXCArrX69s4IuuzWJWTE14qyIV8STtx20fOj8xNMl85cC6E7S7PQJtJZTdhYwx63XTHDSHeMxyGiB698YjzWKDMisrzZKKBE6QRg6+OWrghP7FhT+5jjaXtLXTKt7e8QFHWSsVoqWLfmj3D4eJBPL6+PK2lOVug/ky/cMDNqVNWCMSlS4NniPQZ2Lrnb/nwwQGKyIXeQCpe/HD5kqGuRgTpE0w+GoE/FczYAkDkr4QDv5TTAmNKjnf4X7aimdX9TXIlnlcraTR4E8Wh00Sw49lmJNoKHiJiSaY22j/enxfg5qeICBNlB9kR7LYEzIsHaqpMdHZsznt9r2zWCJ7hAu7u+ucaLJgDfZTVwoJnOJjJv9ZloHChW2OpCOktiEo7qgbGXZMV39hBcE6Qh7aUl77GTdUU2BugLRt2KmIfNAwBxJ0vfn7Rjk5tq9CusPJxl5+5b+ySPuO/GxE6d2odPlV2PFUNxgZhLlsMlCNbUfPkp33b3agPImyLwJxuM4b/Rl4ZRR+r6dXJn1CRF3yNAqptY6c53dDhic0E3rTik8Sv23tayji9L0U2BO9oxoXrf+p46fDNa4qBR4pQeLEqc0SIkVZAlmwVkqmdMEZjbvex8i+EbhjV3H5HP39MlMxMSovIqP/IZr9p/5aLLkI2DMB+nea+9wMoaHNvPp+eF1hnupfRfgTklQA+XfpXR6kR3e5MAP65ZSrJfYzIm5ks9QjI0vsXXv7dL2c6jaKE/YmTy7q/7gsB0v7dlO8oKsu7uq4v9QHZ/LOKIjHog4lM0E5Yn+iFU7DQTlrDWCohiqBvuF4fuUIOOklzNLwoeFiO++tiHFRle2+v3GSFiqYbTsP69YUk8xTUM1nC1L6h2LO28h5OMoZkEfSOmJsLM9C+2ZRulp0X9dE4T3g+53oJfe7qSi86Pm1QyF/fDK3WnLU71E5g9y2g/ V5O5nctw T6VWDEazvDgkqWfrEedoRwSW9dyIm5pqyJKbPE46FTAzh2b6Njup+QSENe85qOtCL1AZhwiHPDDSxsKJ4pIweblVbJlIWRSV+FsCFuONOnS3Ps2ibzB6R1xhnEJb761W6bCtQh2hpQpe7o5X2fnWjkf2gRZJ1nQRDI7tkRbWcss2q605DMXBf5lXoAVs3j5NpIVCm21CqqePa31tMXzb7waR2c05BMsnUpWJV8cSCzujGLWgg+YdeLJFj0xt+klBO3zCzneR/EYPnEQ9bhoVXvRUbuf07NgHsAI/4k/rORMN/Nmi4JfWXX7bTOmUfSrcKKaGaGlmw4iy7HM0pGwfzGex7Rc12025Q4268MhauM1tLUBoxiPpxDPyS9pWcvtbluys3bPjpSzViEsXWl3HRx3kTaKW8Vvp6i+hFiqb+bLufd/LvzpP3oLH0FQbCuNqLZzk71NxEBEP2++B4K+4IIWrsCSck48XcqnS/vgEqDl97b7XhUit9uqcEE9Ijd1KzS2HCeFB3I41e60obLSyNv9s8lA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: For large systems, the overhead of vmemmap pages for hugetlb is substantial. It's about 1.5% of memory, which is about 45G for a 3T system. If you want to configure most of that system for hugetlb (e.g. to use as backing memory for VMs), there is a chance of running out of memory on boot, even though you know that the 45G will become available later. To avoid this scenario, and since it's a waste to first allocate and then free that 45G during boot, do pre-HVO for hugetlb bootmem allocated pages ('gigantic' pages). pre-HVO is done by adding functions that are called from sparse_init_nid_early and sparse_init_nid_late. The first is called before memmap allocation, so it takes care of allocating memmap HVO-style. The second verifies that all bootmem pages look good, specifically it checks that they do not intersect with multiple zones. This can only be done from sparse_init_nid_late path, when zones have been initialized. The hugetlb page size must be aligned to the section size, and aligned to the size of memory described by the number of page structures contained in one PMD (since pre-HVO is not prepared to split PMDs). This should be true for most 'gigantic' pages, it is for 1G pages on x86, where both of these alignment requirements are 128M. This will only have an effect if hugetlb_bootmem_alloc was called early in boot. If not, it won't do anything, and HVO for bootmem hugetlb pages works as before. Signed-off-by: Frank van der Linden --- include/linux/hugetlb.h | 2 + mm/hugetlb.c | 4 +- mm/hugetlb_vmemmap.c | 143 ++++++++++++++++++++++++++++++++++++++++ mm/hugetlb_vmemmap.h | 6 ++ mm/sparse-vmemmap.c | 4 ++ 5 files changed, 157 insertions(+), 2 deletions(-) diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 10a7ce2b95e1..2512463bca49 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -687,6 +687,8 @@ struct huge_bootmem_page { #define HUGE_BOOTMEM_HVO 0x0001 #define HUGE_BOOTMEM_ZONES_VALID 0x0002 +bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m); + int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list); int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn); struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 05c5a65e605f..28653214f23d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3311,8 +3311,8 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h, } } -static bool __init hugetlb_bootmem_page_zones_valid(int nid, - struct huge_bootmem_page *m) +bool __init hugetlb_bootmem_page_zones_valid(int nid, + struct huge_bootmem_page *m) { unsigned long start_pfn; bool valid; diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 4eddf3c30d62..49cbd82a2f82 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -743,6 +743,149 @@ void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head __hugetlb_vmemmap_optimize_folios(h, folio_list, true); } +#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT + +/* Return true of a bootmem allocated HugeTLB page should be pre-HVO-ed */ +static bool vmemmap_should_optimize_bootmem_page(struct huge_bootmem_page *m) +{ + unsigned long section_size, psize, pmd_vmemmap_size; + phys_addr_t paddr; + + if (!READ_ONCE(vmemmap_optimize_enabled)) + return false; + + if (!hugetlb_vmemmap_optimizable(m->hstate)) + return false; + + psize = huge_page_size(m->hstate); + paddr = virt_to_phys(m); + + /* + * Pre-HVO only works if the bootmem huge page + * is aligned to the section size. + */ + section_size = (1UL << PA_SECTION_SHIFT); + if (!IS_ALIGNED(paddr, section_size) || + !IS_ALIGNED(psize, section_size)) + return false; + + /* + * The pre-HVO code does not deal with splitting PMDS, + * so the bootmem page must be aligned to the number + * of base pages that can be mapped with one vmemmap PMD. + */ + pmd_vmemmap_size = (PMD_SIZE / (sizeof(struct page))) << PAGE_SHIFT; + if (!IS_ALIGNED(paddr, pmd_vmemmap_size) || + !IS_ALIGNED(psize, pmd_vmemmap_size)) + return false; + + return true; +} + +/* + * Initialize memmap section for a gigantic page, HVO-style. + */ +void __init hugetlb_vmemmap_init_early(int nid) +{ + unsigned long psize, paddr, section_size; + unsigned long ns, i, pnum, pfn, nr_pages; + unsigned long start, end; + struct huge_bootmem_page *m = NULL; + void *map; + + /* + * Noting to do if bootmem pages were not allocated + * early in boot, or if HVO wasn't enabled in the + * first place. + */ + if (!hugetlb_bootmem_allocated()) + return; + + if (!READ_ONCE(vmemmap_optimize_enabled)) + return; + + section_size = (1UL << PA_SECTION_SHIFT); + + list_for_each_entry(m, &huge_boot_pages[nid], list) { + if (!vmemmap_should_optimize_bootmem_page(m)) + continue; + + nr_pages = pages_per_huge_page(m->hstate); + psize = nr_pages << PAGE_SHIFT; + paddr = virt_to_phys(m); + pfn = PHYS_PFN(paddr); + map = pfn_to_page(pfn); + start = (unsigned long)map; + end = start + nr_pages * sizeof(struct page); + + if (vmemmap_populate_hvo(start, end, nid, + HUGETLB_VMEMMAP_RESERVE_SIZE) < 0) + continue; + + memmap_boot_pages_add(HUGETLB_VMEMMAP_RESERVE_SIZE / PAGE_SIZE); + + pnum = pfn_to_section_nr(pfn); + ns = psize / section_size; + + for (i = 0; i < ns; i++) { + sparse_init_early_section(nid, map, pnum, + SECTION_IS_VMEMMAP_PREINIT); + map += section_map_size(); + pnum++; + } + + m->flags |= HUGE_BOOTMEM_HVO; + } +} + +void __init hugetlb_vmemmap_init_late(int nid) +{ + struct huge_bootmem_page *m, *tm; + unsigned long phys, nr_pages, start, end; + unsigned long pfn, nr_mmap; + struct hstate *h; + void *map; + + if (!hugetlb_bootmem_allocated()) + return; + + if (!READ_ONCE(vmemmap_optimize_enabled)) + return; + + list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) { + if (!(m->flags & HUGE_BOOTMEM_HVO)) + continue; + + phys = virt_to_phys(m); + h = m->hstate; + pfn = PHYS_PFN(phys); + nr_pages = pages_per_huge_page(h); + + if (!hugetlb_bootmem_page_zones_valid(nid, m)) { + /* + * Oops, the hugetlb page spans multiple zones. + * Remove it from the list, and undo HVO. + */ + list_del(&m->list); + + map = pfn_to_page(pfn); + + start = (unsigned long)map; + end = start + nr_pages * sizeof(struct page); + + vmemmap_undo_hvo(start, end, nid, + HUGETLB_VMEMMAP_RESERVE_SIZE); + nr_mmap = end - start - HUGETLB_VMEMMAP_RESERVE_SIZE; + memmap_boot_pages_add(DIV_ROUND_UP(nr_mmap, PAGE_SIZE)); + + memblock_phys_free(phys, huge_page_size(h)); + continue; + } else + m->flags |= HUGE_BOOTMEM_ZONES_VALID; + } +} +#endif + static struct ctl_table hugetlb_vmemmap_sysctls[] = { { .procname = "hugetlb_optimize_vmemmap", diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h index 926b8b27b5cb..0031e49b12f7 100644 --- a/mm/hugetlb_vmemmap.h +++ b/mm/hugetlb_vmemmap.h @@ -9,6 +9,8 @@ #ifndef _LINUX_HUGETLB_VMEMMAP_H #define _LINUX_HUGETLB_VMEMMAP_H #include +#include +#include /* * Reserve one vmemmap page, all vmemmap addresses are mapped to it. See @@ -25,6 +27,10 @@ long hugetlb_vmemmap_restore_folios(const struct hstate *h, void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio); void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list); void hugetlb_vmemmap_optimize_bootmem_folios(struct hstate *h, struct list_head *folio_list); +#ifdef CONFIG_SPARSEMEM_VMEMMAP_PREINIT +void hugetlb_vmemmap_init_early(int nid); +void hugetlb_vmemmap_init_late(int nid); +#endif static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index bee22ca93654..29647fd3d606 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -32,6 +32,8 @@ #include #include +#include "hugetlb_vmemmap.h" + /* * Flags for vmemmap_populate_range and friends. */ @@ -594,6 +596,7 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, */ void __init sparse_vmemmap_init_nid_early(int nid) { + hugetlb_vmemmap_init_early(nid); } /* @@ -604,5 +607,6 @@ void __init sparse_vmemmap_init_nid_early(int nid) */ void __init sparse_vmemmap_init_nid_late(int nid) { + hugetlb_vmemmap_init_late(nid); } #endif