From patchwork Mon Jan 27 23:21:54 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 13951855 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 49254C0218A for ; Mon, 27 Jan 2025 23:23:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B1EE82801C4; Mon, 27 Jan 2025 18:22:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id ACB4028013A; Mon, 27 Jan 2025 18:22:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F7482801C4; Mon, 27 Jan 2025 18:22:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 6A76828013A for ; Mon, 27 Jan 2025 18:22:50 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 5E6E716059D for ; Mon, 27 Jan 2025 23:22:49 +0000 (UTC) X-FDA: 83054808858.06.56E17EE Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) by imf18.hostedemail.com (Postfix) with ESMTP id 906391C0009 for ; Mon, 27 Jan 2025 23:22:47 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Op2oPyvy; spf=pass (imf18.hostedemail.com: domain of 3RhWYZwQKCBU0Gy619916z.x97638FI-775Gvx5.9C1@flex--fvdl.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3RhWYZwQKCBU0Gy619916z.x97638FI-775Gvx5.9C1@flex--fvdl.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738020167; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fPEa5A3QHhx3Kbv5Xhjw+prW9k2+8h7ykdvwjxvCKJY=; b=s4Go1zW9Ro1kQSZvfqj9y8DVnSQZ781tMMEQVYx/kxLNzayyxEmGVQMgZ2JlS16Zpss0Rg UT0V/jgRXg/vUR5ChyUdMCWNGozpkULzyOCuOwLWLq4ic0AnmKdSQFZuKnC6ZeROhCZOW3 00DGXsW5EpKJf8ALHo3gHKBKr/tYEmk= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=Op2oPyvy; spf=pass (imf18.hostedemail.com: domain of 3RhWYZwQKCBU0Gy619916z.x97638FI-775Gvx5.9C1@flex--fvdl.bounces.google.com designates 209.85.216.74 as permitted sender) smtp.mailfrom=3RhWYZwQKCBU0Gy619916z.x97638FI-775Gvx5.9C1@flex--fvdl.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738020167; a=rsa-sha256; cv=none; b=Ip0lRgtTbFV7K/Nv2rw+5+02kpmEWJaradohYNWYPG+PW4C+VJiyl49IgbHzFaVXq9CkZj SZKS3NiD0YjOMQ/WHPTZnwkIlU3Efj+HnG4WsxBZ9FtT8ocP70bmAnnqpQTx4OCI5EGn0l EnEAib8KBhASltgj8TnIvtgXfhAQ8YM= Received: by mail-pj1-f74.google.com with SMTP id 98e67ed59e1d1-2f816a85facso3532302a91.3 for ; Mon, 27 Jan 2025 15:22:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1738020166; x=1738624966; darn=kvack.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=fPEa5A3QHhx3Kbv5Xhjw+prW9k2+8h7ykdvwjxvCKJY=; b=Op2oPyvy2irzpebu/Tu4hi5f/0Z7RGMjuCZa3r8EKkkbboKJc1etJhrCKc2M+qndeJ O2HTiQnYFhMVizsLEzNKq7csSGOQmiBohPHBBayJOWojXMhhIP+1SCW9ll40UEhfjOW5 CuxU629JJGA7/lg3I9RAboBjnUYB8Sdx4KsWCh6IxGZ3MuJOYW/givMIfXRaAC6oao4z CtAyqPyYJRZUJzN56t0kmORcFpTLSMn8HyV/zdJqav0TBPhKqJHMoKJpOZS70hxrtFUc 83VDVjI/7V18A2KE9th8xd73vZ/rD5ut9wuesh7z+hAg8/bLy8qVM9L+YOuTSlSu0AAj qCxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738020166; x=1738624966; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=fPEa5A3QHhx3Kbv5Xhjw+prW9k2+8h7ykdvwjxvCKJY=; b=WGt1LepzviGB1O3qC6JhNlWaeMzH6zPZ4ri3858GsOjQ1G5T0Ebp7gz5GYNgqza/zd cka4Q2nyJtdOyoVTegqDy1iwxop8KZvUt4+4HEPJTui82oorQfKwfXwUWxN108ACAwlJ gHgDa3VflsjHHCK3VVsQgjUAQoHq4QNilUw2YxFhxn8SZBVwMFgxhg928masq5ZBraWA bTkRB7Uug5XMWwb/DuXH69WnLR4bKzzrZ8rgabu+M/Iv37VA57mIhmKK3BuhiXZlPN6o lcqXKsK88htPsAgzUdvK9O6Ze8ANy2MngF55Y9CdtTo/Dh8HE3nWd6R170WcwBIutW/s cKpQ== X-Forwarded-Encrypted: i=1; AJvYcCUBk0VKK8LZfCoVNUv/zvvssrhtMC7GaKRhzSB6eiZhznM83TiBpcqKjBt8vJjkwXikyj+07my2WA==@kvack.org X-Gm-Message-State: AOJu0YymJSIULhg4d2RQIGi8In+jL4q34SCiH0zSw+LXtoBVv6y2c7JM 44q2HeQfX7sEDpNCs5wLS6fX7UbsMe281w8ParnOT/nnswB4nwI5imSY5XZr/nzNWBriww== X-Google-Smtp-Source: AGHT+IGn7jc5Om129Hpx9vXBJ3rH8UAewwJypxwqAHi6sGDmxGCjPTFhR44F/Z++C8sNe0c1dXnl0ue0 X-Received: from pfxa29.prod.google.com ([2002:a05:6a00:1d1d:b0:728:e245:6e93]) (user=fvdl job=prod-delivery.src-stubby-dispatcher) by 2002:a05:6a00:84f:b0:725:df1a:288 with SMTP id d2e1a72fcca58-72dafaf8ab3mr70132923b3a.24.1738020166471; Mon, 27 Jan 2025 15:22:46 -0800 (PST) Date: Mon, 27 Jan 2025 23:21:54 +0000 In-Reply-To: <20250127232207.3888640-1-fvdl@google.com> Mime-Version: 1.0 References: <20250127232207.3888640-1-fvdl@google.com> X-Mailer: git-send-email 2.48.1.262.g85cc9f2d1e-goog Message-ID: <20250127232207.3888640-15-fvdl@google.com> Subject: [PATCH 14/27] mm/hugetlb: check bootmem pages for zone intersections From: Frank van der Linden To: akpm@linux-foundation.org, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: yuzhao@google.com, usama.arif@bytedance.com, joao.m.martins@oracle.com, roman.gushchin@linux.dev, Frank van der Linden X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 906391C0009 X-Stat-Signature: 7qtnhdnh7ukfqdd8ixrdpqeoukyami45 X-Rspam-User: X-HE-Tag: 1738020167-814541 X-HE-Meta: U2FsdGVkX1/ZyRywcs1aZhJdhrxEVFHztrrTpu7W5LQlRedU4umhp8hFbTkdk1z6/9hZII5gQA9d1Vomh0MedEy8iNj3ByEmLrTjeQkeq2ZrfCzyzova/aH6IZgG5B0vnYIlz5jdN0trSj8bWQaNox35zHCOiicQH4MGzKhBqrjJ/Pn3ql9/7Q3CrOHiy3Ztfw1ts45ozYo5ZankXswMQs/PtpW9+MO+VrVM0vlEnsz1Jwf2rLMOyDIu5jnWOjNDRJ47t8FxOZhMaC6peaIXI5zLsSqPw4j28/64X/WoWmy4P2BEbuxJCwfkscFxXFIqWXRYiKilQ3foAWG60qbZdX/3kojJs/1zkVSD2RFYv2x0ehNVVImDnnXtOmL1end85FJFmFbEnZ5qQqhZki71wKfTNUn2mNtqh0Xdx4fNJ6kqbW0aWTIuTCR6H/J24J/JUZu/HGz2S3HYCHN+tyPb8H75lkAXYm8+STmreWe4JsG3nTULflkIYTLKd0abcrTWXv2o4D/QRUglCD9IyXJM6BeZAz0TavmAc1pn2kh8hD7RKgECSOCbgpi6qh3G/H4nNjHsjk2G2wRUewXyTMF+hlRA+wyuuqErf6S2IO/t7485jv2LZ8hn3eZmfHRsg9N/1eQde1WZDQrwTBE06bsvxSEPlQGBnVCOTdXo0/Ml3jqoEh4Y8EVnfggHpCdDIvZIc6VgvYpVtUgeBAFzuVzwHbYRQFd/3wCmmmHk0XihcuQ55+5lYoL2tu3Dsp0n0i6r9dSI2T7g9hlwjkIhv9JW8ase9WFm6oY3+K7uOTPrrSWJf+A5j467dYg1EtfcQDdu01VlO5wdZI50pE/gyGoSl8nSsEsp9t1KZzp/Dv0zMY1J/PTqE6yTdj7zYo5v97mvtVN4MwOaxR3GSK0Y4IC+yl8gnCGpoyKNN0kPHIfDxtm7DDVHDpmdxu7SzyNc5Mhuj6Q1mf/Ih7+wPyvwckw zQj/PTsI HKi+6LtjTPTg7qlFKJE2XKDsmbV4qEiv8Iqb0mYHrcC1wnIS/LnUM+5dsSIFyU0GpybReL4DzcUlbjprRlcPc+0KADkMyvnJBVXL5HtDnVSX/ZnqyQNXTjuy5LyVMedcM6pYdPEvHXShOVf9XmOa6JIeJwHubq71q1jDYCBNjWBSdohkeKiiMrq0OBX4OwIYF7lMQtY0k7+qw2xq1YiI+6SaZnwt9VGjJbpY4RMsBriKNzIFiNaEzHxKprxYOQ+m2TzpbL3ptOpqdJC+hTKk3pHSkuFXEEOsCQJy3Dub9UvDel37eFk+BG/ebvB5XrfoeyPxzlw4lhWMeXyMgHnbnjZMNrpqoxdY+qyLDUonmX/l+amt3GaDnXZlWpAPPSeqoAyL9YGSh/PEaOS8S3LesGIC6QA6+U7FkTaUPHxeRBJsD88g1xxAfCAKgq3p6ocTvh/Q37t1Ed85f3RmCIVjMqF1AK1ea7FIWPoTDSX5m0i1lYELDzj4YvcxZy82PesBko3r/gSSRuXR3nio4wOfkBHzvMQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Bootmem hugetlb pages are allocated using memblock, which isn't (and mostly can't be) aware of zones. So, they may end up crossing zone boundaries. This would create confusion, a hugetlb page that is part of multiple zones is bad. Worse, HVO might then end up stealthily re-assigning pages to a different zone when a hugetlb page is freed, since the tail page structures beyond the first vmemmap page would inherit the zone of the first page structures. While the chance of this happening is low, you can definitely create a configuration where this happens (especially using ZONE_MOVABLE). To avoid this issue, check if bootmem hugetlb pages intersect with multiple zones during the gather phase, and discard them, handing them to the page allocator, if they do. Record the number of invalid bootmem pages per node and subtract them from the number of available pages at the end, making it easier to do these checks in multiple places later on. Signed-off-by: Frank van der Linden --- mm/hugetlb.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++-- mm/internal.h | 2 ++ mm/mm_init.c | 25 +++++++++++++++++++++ 3 files changed, 86 insertions(+), 2 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9969717b7dd8..a4d29a4f3efe 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -63,6 +63,7 @@ static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata; static unsigned long hugetlb_cma_size __initdata; __initdata struct list_head huge_boot_pages[MAX_NUMNODES]; +__initdata unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE]; /* * Due to ordering constraints across the init code for various @@ -3309,6 +3310,44 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h, } } +static bool __init hugetlb_bootmem_page_zones_valid(int nid, + struct huge_bootmem_page *m) +{ + unsigned long start_pfn; + bool valid; + + start_pfn = virt_to_phys(m) >> PAGE_SHIFT; + + valid = !pfn_range_intersects_zones(nid, start_pfn, + pages_per_huge_page(m->hstate)); + if (!valid) + hstate_boot_nrinvalid[hstate_index(m->hstate)]++; + + return valid; +} + +/* + * Free a bootmem page that was found to be invalid (intersecting with + * multiple zones). + * + * Since it intersects with multiple zones, we can't just do a free + * operation on all pages at once, but instead have to walk all + * pages, freeing them one by one. + */ +static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page, + struct hstate *h) +{ + unsigned long npages = pages_per_huge_page(h); + unsigned long pfn; + + while (npages--) { + pfn = page_to_pfn(page); + __init_reserved_page_zone(pfn, nid); + free_reserved_page(page); + page++; + } +} + /* * Put bootmem huge pages into the standard lists after mem_map is up. * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages. @@ -3316,14 +3355,25 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h, static void __init gather_bootmem_prealloc_node(unsigned long nid) { LIST_HEAD(folio_list); - struct huge_bootmem_page *m; + struct huge_bootmem_page *m, *tm; struct hstate *h = NULL, *prev_h = NULL; - list_for_each_entry(m, &huge_boot_pages[nid], list) { + list_for_each_entry_safe(m, tm, &huge_boot_pages[nid], list) { struct page *page = virt_to_page(m); struct folio *folio = (void *)page; h = m->hstate; + if (!hugetlb_bootmem_page_zones_valid(nid, m)) { + /* + * Can't use this page. Initialize the + * page structures if that hasn't already + * been done, and give them to the page + * allocator. + */ + hugetlb_bootmem_free_invalid_page(nid, page, h); + continue; + } + /* * It is possible to have multiple huge page sizes (hstates) * in this list. If so, process each size separately. @@ -3595,13 +3645,20 @@ static void __init hugetlb_init_hstates(void) static void __init report_hugepages(void) { struct hstate *h; + unsigned long nrinvalid; for_each_hstate(h) { char buf[32]; + nrinvalid = hstate_boot_nrinvalid[hstate_index(h)]; + h->max_huge_pages -= nrinvalid; + string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32); pr_info("HugeTLB: registered %s page size, pre-allocated %ld pages\n", buf, h->free_huge_pages); + if (nrinvalid) + pr_info("HugeTLB: %s page size: %lu invalid page%s discarded\n", + buf, nrinvalid, nrinvalid > 1 ? "s" : ""); pr_info("HugeTLB: %d KiB vmemmap can be freed for a %s page\n", hugetlb_vmemmap_optimizable_size(h) / SZ_1K, buf); } diff --git a/mm/internal.h b/mm/internal.h index 57662141930e..63fda9bb9426 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -658,6 +658,8 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn, } void set_zone_contiguous(struct zone *zone); +bool pfn_range_intersects_zones(int nid, unsigned long start_pfn, + unsigned long nr_pages); static inline void clear_zone_contiguous(struct zone *zone) { diff --git a/mm/mm_init.c b/mm/mm_init.c index 925ed6564572..f7d5b4fe1ae9 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -2287,6 +2287,31 @@ void set_zone_contiguous(struct zone *zone) zone->contiguous = true; } +/* + * Check if a PFN range intersects multiple zones on one or more + * NUMA nodes. Specify the @nid argument if it is known that this + * PFN range is on one node, NUMA_NO_NODE otherwise. + */ +bool pfn_range_intersects_zones(int nid, unsigned long start_pfn, + unsigned long nr_pages) +{ + struct zone *zone, *izone = NULL; + + for_each_zone(zone) { + if (nid != NUMA_NO_NODE && zone_to_nid(zone) != nid) + continue; + + if (zone_intersects(zone, start_pfn, nr_pages)) { + if (izone != NULL) + return true; + izone = zone; + } + + } + + return false; +} + static void __init mem_init_print_info(void); void __init page_alloc_init_late(void) {