From patchwork Thu Oct 3 20:00:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Ritesh Harjani (IBM)" X-Patchwork-Id: 13821493 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61859CF34C3 for ; Thu, 3 Oct 2024 20:01:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DAF9B6B032A; Thu, 3 Oct 2024 16:01:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D60166B0441; Thu, 3 Oct 2024 16:01:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C009F6B032B; Thu, 3 Oct 2024 16:01:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id A188C6B0325 for ; Thu, 3 Oct 2024 16:01:06 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id C74EC809EC for ; Thu, 3 Oct 2024 20:01:05 +0000 (UTC) X-FDA: 82633359690.27.B065269 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) by imf14.hostedemail.com (Postfix) with ESMTP id E8BD210001D for ; Thu, 3 Oct 2024 20:01:01 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QpIT0Orr; spf=pass (imf14.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727985533; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=c5VJbGliA4ofybWRxv70Ry6tNcgt/6Jiz8BqFS1XLa0=; b=iwl/keinZWruk2AkEaF/Xu0/gfX1c+Id8R0OBih5NQ8MBLryLAV9TwC9RFHsYkJ9+aSn96 K09nZHXcVNDMUhnoqjL8oWtv/JVU0aDjlhsyXn9n6sSTjIE9eT77US70SHsdsbUvqlZGu3 DzhGvxNN8IIF+K4OefCr09sGy21fIFE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727985533; a=rsa-sha256; cv=none; b=pyy5bRlSTs5dD1VHBjtKMMG/+TZ3KQENLXkz1cvufGgBgGQUNoZpMWHLHtfC1V0vzvj8tI ihPuQHUfsBGteWvkbeYMvcQqRpIvOfoaK9LPmZKWZLOz6qOdjRDcbx58ZX/+NtJW9Sru5I pwdBjjiXZjqVfoTsN+p5g2UajSgz77o= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QpIT0Orr; spf=pass (imf14.hostedemail.com: domain of ritesh.list@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=ritesh.list@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-20bcae5e482so13091255ad.0 for ; Thu, 03 Oct 2024 13:01:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1727985659; x=1728590459; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=c5VJbGliA4ofybWRxv70Ry6tNcgt/6Jiz8BqFS1XLa0=; b=QpIT0OrrOCbfV5exvu/ma7Ch8c7YJ+wiUaNvqpoErw34VIZQzT94PWaEVUPyx2Gq9M gUsPLkjsuWySh03FHy4RrFqRN67VL8jX8NMzHwKghYrbIb7EuIdSta+BipnnlX/tsiCx GQNi8u/DnKFhm5dAg4GJDKn1l2uQsB5EEIZRxkFqMyPKo1gvp4ZedtSIetaxFP9GueR+ sLi4xicNwxGZPWmvCoOhgKglRCQZ7zh/Bt8vjw4F1JjB1/gqOXyWXAGVUgoySrCyzsCk F07bYQf+o8nY+hO50YlPdiPv1dJMpNg00xHDTYNN+6z/GRsvWUV0LBehLswQCnuBrzMT UoPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727985659; x=1728590459; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=c5VJbGliA4ofybWRxv70Ry6tNcgt/6Jiz8BqFS1XLa0=; b=nfLfCCvUgQBv9VR051QDsNJWpgufQ1yYKxliMTBplqWqABYHzms6PVIj5v1IR9HgVP 017X5IeiAWVOwXnwLIrNSgtei72ETLsA7/uW/zL31yz6uSB+dbyMw0Urxb1zOXY+Gemu LXTHS5SwiN3EASJANV+kdwyDQ+x5SIGfW6pxY9T1nZ0GlgnXEHroZ1xRo4weYDNElP9U rwdIQcu9ExnMlpG5+JxMuHLhixOMqJpmW1Pasdsp7QysCSUKxW7Am4tm94YYCwFLV3o1 QWRd0+TK713ljLTdTg1M9lFNUDvaEgYBdcC3je5x+AhC+QjR0r8Gqhd4qIPEwM0FF4Qw 3mXQ== X-Gm-Message-State: AOJu0YzFlE1bj2Fy7mN2wig0Tk3oVS654yRmKzI9a6oaX77XCgzPxUou yCoD9z/Lrmr13A7gUHXKcriVwkTyQhKW1GzuifwGq4HDhssFciVWTUkkwQ== X-Google-Smtp-Source: AGHT+IH0K+AIt03Qatq93yCD/2TIGRuZnR+relmfAm1x+BhTOYu19ut5f+ElwmYUF8cnvl3edRNCGg== X-Received: by 2002:a17:903:228a:b0:207:1825:c65e with SMTP id d9443c01a7336-20bfe05f3c5mr4137035ad.18.1727985659542; Thu, 03 Oct 2024 13:00:59 -0700 (PDT) Received: from dw-tp.. ([49.205.218.89]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20beefafdadsm12417935ad.212.2024.10.03.13.00.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 03 Oct 2024 13:00:58 -0700 (PDT) From: "Ritesh Harjani (IBM)" To: linux-mm@kvack.org Cc: "Ritesh Harjani (IBM)" , Donet Tom , Gang Li , Daniel Jordan , Muchun Song , David Rientjes Subject: [PATCH] mm/hugetlb: Fix hugepage allocation for interleaved memory nodes Date: Fri, 4 Oct 2024 01:30:43 +0530 Message-ID: <7e0ca1e8acd7dd5c1fe7cbb252de4eb55a8e851b.1727984881.git.ritesh.list@gmail.com> X-Mailer: git-send-email 2.46.0 MIME-Version: 1.0 X-Rspamd-Queue-Id: E8BD210001D X-Stat-Signature: hor5t6bx7fkngerw31qgobxkooqhponw X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1727985661-552451 X-HE-Meta: U2FsdGVkX189QX9xJy/iQZ5itPp1hUh/zQpgCGRPUXSAXPjDQkbu7ltSIfDIWu2a4G7A+q3coNZELUvxCWZEcgx2wJooGfMqDTI/tkXARmz7c79fHLehwKAEUh+IYcLY8O/c66T99EvQjobz5RSC7N4jDJpt58jJht6ABeJ6plE6tGeRWhsAzTnH3uzy99ALqLnAn4t5Nbm80jFdHVV6mWaIoO1ALgcC3Pt16HjWJmv2QKTG850IlORgr4zYXKQTSwBsiMUEKQElm2ple7pGXlluki6Q4DrgRVQFc/yDG2Yu9Em3ChOveA5DiTrIsehX8QCCFHxg6yHdMV/FscXjtzPkJV9rj6C+FNA15YKAeMrm9jxEZGRcXU8RnKzU09t6yhkbU7gVs2F+M7ichFD3RQTBdpSVsYqyVFMRm4elbXK7HFODQ3JhA+55DUk+D8X/rTtR/yBI6D3rRMfjerY83xh6iT9WkFEHd49r9yblohY8XYU0tnL0mwAhBmH3F1ZG8BxtRMs0OALdkaFz63Y10ewwANqJetI3iW4HJrQXpztCzZDlPQR7GzXPfxMxBerQFnWTPHsQB4iv/hEIf4M6zOvl4wx9mRevvS+CDxWQg0qv52qi5JYDQzPATesqnABjHVh1Tc35PzBXjuUc95ETLZop9aWk+v08kBdLAfmqs8ss8yVukZOtdEIp2ORZjXNFVwV6yjLjew/4tHS8DB4vnxURXTIM4yhlCnr2y/SLHyapEn2oAasmALeSoPzMFIzP2CS1DEAy/ohGcYo2Q0g/xCio0WRM/emu5hgWpm8/G7Ml0DY8ewAlS0A8N47BEkzfGuBrMckh4ud3UN5F641yPHnKfXiMfka0ZqBz8L3Z/+ZvEAAiWEhYztXSglk4PhLM8JwFt9OOWg+RUXXQIpv1PzAP95L2ZaB+9yFiTsn2WBbkW2O79QYn4LS7HKdwnH7Wghg/+SNK3Oa5Z6ex866 4t/rhdkR NkVGSAWHw5qfqDygKJiCCJ1PI0IiswxYuQS/UqMwq2wVT/iHZnD/7RFs+jtq8TRe1LlfzxlnHicA6bxxPprq9gftj1ESA9c0FHTkucCk5lM3a2Ht1yfVbkCH8Tk+nIbs9hGs31YfspyccxWcXFJL+zbXJpylJy5nUv6Fns2+JzoUYLkpT5gxt4jXKtQImRAMpTMstVeridxo5Rrxq1sEUqyWK+mnOBijDN7soCQWrWv0OWQC6QWMYDSzUH8Na0YLP90PVIudI0vF8pfLlARv4wfVxXI2TwrS2m/CbiIuFXN+iYddKHijALcKujOndF2qL8QGiZZjYZRVRTEDVAh2QKCM7X4Zms4LlWKjDc7WYr1xfRBQgcxujbkNBmVOB93Z4f7K5hxNp4lYvy0RHJ2nDd3iWVZeKJikPaSIzuXQi44pH3Fu2nulbeurf21W2p2SHgb7x2wjSyCGkf+EQCou5qMawNPafg5w+Pzfr+Su5ysD4CcrBY+/sfLxelrTr5IB1pme9QRb7lUI+t+M75vEYB4ZGjoIrmbho/u/cXt8O2qAsmKSvLbqyhNyzvFDVWhd+FmNAw0UWAaPZPzZoBqR0LgDsFyx0aImWtXrBszh8KK0O07s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: gather_bootmem_prealloc() function assumes the start nid as 0 and size as num_node_state(N_MEMORY). Since memory attached numa nodes can be interleaved in any fashion, hence ensure current code checks for all online numa nodes as part of gather_bootmem_prealloc_parallel(). Let's still make max_threads as N_MEMORY so that we can possibly have a uniform distribution of online nodes among these parallel threads. e.g. qemu cmdline ======================== numa_cmd="-numa node,nodeid=1,memdev=mem1,cpus=2-3 -numa node,nodeid=0,cpus=0-1 -numa dist,src=0,dst=1,val=20" mem_cmd="-object memory-backend-ram,id=mem1,size=16G" w/o this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): ========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 0 kB with this patch for cmdline (default_hugepagesz=1GB hugepagesz=1GB hugepages=2): =========================== ~ # cat /proc/meminfo |grep -i huge AnonHugePages: 0 kB ShmemHugePages: 0 kB FileHugePages: 0 kB HugePages_Total: 2 HugePages_Free: 2 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 1048576 kB Hugetlb: 2097152 kB Fixes: b78b27d02930 ("hugetlb: parallelize 1G hugetlb initialization") Signed-off-by: Ritesh Harjani (IBM) Cc: Donet Tom Cc: Gang Li Cc: Daniel Jordan Cc: Muchun Song Cc: David Rientjes Cc: linux-mm@kvack.org --- ==== Additional data ==== w/o this patch: ================ ~ # dmesg |grep -Ei "numa|node|huge" [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] [ 0.000000][ T0] Movable zone start for each node [ 0.000000][ T0] Early memory node ranges [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Initmem setup node 0 as memoryless [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) [ 0.000000][ T0] Fallback order for Node 0: 1 [ 0.000000][ T0] Fallback order for Node 1: 1 [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 [ 0.044978][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [ 0.209159][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level [ 0.414281][ T1] smp: Brought up 2 nodes, 4 CPUs [ 0.415268][ T1] numa: Node 0 CPUs: 0-1 [ 0.416030][ T1] numa: Node 1 CPUs: 2-3 [ 13.644459][ T41] node 1 deferred pages initialised in 12040ms [ 14.241701][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages [ 14.242781][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page [ 14.243806][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages [ 14.244753][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page [ 16.490452][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced [ 27.804266][ T1] Demotion targets for Node 1: null with this patch: ================= ~ # dmesg |grep -Ei "numa|node|huge" [ 0.000000][ T0] numa: Partition configured for 2 NUMA nodes. [ 0.000000][ T0] memory[0x0] [0x0000000000000000-0x00000003ffffffff], 0x0000000400000000 bytes on node 1 flags: 0x0 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde50800-0x3dde57fff] [ 0.000000][ T0] numa: NODE_DATA(0) on node 1 [ 0.000000][ T0] numa: NODE_DATA [mem 0x3dde49000-0x3dde507ff] [ 0.000000][ T0] Movable zone start for each node [ 0.000000][ T0] Early memory node ranges [ 0.000000][ T0] node 1: [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Initmem setup node 0 as memoryless [ 0.000000][ T0] Initmem setup node 1 [mem 0x0000000000000000-0x00000003ffffffff] [ 0.000000][ T0] Kernel command line: root=/dev/vda1 console=ttyS0 nokaslr slub_max_order=0 norandmaps memblock=debug noreboot default_hugepagesz=1GB hugepagesz=1GB hugepages=2 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] memblock_alloc_try_nid_raw: 1073741824 bytes align=0x40000000 nid=1 from=0x0000000000000000 max_addr=0x0000000000000000 __alloc_bootmem_huge_page+0x1ac/0x2c8 [ 0.000000][ T0] Inode-cache hash table entries: 1048576 (order: 7, 8388608 bytes, linear) [ 0.000000][ T0] Fallback order for Node 0: 1 [ 0.000000][ T0] Fallback order for Node 1: 1 [ 0.000000][ T0] SLUB: HWalign=128, Order=0-0, MinObjects=0, CPUs=4, Nodes=2 [ 0.048825][ T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [ 0.204211][ T1] Timer migration: 2 hierarchy levels; 8 children per group; 1 crossnode level [ 0.378821][ T1] smp: Brought up 2 nodes, 4 CPUs [ 0.379642][ T1] numa: Node 0 CPUs: 0-1 [ 0.380302][ T1] numa: Node 1 CPUs: 2-3 [ 11.577527][ T41] node 1 deferred pages initialised in 10250ms [ 12.557856][ T1] HugeTLB: registered 1.00 GiB page size, pre-allocated 2 pages [ 12.574197][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page [ 12.576339][ T1] HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages [ 12.577262][ T1] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page [ 15.102445][ T1] pci_bus 0000:00: Unknown NUMA node; performance will be reduced [ 26.173888][ T1] Demotion targets for Node 1: null mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.39.5 diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9a3a6e2dee97..60f45314c151 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3443,7 +3443,7 @@ static void __init gather_bootmem_prealloc(void) .thread_fn = gather_bootmem_prealloc_parallel, .fn_arg = NULL, .start = 0, - .size = num_node_state(N_MEMORY), + .size = num_node_state(N_ONLINE), .align = 1, .min_chunk = 1, .max_threads = num_node_state(N_MEMORY),