From patchwork Wed Oct 18 10:42:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ingo Molnar X-Patchwork-Id: 13426847 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 18B8DCDB47E for ; Wed, 18 Oct 2023 10:42:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87A578E0006; Wed, 18 Oct 2023 06:42:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 85B258D0016; Wed, 18 Oct 2023 06:42:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 719098E0006; Wed, 18 Oct 2023 06:42:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5F8138D0016 for ; Wed, 18 Oct 2023 06:42:57 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 1EB5EC019D for ; Wed, 18 Oct 2023 10:42:57 +0000 (UTC) X-FDA: 81358244394.04.2BB37C3 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf05.hostedemail.com (Postfix) with ESMTP id 2D5E3100016 for ; Wed, 18 Oct 2023 10:42:54 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h7uuzJZI; spf=pass (imf05.hostedemail.com: domain of mingo.kernel.org@gmail.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=mingo.kernel.org@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1697625775; a=rsa-sha256; cv=none; b=iUY6bs/Q1luxbXJw6vVbElvm+ifQHnFRstg22MeiENqnlKQrns0KsfU6OAxseNADPs/6kT fA7HSt8c8XmeGBQgSPDEGYzP6crH1yhlruFUjK4IunUs/io/OqgOQOPA+OGhOWtQEi43q6 aZjkRC9ZP+fZ/AYfwWiWerotrXl2fXM= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h7uuzJZI; spf=pass (imf05.hostedemail.com: domain of mingo.kernel.org@gmail.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=mingo.kernel.org@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1697625775; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=m9eNFghMxszk3YbHipHRejIdiQf5YgHBf4/5ZaM5L2c=; b=Ci2Gs/R+XJ/OlSpw9pG5bSUa1T7kDpRupHkrhWUZaGJ7LzPBMKSEn5zwZfvHyVUzcoghZJ sBsQn20MeMD0UqOlZ3CphbpXspp+37E5YTKoszLr+BcENGTA65s/djL4ely8tpAWF1tP2A Qma3JBM56Iq3/RryQ2HucelCIW0vQnk= Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-9a6190af24aso1097662566b.0 for ; Wed, 18 Oct 2023 03:42:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1697625774; x=1698230574; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc:subject:date:message-id :reply-to; bh=m9eNFghMxszk3YbHipHRejIdiQf5YgHBf4/5ZaM5L2c=; b=h7uuzJZIRwd+qIXUPEEXVpg3Iv5OsKV2dCJYZM4HokAZinz80g7vdvZQcminlM2qrF 6UZjIZnwn3wBm76vB0TAdztEIkAEAPzpFFRY8XJATexmx35xW3DVRSZpeE0EroOmTYmL CvVIjbpnL/jBVLTQQCgUkAcoLZu3HQeOBH9BtiQ0NOTcdHSWG/aldQG3RYtzahbNdqUP 1srC2PNLiZzXgO2gcXQ+8DWYlqcXWPo/Z3c6LVqO/NIRtqdmrIRcMg7Gy0qAPvrfzcpl R3UBPowdR6rSTU6F0SOY1c5TqNit5cmj0FmKF+QhfBWYSMPi86BZi//UkwZvJSwjVnt7 eJvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1697625774; x=1698230574; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=m9eNFghMxszk3YbHipHRejIdiQf5YgHBf4/5ZaM5L2c=; b=ikkjHFz8Evvchwt7Mtyhm6+L06Iuq8svSkIcHn9E6l/ANAYp+axvDbAjH8V4L+Yaci NdCQq2jWTobk18N9KWsY9a/UsBSzxPjCk2ZNpS5ks64c7Et4F/iVws9StjoSBuKUcd/p 2lQwamj7lqkJxnspTyBaW6tqNHgYxpdjkJUpATgCoX8UHAgQd2s6Oegv1P4YGB+ZzJsU I/oSyXvqU0mwB8k4pInhisQ9apljTH/pps42t3X5M4/RdN43xF7qoLqej5fvqr59uNeQ b+iJjBb9LSJ4HaGD4TdYUqAjL12KJPIRqJQ/qDyU0YlohT+EQeIqw+gu6HmKN6lM2O0K y8rQ== X-Gm-Message-State: AOJu0YwE/QVwPM35NM2T1WONZIOY6yeXLNpyQeLjnw1RzkBp5uO5L6KW 82xOrAWc3iObu6vxVslmOgg= X-Google-Smtp-Source: AGHT+IFencvTC68TFJimy9AUfJSm5E3bhYTPdwOE5xsVWGzExsslopJsSRaNT2v5UhDBIdUSxz+OZw== X-Received: by 2002:a17:907:5cc:b0:9b2:9e44:222e with SMTP id wg12-20020a17090705cc00b009b29e44222emr4284728ejb.19.1697625773687; Wed, 18 Oct 2023 03:42:53 -0700 (PDT) Received: from gmail.com (1F2EF7B2.nat.pool.telekom.hu. [31.46.247.178]) by smtp.gmail.com with ESMTPSA id lh22-20020a170906f8d600b009b94c545678sm1397665ejb.153.2023.10.18.03.42.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Oct 2023 03:42:53 -0700 (PDT) Date: Wed, 18 Oct 2023 12:42:50 +0200 From: Ingo Molnar To: Mike Rapoport Cc: x86@kernel.org, Andrew Morton , Andy Lutomirski , Borislav Petkov , Dave Hansen , David Hildenbrand , "H. Peter Anvin" , Ingo Molnar , Michal Hocko , Peter Zijlstra , Qi Zheng , Thomas Gleixner , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2] x86/mm: Drop 4MB restriction on minimal NUMA node memory size Message-ID: References: <20231017062215.171670-1-rppt@kernel.org> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20231017062215.171670-1-rppt@kernel.org> X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 2D5E3100016 X-Stat-Signature: xqciya1raddroa85c5e6yxz8gjziggso X-Rspam-User: X-HE-Tag: 1697625774-442233 X-HE-Meta: U2FsdGVkX193OTw1QykRHVbmC3kuySAych2f7nq7hxowm4myjluX+rPq5Z2KqAn/L33X3EYaixKg7o3w9oeQIEcvbeV7/2NwsxP0G04vXkxs4kUB6Vo5O+YakOcbIHooe2Q/omA3G20RJqFoz4iU2DSBKocCDm9drFkF8GPzqaOUGJxkNBDPvBw/NVJakb61EvmFC18uYclCs4zXxxdfarl8mhmg/gVnYfGEZZB8Z5osTIT2ieiO0BwHJlazY81fRQnE+XhawSnjP9450AVGYBTuL+biPJjY57NW5jOC9iMHvdkZ46KQHyyLMp4IlPdcKUAdfEqbX/zyR8Oec/kuBnQkJ+M+El62Iwnqtg4DSnnwtCcKXKgNhy0QdfxdEMJV9MSzp04EZ+67bO7J0dKBqnrwyFxEMVl1QtnnL2Hs/VRztVC114kc3wvDARgmLjMGrM1QGV5CLgF9GpoEoZs32g7q4oJT/Lx6tDbhyi2cxvt4aWBdDSxD/e9t/zrnW0afBfmQVIUwtU29fyFNks3QFFF/ua+8sX2JzuNIXjxyHqxLeYm0TpV/LeJJvQnB/avsksIkb4iFkEk7IIy1pJwHgKGvObAQa28cAoZOUCNcxzO4ZTGb5lWDG9k0rw7YjJqqP7tvN0lXwJZwoZGlDEOoVDUUuoJvvqEWSxEUGkjtOgE4krc7eWj6EiJjSp3rzhFGN/VYMTekKlBVj371Zxew+R5Sg4wqBp7jlxIbZEobEkii3K9e+VnlbJJKQjupqrzZE1ekTxD0A7v2nFX5Qu3hNGVRQlUwc9jrLGpyW1W/0xiVVHgHt0sk4VzzfKQ51ZSqgEaaboaaU5rxSKJkQZem1y3Dx2CiB0083yYRSOTi7Q3koDfiKvLreWhogQnqsg29T3V9PJAHEmWZJpiBB4qOQ1qOYhosSnq8WcN2NxTHPzTlUjYxBhuNUp9U7HEzfX3j4ZJwV52E6bMD0Xib9Wz zZL25Jkx y6/bm68aIxM3oN4ShrJzGYzY+V+gYM0nA0o/TIt8LGLyZLjGr3EXKTgKAN1XKkJNIlLFgWK03Jg1ZJxACFzFBgBIq3Un6lFwS99NZE0XhTe+bGFShy2CCcf9P4n5lsaRninVDDEbifOLucE2Z/eDbbF+C9tJRRtd+AjZHcFahojEqYT4zEJxREGI9tYMg3hnp9zfvzXUgVXJ/X7d5QUlMsV3et4Vv+yLr7VpMKvE8dKs0JZGOzrUlob9CcbMKy2lCcG7FFjNeIcLcRIj6T4fM0aKRU8qOIDfnVPSbrlGtX3o5vBv8n0dgnndL8tMgk4DlRXJKp3lfkRx4inVOC5bWLuIrGpiLymgaDL2o6waPQjlfdtBA/04hvlxDuMHKvVbsDPfTb13ZnPjV2PNl+1EdrSxAR6Uh1zDxuJBr2cC/TB1I59wL0Ji8JBa4NdBKdrDxw4Gwpr//9WaDLz5oT27Gsbc/JyrJvtzBN1ORzfzAUu3Qh8Nx9O+ppzU+pSEQNA0cHCmLOrqJo/LfoXEuogTY5UgwZgEt9T7pB05wNX2hs7104yx2PbmhI01RCS4A5qVFWIVujfpr6fyIDgfDhm+TeOsDBYm7K79YrRZ2pMciffUSdQuawWa3roHub5KIa1kF7ZR4wRqP20kGHG1qj6xcICzKGlVuI0kNyE007uiR+nHGUcSBlq/anPOhrPEfjYH7yQOb+jjYXlvSu2ONA6/SDtsU0Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: * Mike Rapoport wrote: > From: "Mike Rapoport (IBM)" > > Qi Zheng reports crashes in a production environment and provides a > simplified example as a reproducer: > > For example, if we use qemu to start a two NUMA node kernel, > one of the nodes has 2M memory (less than NODE_MIN_SIZE), > and the other node has 2G, then we will encounter the > following panic: > > [ 0.149844] BUG: kernel NULL pointer dereference, address: 0000000000000000 > [ 0.150783] #PF: supervisor write access in kernel mode > [ 0.151488] #PF: error_code(0x0002) - not-present page > <...> > [ 0.156056] RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40 > <...> > [ 0.169781] Call Trace: > [ 0.170159] > [ 0.170448] deactivate_slab+0x187/0x3c0 > [ 0.171031] ? bootstrap+0x1b/0x10e > [ 0.171559] ? preempt_count_sub+0x9/0xa0 > [ 0.172145] ? kmem_cache_alloc+0x12c/0x440 > [ 0.172735] ? bootstrap+0x1b/0x10e > [ 0.173236] bootstrap+0x6b/0x10e > [ 0.173720] kmem_cache_init+0x10a/0x188 > [ 0.174240] start_kernel+0x415/0x6ac > [ 0.174738] secondary_startup_64_no_verify+0xe0/0xeb > [ 0.175417] > [ 0.175713] Modules linked in: > [ 0.176117] CR2: 0000000000000000 > > The crashes happen because of inconsistency between nodemask that has > nodes with less than 4MB as memoryless and the actual memory fed into > core mm. Presumably the core MM got fixed too to not just crash, but provide some sort of warning? > The commit 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring > empty node in SRAT parsing") that introduced minimal size of a NUMA node > does not explain why a node size cannot be less than 4MB and what boot > failures this restriction might fix. > > Since then a lot has changed and core mm won't confuse badly about small > node sizes. Core MM won't get confused ... other than by the above weird Qemu topology, to which it responds with a ... NULL pointer dereference? Seems quite close to the literal definition of 'get confused badly' to me, and doesn't give me the warm fuzzy feeling that giving the core MM even *more* weird topologies is super safe ... :-/ > Drop the limitation for the minimal node size. While I agree with dropping the limitation, and I agree that 9391a3f9c7f1 should have provided more of a justification, I believe a core MM fix is in order as well, for it to not crash. [ If it's fixed upstream already, please reference the relevant commit ID. ] Also, the changelog spelling & general presentation were quite low quality - I've fixed it up a bit below, please carry this version going forward. Please spell-check your patches before sending out Nth versions of it, maybe maintainers are skipping them for a reason! Thanks, Ingo =================> From: "Mike Rapoport (IBM)" Date: Tue, 17 Oct 2023 09:22:15 +0300 Subject: [PATCH] x86/mm: Drop 4MB restriction on minimal NUMA node memory size Qi Zheng reported crashes in a production environment and provided a simplified example as a reproducer: | For example, if we use qemu to start a two NUMA node kernel, | one of the nodes has 2M memory (less than NODE_MIN_SIZE), | and the other node has 2G, then we will encounter the | following panic: | | BUG: kernel NULL pointer dereference, address: 0000000000000000 | <...> | RIP: 0010:_raw_spin_lock_irqsave+0x22/0x40 | <...> | Call Trace: | | deactivate_slab() | bootstrap() | kmem_cache_init() | start_kernel() | secondary_startup_64_no_verify() The crashes happen because of inconsistency between the nodemask that has nodes with less than 4MB as memoryless, and the actual memory fed into the core mm. The commit: 9391a3f9c7f1 ("[PATCH] x86_64: Clear more state when ignoring empty node in SRAT parsing") ... that introduced minimal size of a NUMA node does not explain why a node size cannot be less than 4MB and what boot failures this restriction might fix. In the 17 years since then a lot has changed and core mm won't get confused about small node sizes. Drop the limitation for the minimal node size. [ mingo: Improved changelog clarity. ] Reported-by: Qi Zheng Signed-off-by: Mike Rapoport (IBM) Not-Yet-Signed-off-by: Ingo Molnar Acked-by: David Hildenbrand Acked-by: Michal Hocko Link: https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/ --- arch/x86/include/asm/numa.h | 7 ------- arch/x86/mm/numa.c | 7 ------- 2 files changed, 14 deletions(-) diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h index e3bae2b60a0d..ef2844d69173 100644 --- a/arch/x86/include/asm/numa.h +++ b/arch/x86/include/asm/numa.h @@ -12,13 +12,6 @@ #define NR_NODE_MEMBLKS (MAX_NUMNODES*2) -/* - * Too small node sizes may confuse the VM badly. Usually they - * result from BIOS bugs. So dont recognize nodes as standalone - * NUMA entities that have less than this amount of RAM listed: - */ -#define NODE_MIN_SIZE (4*1024*1024) - extern int numa_off; /* diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index c01c5506fd4a..aa39d678fe81 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -602,13 +602,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi) if (start >= end) continue; - /* - * Don't confuse VM with a node that doesn't have the - * minimum amount of memory: - */ - if (end && (end - start) < NODE_MIN_SIZE) - continue; - alloc_node_data(nid); }