From patchwork Sat Jul 20 17:35:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Jerome Glisse X-Patchwork-Id: 13737863 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2100C3DA59 for ; Sat, 20 Jul 2024 17:35:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4C8136B0083; Sat, 20 Jul 2024 13:35:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 477856B0085; Sat, 20 Jul 2024 13:35:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 33FE16B0088; Sat, 20 Jul 2024 13:35:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 1659D6B0083 for ; Sat, 20 Jul 2024 13:35:50 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 93EDAA4595 for ; Sat, 20 Jul 2024 17:35:49 +0000 (UTC) X-FDA: 82360833618.09.A0C04A2 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) by imf21.hostedemail.com (Postfix) with ESMTP id 9AF2C1C000E for ; Sat, 20 Jul 2024 17:35:47 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=3SpBR0qq; spf=pass (imf21.hostedemail.com: domain of 3cvWbZgcKCAkspur11npxxpun.lxvurw36-vvt4jlt.x0p@flex--jglisse.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3cvWbZgcKCAkspur11npxxpun.lxvurw36-vvt4jlt.x0p@flex--jglisse.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721496925; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=jSTDR4P/Fbj8rjRW35CmYmprz89sVj3I9HRlwsNtEQw=; b=QEgQF3hjXGoUM9T4lm72mlZjy3qbu+bl1rwsIXrWBYKaVojhgkmEAmb7g1VTJvX5L3SBAT vPZJDjAbM2ExO7HGtR1cfJ9HL5N8ZNloaBcd9o5tfDb93oQbJQqzW0+Bsx2DwO1USqAiCI /XpTmPyEGEUPZJKsfQXBlJauTwfTifI= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=3SpBR0qq; spf=pass (imf21.hostedemail.com: domain of 3cvWbZgcKCAkspur11npxxpun.lxvurw36-vvt4jlt.x0p@flex--jglisse.bounces.google.com designates 209.85.219.202 as permitted sender) smtp.mailfrom=3cvWbZgcKCAkspur11npxxpun.lxvurw36-vvt4jlt.x0p@flex--jglisse.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721496925; a=rsa-sha256; cv=none; b=vgEQrKU6IJ8stJ3CodTjMnBrOyXe93vZRVOCUVknN5OHWPqGGbSf5tHEjBlWPvbDL2JWne Zp5QgQJLpL/+cFxDXyETo5/Z5jL15FrAmtWNLUSmRujn0dza3FqysTGmEbY0vJUfo9WvJN PGULFXJpL8yxguPrqRQCVrRNXs6FhLM= Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e087c6181e9so1279315276.3 for ; Sat, 20 Jul 2024 10:35:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1721496946; x=1722101746; darn=kvack.org; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=jSTDR4P/Fbj8rjRW35CmYmprz89sVj3I9HRlwsNtEQw=; b=3SpBR0qqZX3Xvqb3r5vc4ge77qGbEVr3/6fWn6aydAyAXk64kCP0oYwpjur+lqepCQ WjfpZZWoQATgx7rOT4hJP4j9ba1pUJJDzz4tpFo3N4P3Tn1P0grTFDkVa2PqAFNRcyz1 mkIy+V7Y6jTzMpWeyZa7rkQqFrOJHCVqbg6+wKsAy2+MdBu9Z4aRULMYcPMGXtTht2AI Hbk3MIz+iDvfxEZxKrUiNKs3hofRTgy+fw6Q/e32gD8QbAZJOuiNQFP12uXN7WPuffht 5NKvBHRD5GvA0H0A10cyyPfvnaOG0eK5LI2G+/a+legDW7+pbpiQO4rbZer2icuDggHZ EyKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721496946; x=1722101746; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=jSTDR4P/Fbj8rjRW35CmYmprz89sVj3I9HRlwsNtEQw=; b=TY2P0EKC2omTPPZxTlv7XxirqOd6fRG5Zlze4m2b0Dtt7Stv8kC1hJChBxJauqlJhk q5FTIrevDEW/355UXNpEh7tURXocdNxQsZfRZZ0lT/aA08WmX6Z8ez7jJHXFwB6NqdGX 8rS9NLgyYguWRE4nQ8PO2mhGm5TyrW2K8rdIUoERzaLR4q9zIDu5keU8fLslREjID7aq FYaygbGp2zMsqkfOEIeLo0IDhqg1TqVvcIWLJ0yeLJseJQ3YgsIvIZpm76cAfp0DKl5X BR+eC7gSeTIutbKPF2LhgBM5sRXbB5cytbqkEVJi99UjZBIBZBJ4tvP5h/pNUKtaLfYZ v6Sw== X-Forwarded-Encrypted: i=1; AJvYcCUg4AHLVBmT54K+9FKhWlGsbM55AkzkCJH8niULKxkeuvRNmqwufOPUoZJh4jzMFj+d6Rrk3R00YvUD5TbAy4dRcFg= X-Gm-Message-State: AOJu0YyTBg4UCVzTj20faMrqNKMXznAwMZwrujGe6DvLfxcobLOdbjT1 XJZoQdeP03YrvgS4FQrg3qafi2LkM3g1BA+TRlgDqlkMwxDEaHbqfcCWJJPMYHzxo799tF6Ie77 eElE56A== X-Google-Smtp-Source: AGHT+IHzHGE+4Mf1AVi4KXPPKK23kOipnaPWrS8JnK0Wo6p7UpxnYoWF5JzEWZ4ozvZkdPd2aV+vcVPXamGR X-Received: from jglisse-desktop.svl.corp.google.com ([2620:15c:2a3:200:5d5c:1221:33d8:1aa1]) (user=jglisse job=sendgmr) by 2002:a05:6902:1109:b0:e02:5b08:d3a with SMTP id 3f1490d57ef6-e086fcb109emr25253276.0.1721496946462; Sat, 20 Jul 2024 10:35:46 -0700 (PDT) Date: Sat, 20 Jul 2024 10:35:43 -0700 Mime-Version: 1.0 X-Mailer: git-send-email 2.45.2.1089.g2a221341d9-goog Message-ID: <20240720173543.897972-1-jglisse@google.com> Subject: [PATCH] mm: fix maxnode for mbind(), set_mempolicy() and migrate_pages() From: Jerome Glisse To: Andrew Morton , linux-mm@kvack.org Cc: Jerome Glisse , linux-kernel@vger.kernel.org, stable@vger.kernel.org X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: 9AF2C1C000E X-Stat-Signature: rsttcadppq8ha1xja3yuwgiywf7bqbod X-Rspam-User: X-HE-Tag: 1721496947-870207 X-HE-Meta: U2FsdGVkX1+8tJutt4sjHGNKP8Ht9UBTkHUtseJX2PtDNi45em8ZWjRT8NXC7Vj6y3QsqyYCB1V5Tq8CPDdxHq57Oi3aRlkRjoce1aO2TUgmwo7yf9CZ3vyaMFyb+w9wfhO0arriGvKK56UJ6cPosKgXjdKZHdupWGxxBp40qpHJ6fC7kGTZ/Yf0/PnkGICYHqcujknvtxABANF9RbcwM+ej+YiI8zwi8WjLNLNyQoddsFRUnx+FsSNrYBd5PdBVdJAUyq2fKRD1fcHN8Mhserhrg8lIKQ7/KhAEqeMHCzuxdU0xK8PTB8DnqYK6lW19tnMGnDMsYYtGedY7Jw+VFXJVMAjOj6CPLat40S5oqtfUwxUFRnpmG+ixrN81YlsVOHnwItznYEyJncKwnYkUTuk9zHcOt5pfuvBd9UrHvZVjew0hJMg6qbDLY4B3PUt2rg6GHAAtadajiOMC2yvY5UsmNQBF+QPVKqyK8wRAcYuekLu49O5Xj5rRUQkeK/xDae2YQbmdXgSJccIRonhYjvqiaarMGOLhX4qZufi3J4WGg8Zz8ItXfI28UV4msNocHDEbmI2t+oTvi5vOT7rvqUEDQqjaMg5f1pibMmLcRezXcrZEzboUBp/8X0US9aMesOOL+m4WWtJHzCk1bMVPXsQeDVwI4QGqOu652Y9ihOqUcQUeXKLqNmvf1EschPdYfhMHFIsbyuiTZpfrfW4pRkq0jQlnBOu4o/AHfiZQUlRLs+2H0PQTGNkThXkXDfxYvhN7qXRm7OOcB7pzf0K6yX1WKQ9NTuR4w0wFqCfF6dLd971wXv0M8zVU/BX08hO4JcGlHccwg3As1NS38xkhj5q2/tlFGJ5EjRJZV2HhgvQTRLgfdDvEkKZQpL82MQtyPjiGkE/M2ne+f/h27xZy3dEjYV1eOXNPu6hqmMCI8RIx5Na5i8gfCPjf1ixAJNDUG/8MTYGC+4JsUnsEPn/ i152eTxr MDRqHkC4BBZPaT4cu+zz06LMPXRlP0io2uwybD65aOVhfaFRphfqC7mqiDvoH60tBZRPce1tt8bFWCoGqE4o3IE7Mfpuh64ZMu6S2XQlGjQkbh+lZhFLmFgu2wJQuCzbErhtDjn+RsdkHon138/zL8rtk8ugltJHRpJKoOU9KQOKWqzoD4hnX9FdQno6mTLpZDRlnz64SxvR5G9kj2ze+DPM++l5RQgZ2tStZg50NlAEVOqUf3FLBO+7O50HvIIjA9Cwap/6rl/feKwctpy1duWjpi1jnq8RczlM8Dc/7B9M5+FNlsMJ3Km3ArHDSwg2Ao2IWkemU+VDMKu6hCs9l9keYQ5zgAAq2IiXvBY36DXQowuy41HTv42QG2AK2o4O3Kj5VTxkvrqY1JS/njrw37+HzloqxkvvxBRfW1M9s+2gEi+FFX1qGtqTkyfEN8Mm2yiRXpev6r0PViLpe7wgW37IPOTGhhM+vPf5keBa/aRlGixciZ3n9ZGrcUtbLnDsUUtpD+5mfThSvL8u6HgaEHcZaAgXmGGzb5gKP90iI1LOUMH2DYN9aLPmb8wSk1e3p0PwxrLLFDVcqZmnS+L48Fg/RKZupIyhxY5G37HTU/NjCWGDvocrfglIS0gOGZvg/dF1M8CdL9RNiqFgwdo9rFLwyEWSb+ZTZvQtnovBmvpWOe5fymRh6vmYSzI8NJG8Zn9lgZRwkZ0VxFIXusffYywYhpGAyD8LpP1sd/XUDqlZfpCI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Because maxnode bug there is no way to bind or migrate_pages to the last node in multi-node NUMA system unless you lie about maxnodes when making the mbind, set_mempolicy or migrate_pages syscall. Manpage for those syscall describe maxnodes as the number of bits in the node bitmap ("bit mask of nodes containing up to maxnode bits"). Thus if maxnode is n then we expect to have a n bit(s) bitmap which means that the mask of valid bits is ((1 << n) - 1). The get_nodes() decrement lead to the mask being ((1 << (n - 1)) - 1). The three syscalls use a common helper get_nodes() and first things this helper do is decrement maxnode by 1 which leads to using n-1 bits in the provided mask of nodes (see get_bitmap() an helper function to get_nodes()). The lead to two bugs, either the last node in the bitmap provided will not be use in either of the three syscalls, or the syscalls will error out and return EINVAL if the only bit set in the bitmap was the last bit in the mask of nodes (which is ignored because of the bug and an empty mask of nodes is an invalid argument). I am surprised this bug was never caught ... it has been in the kernel since forever. People can use the following function to detect if the kernel has the bug: bool kernel_has_maxnodes_bug(void) { unsigned long nodemask = 1; bool has_bug; long res; res = set_mempolicy(MPOL_BIND, &nodemask, 1); has_bug = res && (errno == EINVAL); set_mempolicy(MPOL_DEFAULT, NULL, 0); return has_bug; } You can tested with any of the three program below: gcc mbind.c -o mbind -lnuma gcc set_mempolicy.c -o set_mempolicy -lnuma gcc migrate_pages.c -o migrate_pages -lnuma First argument is maxnode, second argument is the bit index to set in the mask of node (0 set the first bit, 1 the second bit, ...). ./mbind 2 1 & sleep 2 && numastat -n -p `pidof mbind` && fg ./set_mempolicy 2 1 & sleep 2 && numastat -n -p `pidof set_mempolicy` && fg ./migrate_pages 2 1 & sleep 2 && numastat -n -p `pidof migrate_pages` && fg mbind.c %< ---------------------------------------------------------- void *anon_mem(size_t size) { void *ret; ret = mmap(NULL, size, PROT_READ| PROT_WRITE, MAP_PRIVATE| MAP_ANON, -1, 0); return ret == MAP_FAILED ? NULL : ret; } unsigned long mround(unsigned long v, unsigned long m) { if (m == 0) { return v; } return v + m - (v % m); } void bitmap_set(void *_bitmap, unsigned long b) { uint8_t *bitmap = _bitmap; bitmap[b >> 3] |= (1 << (b & 7)); } int main(int argc, char *argv[]) { unsigned long *nodemask, maxnode, node, i; size_t bytes; int8_t *mem; long res; if (argv[1] == NULL || argv[2] == NULL) { printf("missing argument: %s maxnodes node\n", argv[0]); return -1; } maxnode = atoi(argv[1]); node = atoi(argv[2]); bytes = mround(mround(maxnode, 8) >> 3, sizeof(unsigned long)); nodemask = calloc(bytes, 1); mem = anon_mem(NPAGES << 12); if (!mem || !nodemask) { return -1; } // Try to bind memory to node bitmap_set(nodemask, node); res = mbind(mem, NPAGES << 12, MPOL_BIND, nodemask, maxnode, 0); if (res) { printf("mbind(mem, NPAGES << 12, MPOL_BIND, " "nodemask, %d, 0) failed with %d\n", maxnode, errno); return -1; } // Write something to breakup from the zero page for (unsigned i = 0; i < NPAGES; i++) { mem[i << 12] = i + 1; } // Allow numastats to gather statistics getchar(); return 0; } set_mempolicy %< ---------------------------------------------------- void *anon_mem(size_t size) { void *ret; ret = mmap(NULL, size, PROT_READ| PROT_WRITE, MAP_PRIVATE| MAP_ANON, -1, 0); return ret == MAP_FAILED ? NULL : ret; } unsigned long mround(unsigned long v, unsigned long m) { if (m == 0) { return v; } return v + m - (v % m); } void bitmap_set(void *_bitmap, unsigned long b) { uint8_t *bitmap = _bitmap; bitmap[b >> 3] |= (1 << (b & 7)); } int main(int argc, char *argv[]) { unsigned long *nodemask, maxnode, node, i; size_t bytes; int8_t *mem; long res; if (argv[1] == NULL || argv[2] == NULL) { printf("missing argument: %s maxnodes node\n", argv[0]); return -1; } maxnode = atoi(argv[1]); node = atoi(argv[2]); // bind memory to node 0 ... i = 1; res = set_mempolicy(MPOL_BIND, i, 2); if (res) { printf("set_mempolicy(MPOL_BIND, []=1, %d) " "failed with %d\n", maxnode, errno); return -1; } bytes = mround(mround(maxnode, 8) >> 3, sizeof(unsigned long)); nodemask = calloc(bytes, 1); mem = anon_mem(NPAGES << 12); if (!mem || !nodemask) { return -1; } // Try to bind memory to node bitmap_set(nodemask, node); res = set_mempolicy(MPOL_BIND, nodemask, maxnode); if (res) { printf("set_mempolicy(MPOL_BIND, nodemask, %d) " "failed with %d\n", maxnode, errno); return -1; } // Write something to breakup from the zero page for (unsigned i = 0; i < NPAGES; i++) { mem[i << 12] = i + 1; } // Allow numastats to gather statistics getchar(); return 0; } migrate_pages %< ---------------------------------------------------- void *anon_mem(size_t size) { void *ret; ret = mmap(NULL, size, PROT_READ| PROT_WRITE, MAP_PRIVATE| MAP_ANON, -1, 0); return ret == MAP_FAILED ? NULL : ret; } unsigned long mround(unsigned long v, unsigned long m) { if (m == 0) { return v; } return v + m - (v % m); } void bitmap_set(void *_bitmap, unsigned long b) { uint8_t *bitmap = _bitmap; bitmap[b >> 3] |= (1 << (b & 7)); } int main(int argc, char *argv[]) { unsigned long *old_nodes, *new_nodes, maxnode, node, i; size_t bytes; int8_t *mem; long res; if (argv[1] == NULL || argv[2] == NULL) { printf("missing argument: %s maxnodes node\n", argv[0]); return -1; } maxnode = atoi(argv[1]); node = atoi(argv[2]); // bind memory to node 0 ... i = 1; res = set_mempolicy(MPOL_BIND, &i, 2); if (res) { printf("set_mempolicy(MPOL_BIND, []=1, %d) " "failed with %d\n", maxnode, errno); return -1; } bytes = mround(mround(maxnode, 8) >> 3, sizeof(unsigned long)); old_nodes = calloc(bytes, 1); new_nodes = calloc(bytes, 1); mem = anon_mem(NPAGES << 12); if (!mem || !new_nodes || !old_nodes) { return -1; } // Write something to breakup from the zero page for (unsigned i = 0; i < NPAGES; i++) { mem[i << 12] = i + 1; } // Try to bind memory to node bitmap_set(old_nodes, 0); bitmap_set(new_nodes, node); res = migrate_pages(getpid(), maxnode, old_nodes, new_nodes); if (res) { printf("migrate_pages(pid, %d, old_nodes, " "new_nodes) failed with %d\n", maxnode, errno); return -1; } // Allow numastats to gather statistics getchar(); return 0; } Signed-off-by: Jérôme Glisse To: Andrew Morton To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org --- mm/mempolicy.c | 1 - 1 file changed, 1 deletion(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index aec756ae5637..658e5366d266 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1434,7 +1434,6 @@ static int get_bitmap(unsigned long *mask, const unsigned long __user *nmask, static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask, unsigned long maxnode) { - --maxnode; nodes_clear(*nodes); if (maxnode == 0 || !nmask) return 0;