From patchwork Thu Dec 19 19:18:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joshua Hahn X-Patchwork-Id: 13915539 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 568F0E77184 for ; Thu, 19 Dec 2024 19:18:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DDB346B007B; Thu, 19 Dec 2024 14:18:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D8AF76B0082; Thu, 19 Dec 2024 14:18:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C2B136B0083; Thu, 19 Dec 2024 14:18:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id A04256B007B for ; Thu, 19 Dec 2024 14:18:49 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 503501C5C9A for ; Thu, 19 Dec 2024 19:18:49 +0000 (UTC) X-FDA: 82912669728.10.557F6C6 Received: from mail-yw1-f170.google.com (mail-yw1-f170.google.com [209.85.128.170]) by imf22.hostedemail.com (Postfix) with ESMTP id 9517CC0003 for ; Thu, 19 Dec 2024 19:18:12 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kch5Bi8f; spf=pass (imf22.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734635902; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=25+vtXKNYpEbk74TE2SwqRxiUv+OpbQXae8RKLLyaG4=; b=EG52UpFmHBwy3cO5sLr4+r8QJtnA60RqWJ6R8FCP7vt2p/DxPzsnkZ3ETKwYDJ820fLZZi EmwLgR8ON3fdHLyPjmIZOmDiC0dDZBZJKm83izXfzRKTwyzdFX9XAXWdvzNP3K8Z6jTWSB lKLqesv/uCizX9ks9WNsQ3PvhJJnxuY= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kch5Bi8f; spf=pass (imf22.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734635902; a=rsa-sha256; cv=none; b=SMM1oR+OfHKaG4nC6/V9bWzDxoWo3aPo2iAPaX6GURJvq0oijH9B7BAlT9i2Hr3zi45Y7P OBEVcM4NECZsfxiJ7NxpjylXX4zL1jDOB6dkiLEQ5RaAZDhU+gUlKBDWYwlPvhPZ0TOFSF 8UZTtsBMCquz9zBmGaMJjNJMvm85zHQ= Received: by mail-yw1-f170.google.com with SMTP id 00721157ae682-6ef81222aaaso9157687b3.3 for ; Thu, 19 Dec 2024 11:18:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1734635926; x=1735240726; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=25+vtXKNYpEbk74TE2SwqRxiUv+OpbQXae8RKLLyaG4=; b=kch5Bi8fCWsLsQLtCY8ToL9KXKcJCuHuKx/G40siF6VVXXcgpimIMpVIBfJsdqRy46 egQusvKc1xLnelvFTPFSq7P7ezcm+zS7EEkBBeHj2dDsrWBZjKdlsUcYs/nTT/MWA+XR 7mDXoq0/YOqQ21cE1skAXbM+3/Gme6oVAt6rghblv4PnBKw44x7b51f1w9270BBZidn1 S061CvNikDSoBy3RLDdrzxEUCwcH+5QuM3VGqbVRwS7ZS9PF4OkFSu9gVP/HNnbqJw65 SnZYhsiys6XyMSAhItmh48d/XV3h9a3h9bGvMpsLqYuHphzHWLT9Qa7PxIwPkogjg5kX Jm9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1734635926; x=1735240726; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=25+vtXKNYpEbk74TE2SwqRxiUv+OpbQXae8RKLLyaG4=; b=YZxsfa7mT6SlQQ2W7R5RI1C8MyCynH4VjH86fEB5+radnxSXZ3rDdFUR227jNaAXhR RiKuz1O8m3xHo5UwIxcCkLKpDqx1XCtq3wI+cx/IbB/od8GS0sZHRiCBQaT0zuLwQhCy K4PQXFa8mfF2qI4rVcFZhU9hLEdUyWRvJ2dMBkjO7/ElqOjODyVx69ORb66kQRqVmB1+ +kygeGJzQ9Nd6TkwCr5QzC/arD8WC2xhtrf89UAVfTyFqAMGjElTBAKAjmCz/R0dMukh SbuSvhrMcb7pEDHD/GhUOL9FU6Kk4OSSMy4PsLSFEElB/cUynqk7WE+PQCgiFgvh61so pgdg== X-Forwarded-Encrypted: i=1; AJvYcCX/+CZ882HfdayK2UZeR62oNKdGONLnX+ynuyLRsjEOaHPtniY54o2wDWUT6X7geL59rkErkmUwcw==@kvack.org X-Gm-Message-State: AOJu0YyzNuk88+NbhwjBUFkR0olgVy5sIiOrM9sNP+bzm2bmWSdQUR+9 28xQK8BO6jrF9Hq88EnrYkhbyMCIryVt+FGgt6F1LuzRvXJdyB7V X-Gm-Gg: ASbGncvNE1JM24nUp7YC9ZX79xcedlZ6cwWpG9b0BL0wqKD4Fwpii6vXjA/eziNqAGe R/ypcJEzPEzrdPE6ztsebZOXQ7helVur8xNVRd8g0P5AXcToWQeb9Zzcujjc/bEvChnYQdIl3Me o3a1Hx0Na2yISFnTfJZNXXy3EQvS27RJZQuyLltqzN61YJ4P+gF4XyIr5LAt2deBg1GcQ5Cxd8a NQcfSsh1MYRzamQFGPlE2X/Io3zS/AjcomI1ErsLKKnuilE56VJMds= X-Google-Smtp-Source: AGHT+IEbjrePzPhqFj+OpbVDSqSPJ9hjZMOnnqwwrz9lMVpUlJ/mA71spAHqYgTF226F354Zzw8/Lg== X-Received: by 2002:a05:690c:6b02:b0:6ef:6d37:181d with SMTP id 00721157ae682-6f3e1b311a5mr47454447b3.8.1734635926424; Thu, 19 Dec 2024 11:18:46 -0800 (PST) Received: from localhost ([2a03:2880:25ff:6::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6f3e7477596sm4188387b3.54.2024.12.19.11.18.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 19 Dec 2024 11:18:45 -0800 (PST) From: Joshua Hahn To: gourry@gourry.net, hyeonggon.yoo@sk.com Cc: rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org, akpm@linux-foundation.org, honggyu.kim@sk.com, ying.huang@linux.alibaba.com, rakie.kim@sk.com, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: [RFC PATCH v2] Weighted interleave auto-tuning Date: Thu, 19 Dec 2024 11:18:45 -0800 Message-ID: <20241219191845.3506370-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 X-Rspamd-Queue-Id: 9517CC0003 X-Rspamd-Server: rspam12 X-Stat-Signature: qr1myq4rhgbim99qinjoznedpuixkoqw X-Rspam-User: X-HE-Tag: 1734635892-445562 X-HE-Meta: U2FsdGVkX18+BtDZtsc/KKsqb7VGCaVQ2q5wilKvdObtUfUg0savymMgAA8g6iGQehdUrzGKkFKUBO+O/D4/O18ypT9cs8FiOBhos+fbjA8Jm03ZXNoUOpdvBvsXbCeH9gfRw81TUSMCHq7lnZj2vuMLq/QQJURSD6GNmZmocCp09kdExiMNVvXobLzmqzTQvqJw8lHhBKg4Jx9HnFzDPx9od3EXAcJ+8qelHBMV7kmZut52KgqQuCIPQCCPSUKCL1D2oGpHfQe9DUoor9zxXYOKhR1iKQm/1ygHEXCsUaBPczsVKyKjjYzVTBDGwKjjxRgh4oPn1Yhlp+JXR7GUUREdrBdGvZOBU6POz0hmf1pVKs1CrA/eN3my6iPRnEer0YrtDBc0L1DlaVs9bArnNUPoZ2dN7NX5pm+kClXyudhIX3A4L95hz1mhTgnNhWm+D3nyBeFDlxOehqaI5Mb52+1/3xcKmPSdo60g9x6qGzA/uSJ6EKrQPyz6AyZPsxws8TFDNbCtR97r/ZnN/YgoJpz/6ceT9ouVdJAtcBH4PNPHOHUoGeC+TXyXwqbkFytlSvXNEuBpNT7cmu7aS6Rj3/Gg8VTjd2fxgqY0gCwSrXL7Y6T95vaQ2v8MU6kdKsS05jwgmjSN+uDRjfnp31wWSiN+2XMqiyUf8w6+0R6lMjgxPDCPBcDe39jIjqf8hCTrFvuNwVkT8S0IvOn2wtGC5uapYf5pxwVcTFwBwFnRGcKo52C2UGU9DvRyFcRjSiLxe9+eCgqM7kEAuN/4saa12PdrzWsj5eMOovFCvRrnuXAf5B3hEVbQot57C5SfTpuINObNXXIatnYYFZO0bQgrEpMrtjR7gGqDwrivsfFub+vImQU8Pv27YjI89dvTk4QR614F4O5EhRmGrCfyy3X91zsWbDS72nyx05bUkhYmDu8+Eyb0bparz+BKf+tbGHhhEn6gCanApOxstuRF/MF 9GklYMYi 3K4KgzdsEXx9HFSyyKxqurFxpVqbXhI3ke80bkcnaKUnqXdGVgy0KeZkGLftzBlRpdl3p/pUQz96gSOuSg0KYppq5EOd5Ur8UwkiglztQxwEMGqVwyuSHa5OcGDBblSE25zMz9hwNF6XteNTVho82GDLnOXhylksB64wNXmhw9zz/8jQwE3/2RVWlwYl4k2cHed+mihfBgCuSlX3CMxBh67Qm0f4W5hxVcPT9libMXUAqRm+bmehBWjQQE0cHybQiVHY7r0QW4XBl+V3XpUUJrNSDXCPQxb3fqe/feNu84+0dGKtgSd2pg/JSHrZEnD1th5XJQBu3UQ/ih33pDk72XG6o4CTnBvtSf8D1mvGR4GkIYguyykqB697iFSjuVkPSnNvEQjLv+BCKjfNJZT8LStF/sC6sYHQBMUoFgaAodFMdD2BmP+qX2z20v1hpey/oKZ/9hRHx/cUKUOYmLMZZVnRakVGYrUxpwbXhika/VQAvu8rOf0vzUROVYuYozSvw3HXpuwZfBlzgfpicYI5NPvHU2plHHg8nq/0yOcbxkxJNVbK5ysWR4XG0+oAg/6t0spXQLLQiSK7RjSXkZdzDwo6CaHcAnR1kGpZ2VlC6ovd8glMzTpDqqtXosg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On machines with multiple memory nodes, interleaving page allocations across nodes allows for better utilization of each node's bandwidth. Previous work by Gregory Price [1] introduced weighted interleave, which allowed for pages to be allocated across NUMA nodes according to user-set ratios. Ideally, these weights should be proportional to their bandwidth, so that under bandwidth pressure, each node uses its maximal efficient bandwidth and prevents latency from increasing exponentially. At the same time, we want these weights to be as small as possible. Having ratios that involve large co-prime numbers like 7639:1345:7 leads to awkward and inefficient allocations, since the node with weight 7 will remain mostly unused (and despite being proportional to bandwidth, will not aid in relieving the pressure present in the other two nodes). This patch introduces an auto-configuration for the interleave weights that aims to balance the two goals of setting node weights to be proportional to their bandwidths and keeping the weight values low. This balance is controlled by a value "weightiness", which defines the interleaving aggression. Higher values lead to less interleaving (255:1), while lower values lead to more interleaving (1:1). Large weightiness values generally lead to increased weight-bandwidth proportionality, but can lead to underutilized nodes (think worst-case scenario, which is 1:max_node_weight). Lower weightiness reduces the effects of underutilized nodes, but may lead to improperly loaded distributions. This knob is exposed as a sysfs interface with a default value of 32. Weights are re-calculated once at boottime and then every time the knob is changed by the user, or when the ACPI table is updated. [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/ Signed-off-by: Joshua Hahn Signed-off-by: Gregory Price Co-Developed-by: Gregory Price --- Changelog v2: - Name of the interface is changed from v1: "max_node_weight" --> "weightiness" - Default interleave weight table no longer exists. Rather, the interleave weight table is initialized with the defaults, if bandwidth information is available. - In addition, all sections that handle iw_table have been changed to reference iw_table if it exists, otherwise defaulting to 1. - All instances of unsigned long are converted to uint64_t to guarantee support for both 32-bit and 64-bit machines - sysfs initialization cleanup - Documentation has been rewritten to explicitly outline expected behavior and expand on the interpretation of "weightiness". - kzalloc replaced with kcalloc for readability - Thank you Gregory and Hyeonggon for your review & feedback! ...fs-kernel-mm-mempolicy-weighted-interleave | 36 ++++ drivers/acpi/numa/hmat.c | 1 + drivers/base/node.c | 7 + include/linux/mempolicy.h | 4 + mm/mempolicy.c | 183 +++++++++++++++--- 5 files changed, 209 insertions(+), 22 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave index 0b7972de04e9..edb2c1f4753f 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -23,3 +23,39 @@ Description: Weight configuration interface for nodeN Writing an empty string or `0` will reset the weight to the system default. The system default may be set by the kernel or drivers at boot or during hotplug events. + +What: /sys/kernel/mm/mempolicy/weighted_interleave/weightiness +Date: December 2024 +Contact: Linux memory management mailing list +Description: Weight limiting / scaling interface + + "Weightiness": a measure of interleave aggression between + memory nodes. Higher values lead to less interleaving (255:1), + while lower values lead to more interleaving (1:1). + + When this value is updated, all node weights are re-calculated + to reflect the new weightiness. These re-calculated values + overwrite all existing node weights, including those manually + set by writing to the nodeN files. + + Node weight re-calculation is performed by scaling down + bandwidth values reported in the ACPI HMAT to the range + [1, weightiness]. Note that re-calculation uses only the + weightiness parameter and bandwidth values, and ignores all + current node weights. + + Minimum weight: 1 + Default value: 32 + Maximum weight: 255 + + Writing an empty string will set the value to be the default + (32). Writing a value outside the valid range will return + EINVAL and will not re-trigger a weight scaling. + + If there is no bandwidth data in the ACPI HMAT, then this file + will return ENODEV on an attempted write and perform no updates. + Furthermore, if there is no bandwidth information available, + all nodes' weights will default to 1. + + Setting max_node_weight to 1 is equivalent to unweighted + interleave. diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index a2f9e7a4b479..83f3858a773f 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/base/node.c b/drivers/base/node.c index eb72580288e6..d45216386c03 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord, break; } } + + /* When setting CPU access coordinates, update mempolicy */ + if (access == ACCESS_COORDINATE_CPU) { + if (mempolicy_set_node_perf(nid, coord)) + pr_info("failed to set node%d mempolicy attrs\n", nid); + } } EXPORT_SYMBOL_GPL(node_set_perf_attrs); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..d564e9e893ea 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -177,6 +178,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); +extern int mempolicy_set_node_perf(unsigned int node, + struct access_coordinate *coords); + #else struct mempolicy {}; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ee32a10e992c..cb355bdcdd12 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -109,6 +109,7 @@ #include #include #include +#include #include #include @@ -145,17 +146,19 @@ enum { }; static unsigned int mempolicy_behavior; +static uint64_t *node_bw_table; + /* - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes - * system-default value should be used. A NULL iw_table also denotes that - * system-default values should be used. Until the system-default table - * is implemented, the system-default is always 1. + * iw_table is the interleave weight table, with default values populated + * using ACPI-reported bandwidths (if exists) and scaled down to [1,255]. * * iw_table is RCU protected */ static u8 __rcu *iw_table; static DEFINE_MUTEX(iw_table_lock); +static int weightiness = 32; + static u8 get_il_weight(int node) { u8 *table; @@ -163,14 +166,97 @@ static u8 get_il_weight(int node) rcu_read_lock(); table = rcu_dereference(iw_table); - /* if no iw_table, use system default */ weight = table ? table[node] : 1; - /* if value in iw_table is 0, use system default */ - weight = weight ? weight : 1; rcu_read_unlock(); return weight; } +/* + * Convert ACPI-reported bandwidths into weighted interleave weights for + * informed page allocation. + * Call with iw_table_lock. + */ +static void reduce_interleave_weights(uint64_t *bw, u8 *new_iw) +{ + uint64_t ttl_bw = 0, ttl_iw = 0, scaling_factor = 1, iw_gcd = 1; + unsigned int i = 0; + + /* Recalculate the bandwidth distribution given the new info */ + for (i = 0; i < nr_node_ids; i++) + ttl_bw += bw[i]; + + /* If node is not set or has < 1% of total bw, use minimum value of 1 */ + for (i = 0; i < nr_node_ids; i++) { + if (bw[i]) { + scaling_factor = 100 * bw[i]; + new_iw[i] = max(scaling_factor / ttl_bw, 1); + } else { + new_iw[i] = 1; + } + ttl_iw += new_iw[i]; + } + + /* + * Scale each node's share of the total bandwidth from percentages + * to whole numbers in the range [1, weightiness] + */ + for (i = 0; i < nr_node_ids; i++) { + scaling_factor = weightiness * new_iw[i]; + new_iw[i] = max(scaling_factor / ttl_iw, 1); + if (unlikely(i == 0)) + iw_gcd = new_iw[0]; + iw_gcd = gcd(iw_gcd, new_iw[i]); + } + + /* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */ + for (i = 0; i < nr_node_ids; i++) + new_iw[i] /= iw_gcd; +} + +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) +{ + uint64_t *old_bw, *new_bw; + uint64_t bw_val; + u8 *old_iw, *new_iw; + + /* + * Bandwidths above this limit causes rounding errors when reducing + * weights. This value is ~16 exabytes, which is unreasonable anyways. + */ + bw_val = min(coords->read_bandwidth, coords->write_bandwidth); + if (bw_val > (U64_MAX / 10)) + return -EINVAL; + + new_bw = kcalloc(nr_node_ids, sizeof(uint64_t), GFP_KERNEL); + if (!new_bw) + return -ENOMEM; + + new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL); + if (!new_iw) { + kfree(new_bw); + return -ENOMEM; + } + + mutex_lock(&iw_table_lock); + old_bw = node_bw_table; + old_iw = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + + if (old_bw) + memcpy(new_bw, old_bw, nr_node_ids*sizeof(uint64_t)); + new_bw[node] = bw_val; + node_bw_table = new_bw; + + reduce_interleave_weights(new_bw, new_iw); + rcu_assign_pointer(iw_table, new_iw); + + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old_bw); + kfree(old_iw); + return 0; +} + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -2014,10 +2100,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) table = rcu_dereference(iw_table); /* calculate the total weight */ for_each_node_mask(nid, nodemask) { - /* detect system default usage */ - weight = table ? table[nid] : 1; - weight = weight ? weight : 1; - weight_total += weight; + weight_total += table ? table[nid] : 1; } /* Calculate the node offset based on totals */ @@ -2026,7 +2109,6 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) while (target) { /* detect system default usage */ weight = table ? table[nid] : 1; - weight = weight ? weight : 1; if (target < weight) break; target -= weight; @@ -2409,7 +2491,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, unsigned long nr_allocated = 0; unsigned long rounds; unsigned long node_pages, delta; - u8 *table, *weights, weight; + u8 *weights, weight; unsigned int weight_total = 0; unsigned long rem_pages = nr_pages; nodemask_t nodes; @@ -2458,16 +2540,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, if (!weights) return total_allocated; - rcu_read_lock(); - table = rcu_dereference(iw_table); - if (table) - memcpy(weights, table, nr_node_ids); - rcu_read_unlock(); - - /* calculate total, detect system default usage */ for_each_node_mask(node, nodes) { - if (!weights[node]) - weights[node] = 1; + weights[node] = get_il_weight(node); weight_total += weights[node]; } @@ -3397,6 +3471,54 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, static struct iw_node_attr **node_attrs; +static ssize_t weightiness_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sysfs_emit(buf, "%d\n", weightiness); +} + +static ssize_t weightiness_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + uint64_t *bw; + u8 *old_iw, *new_iw; + u8 new_weightiness; + + if (count == 0 || sysfs_streq(buf, "")) + new_weightiness = 32; + else if (kstrtou8(buf, 0, &new_weightiness) || new_weightiness == 0) + return -EINVAL; + + new_iw = kzalloc(nr_node_ids, GFP_KERNEL); + if (!new_iw) + return -ENOMEM; + + mutex_lock(&iw_table_lock); + bw = node_bw_table; + + if (!bw) { + mutex_unlock(&iw_table_lock); + kfree(new_iw); + return -ENODEV; + } + + weightiness = new_weightiness; + old_iw = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + + reduce_interleave_weights(bw, new_iw); + rcu_assign_pointer(iw_table, new_iw); + mutex_unlock(&iw_table_lock); + + synchronize_rcu(); + kfree(old_iw); + + return count; +} + +static struct kobj_attribute wi_attr = + __ATTR(weightiness, 0664, weightiness_show, weightiness_store); + static void sysfs_wi_node_release(struct iw_node_attr *node_attr, struct kobject *parent) { @@ -3413,6 +3535,7 @@ static void sysfs_wi_release(struct kobject *wi_kobj) for (i = 0; i < nr_node_ids; i++) sysfs_wi_node_release(node_attrs[i], wi_kobj); + kobject_put(wi_kobj); } @@ -3454,6 +3577,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj) return 0; } +static struct attribute *wi_default_attrs[] = { + &wi_attr.attr, + NULL +}; + +static const struct attribute_group wi_attr_group = { + .attrs = wi_default_attrs, +}; + static int add_weighted_interleave_group(struct kobject *root_kobj) { struct kobject *wi_kobj; @@ -3470,6 +3602,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj) return err; } + err = sysfs_create_group(wi_kobj, &wi_attr_group); + if (err) { + pr_err("failed to add sysfs [weightiness]\n"); + kobject_put(wi_kobj); + return err; + } + for_each_node_state(nid, N_POSSIBLE) { err = add_weight_node(nid, wi_kobj); if (err) {