From patchwork Tue Dec 10 21:54:39 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joshua Hahn X-Patchwork-Id: 13902114 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD7D8E7717F for ; Tue, 10 Dec 2024 21:54:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 41B176B0294; Tue, 10 Dec 2024 16:54:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3CBF96B0295; Tue, 10 Dec 2024 16:54:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 26B9D8D0013; Tue, 10 Dec 2024 16:54:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0349A6B0294 for ; Tue, 10 Dec 2024 16:54:44 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B09E81209AB for ; Tue, 10 Dec 2024 21:54:44 +0000 (UTC) X-FDA: 82880403942.15.442158A Received: from mail-yw1-f170.google.com (mail-yw1-f170.google.com [209.85.128.170]) by imf17.hostedemail.com (Postfix) with ESMTP id F0F7440003 for ; Tue, 10 Dec 2024 21:54:25 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VWgVn2uO; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733867662; a=rsa-sha256; cv=none; b=Kaaxl/IrwTLevT0vAT/khhmSwRmfEWS7bHo5Ao8/BkX1u7pn+yn0tDMO5kNkFAGaAZciHf 6HSXwyyLap3zz4Gthip5ZGMzSODEFHfjFaZuLcEWGhZofgEtPIKSHARI04OjhqJjsQ4v4Z C0n1eKHJEwTcT6sTZWRqoYmnJFPGUc0= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VWgVn2uO; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of joshua.hahnjy@gmail.com designates 209.85.128.170 as permitted sender) smtp.mailfrom=joshua.hahnjy@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733867662; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=CLMeg3KtiYPYLHuefeREp3hhLN8tVuKnI5cE4Kch8vA=; b=mSgrHoZ5hWlOJDbUswkdoioTIl0fq9TWl4eTS5f8vlgzoEjSQnueDwr1wlwhom89nzFQYX CoTtOlgogwiDcFVvHlN3Ug+SpWPQbttTrJz83tZG45OMek5gEjNTmH8UTcy0Zsm2ztPk94 y32Xd8Q4jaiWFLvB7V5Uuwu3lzqHjco= Received: by mail-yw1-f170.google.com with SMTP id 00721157ae682-6ef7640e484so71444517b3.3 for ; Tue, 10 Dec 2024 13:54:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1733867682; x=1734472482; darn=kvack.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=CLMeg3KtiYPYLHuefeREp3hhLN8tVuKnI5cE4Kch8vA=; b=VWgVn2uO587d0ClJYPdgXKM2iKXjGrA7/Gn5qVbiuzefxc7i38doGtcFYvClrRhyW0 Y3taii4sM6C8mfAguWPreHJiJP37j038VSAXe65fSvnvCwAItkx7l0BnYctJq2I6s6nP r0c9fvvN8YRzaHg8poJWmj8f6XVNimWcgf1iI8WXXUEVd4zFvuYcaIzRDa8pM95lbSLW 4yboj0EAMRnMwjzhsCGWuoUJVcb79pgqyQB0XuG8B1Lxi1Yp6wJ270UwsibRcG7jjgtj fZ1xr+JLzSOVgd4pZLfRxIIZ5ugidPKZv6/9HWxKRdWLww5USkEwmOhENgZbUeAGH5cR AZpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733867682; x=1734472482; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=CLMeg3KtiYPYLHuefeREp3hhLN8tVuKnI5cE4Kch8vA=; b=YcfeNon4FC8/ue4CMfCgTO5aUM752sTFfeOK/1u6qhJM72XWq6lab5q5vYoxqjwmZa Yo2E/5EubGCKzsLHwFLj7F4Z4WsiymkPARHOWl2QsItnsCtu8ny9lcw0Z0NfuUI/JuIq ObzkpBun6Vtiu7xx7+45ifwFPQG5rerVNYakPGHWf+ljhEFrp8drlwTZ+h8eVbt3TzM6 5aCkYuseACFH4bmC79yqYAtDPXVVM6Nd8TsXOWJbU2RAfNDOcXSxtzszvl/2Va3YRfB+ NmnkfzjX2KnB2yyQsePLrfiWkoSSWOihy19hbUh9wCaGr/tZBNKBs3wcEWmT5/gLWbLu hZyg== X-Forwarded-Encrypted: i=1; AJvYcCUkmrMILCGY0Hh4KvNPqjr2Py6tSpOIzkY8R9uJ5FeC53ef5BOt4FuS9Nj4br0UR0ZmXiI++MULEg==@kvack.org X-Gm-Message-State: AOJu0Ywj3T7xSegPaO5HSjhncaDMo89oDppsEEJ2V6G68WOXYSSmMGFp 2kGRc/38nnc285v/6wnXQqcSwE/iJ0frx4v6R/u0yF+uPHqbASVA X-Gm-Gg: ASbGncsJy9DB2qnxpxTg1Cqx55e+5XKlsVdUm66Sh+pkXtw3/MEUB/Q5GdbE5fdAdda 25hXzG5ankY9XxfxPjY0O28zW4053R0gTp9tMsamjtI2MPWsaOCcA7Jfh1Q049/u2l1zrapIXH9 z1ZHdwAf6QuyUAYY1Q/n1H658MYPsbn+27PYcE6x+C0KLHfrrJOEnpc2Kfysb7UzmE97Sb/GfXJ 4+t2wM39RPBlYffobpXwyWSWkYHADFRmPqWSHmfCxJDcp509RckSwjSkchW5Jk5OVqN1DXHMpNG bUOr09Y4d/UyHtM= X-Google-Smtp-Source: AGHT+IEBU9zQf0CAnNhG1S1981KjuQutbNTd3IcROBOXkDqd2I0onmLUpHUjTZWJdHae7O25LE02ng== X-Received: by 2002:a05:690c:6e01:b0:6ee:7797:672 with SMTP id 00721157ae682-6f147fd1c9bmr8670717b3.7.1733867681690; Tue, 10 Dec 2024 13:54:41 -0800 (PST) Received: from localhost (fwdproxy-nha-113.fbsv.net. [2a03:2880:25ff:71::face:b00c]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6efd3889bfbsm29374367b3.53.2024.12.10.13.54.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 10 Dec 2024 13:54:40 -0800 (PST) From: Joshua Hahn To: gourry@gourry.net Cc: rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org, akpm@linux-foundation.org, honggyu.kim@sk.com, ying.huang@linux.alibaba.com, rakie.kim@sk.com, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: [RFC PATCH] mm/mempolicy: Weighted interleave auto-tuning Date: Tue, 10 Dec 2024 13:54:39 -0800 Message-ID: <20241210215439.94819-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: F0F7440003 X-Stat-Signature: 7qpbu3dirkoqx6xe8kwj3ute93zz5urw X-Rspam-User: X-HE-Tag: 1733867665-671995 X-HE-Meta: U2FsdGVkX18PHgNO437jJI9KOxSPB81Szpi/KvAe24IFspvEptjYFGPGzRHKtvirCWAYbRoN4r84JoDO55k6jY0BTGJrbQBgvXwiKDdDdOJvrvw4ogI4CTmr/RVpWC2j4B55cyUsi2Xo4dJ6Ve7LFHtGB3UmebaibAlZOEKEb6doJGj5avLcGEHqj6IjDvfjmXIdz9Ic+YdDK6x7qES3ltM6HqIMWTiPR8hjv9e5FEs3JLPS9SkXzKV+HvnLI7aAP/S529L6LfGKqJUjVw7t2O4Av/3zjENOG3QzooSN1XdVC9Wu6vQjc11pO5DNjwHzIYhDpXdGT8yhRgjaJ5d7YqVUfv9CqK5veWQXyC+K70d3QkdbAqPPyQePXu70IP+/QR4NsqieTlnPvs/7A6hPF9mF2mbocUdkLgPYaSIq/bqsX5euyrRGXcpcfd4BNHyb5fsZb1xS6t5dN/rDWNoN/1MVcB35OHsOCPMl7KiPrAsn+J/ZCFP1RL7kIqPFjUYfu1bccLYxtuTxh+9FJlU3ERo3kdjzPY1B4QU3W9a3welqEt5V3X9l5W9RinMKaU0OWhWhXQKt9J4w5qCAxMNi1Lt4PK0n9l8weswxzSw9WPrH+6r0VjEaYGHnLoNvhH/u3BWJQrllZxCkUBTcifWe1q85wNY5EjP929JiDWA66M89depd+5SdgSzWoq81stRusJ5sLXaThyQEgUCzO5SIcgP2g721w1GnwC50djzgp+t/jhNVVr8HOtrM4/1ZPaIkuqBZGVJAMFUQbgFmdCFTNVpHHM9DXQ6mdUwKeO20wcfqddq1b+nFlZTdPNrTyOZeJtwWS0Y6IMMhGNBk0nHjs8rrskmrZUUdJgj4Un4MFLfdyhSi8vONwQlbbN7dpUORdRleo7E2Y//WG0rttPA5nzrld4vn2N2mSFJKr9NVKuNqef7Uliuf6uxiXwmshrIbGRiPu1/Urzp6eW1o+45 eN+KvFwc UL65G7yUOuuLZXby8GXmhHY2ffoMT81GgaymgXKonfp4/kn6TZmK9Z+ww8Y3u74p9v9O4WLHlHXZUNy7zvUgVVLce5RUywtGhb8TQLA/cWLPFD1zXl8xjAjh5OIqsHtFXv03+AFAJz3H8Ce/kd82BqPw38wr8Sh5qwBABRRTBFY7SwruCIgNpgr8Pv7NeHphxFVNKlq2MknZCgnnw2XHgKfNmbjRacRmfj+C2EDpUEOTl3stBT0r0Zt2+GF0fCA2VlwsTYeGUxgauvcLkYn8gy/Mw9VBVdAU/6wtHuKMRwGa9mHsl/EdNsFHxF5olOFDfOpB62JloPWLoxM3lqI6Kln5RwCMzIGMbyWq+wWBxyQvBHX65v2x1r4R0sTCuIo1nj6dznNIhf/TfBcb49KEtKI2q0NDEpS7mPm2u2gAVfTfV7tBGTl8LYu9YBiVupkPEyiUwrDxWNzcvfT27P84pNty5iIMQEACCeimavcib5cml8+XSsQLJ7EeaBgi4MWjFfR0Umgq11Lq5qeg12UDnlfKuaNVWJDRj96XavDmaLoYtxuukqRpBz3pU3pNS1nvhh293iGzgzAcqacb0FEJapg9yTA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On machines with multiple memory nodes, interleaving page allocations across nodes allows for better utilization of each node's bandwidth. Previous work by Gregory Price [1] introduced weighted interleave, which allowed for pages to be allocated across NUMA nodes according to user-set ratios. Ideally, these weights should be proportional to their bandwidth, so that under bandwidth pressure, each node uses its maximal efficient bandwidth and prevents latency from increasing exponentially. At the same time, we want these weights to be as small as possible. Having ratios that involve large co-prime numbers like 7639:1345:7 leads to awkward and inefficient allocations, since the node with weight 7 will remain mostly unused (and despite being proportional to bandwidth, will not aid in relieving the pressure present in the other two nodes). This patch introduces an auto-configuration for the interleave weights that aims to balance the two goals of setting node weights to be proportional to their bandwidths and keeping the weight values low. This balance is controlled by a value max_node_weight, which defines the maximum weight a single node can take. Large max_node_weights generally lead to increased weight-bandwidth proportionality, but can lead to underutilized nodes (think worst-case scenario, which is 1:max_node_weight). Lower max_node_weights reduce the effects of underutilized nodes, but may lead to improperly loaded distributions. This knob is exposed as a sysfs interface with a default value of 32. Weights are re-calculated once at boottime and then every time the knob is changed by the user, or when the ACPI table is updated. [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/ Signed-off-by: Joshua Hahn Signed-off-by: Gregory Price Co-Developed-by: Gregory Price --- ...fs-kernel-mm-mempolicy-weighted-interleave | 24 +++ drivers/acpi/numa/hmat.c | 1 + drivers/base/node.c | 7 + include/linux/mempolicy.h | 4 + mm/mempolicy.c | 195 ++++++++++++++++-- 5 files changed, 211 insertions(+), 20 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave index 0b7972de04e9..2ef9a87ce878 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -23,3 +23,27 @@ Description: Weight configuration interface for nodeN Writing an empty string or `0` will reset the weight to the system default. The system default may be set by the kernel or drivers at boot or during hotplug events. + +What: /sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight +Date: December 2024 +Contact: Linux memory management mailing list +Description: Weight limiting / scaling interface + + The maximum interleave weight for a memory node. When it is + updated, any previous changes to interleave weights (i.e. via + the nodeN sysfs interfaces) are ignored, and new weights are + calculated using ACPI-reported bandwidths and scaled. + + It is possible for weights to be greater than max_node_weight if + the nodeN interfaces are directly modified to be greater. + + Minimum weight: 1 + Default value: 32 + Maximum weight: 255 + + Writing an empty string will set the value to be the default + (32). Writing a value outside the valid range will return + EINVAL and will not re-trigger a weight scaling. + + Setting max_node_weight to 1 is equivalent to unweighted + interleave. diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index a2f9e7a4b479..83f3858a773f 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include #include diff --git a/drivers/base/node.c b/drivers/base/node.c index eb72580288e6..d45216386c03 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include #include @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord, break; } } + + /* When setting CPU access coordinates, update mempolicy */ + if (access == ACCESS_COORDINATE_CPU) { + if (mempolicy_set_node_perf(nid, coord)) + pr_info("failed to set node%d mempolicy attrs\n", nid); + } } EXPORT_SYMBOL_GPL(node_set_perf_attrs); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..d564e9e893ea 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -177,6 +178,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol) extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone); +extern int mempolicy_set_node_perf(unsigned int node, + struct access_coordinate *coords); + #else struct mempolicy {}; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ee32a10e992c..f789280acdcb 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -109,6 +109,7 @@ #include #include #include +#include #include #include @@ -153,24 +154,116 @@ static unsigned int mempolicy_behavior; * * iw_table is RCU protected */ +static unsigned long *node_bw_table; +static u8 __rcu *default_iw_table; +static DEFINE_MUTEX(default_iwt_lock); + static u8 __rcu *iw_table; static DEFINE_MUTEX(iw_table_lock); +static int max_node_weight = 32; + static u8 get_il_weight(int node) { - u8 *table; + u8 *table, *defaults; u8 weight; rcu_read_lock(); + defaults = rcu_dereference(default_iw_table); table = rcu_dereference(iw_table); - /* if no iw_table, use system default */ - weight = table ? table[node] : 1; - /* if value in iw_table is 0, use system default */ - weight = weight ? weight : 1; + /* if no iw_table, use system default - if no default, use 1 */ + weight = table ? table[node] : 0; + weight = weight ? weight : (defaults ? defaults[node] : 1); rcu_read_unlock(); return weight; } +/* + * Convert ACPI-reported bandwidths into weighted interleave weights for + * informed page allocation. + * Call with default_iwt_lock held + */ +static void reduce_interleave_weights(unsigned long *bw, u8 *new_iw) +{ + uint64_t ttl_bw = 0, ttl_iw = 0, scaling_factor = 1; + unsigned int iw_gcd = 1, i = 0; + + /* Recalculate the bandwidth distribution given the new info */ + for (i = 0; i < nr_node_ids; i++) + ttl_bw += bw[i]; + + /* If node is not set or has < 1% of total bw, use minimum value of 1 */ + for (i = 0; i < nr_node_ids; i++) { + if (bw[i]) { + scaling_factor = 100 * bw[i]; + new_iw[i] = max(scaling_factor / ttl_bw, 1); + } else { + new_iw[i] = 1; + } + ttl_iw += new_iw[i]; + } + + /* + * Scale each node's share of the total bandwidth from percentages + * to whole numbers in the range [1, max_node_weight] + */ + for (i = 0; i < nr_node_ids; i++) { + scaling_factor = max_node_weight * new_iw[i]; + new_iw[i] = max(scaling_factor / ttl_iw, 1); + if (unlikely(i == 0)) + iw_gcd = new_iw[0]; + iw_gcd = gcd(iw_gcd, new_iw[i]); + } + + /* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */ + for (i = 0; i < nr_node_ids; i++) + new_iw[i] /= iw_gcd; +} + +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords) +{ + unsigned long *old_bw, *new_bw; + unsigned long bw_val; + u8 *old_iw, *new_iw; + + /* + * Bandwidths above this limit causes rounding errors when reducing + * weights. This value is ~16 exabytes, which is unreasonable anyways. + */ + bw_val = min(coords->read_bandwidth, coords->write_bandwidth); + if (bw_val > (U64_MAX / 10)) + return -EINVAL; + + new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL); + if (!new_bw) + return -ENOMEM; + + new_iw = kzalloc(nr_node_ids, GFP_KERNEL); + if (!new_iw) { + kfree(new_bw); + return -ENOMEM; + } + + mutex_lock(&default_iwt_lock); + old_bw = node_bw_table; + old_iw = rcu_dereference_protected(default_iw_table, + lockdep_is_held(&default_iwt_lock)); + + if (old_bw) + memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long)); + new_bw[node] = bw_val; + node_bw_table = new_bw; + + reduce_interleave_weights(new_bw, new_iw); + rcu_assign_pointer(default_iw_table, new_iw); + + mutex_unlock(&default_iwt_lock); + synchronize_rcu(); + kfree(old_bw); + kfree(old_iw); + return 0; +} + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -2001,7 +2094,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) { nodemask_t nodemask; unsigned int target, nr_nodes; - u8 *table; + u8 *table, *defaults; unsigned int weight_total = 0; u8 weight; int nid; @@ -2012,11 +2105,12 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) rcu_read_lock(); table = rcu_dereference(iw_table); + defaults = rcu_dereference(iw_table); /* calculate the total weight */ for_each_node_mask(nid, nodemask) { /* detect system default usage */ - weight = table ? table[nid] : 1; - weight = weight ? weight : 1; + weight = table ? table[nid] : 0; + weight = weight ? weight : (defaults ? defaults[nid] : 1); weight_total += weight; } @@ -2025,8 +2119,8 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) nid = first_node(nodemask); while (target) { /* detect system default usage */ - weight = table ? table[nid] : 1; - weight = weight ? weight : 1; + weight = table ? table[nid] : 0; + weight = weight ? weight : (defaults ? defaults[nid] : 1); if (target < weight) break; target -= weight; @@ -2409,7 +2503,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, unsigned long nr_allocated = 0; unsigned long rounds; unsigned long node_pages, delta; - u8 *table, *weights, weight; + u8 *weights, weight; unsigned int weight_total = 0; unsigned long rem_pages = nr_pages; nodemask_t nodes; @@ -2458,16 +2552,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, if (!weights) return total_allocated; - rcu_read_lock(); - table = rcu_dereference(iw_table); - if (table) - memcpy(weights, table, nr_node_ids); - rcu_read_unlock(); - - /* calculate total, detect system default usage */ for_each_node_mask(node, nodes) { - if (!weights[node]) - weights[node] = 1; + weights[node] = get_il_weight(node); weight_total += weights[node]; } @@ -3396,6 +3482,7 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, } static struct iw_node_attr **node_attrs; +static struct kobj_attribute *max_nw_attr; static void sysfs_wi_node_release(struct iw_node_attr *node_attr, struct kobject *parent) @@ -3413,6 +3500,10 @@ static void sysfs_wi_release(struct kobject *wi_kobj) for (i = 0; i < nr_node_ids; i++) sysfs_wi_node_release(node_attrs[i], wi_kobj); + + sysfs_remove_file(wi_kobj, &max_nw_attr->attr); + kfree(max_nw_attr->attr.name); + kfree(max_nw_attr); kobject_put(wi_kobj); } @@ -3454,6 +3545,63 @@ static int add_weight_node(int nid, struct kobject *wi_kobj) return 0; } +static ssize_t max_nw_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "%d\n", max_node_weight); +} + +static ssize_t max_nw_store(struct kobject *kobj, + struct kobj_attribute *attr, const char *buf, size_t count) +{ + unsigned long *bw; + u8 *old_iw, *new_iw; + u8 max_weight; + + if (count == 0 || sysfs_streq(buf, "")) + max_weight = 32; + else if (kstrtou8(buf, 0, &max_weight) || max_weight == 0) + return -EINVAL; + + new_iw = kzalloc(nr_node_ids, GFP_KERNEL); + if (!new_iw) + return -ENOMEM; + + mutex_lock(&default_iwt_lock); + bw = node_bw_table; + + if (!bw) { + mutex_unlock(&default_iwt_lock); + kfree(new_iw); + return -ENODEV; + } + + max_node_weight = max_weight; + old_iw = rcu_dereference_protected(default_iw_table, + lockdep_is_held(&default_iwt_lock)); + + reduce_interleave_weights(bw, new_iw); + rcu_assign_pointer(default_iw_table, new_iw); + mutex_unlock(&default_iwt_lock); + + synchronize_rcu(); + kfree(old_iw); + + return count; +} + +static struct kobj_attribute wi_attr = + __ATTR(max_node_weight, 0664, max_nw_show, max_nw_store); + +static struct attribute *wi_default_attrs[] = { + &wi_attr.attr, + NULL +}; + +static const struct attribute_group wi_attr_group = { + .attrs = wi_default_attrs, +}; + static int add_weighted_interleave_group(struct kobject *root_kobj) { struct kobject *wi_kobj; @@ -3470,6 +3618,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj) return err; } + err = sysfs_create_group(wi_kobj, &wi_attr_group); + if (err) { + pr_err("failed to add sysfs [max_node_weight]\n"); + kobject_put(wi_kobj); + return err; + } + for_each_node_state(nid, N_POSSIBLE) { err = add_weight_node(nid, wi_kobj); if (err) {