From patchwork Tue Oct 31 00:38:07 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13440954 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C751436C for ; Tue, 31 Oct 2023 00:38:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VkQ7OU1x" Received: from mail-yb1-xb44.google.com (mail-yb1-xb44.google.com [IPv6:2607:f8b0:4864:20::b44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A094A99; Mon, 30 Oct 2023 17:38:21 -0700 (PDT) Received: by mail-yb1-xb44.google.com with SMTP id 3f1490d57ef6-d9fe0a598d8so4137823276.2; Mon, 30 Oct 2023 17:38:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698712700; x=1699317500; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=OD4RKFVfAbTN2kU9bJZ9F+y9XLZ/Az1TURsN25S7xmg=; b=VkQ7OU1xlX2UrJqR70nb1pd++N1HYyZIUKoCcl6LGVfaOkxhCSyMV+5wlX2qbt73DL aXniR5moJE+uvTc8euJlTDgyB2ESEM0iEPAhtyBb1V+Xlc8YuL7V3k7UIQSF2uD3fDVN 8+GpAm1/VuJ3s4SL50BcskgzHO7CDKprWl9knHpXxPSgjXRHDidOmsOMfljtu1R0o3MW vL4wmTSlM9xpKIdMCZu2aXgOPk0bcl8r1lI5gHiH9lLGy08vnZpp68jQM9wEOVZ53i72 kPMBph6QNQuToWzXPgfuq4POIay5q8JPF4l4Z8ASVdFRaqW/iVM08/Am5s5S/XFmTl2F VVyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698712701; x=1699317501; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OD4RKFVfAbTN2kU9bJZ9F+y9XLZ/Az1TURsN25S7xmg=; b=g8R3a+pV5+GnpeoAFdigJwROJ+YhDXCSRqgS3GtChdg7/Iny2qrNiOfIa2IBD4NNx9 /kxoDJs/yUNn1zllJYaOKY0vdZAuEavMXkVue97+HjK5YUtwHVHxWdGXXKM1AGx1Josh rsmsT9fgVj9/KfEj8xSOxePmXo00iQrkOp7LTl2ATlFZ8hJzgpNIFg+Ycb4pGLvjEZk1 c8PJcDu4d6bQbVeVSUh9TQkSqlxgCELq7CE8f097zq6i7ltfLkLye356kltBYfoaa+M7 lpa7/gY8DExCgQdB+xseaqnPy6uZltGokOpLGVCuYVmNt7dkkH+J9NTB0qsO4+hPawii OlxQ== X-Gm-Message-State: AOJu0Yxbm3wfm/ocQV0SqaVNQxGQaPRScsAxNxIkq+HGk9aJZKEN6K9b TCnAX1btG6INwrIY2OA+UUsEqBfblUQg X-Google-Smtp-Source: AGHT+IFfmuVE+ioJmsCa3B2MRGYbfibjNOTfLGweeVtA4hU+86oyAaa9aIfB3b+hqlewkbs+8GiZ7g== X-Received: by 2002:a25:83c2:0:b0:da0:3535:41f5 with SMTP id v2-20020a2583c2000000b00da0353541f5mr10377007ybm.52.1698712700698; Mon, 30 Oct 2023 17:38:20 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id b19-20020a25ae93000000b00da086d6921fsm182750ybj.50.2023.10.30.17.38.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 17:38:20 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-kernel@vger.kernel.org Cc: linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, hannes@cmpxchg.org, tim.c.chen@intel.com, dave.hansen@intel.com, mhocko@kernel.org, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: [RFC PATCH v3 1/4] base/node.c: initialize the accessor list before registering Date: Mon, 30 Oct 2023 20:38:07 -0400 Message-Id: <20231031003810.4532-2-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231031003810.4532-1-gregory.price@memverge.com> References: <20231031003810.4532-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The current code registers the node as available in the node array before initializing the accessor list. This makes it so that anything which might access the accessor list as a result of allocations will cause an undefined memory access. In one example, an extension to access hmat data during interleave caused this undefined access as a result of a bulk allocation that occurs during node initialization but before the accessor list is initialized. Initialize the accessor list before making the node generally available to the global system. Signed-off-by: Gregory Price --- drivers/base/node.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/base/node.c b/drivers/base/node.c index 493d533f8375..4d588f4658c8 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -868,11 +868,15 @@ int __register_one_node(int nid) { int error; int cpu; + struct node *node; - node_devices[nid] = kzalloc(sizeof(struct node), GFP_KERNEL); - if (!node_devices[nid]) + node = kzalloc(sizeof(struct node), GFP_KERNEL); + if (!node) return -ENOMEM; + INIT_LIST_HEAD(&node->access_list); + node_devices[nid] = node; + error = register_node(node_devices[nid], nid); /* link cpu under this node */ @@ -881,7 +885,6 @@ int __register_one_node(int nid) register_cpu_under_node(cpu, nid); } - INIT_LIST_HEAD(&node_devices[nid]->access_list); node_init_caches(nid); return error; From patchwork Tue Oct 31 00:38:08 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13440955 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E21A236C for ; Tue, 31 Oct 2023 00:38:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="DwBwjwaz" Received: from mail-yb1-xb43.google.com (mail-yb1-xb43.google.com [IPv6:2607:f8b0:4864:20::b43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D6BF9A9; Mon, 30 Oct 2023 17:38:23 -0700 (PDT) Received: by mail-yb1-xb43.google.com with SMTP id 3f1490d57ef6-da077db5145so4147540276.0; Mon, 30 Oct 2023 17:38:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698712703; x=1699317503; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vNWxUz2NxLkXz6WG/bZ7peSXHyS5FiPJ8oc8+TIvteE=; b=DwBwjwaz1Io0JzxxBYmbHvN54LjM2QmnmfUvoLpxhEtFwB6xPAEiS16N3StLDYr9yc ZA9KkT4pTnSjq3pyeB4E7qP4hNGTqo0hNd+WX/1gLOckcA5sw5FRd3BYzmOUlPwo2dLW K8jr5OfX/Tst64GI4xFYBbcVuKon4JjdkyP18OvnZRq5zAyfF0wPvvl1L1Mw/nJnloRt N1VucwsWes4Y3ZtkT+4iTvtAx+ZoukHONEwJi5DS7t8grZ5Ykf6abyZJh6zNtI80PUEl yqNmpGJ2BDIw/b4VugQJBMIyDYFtz3i21uzwQR9VELYBhoWk+f7eIGNVg1GTSGX0A26l V6BQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698712703; x=1699317503; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vNWxUz2NxLkXz6WG/bZ7peSXHyS5FiPJ8oc8+TIvteE=; b=HsPYMejgRfOb1UNA/DgLHxYTPBNtWmEXk2wyePP3tPAFBdbZlL6YPF3YpBuu8qtOrl skeL3Pe4KPV/d1AYyPNl57LVC6u+X+cFbSNn+z6N4ah3iG5rbzI6EpqAQ4nLWGM3W2Ej P5W/dWy9QSXw21x9gyIg55j/XykGCRBqNO2lsrdywtkZfkh7uJLQsABIaUHy9osS0YvY q/K896AXxJ8iozDkSLbnkioW+s4FVvZ2VQ4zvSixyXVoNIC+d5iXoTN3PAs6NR4UVW0z xPjOtK5EyoCSWuM1m0QC4N93YbyQTBG6cOpYfZ2JcRpOU0OAttjVXssapKYweCjUlFRH iA9w== X-Gm-Message-State: AOJu0Yz1FwDeibI958caZMQov7mPOi9H3S/RwVgXmfUPnUlJQRhYTErk ihaldX2TRNmu2B9X+vyfn9ZdnbNmflIi X-Google-Smtp-Source: AGHT+IGgWc7KHIcrjpKOpD6pgY9qEv+bwm9+PsDaXLv1DX1ayloOlH4Mxk+5+Uw0Z8nEH8CkF/8q5g== X-Received: by 2002:a25:ad03:0:b0:d9c:a583:9db0 with SMTP id y3-20020a25ad03000000b00d9ca5839db0mr10439299ybi.39.1698712702923; Mon, 30 Oct 2023 17:38:22 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id b19-20020a25ae93000000b00da086d6921fsm182750ybj.50.2023.10.30.17.38.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 17:38:22 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-kernel@vger.kernel.org Cc: linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, hannes@cmpxchg.org, tim.c.chen@intel.com, dave.hansen@intel.com, mhocko@kernel.org, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: [RFC PATCH v3 2/4] node: add accessors to sysfs when nodes are created Date: Mon, 30 Oct 2023 20:38:08 -0400 Message-Id: <20231031003810.4532-3-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231031003810.4532-1-gregory.price@memverge.com> References: <20231031003810.4532-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Accessor information is presently only exposed when hmat information is registered. Add accessor information at node creation to allow new attributes to be added even in the absense of hmat information. Signed-off-by: Gregory Price --- drivers/base/node.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/drivers/base/node.c b/drivers/base/node.c index 4d588f4658c8..b09c9c8e6830 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -868,7 +868,9 @@ int __register_one_node(int nid) { int error; int cpu; + int onid; struct node *node; + struct node_access_nodes *acc; node = kzalloc(sizeof(struct node), GFP_KERNEL); if (!node) @@ -887,6 +889,20 @@ int __register_one_node(int nid) node_init_caches(nid); + /* + * for each cpu node - add accessor to this node + * if this is a cpu node, add accessor to each other node + */ + for_each_online_node(onid) { + /* During system bringup nodes may not be fully initialized */ + if (!node_devices[onid]) + continue; + if (node_state(onid, N_CPU)) + acc = node_init_node_access(node, onid); + if (node_state(nid, N_CPU)) + acc = node_init_node_access(node_devices[onid], nid); + } + return error; } From patchwork Tue Oct 31 00:38:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13440956 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1657236C for ; Tue, 31 Oct 2023 00:38:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="buoQxiuo" Received: from mail-yb1-xb43.google.com (mail-yb1-xb43.google.com [IPv6:2607:f8b0:4864:20::b43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8708EE8; Mon, 30 Oct 2023 17:38:28 -0700 (PDT) Received: by mail-yb1-xb43.google.com with SMTP id 3f1490d57ef6-da37522a363so299222276.0; Mon, 30 Oct 2023 17:38:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698712707; x=1699317507; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wqzp8UEwRJTgLaeQpU+LsOI00hLnye5kfN/uXwQ/cgQ=; b=buoQxiuoxLkLZDdg+P5uXTOeWrFdeKbObps0zJU3pCzxeMdfms9Ye/6KAfJbGfofSo 2cX4kYQJ2A6It7oy6GvEDqvtpFOygJ/QLy0IAiRqNm9jBBgyVfGb0qbFEdauo32W8KCI K5l5vJA1hXnrgLpJyVFS8peuqXMC/Y3c4cEKH65v2ns+1NRNbxn/Scy4rN00VO8XYwSZ whnWw7bNNdrqRAyG0TAKcE/3ury1nXKrzRwMABEHpiV1XRKyLXRZ0UN5eF/2QX/Fw52g THyebKx6x3cte8fOdYxjbuRjrgoD401a2j5iQY4ArkNkSIBc2XpqXaEbpU0Vj8sZ4dDh HsWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698712707; x=1699317507; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wqzp8UEwRJTgLaeQpU+LsOI00hLnye5kfN/uXwQ/cgQ=; b=l0z61g+TSn7yWD9vPZPPprjRAIfdG5dcNg12O5gztxgi+4HqTYPpKdcA5oXcpb/R6y 5FXRuWFti/uXchnOK0uJbRvAcS+O+t85dou0Ju4R+uSXrC9aVftr7QAzGQkllxym+Dq4 BytsgJfrKuMKDaADuCa/iayScBVguNzCCHkfGNXngftgn9ubKk+ZoFbbuY2NWEeQYGVm DrD+WYRtPFgs5aDSoEPzTPF3bWFXPDQ06oudnthorkTo7Q+gQgQswCO5+e4IYFfR9TGJ 96uNDIrpTD/xsGzBnbeoDEZMLLhjzhevI0NxA6OjIKpara2BybtL3bsBv5moCw2VDjx2 CkaA== X-Gm-Message-State: AOJu0YxO23RCB2yqUS5hkoFJwV0m1OV9AgwdejMqcU1rjQoWI4lv1io4 Irx1W+X0ZBCW6MWclU6Sc6hXDaMv7jIA X-Google-Smtp-Source: AGHT+IFKcy1sU3dFigUiCwURyOPeE7JfdrOwJH3YBwiTqPhB7WcZjjcQyS/NuIc9Ms5Q8SWg+UlvIA== X-Received: by 2002:a25:d1c5:0:b0:d9a:401d:f5da with SMTP id i188-20020a25d1c5000000b00d9a401df5damr11555099ybg.51.1698712707500; Mon, 30 Oct 2023 17:38:27 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id b19-20020a25ae93000000b00da086d6921fsm182750ybj.50.2023.10.30.17.38.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 17:38:27 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-kernel@vger.kernel.org Cc: linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, hannes@cmpxchg.org, tim.c.chen@intel.com, dave.hansen@intel.com, mhocko@kernel.org, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price , Ravi Shankar Subject: [RFC PATCH v3 3/4] node: add interleave weights to node accessor Date: Mon, 30 Oct 2023 20:38:09 -0400 Message-Id: <20231031003810.4532-4-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231031003810.4532-1-gregory.price@memverge.com> References: <20231031003810.4532-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add a configurable interleave weight to the node for each possible accessor. The intent of this weight is to enable set_mempolicy() to to distribute memory across nodes based on the accessor and the effective bandwidth available. The goal is to maximize the effective use of available bandwidth. The default weight is 1 for all nodes, which will mimic the current interleave (basic round-robin). Signed-off-by: Gregory Price Suggested-by: Ying Huang Suggested-by: Ravi Shankar --- drivers/base/node.c | 95 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/node.h | 17 ++++++++ 2 files changed, 112 insertions(+) diff --git a/drivers/base/node.c b/drivers/base/node.c index b09c9c8e6830..29bb3874a885 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -83,9 +83,84 @@ struct node_access_nodes { #ifdef CONFIG_HMEM_REPORTING struct node_hmem_attrs hmem_attrs; #endif + unsigned char il_weight; }; #define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev) +#define MAX_NODE_INTERLEAVE_WEIGHT 100 +static ssize_t il_weight_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + return sysfs_emit(buf, "%u\n", + to_access_nodes(dev)->il_weight); +} + +static ssize_t il_weight_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + unsigned char weight; + int ret; + + ret = kstrtou8(buf, 0, &weight); + if (ret) + return ret; + + if (!weight || weight > MAX_NODE_INTERLEAVE_WEIGHT) + return -EINVAL; + + to_access_nodes(dev)->il_weight = weight; + return len; +} +DEVICE_ATTR_RW(il_weight); + +unsigned char node_get_il_weight(unsigned int nid, unsigned int access_nid) +{ + struct node *node; + struct node_access_nodes *c; + unsigned char weight = 1; + + node = node_devices[nid]; + if (!node) + return weight; + + list_for_each_entry(c, &node->access_list, list_node) { + if (c->access != access_nid) + continue; + weight = c->il_weight; + break; + } + return weight; +} + +unsigned int nodes_get_il_weights(unsigned int access_nid, nodemask_t *nodes, + unsigned char *weights) +{ + unsigned int nid; + struct node *node; + struct node_access_nodes *c; + unsigned int ttl_weight = 0; + unsigned char weight = 1; + + for_each_node_mask(nid, *nodes) { + weight = 1; + node = node_devices[nid]; + if (!node) + goto next_node; + list_for_each_entry(c, &node->access_list, list_node) { + if (c->access != access_nid) + continue; + weight = c->il_weight; + break; + } +next_node: + weights[nid] = weight; + ttl_weight += weight; + } + return ttl_weight; +} + static struct attribute *node_init_access_node_attrs[] = { NULL, }; @@ -116,6 +191,7 @@ static void node_remove_accesses(struct node *node) list_for_each_entry_safe(c, cnext, &node->access_list, list_node) { list_del(&c->list_node); + device_remove_file(&c->dev, &dev_attr_il_weight); device_unregister(&c->dev); } } @@ -140,6 +216,7 @@ static struct node_access_nodes *node_init_node_access(struct node *node, return NULL; access_node->access = access; + access_node->il_weight = 1; dev = &access_node->dev; dev->parent = &node->dev; dev->release = node_access_release; @@ -150,6 +227,9 @@ static struct node_access_nodes *node_init_node_access(struct node *node, if (device_register(dev)) goto free_name; + if (device_create_file(dev, &dev_attr_il_weight)) + dev_warn(dev, "failed to add il_weight attribute\n"); + pm_runtime_no_callbacks(dev); list_add_tail(&access_node->list_node, &node->access_list); return access_node; @@ -363,6 +443,21 @@ static void node_init_caches(unsigned int nid) #else static void node_init_caches(unsigned int nid) { } static void node_remove_caches(struct node *node) { } + +unsigned char node_get_il_weight(unsigned int nid, unsigned int access_nid) +{ + return 1; +} + +unsigned int nodes_get_il_weights(unsigned int access_nid, nodemask_t *nodes, + unsigned char *weights) +{ + unsigned int nid; + + for_each_node_mask(nid, *nodes) + weights[nid] = 1; + return nodes_weight(nodes); +} #endif #define K(x) ((x) << (PAGE_SHIFT - 10)) diff --git a/include/linux/node.h b/include/linux/node.h index 427a5975cf40..3c7a6dd2d954 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -138,6 +138,12 @@ extern void unregister_memory_block_under_nodes(struct memory_block *mem_blk); extern int register_memory_node_under_compute_node(unsigned int mem_nid, unsigned int cpu_nid, unsigned access); + +extern unsigned char node_get_il_weight(unsigned int nid, + unsigned int access_nid); +extern unsigned int nodes_get_il_weights(unsigned int access_nid, + nodemask_t *nodes, + unsigned char *weights); #else static inline void node_dev_init(void) { @@ -165,6 +171,17 @@ static inline int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) static inline void unregister_memory_block_under_nodes(struct memory_block *mem_blk) { } +static inline unsigned char node_get_il_weight(unsigned int nid, + unsigned int access_nid) +{ + return 0; +} +static inline unsigned int nodes_get_il_weights(unsigned int access_nid, + nodemask_t *nodes, + unsigned char *weights) +{ + return 0; +} #endif #define to_node(device) container_of(device, struct node, dev) From patchwork Tue Oct 31 00:38:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13440957 Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5A4C236C for ; Tue, 31 Oct 2023 00:38:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="imsWsX6t" Received: from mail-yb1-xb36.google.com (mail-yb1-xb36.google.com [IPv6:2607:f8b0:4864:20::b36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 71DDB102; Mon, 30 Oct 2023 17:38:32 -0700 (PDT) Received: by mail-yb1-xb36.google.com with SMTP id 3f1490d57ef6-d9ad67058fcso4545372276.1; Mon, 30 Oct 2023 17:38:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698712711; x=1699317511; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Cx5PlkWwC2fLL3dTmkDjNmZmIU8cRtisheijomJJMGs=; b=imsWsX6tdFGpj+yIqxP+04I39seIeprIK6tcW7I6nzmTuih6uMFLiFOl69bbeWI8bq 8Rs2ERG91CBLiMN/B+0SFRvWPm8fJ70AcCPmS9heBaRDq6NuNyHZWbRVsL+vggUxwPEX DUgKAO1LFXK9vP91aubmsGDlEDx3QZalrCOah78unv6aM+A0e9NR3faMA0Gig/I0/F3d yZ8ogmlp1GyoC8gUVXVtt4oBBeFGizXbE2nUjgaZjz/6/5pQSgZvvtQECIbvLl+NCXqF ZseHT6QGO5zXn8tdJgiHRz7jN8k8F1vYoQUoF9cmO6Wqxp7JNFegcMHJFevyXdnBpn9N dqFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698712711; x=1699317511; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Cx5PlkWwC2fLL3dTmkDjNmZmIU8cRtisheijomJJMGs=; b=Oqoveds0/2uKrUVm823VI87oLihxFZn1xsVEn5Au5sQkjV7C3NmldQSeBaaC1qZliw TCmpOVCpjQGO5pMuCo7Vko5YAvqIPZMl8wukqn9uCm25lrbIdDKhqQ5iFIFbGDOaPUN7 Rw01hfP3j+vWtLvRCDNY2klDBptdDJwvw5AM1hg8OaV3r4u8EwH/8EhQZ3hPbhOnnRKL hQXEmB2bk+hRBpx8Qvl2pMUQcWR5xROsg3ChIgbmXuIk+TH8T8LKd2Ik9paHVz8yXhv3 ahcp+QDxB8CEMAcBL6Rg0vq+JGCukqwTYETKgHBpJLq9/SG3SNyW2+ep2iP2JiY9dxvV iyuw== X-Gm-Message-State: AOJu0Yyr9EhR5x8Pfy5lOvUEo4BCDib1VMafIlu9zNKokX7a1FFa80la bJ90VDDNrLhauQr0WF52qMkASzJXUQ== X-Google-Smtp-Source: AGHT+IG+ZY2XwA/VPsRVOLtPqHUQiweY/sd8aY9XDrTI59V15gNxaAle6BUs2Yfak9s1J0WLWgcoWA== X-Received: by 2002:a25:40c7:0:b0:d9b:9aeb:8c26 with SMTP id n190-20020a2540c7000000b00d9b9aeb8c26mr9320170yba.40.1698712711115; Mon, 30 Oct 2023 17:38:31 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id b19-20020a25ae93000000b00da086d6921fsm182750ybj.50.2023.10.30.17.38.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 30 Oct 2023 17:38:30 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-kernel@vger.kernel.org Cc: linux-cxl@vger.kernel.org, linux-mm@kvack.org, ying.huang@intel.com, akpm@linux-foundation.org, aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com, hannes@cmpxchg.org, tim.c.chen@intel.com, dave.hansen@intel.com, mhocko@kernel.org, shy828301@gmail.com, gregkh@linuxfoundation.org, rafael@kernel.org, Gregory Price Subject: [RFC PATCH v3 4/4] mm/mempolicy: modify interleave mempolicy to use node weights Date: Mon, 30 Oct 2023 20:38:10 -0400 Message-Id: <20231031003810.4532-5-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231031003810.4532-1-gregory.price@memverge.com> References: <20231031003810.4532-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-cxl@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The node subsystem implements interleave weighting for the purpose of bandwidth optimization. Each node may have different weights in relation to each compute node ("access node"). The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement weighted interleave. By default, since all nodes default to a weight of 1, the original interleave behavior is retained. Examples Weight settings: echo 4 > node0/access0/il_weight echo 1 > node0/access1/il_weight echo 3 > node1/access0/il_weight echo 2 > node1/access1/il_weight Results: Task A: cpunode: 0 nodemask: [0,1] weights: [4,3] allocation result: [0,0,0,0,1,1,1 repeat] Task B: cpunode: 1 nodemask: [0,1] weights: [1,2] allocation result: [0,1,1 repeat] Weights are relative to access node Signed-off-by: Gregory Price Signed-off-by: Srinivasulu Thanneeru --- include/linux/mempolicy.h | 4 ++ mm/mempolicy.c | 138 +++++++++++++++++++++++++++++--------- 2 files changed, 112 insertions(+), 30 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index d232de7cdc56..240468b669fd 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -48,6 +48,10 @@ struct mempolicy { nodemask_t nodes; /* interleave/bind/perfer */ int home_node; /* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */ + /* weighted interleave settings */ + unsigned char cur_weight; + unsigned char il_weights[MAX_NUMNODES]; + union { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 29ebf1e7898c..d62e942a13bd 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -102,6 +102,7 @@ #include #include #include +#include #include #include @@ -300,6 +301,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, policy->mode = mode; policy->flags = flags; policy->home_node = NUMA_NO_NODE; + policy->cur_weight = 0; return policy; } @@ -334,6 +336,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) tmp = *nodes; pol->nodes = tmp; + pol->cur_weight = 0; } static void mpol_rebind_preferred(struct mempolicy *pol, @@ -881,8 +884,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && new->mode == MPOL_INTERLEAVE) { current->il_prev = MAX_NUMNODES-1; + new->cur_weight = 0; + } + task_unlock(current); mpol_put(old); ret = 0; @@ -1903,12 +1909,21 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) /* Do dynamic interleaving for a process */ static unsigned interleave_nodes(struct mempolicy *policy) { - unsigned next; + unsigned int next; + unsigned char next_weight; struct task_struct *me = current; next = next_node_in(me->il_prev, policy->nodes); - if (next < MAX_NUMNODES) + if (!policy->cur_weight) { + /* If the node is set, at least 1 allocation is required */ + next_weight = node_get_il_weight(next, numa_node_id()); + policy->cur_weight = next_weight ? next_weight : 1; + } + + policy->cur_weight--; + if (next < MAX_NUMNODES && !policy->cur_weight) me->il_prev = next; + return next; } @@ -1967,25 +1982,37 @@ unsigned int mempolicy_slab_node(void) static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) { nodemask_t nodemask = pol->nodes; - unsigned int target, nnodes; - int i; + unsigned int target, nnodes, il_weight; + unsigned char weight; int nid; + int cur_node = numa_node_id(); + /* * The barrier will stabilize the nodemask in a register or on * the stack so that it will stop changing under the code. * * Between first_node() and next_node(), pol->nodes could be changed * by other threads. So we put pol->nodes in a local stack. + * + * Additionally, place the cur_node on the stack in case of a migration */ barrier(); nnodes = nodes_weight(nodemask); if (!nnodes) - return numa_node_id(); - target = (unsigned int)n % nnodes; + return cur_node; + + il_weight = nodes_get_il_weights(cur_node, &nodemask, pol->il_weights); + target = (unsigned int)n % il_weight; nid = first_node(nodemask); - for (i = 0; i < target; i++) - nid = next_node(nid, nodemask); + while (target) { + weight = pol->il_weights[nid]; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + return nid; } @@ -2319,32 +2346,83 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) { - int nodes; - unsigned long nr_pages_per_node; - int delta; - int i; - unsigned long nr_allocated; + struct task_struct *me = current; unsigned long total_allocated = 0; + unsigned long nr_allocated; + unsigned long rounds; + unsigned long node_pages, delta; + unsigned char weight; + unsigned long il_weight; + unsigned long req_pages = nr_pages; + int nnodes, node, prev_node; + int cur_node = numa_node_id(); + int i; - nodes = nodes_weight(pol->nodes); - nr_pages_per_node = nr_pages / nodes; - delta = nr_pages - nodes * nr_pages_per_node; - - for (i = 0; i < nodes; i++) { - if (delta) { - nr_allocated = __alloc_pages_bulk(gfp, - interleave_nodes(pol), NULL, - nr_pages_per_node + 1, NULL, - page_array); - delta--; - } else { - nr_allocated = __alloc_pages_bulk(gfp, - interleave_nodes(pol), NULL, - nr_pages_per_node, NULL, page_array); + prev_node = me->il_prev; + nnodes = nodes_weight(pol->nodes); + /* Continue allocating from most recent node */ + if (pol->cur_weight) { + node = next_node_in(prev_node, pol->nodes); + node_pages = pol->cur_weight; + if (node_pages > nr_pages) + node_pages = nr_pages; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (req_pages <= pol->cur_weight) { + pol->cur_weight -= req_pages; + return total_allocated; } - + /* Otherwise we adjust req_pages down, and continue from there */ + req_pages -= pol->cur_weight; + pol->cur_weight = 0; + prev_node = node; + } + + il_weight = nodes_get_il_weights(cur_node, &pol->nodes, + pol->il_weights); + rounds = req_pages / il_weight; + delta = req_pages % il_weight; + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, pol->nodes); + weight = pol->il_weights[node]; + node_pages = weight * rounds; + if (delta > weight) { + node_pages += weight; + delta -= weight; + } else if (delta) { + node_pages += delta; + delta = 0; + } + /* The number of requested pages may not hit every node */ + if (!node_pages) + break; + /* If an over-allocation would occur, floor it */ + if (node_pages + total_allocated > nr_pages) { + node_pages = nr_pages - total_allocated; + delta = 0; + } + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); page_array += nr_allocated; total_allocated += nr_allocated; + prev_node = node; + } + + /* + * Finally, we need to update me->il_prev and pol->cur_weight + * If the last node allocated on has un-used weight, apply + * the remainder as the cur_weight, otherwise proceed to next node + */ + if (node_pages) { + me->il_prev = prev_node; + node_pages %= weight; + pol->cur_weight = weight - node_pages; + } else { + me->il_prev = node; + pol->cur_weight = 0; } return total_allocated;