From patchwork Fri Feb 2 17:02:35 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13543174 Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 18FA314AD04; Fri, 2 Feb 2024 17:02:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.194 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893371; cv=none; b=aNXOS/ZSgAoxdOO1vCHdln08OajATFWbm2kgLPckcHLoA2aCiMAgAHLE3sTRr8CaUaTw+6veDyDaCN7TkcmPf5hTus2n4wD3Om3hGODhwgGWe5XvWvDf4vmPb2HfpbZqcPtewhw/PCLWaQknKBxvPZvUtFYf7azZY3PJdlJNZ4g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893371; c=relaxed/simple; bh=wRnjIaYoM3krBdw0RecHHRZC1Sp/+4vnWQZxds1afs0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version:Content-Type; b=Wpj5ENba3xra+KLqVGQ6TFRVglAs2Ui+HGlTlHbQ1jUVY8JJHJmcf2eajSGgZy+/IqU3NXphycRiCcKoxRHxDHaCGAuT9ZWIL7QJc3Nd0fc1PneWEwMrkwSPIETDOkElRBa9at5ePgOlssuRudkPV+FKbXtc6eWFZC5m/qEjXGI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=O/8mGh13; arc=none smtp.client-ip=209.85.210.194 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="O/8mGh13" Received: by mail-pf1-f194.google.com with SMTP id d2e1a72fcca58-6de0f53f8e8so2738991b3a.0; Fri, 02 Feb 2024 09:02:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706893369; x=1707498169; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=1pDRIrlrQ4UtMpe5hzZtBydhtO2WkdGXjanRY84YnLE=; b=O/8mGh13v+a3Srk6nXmKRc9vT/elrFXAAmNne0ChFRg5Zq7DfqE9AncCZr/KVVyhiR C6ooYqJ+HSF79ViGsAoC1ZDPLTucjTFKO07slz9P+Qo2KnJpqpf1/oc9w45etGkIWVVB 0Y9/S2MAN9nk0rCBvabEdGOYflHscJVm91sQm+1baxdmdaBcB0uWvkVkRolbuiLSVbuz 7Je0kOLa2rmZHA+n1d9tSPwL3GRgAE/ui8h0JbnWhKjNXDXaKFnfBiR20pZnjphH5Zmm DhjrGL72mNK802KdRyiz746OlXsr8lGHOkVHhTPGoFWjKDDSWO7MU75voy1GdI6/gPhL xABA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706893369; x=1707498169; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1pDRIrlrQ4UtMpe5hzZtBydhtO2WkdGXjanRY84YnLE=; b=vSXBoElRey7jkXXhgSPdLKgy7poZjp64EhdkCK1PELVLJJRNEleQhYrhDJ+ipCsNeN 4MjeOOrdt+iDjd0HiGNUsiAmmZxyV1zHALdrnk4gbY2L7MaXezlsLr0nBv8M92qKHgKY zYf7DxAI3Mz004w1btooK0N1ZM3OgBo1q5ji2Br0uu4mDqC3WzmAP8manI522Ef5DmRM cy+Oz4NgNld37Z7rgLAqpYi/qxTvtM/1Ck00P5cd42LHAio+HZJDYsGfzwLJadJ4NB8u NLe+boFXQFuQE2So3OOG6yXDv+nCfs3wJykrikFzdnnBXwv7M1gFg5ezyCUmm49X2WIc GKZA== X-Gm-Message-State: AOJu0YwDlElsWM/on2PUqStdQheGt/lbOm5poVuIHg/6oPfH7JE/j6+k 7cqXS0YAJrLovgox46AN4b/XVz3aWllj2IzDnMIvVY09HmTPEMg= X-Google-Smtp-Source: AGHT+IEi0ywj/8sRmBXyXsM66L0qrQazVSY5CleNRDvWOz/EKdM4P7MBc4iBts3CCmK1F90toljgRA== X-Received: by 2002:a05:6a00:2d09:b0:6da:bcea:4cd4 with SMTP id fa9-20020a056a002d0900b006dabcea4cd4mr4198277pfb.16.1706893369247; Fri, 02 Feb 2024 09:02:49 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCU1kz89Qv1XP+xUI9TWw/AYEC2UL/HgNVc490rP5Y7IX6xAlFUjUHtRkGqH2hXHAfI0JV4CBjP7TrlF32KaLHI5Ea83xq2PdC+AoncPzs58T5FZf1EEaB6ZhMlMPm6zyY/TU/TnDViEF8wYBL3mgo7R/S8j8WP6WNlUKA0HvEv+O0VUu6bNW2RtCd7/Ty0kTgojnF+NbaFxa0SR9LYAqSCkDT04Q7fWyxBt0ojzFTUsiu70LjRCBLlPq0eC7Mo/ya/NmZclCts6pp0LQMgg+dCpUQa469YLWK2hkrykaQiq8D9TSdBG9O2Jw4juCj65H75PesrMPF2wwTOb5GFGCOISNQ6HYgTDX/gZQnQlU1vSKDMRISQ4I2w/tHPjTDKnBHZfLUgKTTjymq3PAguvlj84h/xPIOKcNzYjopMtkmDJY1aiUISZWFxgHfxP77mJwoX9KOWj0BSgTDCcIqjlgpumbE7N68jQVrh7xzuHITqf8dSLZsVAMdUuELkr9rkUBVcKLnbDcVdQiksRUeEFP3ooRO8Nc2hUCMZVX6OFtzifmki5kwegyM6JFQ5S4sp/E2CkgDQg0LxI1bPR68uPJQ3Gqhxf68QuRx67ARBnEkWxXrtimbmYQFdcFLl2LwXttsgEBIjp8PM/yB5vhFHQ6mQmWYvhkQe+heSoBvSa/kU= Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id z22-20020aa785d6000000b006ddddc7701fsm1866578pfn.4.2024.02.02.09.02.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 09:02:48 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, corbet@lwn.net, akpm@linux-foundation.org, gregory.price@memverge.com, honggyu.kim@sk.com, rakie.kim@sk.com, hyeongtak.ji@sk.com, mhocko@kernel.org, ying.huang@intel.com, vtavarespetr@micron.com, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, hannes@cmpxchg.org, dan.j.williams@intel.com Subject: [PATCH v5 1/4] mm/mempolicy: implement the sysfs-based weighted_interleave interface Date: Fri, 2 Feb 2024 12:02:35 -0500 Message-Id: <20240202170238.90004-2-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20240202170238.90004-1-gregory.price@memverge.com> References: <20240202170238.90004-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Rakie Kim This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] └── weighted_interleave [2] ├── node0 [3] └── node1 Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] weighted_interleave/: config interface for weighted interleave policy [3] weighted_interleave/nodeN: weight for nodeN If a node value is set to `0`, the system-default value will be used. As of this patch, the system-default for all nodes is always 1. Suggested-by: "Huang, Ying" Signed-off-by: Rakie Kim Signed-off-by: Honggyu Kim Co-developed-by: Gregory Price Signed-off-by: Gregory Price Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Reviewed-by: "Huang, Ying" --- .../ABI/testing/sysfs-kernel-mm-mempolicy | 4 + ...fs-kernel-mm-mempolicy-weighted-interleave | 25 ++ mm/mempolicy.c | 223 ++++++++++++++++++ 3 files changed, 252 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy new file mode 100644 index 000000000000..8ac327fd7fb6 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy @@ -0,0 +1,4 @@ +What: /sys/kernel/mm/mempolicy/ +Date: January 2024 +Contact: Linux memory management mailing list +Description: Interface for Mempolicy diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave new file mode 100644 index 000000000000..0b7972de04e9 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -0,0 +1,25 @@ +What: /sys/kernel/mm/mempolicy/weighted_interleave/ +Date: January 2024 +Contact: Linux memory management mailing list +Description: Configuration Interface for the Weighted Interleave policy + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN +Date: January 2024 +Contact: Linux memory management mailing list +Description: Weight configuration interface for nodeN + + The interleave weight for a memory node (N). These weights are + utilized by tasks which have set their mempolicy to + MPOL_WEIGHTED_INTERLEAVE. + + These weights only affect new allocations, and changes at runtime + will not cause migrations on already allocated pages. + + The minimum weight for a node is always 1. + + Minimum weight: 1 + Maximum weight: 255 + + Writing an empty string or `0` will reset the weight to the + system default. The system default may be set by the kernel + or drivers at boot or during hotplug events. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..41e58c4c0d01 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -131,6 +131,32 @@ static struct mempolicy default_policy = { static struct mempolicy preferred_node_policy[MAX_NUMNODES]; +/* + * iw_table is the sysfs-set interleave weight table, a value of 0 denotes + * system-default value should be used. A NULL iw_table also denotes that + * system-default values should be used. Until the system-default table + * is implemented, the system-default is always 1. + * + * iw_table is RCU protected + */ +static u8 __rcu *iw_table; +static DEFINE_MUTEX(iw_table_lock); + +static u8 get_il_weight(int node) +{ + u8 *table; + u8 weight; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* if no iw_table, use system default */ + weight = table ? table[node] : 1; + /* if value in iw_table is 0, use system default */ + weight = weight ? weight : 1; + rcu_read_unlock(); + return weight; +} + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -3067,3 +3093,200 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +#ifdef CONFIG_SYSFS +struct iw_node_attr { + struct kobj_attribute kobj_attr; + int nid; +}; + +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct iw_node_attr *node_attr; + u8 weight; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + weight = get_il_weight(node_attr->nid); + return sysfs_emit(buf, "%d\n", weight); +} + +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr, + const char *buf, size_t count) +{ + struct iw_node_attr *node_attr; + u8 *new; + u8 *old; + u8 weight = 0; + + node_attr = container_of(attr, struct iw_node_attr, kobj_attr); + if (count == 0 || sysfs_streq(buf, "")) + weight = 0; + else if (kstrtou8(buf, 0, &weight)) + return -EINVAL; + + new = kzalloc(nr_node_ids, GFP_KERNEL); + if (!new) + return -ENOMEM; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + if (old) + memcpy(new, old, nr_node_ids); + new[node_attr->nid] = weight; + rcu_assign_pointer(iw_table, new); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old); + return count; +} + +static struct iw_node_attr **node_attrs; + +static void sysfs_wi_node_release(struct iw_node_attr *node_attr, + struct kobject *parent) +{ + if (!node_attr) + return; + sysfs_remove_file(parent, &node_attr->kobj_attr.attr); + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); +} + +static void sysfs_wi_release(struct kobject *wi_kobj) +{ + int i; + + for (i = 0; i < nr_node_ids; i++) + sysfs_wi_node_release(node_attrs[i], wi_kobj); + kobject_put(wi_kobj); +} + +static const struct kobj_type wi_ktype = { + .sysfs_ops = &kobj_sysfs_ops, + .release = sysfs_wi_release, +}; + +static int add_weight_node(int nid, struct kobject *wi_kobj) +{ + struct iw_node_attr *node_attr; + char *name; + + node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL); + if (!node_attr) + return -ENOMEM; + + name = kasprintf(GFP_KERNEL, "node%d", nid); + if (!name) { + kfree(node_attr); + return -ENOMEM; + } + + sysfs_attr_init(&node_attr->kobj_attr.attr); + node_attr->kobj_attr.attr.name = name; + node_attr->kobj_attr.attr.mode = 0644; + node_attr->kobj_attr.show = node_show; + node_attr->kobj_attr.store = node_store; + node_attr->nid = nid; + + if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) { + kfree(node_attr->kobj_attr.attr.name); + kfree(node_attr); + pr_err("failed to add attribute to weighted_interleave\n"); + return -ENOMEM; + } + + node_attrs[nid] = node_attr; + return 0; +} + +static int add_weighted_interleave_group(struct kobject *root_kobj) +{ + struct kobject *wi_kobj; + int nid, err; + + wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!wi_kobj) + return -ENOMEM; + + err = kobject_init_and_add(wi_kobj, &wi_ktype, root_kobj, + "weighted_interleave"); + if (err) { + kfree(wi_kobj); + return err; + } + + for_each_node_state(nid, N_POSSIBLE) { + err = add_weight_node(nid, wi_kobj); + if (err) { + pr_err("failed to add sysfs [node%d]\n", nid); + break; + } + } + if (err) + kobject_put(wi_kobj); + return 0; +} + +static void mempolicy_kobj_release(struct kobject *kobj) +{ + u8 *old; + + mutex_lock(&iw_table_lock); + old = rcu_dereference_protected(iw_table, + lockdep_is_held(&iw_table_lock)); + rcu_assign_pointer(iw_table, NULL); + mutex_unlock(&iw_table_lock); + synchronize_rcu(); + kfree(old); + kfree(node_attrs); + kfree(kobj); +} + +static const struct kobj_type mempolicy_ktype = { + .release = mempolicy_kobj_release +}; + +static int __init mempolicy_sysfs_init(void) +{ + int err; + static struct kobject *mempolicy_kobj; + + mempolicy_kobj = kzalloc(sizeof(*mempolicy_kobj), GFP_KERNEL); + if (!mempolicy_kobj) { + err = -ENOMEM; + goto err_out; + } + + node_attrs = kcalloc(nr_node_ids, sizeof(struct iw_node_attr *), + GFP_KERNEL); + if (!node_attrs) { + err = -ENOMEM; + goto mempol_out; + } + + err = kobject_init_and_add(mempolicy_kobj, &mempolicy_ktype, mm_kobj, + "mempolicy"); + if (err) + goto node_out; + + err = add_weighted_interleave_group(mempolicy_kobj); + if (err) { + pr_err("mempolicy sysfs structure failed to initialize\n"); + kobject_put(mempolicy_kobj); + return err; + } + + return err; +node_out: + kfree(node_attrs); +mempol_out: + kfree(mempolicy_kobj); +err_out: + pr_err("failed to add mempolicy kobject to the system\n"); + return err; +} + +late_initcall(mempolicy_sysfs_init); +#endif /* CONFIG_SYSFS */ From patchwork Fri Feb 2 17:02:36 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13543175 Received: from mail-pg1-f196.google.com (mail-pg1-f196.google.com [209.85.215.196]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D3DAA14A4CC; Fri, 2 Feb 2024 17:02:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.196 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893375; cv=none; b=YdtH+L5hCuIbqo/cILUIHSikrjkQsxpu4Cpoa8pH9NYvM444oqEnOtVKnqZGjYrDlTFOEqNVmvL5rCt7u89WTSXaj9L1aUcbFMsh2pfMCVw960n7Ode/EaoW7Na4iDQU4fHpa961M1XjbBMBc8rBYsKXEoxM10MeZ9l00SVwrB0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893375; c=relaxed/simple; bh=auQoPWvfixbKGhEErYzSn6fvQh0PFU2VbL8KK5Ns2f0=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=KvEk5g+3zNstvYTh599oV8PC3PDRlazk/k7nHkyRlGEWA5Ix9IG9tKebpBv2cL0HA0LJfscme1faNYWdSmtb3iB8sqpj/oZ4aAcvqhmPxlUaV23N/qzHDFEEDQc2dcG22azwKXer0/Hj3lcl751bau+iiRZiH1y01ub3ZWR5e9Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=lcL/MHkO; arc=none smtp.client-ip=209.85.215.196 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lcL/MHkO" Received: by mail-pg1-f196.google.com with SMTP id 41be03b00d2f7-5d8b519e438so2260657a12.1; Fri, 02 Feb 2024 09:02:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706893373; x=1707498173; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=5AF6y+MlS2Aaikvfug37kciivlA8KerG3tHxNDefUBs=; b=lcL/MHkOh05MtiQAg5A1r5/tG8COkihFG8FqrzZeaHzu6n3bfWAlDTrWo2J3hw/wWN dBThYCnz0/AcVUolKJRx5RfKBKal4APPSTRmFVWARXDNra+0feVbEiZ47M7au9Vzqv8m vswKICzbdmr7ROkm/pWDuLmQALd1vFPp9RuVWVrCM/AMQRsnM0BvCY+keB1DdQjG5Vqs +3PLcYS6Oi4Wz6lejYJSItaDiHFBr5t1khbTPjxs+8zQAB7hlSrtrPrCtE8TPyxHPn26 RnjW7XjSoQqRngbMbtGWUZWpEt5ZhOp3yIjehDn6rGdkh8C1kCgPq1+B+T7fzG+5bC6F yJLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706893373; x=1707498173; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5AF6y+MlS2Aaikvfug37kciivlA8KerG3tHxNDefUBs=; b=V28FngkLPO6PwiJ3qQuiaQ1Spgg2DwkTYPrPPtZhxiKXLwxN2clryZVJvAZT3HmGja eA2HSh1hOHJRXLXYfYGMbHQcPMb+GqPD0EklmW6CfiQGoTLcE77OeArATyF2aAnj0we9 99RMIjRvWTK3HfM4JdQPJoch7/vcTgg+vtpbUAaF9m6LGO0vuZLC0cTuaBg8YhpjRLlL oRFrBKWNm5NvDRmrvjCwGaatVLN9AhPm2v+zBu7vy+2enIYcUpOZTn3iA+64D3wOG9dN rGAKuznwScTIt1+yiT/8fibJ7jXNlFwBwIJMI/+O7ktGkJHj2oUe7mR88eUvnYHjx+gl k5aA== X-Gm-Message-State: AOJu0YyxC/LTgVaqoEtGOyIyPg46WKCcVXlg2AyZ2s+EDo0AjGHCccw2 boJLZ1YiwSqfEGUl0CQFMPpu12XWOB56Kz1Y7ix8TjH6kIm1VdU= X-Google-Smtp-Source: AGHT+IGEuQgxleUGw2gTIz9dWiT2Ix+7OW0Ula2oMbyOkn4BnUCm750NIoGlS3+JmLjBQmBSRbQCPg== X-Received: by 2002:a05:6a20:11a5:b0:19c:a03f:9e5b with SMTP id v37-20020a056a2011a500b0019ca03f9e5bmr5472769pze.5.1706893373148; Fri, 02 Feb 2024 09:02:53 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCWxSeV0i/3cN4FkA1qBO1h6KJWwIQ3BOVFyBkzVfgpbI4zhiU+/ry+RXGd2YHfKOFGBzl679cwDxXk5cqAvgFrLqJ8O1KZXD1Q9EpnRdpm9sXcO8cisT+CHJtr5h/kDNROOF9Va7TjOLDkjaNKd3cqlWA+gy03VX/OlvH5GnHBeR1Gt8IEuBFg99qs7oWP6KwV0lu2lZAIOhl/jrWgoYjteDCd9eRR3SAXyzTTxNSidXpdoeLXuw2fYaN7t24eZGHjxJdIw3Jt/8DZvEQuRtIYRqI2CPkPFHbfeTPR/3xLlVtpPLXy+Xu6aVvWLB0fQ5U0Dxj8we/t9n9jJTkFOBwI+VcZlKD+HhZ4OLz20he5eFmQU+uwn/2HaElnr+Rbyf6+1o27Z39KUA7wphCy8EV2tylhXnl1rnPwVIskJkZ821xQmnsjd58oZDz7LH/tygJzxdoIOfpKMslnBUlBtvZ9uNS+yJiancVN5G5vUOEGuu1AAysUYar0K1cUreuu7jYnaDHcDYfLrH0x24Z/JJ9GgYh70jaKsJRu1XDCa25tYEbH+AYBkbBDh4w6Z/gIi5RkEvYlVV3BVaYTdbxs0yWEIpS5m2prUcz/8nWAigpUqhtLAlr+8EnektxAbM/I+rbmlyZ6HA1TchyHctsmFmlNHQDnLwpEBrpQUhvQHTVs= Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id z22-20020aa785d6000000b006ddddc7701fsm1866578pfn.4.2024.02.02.09.02.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 09:02:52 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, corbet@lwn.net, akpm@linux-foundation.org, gregory.price@memverge.com, honggyu.kim@sk.com, rakie.kim@sk.com, hyeongtak.ji@sk.com, mhocko@kernel.org, ying.huang@intel.com, vtavarespetr@micron.com, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, hannes@cmpxchg.org, dan.j.williams@intel.com Subject: [PATCH v5 2/4] mm/mempolicy: refactor a read-once mechanism into a function for re-use Date: Fri, 2 Feb 2024 12:02:36 -0500 Message-Id: <20240202170238.90004-3-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20240202170238.90004-1-gregory.price@memverge.com> References: <20240202170238.90004-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 move the use of barrier() to force policy->nodemask onto the stack into a function `read_once_policy_nodemask` so that it may be re-used. Suggested-by: "Huang, Ying" Signed-off-by: Gregory Price Reviewed-by: "Huang, Ying" --- mm/mempolicy.c | 26 ++++++++++++++++---------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 41e58c4c0d01..697f2a791c24 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1909,6 +1909,20 @@ unsigned int mempolicy_slab_node(void) } } +static unsigned int read_once_policy_nodemask(struct mempolicy *pol, + nodemask_t *mask) +{ + /* + * barrier stabilizes the nodemask locally so that it can be iterated + * over safely without concern for changes. Allocators validate node + * selection does not violate mems_allowed, so this is safe. + */ + barrier(); + memcpy(mask, &pol->nodes, sizeof(nodemask_t)); + barrier(); + return nodes_weight(*mask); +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1916,20 +1930,12 @@ unsigned int mempolicy_slab_node(void) */ static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx) { - nodemask_t nodemask = pol->nodes; + nodemask_t nodemask; unsigned int target, nnodes; int i; int nid; - /* - * The barrier will stabilize the nodemask in a register or on - * the stack so that it will stop changing under the code. - * - * Between first_node() and next_node(), pol->nodes could be changed - * by other threads. So we put pol->nodes in a local stack. - */ - barrier(); - nnodes = nodes_weight(nodemask); + nnodes = read_once_policy_nodemask(pol, &nodemask); if (!nnodes) return numa_node_id(); target = ilx % nnodes; From patchwork Fri Feb 2 17:02:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13543176 Received: from mail-pf1-f195.google.com (mail-pf1-f195.google.com [209.85.210.195]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7444614A4DD; Fri, 2 Feb 2024 17:02:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.195 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893380; cv=none; b=knC6Wsdy+CVFnYxoSBFujeDla7/PHk9msa32L7Sl2V//+KIieHCAXzZisNfOs1zWHCgcRP0LExmkCEif/sf5vKlB+9qtf9K7zXfuq0JzDGcTM6Usm9UH23tvevgXqUMCA91niyRzjVjRUYl78tZqyKKQt8gxWZO9VkDDgaL9Yto= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893380; c=relaxed/simple; bh=xdGg0A0ROiM9rosaVuY9/qbt+h88JFMyUZFBRqGDHtQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=u/K8rgT4vVETurHcvGQ7beb0YaqkmEAQAIbJ1bulYmhJk3ke0cIp2WrrupgyzSEaaW5qKThrExG5NjHeaBaJjlIobHeBszjf9zBZU09p6+jqKXXRMy8vaF+IM6jMDEcCUPkqMcTChBJ9+RM/FLJzjauWG/7p2tVAbV5ObOdeJTo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OhgYhySa; arc=none smtp.client-ip=209.85.210.195 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OhgYhySa" Received: by mail-pf1-f195.google.com with SMTP id d2e1a72fcca58-6ddd1fc67d2so1761379b3a.2; Fri, 02 Feb 2024 09:02:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706893377; x=1707498177; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HDBhiCh0hHoxepd+zDOKD4lrTU3D/afSJsCELhYrrcE=; b=OhgYhySa8OsX/uopilebGXTud24Htsh44cYOm4FO6ODvqvuAtuKa5kfZ2A7bVokuO3 0VKFVmOyP8CziJFiM12yKB9sjcL1vczny0EVvP4Og4/bfq9hlxTRYnSc94N/vAx+DPA/ rOZw0FtpF5Gs7T5je+3bstizIq5+XTdgWjcOsehI3BwMUCz2ZbCjvUsM52DIjtiYdZ4Q DkL9qW4KX1zmncrWIJwikA6Rh8l9RA6+Me4XQPa3ztrVu+gTC65+l9N+le65jbbbkNaU zEITlNgwL9pmKtQKq+R0gozVi6FDnFCTGMryPJtfxLjXysB+iADZhi+G9SZDODNkmIu7 GP+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706893377; x=1707498177; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HDBhiCh0hHoxepd+zDOKD4lrTU3D/afSJsCELhYrrcE=; b=TFD+NcQ6bsEct9W4c5UNNqDDzyl52NJGP5NbXdr66SZkH+nO16ySXbjUKjillbgR8C z0ledxama6dEhizE5O4bgkpaTzMrlB/GAgLO+2CYkJT8nbuC1q7SNehWJp++1Iv2JgcV T719LsoaNRBArSuVaAPY5w23AXOnTT4X2A1gJgGlwr/ZU0qI4t2lDDP1aL4vbwVBK20v I198TFKhxLh7rJizJdgTsOIUQzLrIaGqA7VoM+r4W+yeVG3xZxQNi28pCm9DGb94wk7k AGNb7wkYj5dWqRQULFQujqIKqyJuO40AzG1py1vrArp1MkG56uLI4Ch6TPWSebw9jY/x n3FQ== X-Gm-Message-State: AOJu0YyF4BKCkm0jQotKv46eeZmpw9hatIUC3oSC4661g/pk2h2N6fnU 2VEUE6kXagvgd2RY8O6PcJ7DUvJ+kZPsPKSLYxo6fN1cFEHKqpo= X-Google-Smtp-Source: AGHT+IFhtd0rxGxid2/9EaPZeefmEXggfOG4X65lz/usfgNqp1EtoV4AY/SWcQdHXjzQc50no5HWbw== X-Received: by 2002:a05:6a00:1881:b0:6dd:8743:9e03 with SMTP id x1-20020a056a00188100b006dd87439e03mr9543905pfh.34.1706893377362; Fri, 02 Feb 2024 09:02:57 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCW/3MTQ+dQRV6CKB67MD50G0nS4UHSREnq7zVe5/FqtccGDhN7mWVkaxmEw6L0z9aJKr8JTZqfPj/e8bhuLSVh8DedjAOeTuUZuUE54tLid5LttTE6b+nSLDnzqr2LP7UXJJx87FQr81zRteVUxt4gudQoB1ECzUx96HYdx0puG7ekwynBliQ+pjsaE3ol0AqrpAZ/EzqPZS+eyspjoTXu+QfzYDGzVKs3X3BMOhcn8e1BHIrAKgyMyjgfDWwUvoOusnXiR8JiWwS+2pWfUA9VdfMiZjRLKwQJyzn+i1ehhBmXY1ncDqjL4HmzZzqGA4fBGU0QYh2k5paUl3yURYT2o4wtjufQCWVO+8IGrgkMWa+f188k8/1MpxU+VYh+C8JLNTgpciOkLPFdRfiWHq7OWLW5DsvrQ/DBEnpJu/kd0mvgYEFDp1R0iLP26ERhhQEG+AdVKU2DOWdnSZJB4pOouAfCyM7nMGN+BsPdkmEKltt0a57r7tDb1CjzXPXjxDcKwQR7m+JFBk5CAsS00t8O+qI/YDQtMPkFX2l3+wbG1vubpURupCBvJ5ukcj8qJUTEpxf8E9Ycp/oXj9PAdHCX5dQMPgOpfyioAkqE3RYzOcOoJZ6GNQFRTCVxxSNAJvpG4wym5d/SArSC8qmqh2+7G5nIVTMr0MRCYb+/N3e00GA2JxT3EylwoltH2WVyB8TXBg6JiTG1xdekFraMH Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id z22-20020aa785d6000000b006ddddc7701fsm1866578pfn.4.2024.02.02.09.02.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 09:02:57 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, corbet@lwn.net, akpm@linux-foundation.org, gregory.price@memverge.com, honggyu.kim@sk.com, rakie.kim@sk.com, hyeongtak.ji@sk.com, mhocko@kernel.org, ying.huang@intel.com, vtavarespetr@micron.com, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, hannes@cmpxchg.org, dan.j.williams@intel.com, Srinivasulu Thanneeru Subject: [PATCH v5 3/4] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Date: Fri, 2 Feb 2024 12:02:37 -0500 Message-Id: <20240202170238.90004-4-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20240202170238.90004-1-gregory.price@memverge.com> References: <20240202170238.90004-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 When a system has multiple NUMA nodes and it becomes bandwidth hungry, using the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based interleave policy does not optimally distribute data to make use of their different bandwidth characteristics. Instead, interleave is more effective when the allocation policy follows each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, enabling weighted interleave between NUMA nodes. Weighted interleave allows for proportional distribution of memory across multiple numa nodes, preferably apportioned to match the bandwidth of each node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights for each node can be assigned via the new sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ For now, the default value of all nodes will be `1`, which matches the behavior of standard 1:1 round-robin interleave. An extension will be added in the future to allow default values to be registered at kernel and device bringup time. The policy allocates a number of pages equal to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). Some high level notes about the pieces of weighted interleave: current->il_prev: Tracks the node previously allocated from. current->il_weight: The active weight of the current node (current->il_prev) When this reaches 0, current->il_prev is set to the next node and current->il_weight is set to the next weight. weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Operates only on task->mempolicy. weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Operates on VMA policies. bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight will be allocated first. Operates only on the task->mempolicy. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. If a call to alloc_pages_mpol() were made with a weighted-interleave policy and ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA policy - violating the description above. An inspection of all callers of alloc_pages_mpol() shows that all external callers set ilx to `0`, an index value, or will call get_vma_policy() to acquire the ilx. For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces the `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy requirements (task/vma respectively). Suggested-by: Hasan Al Maruf Signed-off-by: Gregory Price Co-developed-by: Rakie Kim Signed-off-by: Rakie Kim Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Srinivasulu Thanneeru Signed-off-by: Srinivasulu Thanneeru Co-developed-by: Ravi Jonnalagadda Signed-off-by: Ravi Jonnalagadda Reviewed-by: "Huang, Ying" --- .../admin-guide/mm/numa_memory_policy.rst | 9 + include/linux/sched.h | 1 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 218 +++++++++++++++++- 4 files changed, 225 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..a70f20ce1ffb 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocates pages on nodes according to a + weight. For example if nodes [0,1] are weighted [5,2], 5 pages + will be allocated on node0 for every 2 pages allocated on node1. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES diff --git a/include/linux/sched.h b/include/linux/sched.h index ffe8f618ab86..b9ce285d8c9c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1259,6 +1259,7 @@ struct task_struct { /* Protected by alloc_lock: */ struct mempolicy *mempolicy; short il_prev; + u8 il_weight; short pref_node_fork; #endif #ifdef CONFIG_NUMA_BALANCING diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 697f2a791c24..d8cc3a577986 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -19,6 +19,13 @@ * for anonymous memory. For process policy an process counter * is used. * + * weighted interleave + * Allocate memory interleaved over a set of nodes based on + * a set of weights (per-node), with normal fallback if it + * fails. Otherwise operates the same as interleave. + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated + * on node 0 for every 1 page allocated on node 1. + * * bind Only allocate memory on a specific set of nodes, * no fallback. * FIXME: memory is allocated starting with the first node @@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, }; static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, @@ -862,8 +873,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) { current->il_prev = MAX_NUMNODES-1; + current->il_weight = 0; + } task_unlock(current); mpol_put(old); ret = 0; @@ -888,6 +902,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes = pol->nodes; break; case MPOL_LOCAL: @@ -972,6 +987,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { *policy = next_node_in(current->il_prev, pol->nodes); + } else if (pol == current->mempolicy && + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { + if (current->il_weight) + *policy = current->il_prev; + else + *policy = next_node_in(current->il_prev, + pol->nodes); } else { err = -EINVAL; goto out; @@ -1336,7 +1358,8 @@ static long do_mbind(unsigned long start, unsigned long len, * VMAs, the nodes will still be interleaved from the targeted * nodemask, but one by one may be selected differently. */ - if (new->mode == MPOL_INTERLEAVE) { + if (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE) { struct page *page; unsigned int order; unsigned long addr = -EFAULT; @@ -1784,7 +1807,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * @vma: virtual memory area whose policy is sought * @addr: address in @vma for shared policy lookup * @order: 0, or appropriate huge_page_order for interleaving - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or + * MPOL_WEIGHTED_INTERLEAVE * * Returns effective policy for a VMA at specified address. * Falls back to current->mempolicy or system default policy, as necessary. @@ -1801,7 +1825,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1851,6 +1876,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; } +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int node = current->il_prev; + + if (!current->il_weight || !node_isset(node, policy->nodes)) { + node = next_node_in(node, policy->nodes); + /* can only happen if nodemask is being rebound */ + if (node == MAX_NUMNODES) + return node; + current->il_prev = node; + current->il_weight = get_il_weight(node); + } + current->il_weight--; + return node; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1885,6 +1926,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1923,6 +1967,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, return nodes_weight(*mask); } +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) +{ + nodemask_t nodemask; + unsigned int target, nr_nodes; + u8 *table; + unsigned int weight_total = 0; + u8 weight; + int nid; + + nr_nodes = read_once_policy_nodemask(pol, &nodemask); + if (!nr_nodes) + return numa_node_id(); + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* calculate the total weight */ + for_each_node_mask(nid, nodemask) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + weight_total += weight; + } + + /* Calculate the node offset based on totals */ + target = ilx % weight_total; + nid = first_node(nodemask); + while (target) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + rcu_read_unlock(); + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1983,6 +2066,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, *nid = (ilx == NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid = (ilx == NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } return nodemask; @@ -2044,6 +2132,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask = mempolicy->nodes; break; @@ -2144,6 +2233,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * node in its nodemask, we allocate the standard way. */ if (pol->mode != MPOL_INTERLEAVE && + pol->mode != MPOL_WEIGHTED_INTERLEAVE && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2279,6 +2369,114 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; } +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me = current; + unsigned long total_allocated = 0; + unsigned long nr_allocated = 0; + unsigned long rounds; + unsigned long node_pages, delta; + u8 *table, *weights, weight; + unsigned int weight_total = 0; + unsigned long rem_pages = nr_pages; + nodemask_t nodes; + int nnodes, node; + int resume_node = MAX_NUMNODES - 1; + u8 resume_weight = 0; + int prev_node; + int i; + + if (!nr_pages) + return 0; + + nnodes = read_once_policy_nodemask(pol, &nodes); + if (!nnodes) + return 0; + + /* Continue allocating from most recent node and adjust the nr_pages */ + node = me->il_prev; + weight = me->il_weight; + if (weight && node_isset(node, nodes)) { + node_pages = min(rem_pages, weight); + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages <= weight) { + me->il_weight -= rem_pages; + return total_allocated; + } + /* Otherwise we adjust remaining pages, continue from there */ + rem_pages -= weight; + } + /* clear active weight in case of an allocation failure */ + me->il_weight = 0; + prev_node = node; + + /* create a local copy of node weights to operate on outside rcu */ + weights = kzalloc(nr_node_ids, GFP_KERNEL); + if (!weights) + return total_allocated; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + if (table) + memcpy(weights, table, nr_node_ids); + rcu_read_unlock(); + + /* calculate total, detect system default usage */ + for_each_node_mask(node, nodes) { + if (!weights[node]) + weights[node] = 1; + weight_total += weights[node]; + } + + /* + * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. + * Track which node weighted interleave should resume from. + * + * if (rounds > 0) and (delta == 0), resume_node will always be + * the node following prev_node and its weight. + */ + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; + resume_node = next_node_in(prev_node, nodes); + resume_weight = weights[resume_node]; + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, nodes); + weight = weights[node]; + node_pages = weight * rounds; + /* If a delta exists, add this node's portion of the delta */ + if (delta > weight) { + node_pages += weight; + delta -= weight; + } else if (delta) { + /* when delta is depleted, resume from that node */ + node_pages += delta; + resume_node = node; + resume_weight = weight - delta; + delta = 0; + } + /* node_pages can be 0 if an allocation fails and rounds == 0 */ + if (!node_pages) + break; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + if (total_allocated == nr_pages) + break; + prev_node = node; + } + me->il_prev = resume_node; + me->il_weight = resume_weight; + kfree(weights); + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2319,6 +2517,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave( + gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2394,6 +2596,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2530,6 +2733,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + polnid = weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2904,6 +3111,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -2963,6 +3171,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3073,6 +3282,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes = pol->nodes; break; default: From patchwork Fri Feb 2 17:02:38 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13543177 Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E192D14A4E7; Fri, 2 Feb 2024 17:03:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.194 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893383; cv=none; b=V8SdDNVu8mCLpm4eYc4FyjUS6TlNyzzSx5s9oxDCE6ZmwCrJu7XAE2ErskSc014ULfSDtCP5vtkjpb4yycu4LokK+YqvW9PUGF6ROZ7CSnFUE7atTojEC1XbENx4SQgHocH7qVu2toJ16bpll7P+PJRFD6M3m641h2h1udEU1rE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706893383; c=relaxed/simple; bh=WGXGeGTdrW7NoVRK7cYfyrjHnGc5NCJjZnCnYePfUZQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=ATl4mIuHJzYkeuTnof2LM3qsRVv9o4+qCHygIkHYJHDS0h3AxJqezry5FCXwPhdqFSDVl52SgX6W72lzyFN8q7McTB/q0Ne1zLZgHQ8Bg7wzZWfTguT+jK9685tCr8amM/mgSB2GfbD0ZM/m/HVS1bxtEA1M4vwyQd3wbtfJaMs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=WdIbO7Ou; arc=none smtp.client-ip=209.85.210.194 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="WdIbO7Ou" Received: by mail-pf1-f194.google.com with SMTP id d2e1a72fcca58-6dfc321c677so1724918b3a.3; Fri, 02 Feb 2024 09:03:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1706893381; x=1707498181; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=HhrTG5AYewh5phNKX1uPxhcy8QhnMld2uJe/mWMdTe0=; b=WdIbO7Ouw3Z3py0EEW5CcPqIlylJU5eowr5F68NhhsduW+K7P/0JlYNtzBhTRLyNmB 0/UDkimsfnoSj2r/vSfIattYQL5NwWnsWNUmdLMqST0uvnXPPJD+yOsyJzuDkX9DZq2T Gf2d6TjzKYTcKtvVn+6KJWpJT1cFn62m3k9aF7lkOUnDftZszjNV/y+0WAVdi7ZUWAo/ ngKzGNmmQW/yI7SWEkHwlspyAx1i+PhZaOGdeapzN1gZpqs+Uv0Hjg4O9AxzEe39Lb0U XW6XPyNF6g/I57BOhrAgzwh3y6T7wcsJ7BmKCeYgnPQ+qY9TRo8kc0IbqeZLv1EbdNb0 ZRnA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706893381; x=1707498181; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HhrTG5AYewh5phNKX1uPxhcy8QhnMld2uJe/mWMdTe0=; b=gMV/llJJKF+mm6g2aMYc89BpZ/onRKuL5VLgNjOknGshjj0T6jyMimOdUqw7J5tBII d+0R2wFlDOR5JXB3dAx3xMYBcfmq8zefCeLoqhRyKc7Q37g3ofXdb5TJPcxwdM1zLXUt q+ckRFS/yvyD5rLtvOIxUW2OIg8iF13rr1BasJnFbfBdAfwbzYfJSkb0WMCi4nxUiPOa iwyRzRAPOW0ADxgfZiyqZCiIrV9EGGnMV3LEiLmMP42BwDaxCbao54K/lKVu/rs4JjvA 80bg8NCmY83ZpCmNaPbfZxmFPXj3suBDH2K6OBzuaWV+H1wAR68Bk2qDSNOFWvsoH+O1 zQfg== X-Gm-Message-State: AOJu0YzuLpnAjFOBMv7wmS15We5UMAKkI0Y+ldHKI+xKI5xGmFsKFJR7 RqJGJD1JJXk9/78wHODYneTOnCpisYPsB69PhIO1bZNdjHcqyYQ= X-Google-Smtp-Source: AGHT+IFogIopQqwUpwpEFsFzdYwSc8X+34bpQwF6wcEJlxSEsQ2LoOTeQtLBa1R1mPNFivEZAibTKQ== X-Received: by 2002:aa7:8698:0:b0:6dd:c3b1:797e with SMTP id d24-20020aa78698000000b006ddc3b1797emr2585012pfo.19.1706893381133; Fri, 02 Feb 2024 09:03:01 -0800 (PST) X-Forwarded-Encrypted: i=0; AJvYcCVt1bpVi7/1usRCD00UfVroAEBs2kkA7ivPuJBFs7e4G+50XRtr/OkpAzFSMfFRAHdUrmWFaRgbmluqN9+kP6pYoISXSBVGj3gGym0kGZAPJqyC9zCnaxeKxoVqIg+YsKhcBnckhYr0jqT+88HBKWVVvKol1k9qKs6zfze27eMF8LMwoCCImKMQ1QboF2LNyUiQlGDhILX78kacwGH3CEFwJF4MjCCPH0WZzUCY/EqU11rSp+7VSCcPyhqnACvJLqIenX7ox5uk6g+0BRmKDyW+u0Kc96vapdvPzWcSfp5uXlV8r4+b1LugH1OLSTzxHTdLvSuzkh+bhPyPGuGqqkRp6Jq/WmFwjFmTpaVnvynSrPgg6EXt8aJXZqPdvgU1hCvlfMTDAyWGPgEprUT0uA7CjpAMGLx9aO2UBB/uFO4lpMVdoIkgg+FhMnZNXJxNugpKs6x6pYKxEzulQq0Qc4VtjlWtT3Ml2/G1MpSSw2VARz4d2EBU7mh4um+v+jw+7cNDhrTAWNgBx7r4cq/lD29zYteZKCs+AhvgyOw94L0HxtBf8Mh8WR9T8cZvnNh0uxpT4ozEnaaTYxN/bWhlS0dY1lsdRGSWlqopk5eik5zKkpWz/fv0D5jTdAaQ5ARjZhTX69CwPcQUtgzfBrZ0y9Ic17GPXYKjlKblDPYcsHw= Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id z22-20020aa785d6000000b006ddddc7701fsm1866578pfn.4.2024.02.02.09.02.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 09:03:00 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, corbet@lwn.net, akpm@linux-foundation.org, gregory.price@memverge.com, honggyu.kim@sk.com, rakie.kim@sk.com, hyeongtak.ji@sk.com, mhocko@kernel.org, ying.huang@intel.com, vtavarespetr@micron.com, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com, seungjun.ha@samsung.com, hannes@cmpxchg.org, dan.j.williams@intel.com Subject: [PATCH v5 4/4] mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq Date: Fri, 2 Feb 2024 12:02:38 -0500 Message-Id: <20240202170238.90004-5-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20240202170238.90004-1-gregory.price@memverge.com> References: <20240202170238.90004-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In the event of rebind, pol->nodemask can change at the same time as an allocation occurs. We can detect this with tsk->mems_allowed_seq and prevent a miscount or an allocation failure from occurring. The same thing happens in the allocators to detect failure, but this can prevent spurious failures in a much smaller critical section. Suggested-by: "Huang, Ying" Signed-off-by: Gregory Price --- mm/mempolicy.c | 31 +++++++++++++++++++++++++------ 1 file changed, 25 insertions(+), 6 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d8cc3a577986..ed0d5d2d456a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { - unsigned int node = current->il_prev; - - if (!current->il_weight || !node_isset(node, policy->nodes)) { + unsigned int node; + unsigned int cpuset_mems_cookie; + +retry: + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ + cpuset_mems_cookie = read_mems_allowed_begin(); + node = current->il_prev; + if (!node || !node_isset(node, policy->nodes)) { node = next_node_in(node, policy->nodes); - /* can only happen if nodemask is being rebound */ + if (read_mems_allowed_retry(cpuset_mems_cookie)) + goto retry; if (node == MAX_NUMNODES) return node; current->il_prev = node; @@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) static unsigned int interleave_nodes(struct mempolicy *policy) { unsigned int nid; + unsigned int cpuset_mems_cookie; + + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nid = next_node_in(current->il_prev, policy->nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); - nid = next_node_in(current->il_prev, policy->nodes); if (nid < MAX_NUMNODES) current->il_prev = nid; return nid; @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, struct page **page_array) { struct task_struct *me = current; + unsigned int cpuset_mems_cookie; unsigned long total_allocated = 0; unsigned long nr_allocated = 0; unsigned long rounds; @@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, if (!nr_pages) return 0; - nnodes = read_once_policy_nodemask(pol, &nodes); + /* read the nodes onto the stack, retry if done during rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nnodes = read_once_policy_nodemask(pol, &nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); + + /* if the nodemask has become invalid, we cannot do anything */ if (!nnodes) return 0;