From patchwork Thu Dec 7 00:27:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482516 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HEdiuJrs" Received: from mail-yw1-x1144.google.com (mail-yw1-x1144.google.com [IPv6:2607:f8b0:4864:20::1144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A3D1137; Wed, 6 Dec 2023 16:28:12 -0800 (PST) Received: by mail-yw1-x1144.google.com with SMTP id 00721157ae682-5d33574f64eso1035697b3.3; Wed, 06 Dec 2023 16:28:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908892; x=1702513692; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=K9wIViZkBU87Y2XRyOX6ijFii9lZTEeqQT/4TV/pkeE=; b=HEdiuJrs90Xh8R9YEdM3A+7WJTBsnUyKolauuBM3/iGJxnOuguWm5xddce7xoRqwpZ jHZejWLKDPP+tr1ceFp/DsLP9A4Hstk0bLzD+Y6M+AGg7VhmBPKMPrmq0EhPELyKPXV8 0rLJ47fqz7G+epOPG0P7RZL6czvHfoaZn/zBHFlI7bd82YiavqKvphpmBfjU2/O2IQKM UjEaT8AyY07A5o/ccbGRuEC2tLAo8O81st/ulovuzXgJo274wwspuNq0ceLEBQQs7c/T GXyawTrLBIVkH4biJRmEL/IBaoitc2eXivi0IAz6y26U/Jnyn3f4xzutFzGBSWKmUSo1 7k3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908892; x=1702513692; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=K9wIViZkBU87Y2XRyOX6ijFii9lZTEeqQT/4TV/pkeE=; b=OFFEx6Il7qKDra5ZHZqo0fjKmd2djn7rJMndRCDp7KBe1ETBA/7Az0D6W0rTmza7gs 0fTpROKzyU4//LUawvbQ+eA6SpfuGxVX49sepjzDsIkD2v+xibK7vRKeQ7VemY9qOetl SFtRsJsyXNvpp96hpxtJ/zkCNV/Q1BErAH9C9toaqA9oNFYrVN13auccd7ZjtJbCrAzU RBtxggIxZJpZxTpJxhqsYFAA+10PxUBjvq/l48P+drDwT5rHOSWBaOmiG87Vr0ReUh/0 7LktWdhHcBwaXV8Ty65eDgm4uo83705mmx6R5MU8Qg1aeTfVWaCWMQFALdQOPEGXf0KE 6lCg== X-Gm-Message-State: AOJu0YzfabPjyOH15y31HpW8XIpuYDUWijGquZCpykpyO7Jez1NHXBL1 HMeOyEABW5EMysdBj0z66Dtp7FfOlt4D X-Google-Smtp-Source: AGHT+IFsypp4HiyFtXvyk2o+SL5aq0zamwJAkzXlggLGMQm5qgsPFIqsLXEuF5xWNcgX6pcVEg7qoA== X-Received: by 2002:a05:690c:3612:b0:5d7:1940:b37f with SMTP id ft18-20020a05690c361200b005d71940b37fmr1838437ywb.75.1701908891675; Wed, 06 Dec 2023 16:28:11 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:11 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org Subject: [RFC PATCH 01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface Date: Wed, 6 Dec 2023 19:27:49 -0500 Message-Id: <20231207002759.51418-2-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Rakie Kim This patch provides a way to set interleave weight information under sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/node*/node*/weight The sysfs structure is designed as follows. $ tree /sys/kernel/mm/mempolicy/ /sys/kernel/mm/mempolicy/ [1] ├── cpu_nodes [2] ├── possible_nodes [3] └── weighted_interleave [4] ├── node0 [5] │  ├── node0 [6] │  │ └── weight [7] │  └── node1 │  └── weight └── node1   ├── node0   │ └── weight   └── node1   └── weight Each file above can be explained as follows. [1] mm/mempolicy: configuration interface for mempolicy subsystem [2] cpu_nodes: list of cpu nodes information interface which is used to describe which nodes may generate sub-folders under each policy interface. For example, the weighted_interleave policy generates a nodeN folder for each cpu node. [3] possible_nodes: list of possible nodes informational interface which may be used across multiple memory policy configurations. Lists the `possible` nodes for which configurations may be required. A `possible` node is one which has been reserved by the kernel at boot, but may or may not be online. For example, the weighted_interleave policy generates a nodeN/nodeM folder for each cpu node and memory node combination [N,M]. [4] weighted_interleave/: config interface for weighted interleave policy [5] weighted_interleave/nodeN/: initiator node configurations Each CPU node receives its own weighting table, allowing for (src,dst) weighting to be accomplished, where src is the cpu node the task is running on, and dst is an index into the array of weights for that source node. [6] weighted_interleave/nodeN/nodeM/: memory node configurations [7] weighted_interleave/nodeN/nodeM/weight: weight for [N,M] The weight table for nodeN which can be programmed to weight each target (nodeM) differently. This is important for allowing re-weight to occur automatically on a task migration event, either via scheduler initiated migration or a cgroup.cpusets/mems_allowed policy change. Signed-off-by: Rakie Kim Signed-off-by: Honggyu Kim Co-developed-by: Gregory Price Signed-off-by: Gregory Price Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji --- .../ABI/testing/sysfs-kernel-mm-mempolicy | 33 +++ ...fs-kernel-mm-mempolicy-weighted-interleave | 35 +++ mm/mempolicy.c | 226 ++++++++++++++++++ 3 files changed, 294 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy new file mode 100644 index 000000000000..8dc1129d4ab1 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy @@ -0,0 +1,33 @@ +What: /sys/kernel/mm/mempolicy/ +Date: December 2023 +Contact: Linux memory management mailing list +Description: Interface for Mempolicy + +What: /sys/kernel/mm/mempolicy/cpu_nodes +Date: December 2023 +Contact: Linux memory management mailing list +Description: The numa nodes from which accesses can be generated + + A cpu numa node is one which has at least 1 CPU. These nodes + are capable of generating accesses to memory numa nodes, and + will have an interleave weight table. + + Example output: + + ===== ================================================= + "0,1" nodes 0 and 1 have CPUs which may generate access + ===== ================================================= + +What: /sys/kernel/mm/mempolicy/possible_nodes +Date: December 2023 +Contact: Linux memory management mailing list +Description: The numa nodes which are possible to come online + + A possible numa node is one which has been reserved by the + system at boot, but may or may not be online at runtime. + + Example output: + + ========= ======================================== + "0,1,2,3" nodes 0-3 are possibly online or offline + ========= ======================================== diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave new file mode 100644 index 000000000000..75554895ede3 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave @@ -0,0 +1,35 @@ +What: /sys/kernel/mm/mempolicy/weighted_interleave/ +Date: December 2023 +Contact: Linux memory management mailing list +Description: Configuration Interface for the Weighted Interleave policy + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/ +Date: December 2023 +Contact: Linux memory management mailing list +Description: Configuration interface for accesses initiated from nodeN + + The directory to configure access initiator weights for nodeN. + + Possible numa nodes which have not been marked as a CPU node + at boot will not have a nodeN directory made for them at boot. + Hotplug for CPU nodes is not supported. + +What: /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM + /sys/kernel/mm/mempolicy/weighted_interleave/nodeN/nodeM/weight +Date: December 2023 +Contact: Linux memory management mailing list +Description: Configuration interface for target nodes accessed from nodeNN + + The interleave weight for a memory node (M) from initiating + node (N). These weights are utilized by processes which have set + the mempolicy to MPOL_WEIGHTED_INTERLEAVE and have opted into + global weights by omitting a task-local weight array. + + These weights only affect new allocations, and changes at runtime + will not cause migrations on already allocated pages. + + If the weight of 0 is desired, the appropriate way to do this is + by removing the node from the weighted interleave nodemask. + + Minimum weight: 1 + Maximum weight: 255 diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 10a590ee1c89..ce332b5e7a03 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -131,6 +131,11 @@ static struct mempolicy default_policy = { static struct mempolicy preferred_node_policy[MAX_NUMNODES]; +struct interleave_weight_table { + unsigned char weights[MAX_NUMNODES]; +}; +static struct interleave_weight_table *iw_table; + /** * numa_nearest_node - Find nearest node by state * @node: Node id to start the search @@ -3067,3 +3072,224 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) p += scnprintf(p, buffer + maxlen - p, ":%*pbl", nodemask_pr_args(&nodes)); } + +struct iw_node_info { + struct kobject kobj; + int src; + int dst; +}; + +static ssize_t node_weight_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct iw_node_info *node_info = container_of(kobj, struct iw_node_info, + kobj); + return sysfs_emit(buf, "%d\n", + iw_table[node_info->src].weights[node_info->dst]); +} + +static ssize_t node_weight_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + unsigned char weight = 0; + struct iw_node_info *node_info = NULL; + + node_info = container_of(kobj, struct iw_node_info, kobj); + + if (kstrtou8(buf, 0, &weight) || !weight) + return -EINVAL; + + iw_table[node_info->src].weights[node_info->dst] = weight; + + return count; +} + +static struct kobj_attribute node_weight = + __ATTR(weight, 0664, node_weight_show, node_weight_store); + +static struct attribute *dst_node_attrs[] = { + &node_weight.attr, + NULL, +}; + +static struct attribute_group dst_node_attr_group = { + .attrs = dst_node_attrs, +}; + +static const struct attribute_group *dst_node_attr_groups[] = { + &dst_node_attr_group, + NULL, +}; + +static const struct kobj_type dst_node_kobj_ktype = { + .sysfs_ops = &kobj_sysfs_ops, + .default_groups = dst_node_attr_groups, +}; + +static int add_dst_node(int src, int dst, struct kobject *src_kobj) +{ + struct iw_node_info *node_info = NULL; + int ret; + + node_info = kzalloc(sizeof(struct iw_node_info), GFP_KERNEL); + if (!node_info) + return -ENOMEM; + node_info->src = src; + node_info->dst = dst; + + kobject_init(&node_info->kobj, &dst_node_kobj_ktype); + ret = kobject_add(&node_info->kobj, src_kobj, "node%d", dst); + if (ret) { + pr_err("kobject_add error [%d-node%d]: %d", src, dst, ret); + kobject_put(&node_info->kobj); + } + return ret; +} + +static int add_src_node(int src, struct kobject *root_kobj) +{ + int err, dst; + struct kobject *src_kobj; + char name[24]; + + snprintf(name, 24, "node%d", src); + src_kobj = kobject_create_and_add(name, root_kobj); + if (!src_kobj) { + pr_err("failed to create source node kobject\n"); + return -ENOMEM; + } + for_each_node_state(dst, N_POSSIBLE) { + err = add_dst_node(src, dst, src_kobj); + if (err) + break; + } + if (err) + kobject_put(src_kobj); + return err; +} + +static int add_weighted_interleave_group(struct kobject *root_kobj) +{ + struct kobject *wi_kobj; + int nid, err; + + wi_kobj = kobject_create_and_add("weighted_interleave", root_kobj); + if (!wi_kobj) { + pr_err("failed to create node kobject\n"); + return -ENOMEM; + } + + for_each_node_state(nid, N_CPU) { + err = add_src_node(nid, wi_kobj); + if (err) { + pr_err("failed to add sysfs [node%d]\n", nid); + break; + } + } + if (err) + kobject_put(wi_kobj); + return 0; + +} + +static ssize_t cpu_nodes_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int nid, next_nid; + int len = 0; + + for_each_node_state(nid, N_CPU) { + len += sysfs_emit_at(buf, len, "%d", nid); + next_nid = next_node(nid, node_states[N_CPU]); + if (next_nid < MAX_NUMNODES) + len += sysfs_emit_at(buf, len, ","); + } + len += sysfs_emit_at(buf, len, "\n"); + + return len; +} + +static ssize_t possible_nodes_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int nid, next_nid; + int len = 0; + + for_each_node_state(nid, N_POSSIBLE) { + len += sysfs_emit_at(buf, len, "%d", nid); + next_nid = next_node(nid, node_states[N_POSSIBLE]); + if (next_nid < MAX_NUMNODES) + len += sysfs_emit_at(buf, len, ","); + } + len += sysfs_emit_at(buf, len, "\n"); + + return len; +} + +static struct kobj_attribute cpu_nodes_attr = __ATTR_RO(cpu_nodes); +static struct kobj_attribute possible_nodes_attr = __ATTR_RO(possible_nodes); + +static struct attribute *mempolicy_attrs[] = { + &cpu_nodes_attr.attr, + &possible_nodes_attr.attr, + NULL, +}; + +static const struct attribute_group mempolicy_attr_group = { + .attrs = mempolicy_attrs, + NULL, +}; + +static void mempolicy_kobj_release(struct kobject *kobj) +{ + kfree(kobj); + kfree(iw_table); +} + +static const struct kobj_type mempolicy_kobj_ktype = { + .release = mempolicy_kobj_release, + .sysfs_ops = &kobj_sysfs_ops, +}; + +static int __init mempolicy_sysfs_init(void) +{ + int err, nid; + int cpunodes = 0; + struct kobject *root_kobj; + + for_each_node_state(nid, N_CPU) + cpunodes += 1; + iw_table = kmalloc_array(cpunodes, sizeof(*iw_table), GFP_KERNEL); + if (!iw_table) { + pr_err("failed to create interleave weight table\n"); + err = -ENOMEM; + goto fail_obj; + } + memset(iw_table, 1, cpunodes * sizeof(*iw_table)); + + root_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL); + if (!root_kobj) + return -ENOMEM; + + kobject_init(root_kobj, &mempolicy_kobj_ktype); + err = kobject_add(root_kobj, mm_kobj, "mempolicy"); + if (err) { + pr_err("failed to add kobject to the system\n"); + goto fail_obj; + } + + err = sysfs_create_group(root_kobj, &mempolicy_attr_group); + if (err) { + pr_err("failed to register mempolicy group\n"); + goto fail_obj; + } + + err = add_weighted_interleave_group(root_kobj); +fail_obj: + if (err) + kobject_put(root_kobj); + return err; + +} +late_initcall(mempolicy_sysfs_init); From patchwork Thu Dec 7 00:27:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482518 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gxozjxPh" Received: from mail-yw1-x1142.google.com (mail-yw1-x1142.google.com [IPv6:2607:f8b0:4864:20::1142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B59BE13E; Wed, 6 Dec 2023 16:28:14 -0800 (PST) Received: by mail-yw1-x1142.google.com with SMTP id 00721157ae682-5cdc0b3526eso937157b3.1; Wed, 06 Dec 2023 16:28:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908894; x=1702513694; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=vTIdOr/Ad6sy3LNP6UCv/8uum7iS4u1PgX/3D5DWmdQ=; b=gxozjxPhiButx0bdhWQc7Tzu1Gdm/0/2KGoe7/wPeQ84h0fxgZgT+fkvzWTP1PiMa5 DY+jVQ5bN6aJkkwY78ddRof6gwvNQfWkset8AINDKJgHJyC2lPjlczwk3gZ3mg5wDm0b u58AG2ePhOjfUBseXFYNuYHW96xqFPOoQN/QdpwKKtBPZHZw2bep7vv3VRDZB4Vr5w1/ AS64ohrn38NaFk1+gr0D0AOW00JWSZGDAie9jtHNhbz2nlzDUU0l8qDCP2dJvypxudV+ ugBjLWXersm3FXX0GDz+RaERbynq7HLXIpoPF6MYknapPee2iFZiBIO1u67cWLwj1vAz YUIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908894; x=1702513694; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vTIdOr/Ad6sy3LNP6UCv/8uum7iS4u1PgX/3D5DWmdQ=; b=JBEBx+ConyrTguZbZeEbQrtamPdCe9DIKdlrOPPNtp6P5jgY7D3GBgPtvUBLezRffr P7Dur+RXtqTw5iz67taBiBSr5bNPXU3JgD0hnBqGs1s64NoWcJtiZ+uz2qHVwgzNjDpJ qUAFLtUUFKrbYOY3PlXErsdY8oJdRqlSB9I4Eh7F6U3rHB4HZQNuYh+/GLvvlkBANfF5 LRPDNsXqImLkvQGjZd6Wf4dqLGJcaXr96ME/fo/zk6AaMFLknxGVDqvyJqudlYcjNmGt Ob15hr0BPG4tvibGrf0bpxIkBPMXc/hXBtFxNf/QWtRziDZ+pHSkVVlq/TzC1REODXIB ZIUg== X-Gm-Message-State: AOJu0Yy0LwgNrCu+bpT32FE8G0trjjV4KzV4i/bHuftuELbcU86TeeeJ +uirGjt/KpFFg3Z2BjWfdQ== X-Google-Smtp-Source: AGHT+IEuluzDim22nxbf6CVv3wNJyongSqg7zyE45Yr9G8jhwVWJqx4RS6o8FUXZl0K8f3WiDyn4Jg== X-Received: by 2002:a05:690c:804:b0:5d7:1940:3eed with SMTP id bx4-20020a05690c080400b005d719403eedmr1557494ywb.30.1701908893715; Wed, 06 Dec 2023 16:28:13 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:13 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, Srinivasulu Thanneeru Subject: [RFC PATCH 02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving Date: Wed, 6 Dec 2023 19:27:50 -0500 Message-Id: <20231207002759.51418-3-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Rakie Kim When a system has multiple NUMA nodes and it becomes bandwidth hungry, the current MPOL_INTERLEAVE could be an wise option. However, if those NUMA nodes consist of different types of memory such as having local DRAM and CXL memory together, the current round-robin based interleaving policy doesn't maximize the overall bandwidth because of their different bandwidth characteristics. Instead, the interleaving can be more efficient when the allocation policy follows each NUMA nodes' bandwidth weight rather than having 1:1 round-robin allocation. This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which enables weighted interleaving between NUMA nodes. Weighted interleave allows for a proportional distribution of memory across multiple numa nodes, preferablly apportioned to match the bandwidth capacity of each node from the perspective of the accessing node. For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), with a relative bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight distribution is (2:1). Weights will be acquired from the global weight matrix exposed by the sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/ The policy will then allocate the number of pages according to the set weights. For example, if the weights are (2,1), then 2 pages will be allocated on node0 for every 1 page allocated on node1. The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) and mbind(2). There are 3 integration points: weighted_interleave_nodes: Counts the number of allocations as they occur, and applies the weight for the current node. When the weight reaches 0, switch to the next node. Applied by `mempolicy_slab_node()` and `policy_nodemask()` weighted_interleave_nid: Gets the total weight of the nodemask as well as each individual node weight, then calculates the node based on the given index. Applied by `policy_nodemask()` and `mpol_misplaced()` bulk_array_weighted_interleave: Gets the total weight of the nodemask as well as each individual node weight, then calculates the number of "interleave rounds" as well as any delta ("partial round"). Calculates the number of pages for each node and allocates them. If a node was scheduled for interleave via interleave_nodes, the current weight (pol->cur_weight) will be allocated first, before the remaining bulk calculation is done. This simplifies the calculation at the cost of an additional allocation call. One piece of complexity is the interaction between a recent refactor which split the logic to acquire the "ilx" (interleave index) of an allocation and the actually application of the interleave. The calculation of the `interleave index` is done by `get_vma_policy()`, while the actual selection of the node will be later appliex by the relevant weighted_interleave function. Suggested-by: Hasan Al Maruf Signed-off-by: Rakie Kim Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Gregory Price Signed-off-by: Gregory Price Co-developed-by: Srinivasulu Thanneeru Signed-off-by: Srinivasulu Thanneeru Co-developed-by: Ravi Jonnalagadda Signed-off-by: Ravi Jonnalagadda --- .../admin-guide/mm/numa_memory_policy.rst | 17 ++ include/linux/mempolicy.h | 5 + include/uapi/linux/mempolicy.h | 1 + mm/mempolicy.c | 181 +++++++++++++++++- 4 files changed, 201 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..b7b8d3dd420f 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,23 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + rather than simple round-robin interleave (which is the default). + + When utilizing global weights from the sysfs interface, + weights are applied in a src-node relative manner. For example + a task executing on node0 will use the weights from + /sys/kernel/mm/mempolicy/weighted_interleave/node0/ + while a task executing on node1 will use the weights from + /sys/kernel/mm/mempolicy/weighted_interleave/node1/ + + This allows for tasks migrated between nodes (for example + cgroup initiated migrations) to re-weight for the optimal + distribution of bandwidth. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..ba09167e80f7 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -54,6 +54,11 @@ struct mempolicy { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ } w; + + /* Weighted interleave settings */ + struct { + unsigned char cur_weight; + } wil; }; /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index ce332b5e7a03..65e0334a1a18 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -308,6 +308,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, policy->mode = mode; policy->flags = flags; policy->home_node = NUMA_NO_NODE; + policy->wil.cur_weight = 0; return policy; } @@ -420,6 +421,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, }; static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, @@ -841,7 +846,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) current->il_prev = MAX_NUMNODES-1; task_unlock(current); mpol_put(old); @@ -867,6 +873,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes = pol->nodes; break; case MPOL_LOCAL: @@ -951,6 +958,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { *policy = next_node_in(current->il_prev, pol->nodes); + } else if (pol == current->mempolicy && + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { + if (pol->wil.cur_weight) + *policy = current->il_prev; + else + *policy = next_node_in(current->il_prev, + pol->nodes); } else { err = -EINVAL; goto out; @@ -1780,7 +1794,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1830,6 +1845,24 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; } +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int next; + struct task_struct *me = current; + + if (policy->wil.cur_weight > 0) { + policy->wil.cur_weight--; + return me->il_prev; + } + + next = next_node_in(me->il_prev, policy->nodes); + if (next < MAX_NUMNODES) { + me->il_prev = next; + policy->wil.cur_weight = iw_table[numa_node_id()].weights[next]; + } + return next; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1864,6 +1897,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1888,6 +1924,34 @@ unsigned int mempolicy_slab_node(void) } } +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) +{ + nodemask_t nodemask = pol->nodes; + unsigned int target, weight_total = 0; + int nid, local_node = numa_node_id(); + unsigned char weights[MAX_NUMNODES]; + unsigned char weight; + + barrier(); + + /* Collect weights and save them on stack so they don't change */ + for_each_node_mask(nid, nodemask) { + weight = iw_table[local_node].weights[nid]; + weight_total += weight; + weights[nid] = weight; + } + + target = (unsigned int)ilx % weight_total; + + for_each_node_mask(nid, nodemask) { + weight = weights[nid]; + if (target < weight) + return nid; + target -= weight; + } + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1956,6 +2020,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, *nid = (ilx == NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid = (ilx == NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } return nodemask; @@ -2017,6 +2086,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask = mempolicy->nodes; break; @@ -2116,7 +2186,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * If the policy is interleave or does not allow the current * node in its nodemask, we allocate the standard way. */ - if (pol->mode != MPOL_INTERLEAVE && + if ((pol->mode != MPOL_INTERLEAVE && + pol->mode != MPOL_WEIGHTED_INTERLEAVE) && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2252,6 +2323,97 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; } +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me = current; + unsigned long total_allocated = 0; + unsigned long nr_allocated; + unsigned long rounds; + unsigned long node_pages, delta; + unsigned char weight; + unsigned char weights[MAX_NUMNODES]; + unsigned int weight_total; + nodemask_t nodes = pol->nodes; + int nnodes, node, prev_node; + int i; + + /* Stabilize the nodemask on the stack */ + barrier(); + + nnodes = nodes_weight(nodes); + + /* Collect weights and save them on stack so they don't change */ + for_each_node_mask(node, nodes) { + weight = iw_table[numa_node_id()].weights[node]; + weight_total += weight; + weights[node] = weight; + } + + /* Continue allocating from most recent node and adjust the nr_pages */ + if (pol->wil.cur_weight) { + node = next_node_in(me->il_prev, nodes); + node_pages = pol->wil.cur_weight; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (nr_pages <= pol->wil.cur_weight) { + pol->wil.cur_weight -= nr_pages; + return total_allocated; + } + /* Otherwise we adjust nr_pages down, and continue from there */ + nr_pages -= pol->wil.cur_weight; + pol->wil.cur_weight = 0; + prev_node = node; + } + + /* Now we can continue allocating as if from 0 instead of an offset */ + rounds = nr_pages / weight_total; + delta = nr_pages % weight_total; + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, nodes); + weight = weights[node]; + node_pages = weight * rounds; + if (delta) { + if (delta > weight) { + node_pages += weight; + delta -= weight; + } else { + node_pages += delta; + delta = 0; + } + } + /* We may not make it all the way around */ + if (!node_pages) + break; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + prev_node = node; + } + + /* + * Finally, we need to update me->il_prev and pol->wil.cur_weight + * if there were overflow pages, but not equivalent to the node + * weight, set the cur_weight to node_weight - delta and the + * me->il_prev to the previous node. Otherwise if it was perfect + * we can simply set il_prev to node and cur_weight to 0 + */ + if (node_pages) { + me->il_prev = prev_node; + pol->wil.cur_weight = weight - node_pages; + } else { + me->il_prev = node; + pol->wil.cur_weight = 0; + } + + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2292,6 +2454,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave(gfp, pol, + nr_pages, + page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2367,6 +2534,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2503,6 +2671,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + polnid = weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2877,6 +3049,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -2936,6 +3109,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3046,6 +3220,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes = pol->nodes; break; default: From patchwork Thu Dec 7 00:27:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482517 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="aaagYva7" Received: from mail-yw1-x1143.google.com (mail-yw1-x1143.google.com [IPv6:2607:f8b0:4864:20::1143]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 78D6218D; Wed, 6 Dec 2023 16:28:16 -0800 (PST) Received: by mail-yw1-x1143.google.com with SMTP id 00721157ae682-5d3efc071e2so1014707b3.0; Wed, 06 Dec 2023 16:28:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908895; x=1702513695; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Flml1UNmBLLzB1/s7Z9Ivu1nW91eSQVCRu4FPg3vOUU=; b=aaagYva7WvXvsbKwPcvWHVGHH97ACRu2dViRmx1tzqPuLq/mY+A7UQx5N7Usinn14B 3NeyMlOHJq5xFph0EDzD1oe0uC4qYEFTIsdQOS5nhDgLRXJJgwUy9d/5b1/kGRmfHqg/ 1N5nWVVlxo6dj9esIHLuCCWyo0t6aHrgHC9g048/jDY6IVKaILhlKpIlbky5Zvq90iY5 687QYQtdnHI46dmwi57y13TsoAxeion80wy5WwjAqmFyXVBj891TujYEJdRt9n3SDPC0 n2TjYW4io2ES9ga8BAbmykU7/Ck3whlRF1h5foVBK/y8/5NqGihfqZUaOmXKcs5Yc1d6 iHhw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908895; x=1702513695; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Flml1UNmBLLzB1/s7Z9Ivu1nW91eSQVCRu4FPg3vOUU=; b=M1umy21Z0Zq4Wf6KFsrbpwSs49gjD4FVeypIKjxZg+v4Jv+/JffxB/Wr5k25SLfMPc xABuyeKotoa/DASBSvEO98lZtnhADLn9I9nKD4qXlMja/wWIas6Z1foLCxLIYw6aVJl3 q2d5Gh2fxG1cBI/m1zMjMECriF629i9USceU3Pqec4/A838Wuc3RRo28DpaTN3Di9WgY uXIEfrZOSPRLPJn+cqk+XO8asg+7ou3bMvLRnNa/S7CrQtXJjvkEDwRHWzPEfVUJ2i7y 9CyaoWxgKmcKjzBDp3faYmnGdz4BP4khd/yrUvdhqcilfsz9uKj1wl+n4bRQlsc8xna6 PlhQ== X-Gm-Message-State: AOJu0YxJkFSxyPKoSmSszdg6fTmuCU54bWOb+nE8P004mhjKSs99epMk WGZ+7JJ0BjeF++6P69tj2Q== X-Google-Smtp-Source: AGHT+IE51TPShPAvD20Fk8hXysJ2a1fAapf0M/BF9W0GPAFus7ZCfhCYhR2w/eyZazXAbFkHfg63dQ== X-Received: by 2002:a05:690c:fc3:b0:5d3:9f2d:658c with SMTP id dg3-20020a05690c0fc300b005d39f2d658cmr1911791ywb.24.1701908895635; Wed, 06 Dec 2023 16:28:15 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:15 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org Subject: [RFC PATCH 03/11] mm/mempolicy: refactor sanitize_mpol_flags for reuse Date: Wed, 6 Dec 2023 19:27:51 -0500 Message-Id: <20231207002759.51418-4-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 split sanitize_mpol_flags into sanitize and validate. Sanitize is used by set_mempolicy to split (int mode) into mode and mode_flags, and then validates them. Validate validates already split flags. Validate will be reused for new syscalls that accept already split mode and mode_flags. Signed-off-by: Gregory Price --- mm/mempolicy.c | 29 ++++++++++++++++++++++------- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 65e0334a1a18..eec807d0c6a1 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1466,24 +1466,39 @@ static int copy_nodes_to_user(unsigned long __user *mask, unsigned long maxnode, return copy_to_user(mask, nodes_addr(*nodes), copy) ? -EFAULT : 0; } -/* Basic parameter sanity check used by both mbind() and set_mempolicy() */ -static inline int sanitize_mpol_flags(int *mode, unsigned short *flags) +/* + * Basic parameter sanity check used by mbind/set_mempolicy + * May modify flags to include internal flags (e.g. MPOL_F_MOF/F_MORON) + */ +static inline int validate_mpol_flags(unsigned short mode, unsigned short *flags) { - *flags = *mode & MPOL_MODE_FLAGS; - *mode &= ~MPOL_MODE_FLAGS; - - if ((unsigned int)(*mode) >= MPOL_MAX) + if ((unsigned int)(mode) >= MPOL_MAX) return -EINVAL; if ((*flags & MPOL_F_STATIC_NODES) && (*flags & MPOL_F_RELATIVE_NODES)) return -EINVAL; if (*flags & MPOL_F_NUMA_BALANCING) { - if (*mode != MPOL_BIND) + if (mode != MPOL_BIND) return -EINVAL; *flags |= (MPOL_F_MOF | MPOL_F_MORON); } return 0; } +/* + * Used by mbind/set_memplicy to split and validate mode/flags + * set_mempolicy combines (mode | flags), split them out into separate + * fields and return just the mode in mode_arg and flags in flags. + */ +static inline int sanitize_mpol_flags(int *mode_arg, unsigned short *flags) +{ + unsigned short mode = (*mode_arg & ~MPOL_MODE_FLAGS); + + *flags = *mode_arg & MPOL_MODE_FLAGS; + *mode_arg = mode; + + return validate_mpol_flags(mode, flags); +} + static long kernel_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode, unsigned int flags) From patchwork Thu Dec 7 00:27:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482519 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="U0O8d2An" Received: from mail-yw1-x1141.google.com (mail-yw1-x1141.google.com [IPv6:2607:f8b0:4864:20::1141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A968C137; Wed, 6 Dec 2023 16:28:18 -0800 (PST) Received: by mail-yw1-x1141.google.com with SMTP id 00721157ae682-5cfc3a48ab2so1069587b3.0; Wed, 06 Dec 2023 16:28:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908898; x=1702513698; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=o1EahSaWjn587JobDPcMa6mmsjiJNiTkmxUoMqVlCE8=; b=U0O8d2AngkHBzzPDILGcFAw6p0HyTuXy7Efr2BfR2sUiADyAaxwfDzsITB7GVNxId3 dbRxQIUh6yjdjbDxXu57ongUY6H4wngSTuAOLndlC/dqR5XGUSJQ+j2JDqYo4WzM6ZEu BU7q1lh0tm6W4T6sKHS82j0iwFP2i9KVyLahu68Y0j2J94DHF46lXpEQrcRl1MM+86cc NhmDI1XN0LNxAYxkzvWwO/TYwQFD5/G5fuTTkxxA1+pSOjF0DnKmM7bxJ/LfMWmqPAdz q8ViYM8U0CrAyLGxCPvoY6ZxtBTfL2OmLEee8tGsbsUzglF0GqeeYhaEXGSE1bNhvEej f6bg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908898; x=1702513698; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=o1EahSaWjn587JobDPcMa6mmsjiJNiTkmxUoMqVlCE8=; b=tlDGm+YkeFQpXvhA1tsZa5s94CW47cmRah6wHFJMuozX4ZTg4iPv466t2gb/UxoaLS WI9Zh7fOR256yUchlQKKQ7aijJRqPssdiz2rI6Yx0O0czb1+iGwImufprcQm5zX9SG6a 052bNTAVfrPK5LP35QAfnyqc5s6I/GRjI7QaaGEL6ee01NjjLAYcMwJHvlubaRwGvr7b OQCkLKOs1oqzOWmvkdTyuEmYKjZSrV3XTgnMhkojb3E42LJykpmA9qpwHjvV4x/xWpHo Vd60RqRh1SRWUKhGQdJKsABuPSb1jDI9y6+CVwfOeVpVQiNf4Rr/cOzGhBes1p08gkzB CzIA== X-Gm-Message-State: AOJu0YzdWj9RypDB7WsWcr3TiBxVl9f0cE2Hpy85j+GESZo6E4YSAmCh WqBDBSroUC6s2PkVbSVUpg== X-Google-Smtp-Source: AGHT+IHBofphJxufL4ShCxUpV+C9+5nfvGyyqF//uUTQHcd+l9eDG4oZDhCnaYTSmI9K3LJ+jqSscA== X-Received: by 2002:a81:aa0d:0:b0:5d3:d439:aabe with SMTP id i13-20020a81aa0d000000b005d3d439aabemr1632111ywh.26.1701908897789; Wed, 06 Dec 2023 16:28:17 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:17 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org Subject: [RFC PATCH 04/11] mm/mempolicy: create struct mempolicy_args for creating new mempolicies Date: Wed, 6 Dec 2023 19:27:52 -0500 Message-Id: <20231207002759.51418-5-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds a new kernel structure `struct mempolicy_args`, intended to be used for an extensible get/set_mempolicy interface. This implements the fields required to support the existing syscall interfaces interfaces, but does not expose any user-facing arg structure. mpol_new is refactored to take the argument structure so that future mempolicy extensions can all be managed in the mempolicy constructor. The get_mempolicy and mbind syscalls are refactored to utilize the new argument structure, as are all the callers of mpol_new() and do_set_mempolicy. Signed-off-by: Gregory Price --- include/linux/mempolicy.h | 14 ++++++++ mm/mempolicy.c | 69 +++++++++++++++++++++++++++++---------- 2 files changed, 65 insertions(+), 18 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index ba09167e80f7..117c5395c6eb 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -61,6 +61,20 @@ struct mempolicy { } wil; }; +/* + * Describes settings of a mempolicy during set/get syscalls and + * kernel internal calls to do_set_mempolicy() + */ +struct mempolicy_args { + unsigned short mode; /* policy mode */ + unsigned short mode_flags; /* policy mode flags */ + nodemask_t *policy_nodes; /* get/set/mbind */ + int policy_node; /* get: policy node information */ + unsigned long addr; /* get: vma address */ + int addr_node; /* get: node the address belongs to */ + int home_node; /* mbind: use MPOL_MF_HOME_NODE */ +}; + /* * Support for managing mempolicy data objects (clone, copy, destroy) * The default fast path of a NULL MPOL_DEFAULT policy is always inlined. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eec807d0c6a1..4c343218c033 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -268,10 +268,12 @@ static int mpol_set_nodemask(struct mempolicy *pol, * This function just creates a new policy, does some check and simple * initialization. You must invoke mpol_set_nodemask() to set nodes. */ -static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, - nodemask_t *nodes) +static struct mempolicy *mpol_new(struct mempolicy_args *args) { struct mempolicy *policy; + unsigned short mode = args->mode; + unsigned short flags = args->mode_flags; + nodemask_t *nodes = args->policy_nodes; if (mode == MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) @@ -820,8 +822,7 @@ static int mbind_range(struct vma_iterator *vmi, struct vm_area_struct *vma, } /* Set the process memory policy */ -static long do_set_mempolicy(unsigned short mode, unsigned short flags, - nodemask_t *nodes) +static long do_set_mempolicy(struct mempolicy_args *args) { struct mempolicy *new, *old; NODEMASK_SCRATCH(scratch); @@ -830,14 +831,14 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, if (!scratch) return -ENOMEM; - new = mpol_new(mode, flags, nodes); + new = mpol_new(args); if (IS_ERR(new)) { ret = PTR_ERR(new); goto out; } task_lock(current); - ret = mpol_set_nodemask(new, nodes, scratch); + ret = mpol_set_nodemask(new, args->policy_nodes, scratch); if (ret) { task_unlock(current); mpol_put(new); @@ -1235,8 +1236,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src, #endif static long do_mbind(unsigned long start, unsigned long len, - unsigned short mode, unsigned short mode_flags, - nodemask_t *nmask, unsigned long flags) + struct mempolicy_args *margs, unsigned long flags) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; @@ -1256,7 +1256,7 @@ static long do_mbind(unsigned long start, unsigned long len, if (start & ~PAGE_MASK) return -EINVAL; - if (mode == MPOL_DEFAULT) + if (margs->mode == MPOL_DEFAULT) flags &= ~MPOL_MF_STRICT; len = PAGE_ALIGN(len); @@ -1267,7 +1267,7 @@ static long do_mbind(unsigned long start, unsigned long len, if (end == start) return 0; - new = mpol_new(mode, mode_flags, nmask); + new = mpol_new(margs); if (IS_ERR(new)) return PTR_ERR(new); @@ -1284,7 +1284,8 @@ static long do_mbind(unsigned long start, unsigned long len, NODEMASK_SCRATCH(scratch); if (scratch) { mmap_write_lock(mm); - err = mpol_set_nodemask(new, nmask, scratch); + err = mpol_set_nodemask(new, margs->policy_nodes, + scratch); if (err) mmap_write_unlock(mm); } else @@ -1298,7 +1299,7 @@ static long do_mbind(unsigned long start, unsigned long len, * Lock the VMAs before scanning for pages to migrate, * to ensure we don't miss a concurrently inserted page. */ - nr_failed = queue_pages_range(mm, start, end, nmask, + nr_failed = queue_pages_range(mm, start, end, margs->policy_nodes, flags | MPOL_MF_INVERT | MPOL_MF_WRLOCK, &pagelist); if (nr_failed < 0) { @@ -1503,6 +1504,7 @@ static long kernel_mbind(unsigned long start, unsigned long len, unsigned long mode, const unsigned long __user *nmask, unsigned long maxnode, unsigned int flags) { + struct mempolicy_args margs; unsigned short mode_flags; nodemask_t nodes; int lmode = mode; @@ -1517,7 +1519,12 @@ static long kernel_mbind(unsigned long start, unsigned long len, if (err) return err; - return do_mbind(start, len, lmode, mode_flags, &nodes, flags); + memset(&margs, 0, sizeof(margs)); + margs.mode = lmode; + margs.mode_flags = mode_flags; + margs.policy_nodes = &nodes; + + return do_mbind(start, len, &margs, flags); } SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, len, @@ -1598,6 +1605,7 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned long maxnode) { + struct mempolicy_args args; unsigned short mode_flags; nodemask_t nodes; int lmode = mode; @@ -1611,7 +1619,12 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, if (err) return err; - return do_set_mempolicy(lmode, mode_flags, &nodes); + memset(&args, 0, sizeof(args)); + args.mode = lmode; + args.mode_flags = mode_flags; + args.policy_nodes = &nodes; + + return do_set_mempolicy(&args); } SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask, @@ -2877,6 +2890,7 @@ static int shared_policy_replace(struct shared_policy *sp, pgoff_t start, void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) { int ret; + struct mempolicy_args margs; sp->root = RB_ROOT; /* empty tree == default mempolicy */ rwlock_init(&sp->lock); @@ -2889,8 +2903,12 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) if (!scratch) goto put_mpol; + memset(&margs, 0, sizeof(margs)); + margs.mode = mpol->mode; + margs.mode_flags = mpol->flags; + margs.policy_nodes = &mpol->w.user_nodemask; /* contextualize the tmpfs mount point mempolicy to this file */ - npol = mpol_new(mpol->mode, mpol->flags, &mpol->w.user_nodemask); + npol = mpol_new(&margs); if (IS_ERR(npol)) goto free_scratch; /* no valid nodemask intersection */ @@ -2998,6 +3016,7 @@ static inline void __init check_numabalancing_enable(void) void __init numa_policy_init(void) { + struct mempolicy_args args; nodemask_t interleave_nodes; unsigned long largest = 0; int nid, prefer = 0; @@ -3043,7 +3062,11 @@ void __init numa_policy_init(void) if (unlikely(nodes_empty(interleave_nodes))) node_set(prefer, interleave_nodes); - if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes)) + memset(&args, 0, sizeof(args)); + args.mode = MPOL_INTERLEAVE; + args.policy_nodes = &interleave_nodes; + + if (do_set_mempolicy(&args)) pr_err("%s: interleaving failed\n", __func__); check_numabalancing_enable(); @@ -3052,7 +3075,12 @@ void __init numa_policy_init(void) /* Reset policy of current process to default */ void numa_default_policy(void) { - do_set_mempolicy(MPOL_DEFAULT, 0, NULL); + struct mempolicy_args args; + + memset(&args, 0, sizeof(args)); + args.mode = MPOL_DEFAULT; + + do_set_mempolicy(&args); } /* @@ -3082,6 +3110,7 @@ static const char * const policy_modes[] = */ int mpol_parse_str(char *str, struct mempolicy **mpol) { + struct mempolicy_args margs; struct mempolicy *new = NULL; unsigned short mode_flags; nodemask_t nodes; @@ -3168,7 +3197,11 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) goto out; } - new = mpol_new(mode, mode_flags, &nodes); + memset(&margs, 0, sizeof(margs)); + margs.mode = mode; + margs.mode_flags = mode_flags; + margs.policy_nodes = &nodes; + new = mpol_new(&margs); if (IS_ERR(new)) goto out; From patchwork Thu Dec 7 00:27:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482520 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kWa8E/yO" Received: from mail-yw1-x1141.google.com (mail-yw1-x1141.google.com [IPv6:2607:f8b0:4864:20::1141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 876F9D5C; Wed, 6 Dec 2023 16:28:20 -0800 (PST) Received: by mail-yw1-x1141.google.com with SMTP id 00721157ae682-5d3d5b10197so899617b3.2; Wed, 06 Dec 2023 16:28:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908899; x=1702513699; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Qvi2/mpIum+VLqMErxqrFZmXeHAbCeCHSBvaGi/hCFE=; b=kWa8E/yOkqZIPZa/8Z1KslzmldqgMoSaMr4wDSZMQSGByGK8ZQ+88s7hZFtWnBxGAF HDOAo3iY0gTXPDVKeqQpazag+WA3jiacf8pH8xy1ZiAJgNE93e2ZYwgiku4Jz08vmzdt Tn+OJHeOFliuGvPc6FTEI5nfo3NCivkGJkV2wWlESGXO9w8H3sYoNqFHN5j3wcMQ8kxI idHnoVi7dTWT7H1wyNRuXuSFB9Kd4NVMY+y3gwkhU+6qEX0eciive1CTm0wFABMiBlP4 QBortXbKjq3uY5M2JLeBnojiZGFRiBuxGluqLLJcAAZqnsOY/L/Q+OHdECs295zVCmdi 7m1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908899; x=1702513699; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Qvi2/mpIum+VLqMErxqrFZmXeHAbCeCHSBvaGi/hCFE=; b=onpGjoFIHf3wvBfbl7FGypBDTqeYpaA72YPyC+iY7h/qbTxEL3Z0DqIJok5z76trl3 fIX6zQdHdcOjJsPGvYdeJoF/oc4YbgBGZDTwcgMAb1LH5AD11iCWP1Gj6HV7Pkz8If/E iBGqoLi9Il7ROzD/ZR+gO6p043iaDTJph2l8WnpTFup+WStwSYlfIwQRk9iTzvWr6Y3e 7nHBB8MgvmpLCYzuQEjawzxVK5S+OEVgef6emquJ5AlmnzBy5qAwJ5k9Y5N82d96CjTX MUuM6l6Ci2k9s4WiPJnnJUmj8k2s51aQWpLI5p7aO3hjfS1LMr4Fc7Qg0xOy6OpO3tKb bh9Q== X-Gm-Message-State: AOJu0Ywm4T05O9X3uZQzhsWuz0pK/emQaBo6WjiNIqHCWDunq0J9SEWA UW/ZqR5eYldO/GoGgzLuRQ== X-Google-Smtp-Source: AGHT+IFVVF+YBuzebOJmvH0eUoKyko9Si9Ft0YksdzCwNpjnfH/hXJxOD4qQ1QcAfTvX+hXmiDv2mw== X-Received: by 2002:a81:b104:0:b0:5d7:1941:ac3 with SMTP id p4-20020a81b104000000b005d719410ac3mr1584805ywh.94.1701908899587; Wed, 06 Dec 2023 16:28:19 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:19 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org Subject: [RFC PATCH 05/11] mm/mempolicy: refactor kernel_get_mempolicy for code re-use Date: Wed, 6 Dec 2023 19:27:53 -0500 Message-Id: <20231207002759.51418-6-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Pull operation flag checking from inside do_get_mempolicy out to kernel_get_mempolicy. This allows us to flatten the internal code, and break it into separate functions for future syscalls (get_mempolicy2, process_get_mempolicy) to re-use the code, even after additional extensions are made. The primary change is that the flag is treated as the multiplexer that it actually is. For get_mempolicy, the flags represents 3 different primary operations: if (flags & MPOL_F_MEMS_ALLOWED) return task->mems_allowed else if (flags & MPOL_F_ADDR) return vma mempolicy information else return task mempolicy information Plus the behavior modifying flag: if (flags & MPOL_F_NODE) change the return value of (int __user *policy) based on whether MPOL_F_ADDR was set. The original behavior of get_mempolicy is retained, but we utilize the new mempolicy_args structure to pass the operations down the stack. This will allow us to extend the internal functions without affecting the legacy behavior of get_mempolicy. Signed-off-by: Gregory Price --- mm/mempolicy.c | 240 ++++++++++++++++++++++++++++++------------------- 1 file changed, 150 insertions(+), 90 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4c343218c033..fecdc781b6a0 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -898,106 +898,107 @@ static int lookup_node(struct mm_struct *mm, unsigned long addr) return ret; } -/* Retrieve NUMA policy */ -static long do_get_mempolicy(int *policy, nodemask_t *nmask, - unsigned long addr, unsigned long flags) +/* Retrieve the mems_allowed for current task */ +static inline long do_get_mems_allowed(nodemask_t *nmask) { - int err; - struct mm_struct *mm = current->mm; - struct vm_area_struct *vma = NULL; - struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL; + task_lock(current); + *nmask = cpuset_current_mems_allowed; + task_unlock(current); + return 0; +} - if (flags & - ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED)) - return -EINVAL; +/* If the policy has additional node information to retrieve, return it */ +static long do_get_policy_node(struct mempolicy *pol) +{ + /* + * For MPOL_INTERLEAVE, the extended node information is the next + * node that will be selected for interleave. For weighted interleave + * we return the next node based on the current weight. + */ + if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) + return next_node_in(current->il_prev, pol->nodes); - if (flags & MPOL_F_MEMS_ALLOWED) { - if (flags & (MPOL_F_NODE|MPOL_F_ADDR)) - return -EINVAL; - *policy = 0; /* just so it's initialized */ + if (pol == current->mempolicy && + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { + if (pol->wil.cur_weight) + return current->il_prev; + else + return next_node_in(current->il_prev, pol->nodes); + } + return -EINVAL; +} + +/* Handle user_nodemask condition when fetching nodemask for userspace */ +static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask) +{ + if (mpol_store_user_nodemask(pol)) { + *nmask = pol->w.user_nodemask; + } else { task_lock(current); - *nmask = cpuset_current_mems_allowed; + get_policy_nodemask(pol, nmask); task_unlock(current); - return 0; } +} - if (flags & MPOL_F_ADDR) { - pgoff_t ilx; /* ignored here */ - /* - * Do NOT fall back to task policy if the - * vma/shared policy at addr is NULL. We - * want to return MPOL_DEFAULT in this case. - */ - mmap_read_lock(mm); - vma = vma_lookup(mm, addr); - if (!vma) { - mmap_read_unlock(mm); - return -EFAULT; - } - pol = __get_vma_policy(vma, addr, &ilx); - } else if (addr) - return -EINVAL; +/* Retrieve NUMA policy for a VMA assocated with a given address */ +static long do_get_vma_mempolicy(struct mempolicy_args *args) +{ + pgoff_t ilx; + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma = NULL; + struct mempolicy *pol = NULL; + mmap_read_lock(mm); + vma = vma_lookup(mm, args->addr); + if (!vma) { + mmap_read_unlock(mm); + return -EFAULT; + } + pol = __get_vma_policy(vma, args->addr, &ilx); if (!pol) - pol = &default_policy; /* indicates default behavior */ + pol = &default_policy; + /* this may cause a double-reference, resolved by a put+cond_put */ + mpol_get(pol); + mmap_read_unlock(mm); - if (flags & MPOL_F_NODE) { - if (flags & MPOL_F_ADDR) { - /* - * Take a refcount on the mpol, because we are about to - * drop the mmap_lock, after which only "pol" remains - * valid, "vma" is stale. - */ - pol_refcount = pol; - vma = NULL; - mpol_get(pol); - mmap_read_unlock(mm); - err = lookup_node(mm, addr); - if (err < 0) - goto out; - *policy = err; - } else if (pol == current->mempolicy && - pol->mode == MPOL_INTERLEAVE) { - *policy = next_node_in(current->il_prev, pol->nodes); - } else if (pol == current->mempolicy && - (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { - if (pol->wil.cur_weight) - *policy = current->il_prev; - else - *policy = next_node_in(current->il_prev, - pol->nodes); - } else { - err = -EINVAL; - goto out; - } - } else { - *policy = pol == &default_policy ? MPOL_DEFAULT : - pol->mode; - /* - * Internal mempolicy flags must be masked off before exposing - * the policy to userspace. - */ - *policy |= (pol->flags & MPOL_MODE_FLAGS); - } + /* Fetch the node for the given address */ + args->addr_node = lookup_node(mm, args->addr); - err = 0; - if (nmask) { - if (mpol_store_user_nodemask(pol)) { - *nmask = pol->w.user_nodemask; - } else { - task_lock(current); - get_policy_nodemask(pol, nmask); - task_unlock(current); - } + args->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode; + args->mode_flags = (pol->flags & MPOL_MODE_FLAGS); + + /* If this policy has extra node info, fetch that */ + args->policy_node = do_get_policy_node(pol); + + if (args->policy_nodes) + do_get_mempolicy_nodemask(pol, args->policy_nodes); + + if (pol != &default_policy) { + mpol_put(pol); + mpol_cond_put(pol); } - out: - mpol_cond_put(pol); - if (vma) - mmap_read_unlock(mm); - if (pol_refcount) - mpol_put(pol_refcount); - return err; + return 0; +} + +/* Retrieve NUMA policy for the current task */ +static long do_get_task_mempolicy(struct mempolicy_args *args) +{ + struct mempolicy *pol = current->mempolicy; + + if (!pol) + pol = &default_policy; /* indicates default behavior */ + + args->mode = pol == &default_policy ? MPOL_DEFAULT : pol->mode; + /* Internal flags must be masked off before exposing to userspace */ + args->mode_flags = (pol->flags & MPOL_MODE_FLAGS); + + args->policy_node = do_get_policy_node(pol); + + if (args->policy_nodes) + do_get_mempolicy_nodemask(pol, args->policy_nodes); + + return 0; } #ifdef CONFIG_MIGRATION @@ -1734,16 +1735,75 @@ static int kernel_get_mempolicy(int __user *policy, unsigned long addr, unsigned long flags) { + struct mempolicy_args args; int err; - int pval; + int pval = 0; nodemask_t nodes; if (nmask != NULL && maxnode < nr_node_ids) return -EINVAL; - addr = untagged_addr(addr); + if (flags & + ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED)) + return -EINVAL; - err = do_get_mempolicy(&pval, &nodes, addr, flags); + /* Ensure any data that may be copied to userland is initialized */ + memset(&args, 0, sizeof(args)); + args.policy_nodes = &nodes; + args.addr = untagged_addr(addr); + + /* + * set_mempolicy was originally multiplexed based on 3 flags: + * MPOL_F_MEMS_ALLOWED: fetch task->mems_allowed + * MPOL_F_ADDR : operate on vma->mempolicy + * MPOL_F_NODE : change return value of *policy + * + * Split this behavior out here, rather than internal functions, + * so that the internal functions can be re-used by future + * get_mempolicy2 interfaces and the arg structure made extensible + */ + if (flags & MPOL_F_MEMS_ALLOWED) { + if (flags & (MPOL_F_NODE|MPOL_F_ADDR)) + return -EINVAL; + pval = 0; /* just so it's initialized */ + err = do_get_mems_allowed(&nodes); + } else if (flags & MPOL_F_ADDR) { + /* If F_ADDR, we operation on a vma policy (or default) */ + err = do_get_vma_mempolicy(&args); + if (err) + return err; + /* if (F_ADDR | F_NODE), *pval is the address' node */ + if (flags & MPOL_F_NODE) { + /* if we failed to fetch, that's likely an EFAULT */ + if (args.addr_node < 0) + return args.addr_node; + pval = args.addr_node; + } else + pval = args.mode | args.mode_flags; + } else { + /* if not F_ADDR and addr != null, EINVAL */ + if (addr) + return -EINVAL; + + err = do_get_task_mempolicy(&args); + if (err) + return err; + /* + * if F_NODE was set and mode was MPOL_INTERLEAVE + * *pval is equal to next interleave node. + * + * if args.policy_node < 0, this means the mode did + * not have a policy. This presently emulates the + * original behavior of (F_NODE) & (!MPOL_INTERLEAVE) + * producing -EINVAL + */ + if (flags & MPOL_F_NODE) { + if (args.policy_node < 0) + return args.policy_node; + pval = args.policy_node; + } else + pval = args.mode | args.mode_flags; + } if (err) return err; From patchwork Thu Dec 7 00:27:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482521 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AvAGigHs" Received: from mail-yw1-x1141.google.com (mail-yw1-x1141.google.com [IPv6:2607:f8b0:4864:20::1141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AD6910C4; Wed, 6 Dec 2023 16:28:22 -0800 (PST) Received: by mail-yw1-x1141.google.com with SMTP id 00721157ae682-5d7b1a8ec90so819167b3.2; Wed, 06 Dec 2023 16:28:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908901; x=1702513701; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=LtcIT3+1Vvf1IPQzjETz2qs5AIWZRfMHd/0ujKiNxjc=; b=AvAGigHsIaQj0KF+YJNWdgVdg/RwkGvr+0yuC2i/nZt2+SC80H5h+7H6yaNJd2/dO+ dlAWQKSl3PzCem9BgAjIhMdt/eMLI++mI0lV1kzIiruUqqtm/o6Ts5Eq95q4qM2Fr/Dp GK3we/jOv+jyLZhGHcAGUwBR/dd5oFVLAycZYLnylxiTOqdo9ZREUp1NGVdrWoj4nhc6 lokgAyeSVe5BOS/qKYQ1lV4eMUh1MEKct7JCikljgKWWZpU2jcXDBJIXJJOc9SfHK10q jzqTLgMVz1Ign/7pDKC5YkZpTY0lUfzlXGwI+pp2YyNf8nKGte43rIFIj24U8/Jr+FwW p8HA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908901; x=1702513701; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LtcIT3+1Vvf1IPQzjETz2qs5AIWZRfMHd/0ujKiNxjc=; b=hLNUeYrH/HDB6GbuupeJO0cC7vrR5Qecyd3nszZImMpcAQMFY9R1izmrr5SyePVgU7 8LAyFzIpLJi0b3CRsNQCsoyuyYKElmLiYURtMh2ju95SLqw2OoBo8mJX1zrcWp9GYbZm ChlfdyV8BuOq9ohkxh72wDsNSkVpMnUqY2wssNr5VQQ57KBM78GoL++Jc7qyzsRUk6VY lZa4tatjLW67a9GwIkGiQ0o9V/W9j4gLwzxC2IXB3Vu7wCg6Ixi1MdFGutopoehvE6jw FBZumcrNp7Z3TsS3Y9iPi5qU/GJrRK+8poVhZqxeBxsT81drnTvRneL2j02miruJVBVM PNLw== X-Gm-Message-State: AOJu0YxA/KPQjvM8pzW+0+WPAarGUB9vtMDUGlv83WeVy/SCHWxrGnO8 EbPPomYRpmwK1p/HEMJrsnSo9es5UDc1 X-Google-Smtp-Source: AGHT+IFoGq3TrZUWxjktHdZLHqPXr2V42GWf5snPEHWFcJWvB6pDjTGUOlR+5ktf6yPv5nEOg/mPRg== X-Received: by 2002:a05:690c:a06:b0:5d7:1940:8dd5 with SMTP id cg6-20020a05690c0a0600b005d719408dd5mr1208041ywb.60.1701908901467; Wed, 06 Dec 2023 16:28:21 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:21 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org Subject: [RFC PATCH 06/11] mm/mempolicy: allow home_node to be set by mpol_new Date: Wed, 6 Dec 2023 19:27:54 -0500 Message-Id: <20231207002759.51418-7-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds the plumbing into mpol_new() to allow the argument structure's home_node field to be set during mempolicy creation. The syscall sys_set_mempolicy_home_node was added to allow a home node to be registered for a vma. For set_mempolicy2 and mbind2 syscalls, it would be useful to add this as an extension to allow the user to submit a fully formed mempolicy configuration in a single call, rather than require multiple calls to configure a mempolicy. This will become particularly useful if/when pidfd interfaces to change process mempolicies from outside the task appear, as each call to change the mempolicy does an atomic swap of that policy in the task, rather than mutate the policy. Signed-off-by: Gregory Price --- mm/mempolicy.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index fecdc781b6a0..4be63547a4b3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -311,6 +311,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args) policy->flags = flags; policy->home_node = NUMA_NO_NODE; policy->wil.cur_weight = 0; + policy->home_node = args->home_node; return policy; } @@ -1624,6 +1625,7 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, args.mode = lmode; args.mode_flags = mode_flags; args.policy_nodes = &nodes; + args.home_node = NUMA_NO_NODE; return do_set_mempolicy(&args); } @@ -2967,6 +2969,8 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) margs.mode = mpol->mode; margs.mode_flags = mpol->flags; margs.policy_nodes = &mpol->w.user_nodemask; + margs.home_node = NUMA_NO_NODE; + /* contextualize the tmpfs mount point mempolicy to this file */ npol = mpol_new(&margs); if (IS_ERR(npol)) @@ -3125,6 +3129,7 @@ void __init numa_policy_init(void) memset(&args, 0, sizeof(args)); args.mode = MPOL_INTERLEAVE; args.policy_nodes = &interleave_nodes; + args.home_node = NUMA_NO_NODE; if (do_set_mempolicy(&args)) pr_err("%s: interleaving failed\n", __func__); @@ -3139,6 +3144,7 @@ void numa_default_policy(void) memset(&args, 0, sizeof(args)); args.mode = MPOL_DEFAULT; + args.home_node = NUMA_NO_NODE; do_set_mempolicy(&args); } @@ -3261,6 +3267,8 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) margs.mode = mode; margs.mode_flags = mode_flags; margs.policy_nodes = &nodes; + margs.home_node = NUMA_NO_NODE; + new = mpol_new(&margs); if (IS_ERR(new)) goto out; From patchwork Thu Dec 7 00:27:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482522 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="MP7qpTDd" Received: from mail-yw1-x1142.google.com (mail-yw1-x1142.google.com [IPv6:2607:f8b0:4864:20::1142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0D9EE13E; Wed, 6 Dec 2023 16:28:25 -0800 (PST) Received: by mail-yw1-x1142.google.com with SMTP id 00721157ae682-5d7346442d4so931647b3.2; Wed, 06 Dec 2023 16:28:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908904; x=1702513704; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Z0dcKZM4xykmRW4Z1VxrCBsarnvE5RiRnBwkKf2wOH0=; b=MP7qpTDd42i3+X1wWD08FnwJh9hBqmipTZUjnxTTObfL8GcgxDTx0+dEgPySXE9MCW oiDWfwAzxf4ITz3pkRt1wnKxtyffaTsAZgqS/u0ElKMapMYdsSsQK1qldNjQPEMGJ0Qv 5vLzNdRhz2yDyO2IeRQy18wW9ZhPkoHTHrbN7atoE90fxY8WxRPQi8urdXKlrWtaDdUz arMc85NQ1fQYg6n41h4WyBucuJt72LtHE6xkk/Ww4us+6XjL++IykVUgsrrco0dLE+Or ZLvj1miRXyImAXGtKjXbNwnnA4C1PbMd+AMYoiLIHw1O0Ts0krG7XSRPrE5D+tyDrvrq 5+fg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908904; x=1702513704; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Z0dcKZM4xykmRW4Z1VxrCBsarnvE5RiRnBwkKf2wOH0=; b=o7SBneQDaO2OXB9Lq0Urp0GYwGjqvTaB1Luv2C5LH/XNA4DGDvLoJg8HPq4HutXWm5 quhk8mLQhQA+ZD4XzfEK+2b69N7jLCuWYAsCJz0EjgOav+yBVixv/kOL3tjx6W72p6DD 5Q6jkTwOgR+Y+jPksRZxAlT0k31ky1NGbrmsyEtgIAEswB77EE7xF3RpyWa1TDbtjwUM Av/XPO8gBQaEO+XHaxXfJrdHr2QFsWDcxJLSWy5JVJXehaG3dsKxGDdG96Af0scQTbqY dwW5TsLSPR3pN69u4JG2PBBdkOlb5Vba5HrR5kEg4s5rfZCY/T18uH9ZRWe4IsV8ZJAE fN/A== X-Gm-Message-State: AOJu0YwamUd11uhE+vlS61n3WpakgHfInv8HylJ3bUROJ3gXSJMID+WI 3pRj5wpUpwFyAjglqiLOUg== X-Google-Smtp-Source: AGHT+IH3qdQO+H6tz7LttRhtaO2S467jx5TIEBNtqBVkQzcZpaHW+4VWRuHrPbNijy64FV0MlLAD7w== X-Received: by 2002:a81:498c:0:b0:5d3:9f4d:dae0 with SMTP id w134-20020a81498c000000b005d39f4ddae0mr1770101ywa.24.1701908903550; Wed, 06 Dec 2023 16:28:23 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:23 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, Frank van der Linden Subject: [RFC PATCH 07/11] mm/mempolicy: add userland mempolicy arg structure Date: Wed, 6 Dec 2023 19:27:55 -0500 Message-Id: <20231207002759.51418-8-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds the new user-api argument structure intended for set_mempolicy2 and mbind2. struct mpol_args { /* Basic mempolicy settings */ unsigned short mode; unsigned short mode_flags; unsigned long *pol_nodes; unsigned long pol_maxnodes; /* get_mempolicy2: policy information (e.g. next interleave node) */ int policy_node; /* get_mempolicy2: memory range policy */ unsigned long addr; int addr_node; /* all operations: policy home node */ unsigned long home_node; /* mbind2: address ranges to apply the policy */ const struct iovec __user *vec; size_t vlen; }; This structure is intended to be extensible as new mempolicy extensions are added. For example, set_mempolicy_home_node was added to allow vma mempolicies to have a preferred/home node assigned. This structure allows the addition of that setting at the time the mempolicy is set, rather than requiring additional calls to modify the policy. Another suggested extension is to allow mbind2 to operate on multiple memory ranges with a single call. mbind presently operates on a single (address, length) tuple. It was suggested that mbind2 should operate on an iovec, which allows many memory ranges to have the same mempolicy applied to it with a single system call. Full breakdown of arguments as of this patch: mode: Mempolicy mode (MPOL_DEFAULT, MPOL_INTERLEAVE) mode_flags: Flags previously or'd into mode in set_mempolicy (e.g.: MPOL_F_STATIC_NODES, MPOL_F_RELATIVE_NODES) pol_nodes: Policy nodemask pol_maxnodes: Max number of nodes in the policy nodemask policy_node: for get_mempolicy2. Returns extended information about a policy that was previously reported by passing MPOL_F_NODE to get_mempolicy. Instead of overriding the mode value, simply add a field. addr: for get_mempolicy2. Used with MPOL_F_ADDR to run get_mempolicy against the vma the address belongs to instead of the task. addr_node: for get_mempolicy2. Returns the node the address belongs to. Previously get_mempolicy() would override the output value of (mode) if MPOL_F_ADDR and MPOL_F_NODE were set. Instead, we extend mpol_args to do this by default if MPOL_F_ADDR is set and do away with MPOL_F_NODE. vec/vlen: Used by mbind2 to apply the mempolicy to all address ranges described by the iovec. Suggested-by: Frank van der Linden Suggested-by: Vinicius Tavares Petrucci Suggested-by: Hasan Al Maruf Signed-off-by: Gregory Price Co-developed-by: Vinicius Tavares Petrucci Signed-off-by: Vinicius Tavares Petrucci --- .../admin-guide/mm/numa_memory_policy.rst | 31 +++++++++++++++++++ include/uapi/linux/mempolicy.h | 18 +++++++++++ 2 files changed, 49 insertions(+) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index b7b8d3dd420f..6d645519c2c1 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -488,6 +488,37 @@ closest to which page allocation will come from. Specifying the home node overri the default allocation policy to allocate memory close to the local node for an executing CPU. +Extended Mempolicy Arguments:: + + struct mpol_args { + /* Basic mempolicy settings */ + unsigned short mode; + unsigned short mode_flags; + unsigned long *pol_nodes; + unsigned long pol_maxnodes; + + /* get_mempolicy2: policy node information */ + int policy_node; + + /* get_mempolicy2: memory range policy */ + unsigned long addr; + int addr_node; + + /* mbind2: policy home node */ + unsigned long home_node; + + /* mbind2: address ranges to apply the policy */ + struct iovec *vec; + size_t vlen; + }; + +The extended mempolicy argument structure is defined to allow the mempolicy +interfaces future extensibility without the need for additional system calls. + +The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to +all interfaces relative to their non-extended counterparts. Each additional +field may only apply to specific extended interfaces. See the respective +extended interface man page for more details. Memory Policy Command Line Interface ==================================== diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 1f9bb10d1a47..e6b50903047c 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -27,6 +27,24 @@ enum { MPOL_MAX, /* always last member of enum */ }; +struct mpol_args { + /* Basic mempolicy settings */ + unsigned short mode; + unsigned short mode_flags; + unsigned long *pol_nodes; + unsigned long pol_maxnodes; + /* get_mempolicy: policy node information */ + int policy_node; + /* get_mempolicy: memory range policy */ + unsigned long addr; + int addr_node; + /* mbind2: policy home node */ + int home_node; + /* mbind2: address ranges to apply the policy */ + struct iovec *vec; + size_t vlen; +}; + /* Flags for set_mempolicy */ #define MPOL_F_STATIC_NODES (1 << 15) #define MPOL_F_RELATIVE_NODES (1 << 14) From patchwork Thu Dec 7 00:27:56 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482523 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="G6EhdAQ9" Received: from mail-yw1-x1142.google.com (mail-yw1-x1142.google.com [IPv6:2607:f8b0:4864:20::1142]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C09FD53; Wed, 6 Dec 2023 16:28:26 -0800 (PST) Received: by mail-yw1-x1142.google.com with SMTP id 00721157ae682-5d34f8f211fso1161257b3.0; Wed, 06 Dec 2023 16:28:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908905; x=1702513705; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=gjJczytFOqB4aDfacf2nE5mgyEIfDJkoIEfJbfgwqi4=; b=G6EhdAQ9ZebmnuMOyk3SfnsVEDWZ/AdYD+1eeF1h1a7UaYI1yMEqI9qFvcRe8HIu1h hutcIVmJs6kK1N83GGLH6fdDpTyAS6ECwZDPZct9kcmmKBlLLkgvVLVMtelax8Iu7W9p supdi0koS1KT2w322gSL9egElcnpDgSYBC6TdAJTQ6zJIVQobIaQS39C68sW+1bRzyrj 2Duw+cyNFFQc9jU0fWN43QCXiKqH09h3B2dM3jCEbyzyCoQz9NFLQ/1twnIkMOBdKa66 a6w53wRgySRW39WPbAfwH5gZ3+kE4c6lin/vPTI3ivBZxMopxnn6Pf7wHwL8u9m2gKHj tq9g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908905; x=1702513705; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gjJczytFOqB4aDfacf2nE5mgyEIfDJkoIEfJbfgwqi4=; b=ilXG2rnlTYm04T0MLrY89nbq8yhPS5HQTebR6mK6eW+6So+m/4du3+ztBSUNP4/+ay 3QCJTimhYS3Y7jq1rHcq1c8SR93b9gM8sLRxzngBqoIFu0wpSo3oqdqVShhncCl7ntiX UkfAqYHn6CfxuFo1wiU5S7Crvga1QKlgHrzllWMikVyj4GxAaTkazmRbtT98eEXFYuyx kRINq2KWE0po9D1b5UXq6WR+rLb4b+D7zEWTfHWJusubdElomQfb79n5kpgzsFmvnewG pWLSvGUbM8DvWqTyMoJuRk2idoEmH1iv8KhhhuG430dJHu+UYJd4vLyhH0oDeToTZCT1 rbGw== X-Gm-Message-State: AOJu0Yw7dawn2D2hYs0pTQyci22sTzoAeZ/5/v19V/sN+YdbFY1k4oid D8YAudoeUyFvLsO59ZGp4Q== X-Google-Smtp-Source: AGHT+IHKildZXQ6EYymn56Hspm6H3XlJwOLXmYnQMplvqRrtLwVcSKe/8VKsJvgGwIKdBUC4h1hNZA== X-Received: by 2002:a81:9bd5:0:b0:5d7:1940:53c1 with SMTP id s204-20020a819bd5000000b005d7194053c1mr1491411ywg.57.1701908905644; Wed, 06 Dec 2023 16:28:25 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:25 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, Michal Hocko Subject: [RFC PATCH 08/11] mm/mempolicy: add set_mempolicy2 syscall Date: Wed, 6 Dec 2023 19:27:56 -0500 Message-Id: <20231207002759.51418-9-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 set_mempolicy2 is an extensible set_mempolicy interface which allows a user to set the per-task memory policy. Defined as: set_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags); relevant mpol_args fields include the following: mode: The MPOL_* policy (DEFAULT, INTERLEAVE, etc.) mode_flags: The MPOL_F_* flags that were previously passed in or'd into the mode. This was split to hopefully allow future extensions additional mode/flag space. pol_nodes: the nodemask to apply for the memory policy pol_maxnodes: The max number of nodes described by pol_nodes The usize arg is intended for the user to pass in sizeof(mpol_args) to allow forward/backward compatibility whenever possible. The flags argument is intended to future proof the syscall against future extensions which may require interpreting the arguments in the structure differently. Semantics of `set_mempolicy` are otherwise the same as `set_mempolicy` as of this patch. Suggested-by: Michal Hocko Signed-off-by: Gregory Price --- .../admin-guide/mm/numa_memory_policy.rst | 10 ++++++ arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 ++ include/uapi/asm-generic/unistd.h | 4 ++- mm/mempolicy.c | 34 +++++++++++++++++++ 18 files changed, 63 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 6d645519c2c1..7195edaeaad9 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -438,6 +438,8 @@ Set [Task] Memory Policy:: long set_mempolicy(int mode, const unsigned long *nmask, unsigned long maxnode); + long set_mempolicy2(struct mpol_args args, size_t size, + unsigned long flags); Set's the calling task's "task/process memory policy" to mode specified by the 'mode' argument and the set of nodes defined by @@ -446,6 +448,12 @@ specified by the 'mode' argument and the set of nodes defined by 'mode' argument with the flag (for example: MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). +set_mempolicy2() is an extended version of set_mempolicy() capable +of setting a mempolicy which requires more information than can be +passed via get_mempolicy(). For example, weighted interleave with +task-local weights requires a weight array to be passed via the +'mpol_args->il_weights' argument in the 'struct mpol_args' arg. + See the set_mempolicy(2) man page for more details @@ -515,6 +523,8 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system calls. +Extended interfaces (set_mempolicy2) use this argument structure. + The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional field may only apply to specific extended interfaces. See the respective diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 18c842ca6c32..0dc288a1118a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -496,3 +496,4 @@ 564 common futex_wake sys_futex_wake 565 common futex_wait sys_futex_wait 566 common futex_requeue sys_futex_requeue +567 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 584f9528c996..50172ec0e1f5 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -470,3 +470,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 7a4b780e82cb..839d90c535f2 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 5b6a0b02b7de..567c8b883735 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -462,3 +462,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index a842b41c8e06..cc0640e16f2f 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -395,3 +395,4 @@ 454 n32 futex_wake sys_futex_wake 455 n32 futex_wait sys_futex_wait 456 n32 futex_requeue sys_futex_requeue +457 n32 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 525cc54bc63b..f7262fde98d9 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -444,3 +444,4 @@ 454 o32 futex_wake sys_futex_wake 455 o32 futex_wait sys_futex_wait 456 o32 futex_requeue sys_futex_requeue +457 o32 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index a47798fed54e..e10f0e8bd064 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -455,3 +455,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 7fab411378f2..4f03f5f42b78 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -543,3 +543,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 86fec9b080f6..f98dadc2e9df 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -459,3 +459,4 @@ 454 common futex_wake sys_futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 363fae0fe9bf..f47ba9f2d05d 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -459,3 +459,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 7bcaa3d5ea44..53fb16616728 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -502,3 +502,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index c8fac5205803..4b4dc41b24ee 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -461,3 +461,4 @@ 454 i386 futex_wake sys_futex_wake 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue +457 i386 set_mempolicy2 sys_set_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 8cb8bf68721c..1bc2190bec27 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -378,6 +378,7 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 06eefa9c1458..e26dc89399eb 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -427,3 +427,4 @@ 454 common futex_wake sys_futex_wake 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue +457 common set_mempolicy2 sys_set_mempolicy2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index fd9d12de7e92..3244cd990858 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -822,6 +822,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long addr, unsigned long flags); asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned long maxnode); +asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size, + unsigned long flags); asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *from, const unsigned long __user *to); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 756b013fb832..55486aba099f 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -828,9 +828,11 @@ __SYSCALL(__NR_futex_wake, sys_futex_wake) __SYSCALL(__NR_futex_wait, sys_futex_wait) #define __NR_futex_requeue 456 __SYSCALL(__NR_futex_requeue, sys_futex_requeue) +#define __NR_set_mempolicy2 457 +__SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) #undef __NR_syscalls -#define __NR_syscalls 457 +#define __NR_syscalls 458 /* * 32 bit systems traditionally used different diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 4be63547a4b3..fdc56798226b 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1636,6 +1636,40 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask, return kernel_set_mempolicy(mode, nmask, maxnode); } +SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + int err; + nodemask_t policy_nodemask; + + if (flags) + return -EINVAL; + + err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return err; + + err = validate_mpol_flags(kargs.mode, &kargs.mode_flags); + if (err) + return err; + + memset(&margs, '\0', sizeof(margs)); + margs.mode = kargs.mode; + margs.mode_flags = kargs.mode_flags; + if (kargs.pol_nodes) { + err = get_nodes(&policy_nodemask, kargs.pol_nodes, + kargs.pol_maxnodes); + if (err) + return err; + margs.policy_nodes = &policy_nodemask; + } else + margs.policy_nodes = NULL; + + return do_set_mempolicy(&margs); +} + static int kernel_migrate_pages(pid_t pid, unsigned long maxnode, const unsigned long __user *old_nodes, const unsigned long __user *new_nodes) From patchwork Thu Dec 7 00:27:57 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482524 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gDHouhbx" Received: from mail-yw1-x1141.google.com (mail-yw1-x1141.google.com [IPv6:2607:f8b0:4864:20::1141]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 06C97D71; Wed, 6 Dec 2023 16:28:29 -0800 (PST) Received: by mail-yw1-x1141.google.com with SMTP id 00721157ae682-5d40c728fc4so902197b3.1; Wed, 06 Dec 2023 16:28:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908908; x=1702513708; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/Tg3WJAJ6oMcK7EioAe3CD03ih0oC3mio0x9vILrFBM=; b=gDHouhbx1OLXXuZMbg3Y/uT5c3OqRu2KwsQhqcUu+V475QkZBk1Dm0Xx+cl0nAiCNm zu/vDtt18y4GPy3VUBa4/QEDkeRJx3O2pTaN7ZU54aXM0DaGpoWBQVKBcq3fb1YUBTSj kGwzuoM8EtklUZXFTCFbIbkq3NM2+4F11LM3EeNpztmHuzgU+lBlacvGWukTsM1Kyvbm dNj0wZX7pk41AnHlh8nbRt+AJs1sasjW8xw7Mp9kP60+h2Qao9gBtEyI3dnllLHYuz7H HwoHFZscozhQ9RRpCp6s0syJPTcOixqpD5i4Tgd6a4UAH9/Hlb2krumXsU5U3gAXjJtr ho3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908908; x=1702513708; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/Tg3WJAJ6oMcK7EioAe3CD03ih0oC3mio0x9vILrFBM=; b=oSzRfoYt3DZ5hWFa9/PL22ktO/1qqb3hvrhM/Bll6mjxt8X0/qveryaNf99q2P/wQ6 mHXaYlhcyaV179QyNeqnaNWkrFwqO68X30Zhypf7IRQdzM8HHie2QVYFbSojiMMhVDpG 1eyq3Pkidu5buKxDY7YbvLay4gSDyzXClacAkI3qG0WCzaPvWgZ4DTYssU/nvLtLYZJ3 rbkvgtpzhfdCZUd3RPrHylnAGX2V55wNAi/Fr1kqes3E2L+ZbYUQaxcOmuBXNze5Q9lr ti+hY6aCddO9y1/p7+vWPsT8n7eUYS4QcqHCfMxcgF2nd2h8OqCTo3tktKDO75l6Sugn hZyg== X-Gm-Message-State: AOJu0YywvuSnnx4756yu3OF58UiXMc6lmiFTTx9zUI8/2aivYQpa3SnN WpxA/hrYPxib/ZOZd1Q+IQ== X-Google-Smtp-Source: AGHT+IGWNYYvj9txgy1jH+zhaAnbNP1gBbC08hWfxGLjAFp8DaveC+NL8LmMPV9j06ESHluL39lWvA== X-Received: by 2002:a81:c40a:0:b0:5d7:1940:3f03 with SMTP id j10-20020a81c40a000000b005d719403f03mr1218308ywi.52.1701908907892; Wed, 06 Dec 2023 16:28:27 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:27 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, Michal Hocko Subject: [RFC PATCH 09/11] mm/mempolicy: add get_mempolicy2 syscall Date: Wed, 6 Dec 2023 19:27:57 -0500 Message-Id: <20231207002759.51418-10-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 get_mempolicy2 is an extensible get_mempolicy interface which allows a user to retrieve the memory policy for a task or address. Defined as: get_mempolicy2(struct mpol_args *args, size_t size, unsigned long flags) Input values include the following fields of mpol_args: pol_nodes: if set, the nodemask of the policy returned here pol_maxnodes: if pol_nodes is set, must describe max number of nodes to be copied to pol_nodes addr: if MPOL_F_ADDR is passed in `flags`, this address will be used to return the mempolicy details of the vma the address belongs to flags: if MPOL_F_MEMS_ALLOWED, returns mems_allowed in pol_nodes if MPOL_F_ADDR, return mempolicy info vma containing addr else, returns per-task mempolicy information Output values include the following fields of mpol_args: mode: mempolicy mode mode_flags: mempolicy mode flags pol_nodes: if set, the nodemask for the mempolicy policy_node: if the policy has extended node information, it will be placed here. For example MPOL_INTERLEAVE will return the next node which will be used for allocation addr_node: If MPOL_F_ADDR is set, the numa node that the address is located on will be returned. home_node: policy home node will be returned here, or -1 if not. MPOL_F_NODE has been dropped from get_mempolicy2 (it is ignored) in favor or returning explicit values in `policy_node` and `addr_node`. Suggested-by: Michal Hocko Signed-off-by: Gregory Price --- .../admin-guide/mm/numa_memory_policy.rst | 8 +++- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- mm/mempolicy.c | 43 +++++++++++++++++++ 18 files changed, 69 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 7195edaeaad9..82cdb765dd58 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -462,11 +462,17 @@ Get [Task] Memory Policy or Related Information:: long get_mempolicy(int *mode, const unsigned long *nmask, unsigned long maxnode, void *addr, int flags); + long get_mempolicy2(struct mpol_args args, size_t size, + unsigned long flags); Queries the "task/process memory policy" of the calling task, or the policy or location of a specified virtual address, depending on the 'flags' argument. +get_mempolicy2() is an extended version of get_mempolicy() capable of +acquiring extended information about a mempolicy, including those +that can only be set via set_mempolicy2() or mbind2().. + See the get_mempolicy(2) man page for more details @@ -523,7 +529,7 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system calls. -Extended interfaces (set_mempolicy2) use this argument structure. +Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure. The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 0dc288a1118a..0301a8b0a262 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -497,3 +497,4 @@ 565 common futex_wait sys_futex_wait 566 common futex_requeue sys_futex_requeue 567 common set_mempolicy2 sys_set_mempolicy2 +568 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 50172ec0e1f5..771a33446e8e 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -471,3 +471,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 839d90c535f2..048a409e684c 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -457,3 +457,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 567c8b883735..327b01bd6793 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -463,3 +463,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index cc0640e16f2f..921d58e1da23 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -396,3 +396,4 @@ 455 n32 futex_wait sys_futex_wait 456 n32 futex_requeue sys_futex_requeue 457 n32 set_mempolicy2 sys_set_mempolicy2 +458 n32 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index f7262fde98d9..9271c83c9993 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -445,3 +445,4 @@ 455 o32 futex_wait sys_futex_wait 456 o32 futex_requeue sys_futex_requeue 457 o32 set_mempolicy2 sys_set_mempolicy2 +458 o32 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index e10f0e8bd064..0654f3f89fc7 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -456,3 +456,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 4f03f5f42b78..ac11d2064e7a 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -544,3 +544,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index f98dadc2e9df..1cdcafe1ccca 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 455 common futex_wait sys_futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index f47ba9f2d05d..f71742024c29 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -460,3 +460,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 53fb16616728..2fbf5dbe0620 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -503,3 +503,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 4b4dc41b24ee..0af813b9a118 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -462,3 +462,4 @@ 455 i386 futex_wait sys_futex_wait 456 i386 futex_requeue sys_futex_requeue 457 i386 set_mempolicy2 sys_set_mempolicy2 +458 i386 get_mempolicy2 sys_get_mempolicy2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 1bc2190bec27..0b777876fc15 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -379,6 +379,7 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index e26dc89399eb..4536c9a4227d 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -428,3 +428,4 @@ 455 common futex_wait sys_futex_wait 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 +458 common get_mempolicy2 sys_get_mempolicy2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 3244cd990858..774512b7934e 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -820,6 +820,8 @@ asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long __user *nmask, unsigned long maxnode, unsigned long addr, unsigned long flags); +asmlinkage long sys_get_mempolicy2(struct mpol_args *args, size_t size, + unsigned long flags); asmlinkage long sys_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned long maxnode); asmlinkage long sys_set_mempolicy2(struct mpol_args *args, size_t size, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 55486aba099f..719accc731db 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -830,9 +830,11 @@ __SYSCALL(__NR_futex_wait, sys_futex_wait) __SYSCALL(__NR_futex_requeue, sys_futex_requeue) #define __NR_set_mempolicy2 457 __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) +#define __NR_get_mempolicy2 458 +__SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) #undef __NR_syscalls -#define __NR_syscalls 458 +#define __NR_syscalls 459 /* * 32 bit systems traditionally used different diff --git a/mm/mempolicy.c b/mm/mempolicy.c index fdc56798226b..d1d10b2746e3 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1860,6 +1860,49 @@ SYSCALL_DEFINE5(get_mempolicy, int __user *, policy, return kernel_get_mempolicy(policy, nmask, maxnode, addr, flags); } +SYSCALL_DEFINE3(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + int err; + nodemask_t policy_nodemask; + + err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return -EINVAL; + + if (flags & MPOL_F_MEMS_ALLOWED) { + if (!margs.policy_nodes) + return -EINVAL; + err = do_get_mems_allowed(&policy_nodemask); + if (err) + return err; + return copy_nodes_to_user(kargs.pol_nodes, kargs.pol_maxnodes, + &policy_nodemask); + } + + margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL; + if (flags & MPOL_F_ADDR) { + margs.addr = kargs.addr; + err = do_get_vma_mempolicy(&margs); + } else + err = do_get_task_mempolicy(&margs); + + if (err) + return err; + + kargs.mode = margs.mode; + kargs.mode_flags = margs.mode_flags; + kargs.policy_node = margs.policy_node; + kargs.addr_node = (flags & MPOL_F_ADDR) ? margs.addr_node : -1; + if (kargs.pol_nodes) + err = copy_nodes_to_user(kargs.pol_nodes, kargs.pol_maxnodes, + margs.policy_nodes); + + return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0; +} + bool vma_migratable(struct vm_area_struct *vma) { if (vma->vm_flags & (VM_IO | VM_PFNMAP)) From patchwork Thu Dec 7 00:27:58 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482525 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XSL/Nljv" Received: from mail-yw1-x1144.google.com (mail-yw1-x1144.google.com [IPv6:2607:f8b0:4864:20::1144]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1EB2510D4; Wed, 6 Dec 2023 16:28:32 -0800 (PST) Received: by mail-yw1-x1144.google.com with SMTP id 00721157ae682-5d95a3562faso927437b3.2; Wed, 06 Dec 2023 16:28:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908910; x=1702513710; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=40NNtZCJHO/l3XJ34ZM4gOa/b7TnBI7yBJr32xKnmO8=; b=XSL/Nljv6Krt//zkCMfARUCltmW9xIpsyb4auiX4dpneyBABq+pm/TVS0T/iG2fleC m3ultN346WJQTyNV27bIJSj/cMkf1uswbNOCddRMRqNutyOf5h6JXRzvZGsQPUPH4Enn NXI22yVPBpgKGqOMBJywbw9uvv/se8Kv1uMHRi5zeIpp4zChKN3BQeDsVHinQtyuhi/b lrwxT8RWYkzW5HSI0SgWDJSf3G9IF3/IeymHoAKsS3VgQkFoIt2CKKGHaNIdoWu9+nle 5TSjWbL8AiCKCgfU4rNoSwfVyeF+HFV2vMiVul2NaelJfytBAph/HH6MCrC5RxSLpinB xoiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908910; x=1702513710; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=40NNtZCJHO/l3XJ34ZM4gOa/b7TnBI7yBJr32xKnmO8=; b=MkRUQEmNKkS+su4PtE2+f+Og23fB1B1ji9SWhCFqqh/zggDcFTZjQpapXi40A5GHE1 DuHlLj1rY68IvU4IWjDC3c8jS41VdY385Vg8Oj77XFa1+Dri1ssAld7rleMCvs5cFcA/ VtxE/Ag21/aP50CCYVC9mQFtE28MF9JktBCXUKxfp6nx8TrMZoAtDZracxZIJxykE0YL S566V0L/cm4PjGqYMn7bw2ZWTOik7fzhbD1xEJq1rQVfCOqSbpKuOUfo4bBRJ5ywj486 FKT1f3Pfx6GRSlHkfuadgRKHMitpNWeejG/FTGzaPdTkjV2wE3fzz+fwRAccSdysxLgn ZX2g== X-Gm-Message-State: AOJu0Ywv9BLkync3gE+INgzStv9OM++RHJbun6OFZCG85HZ+yvFuMNJ6 yxV12Q3HsChPmBEKUdnzZA== X-Google-Smtp-Source: AGHT+IGRQRxj3rwcCWakQgHtTKe/LGbhpbZsU8iI4+729MfQSLnYHt5oh1HP29gFHi5gjEP2X5bU+A== X-Received: by 2002:a81:9a8a:0:b0:5d2:5caf:759 with SMTP id r132-20020a819a8a000000b005d25caf0759mr1603168ywg.22.1701908910652; Wed, 06 Dec 2023 16:28:30 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:30 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, Michal Hocko , Frank van der Linden Subject: [RFC PATCH 10/11] mm/mempolicy: add the mbind2 syscall Date: Wed, 6 Dec 2023 19:27:58 -0500 Message-Id: <20231207002759.51418-11-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 mbind2 is an extensible mbind interface which allows a user to set the mempolicy for one or more address ranges. Defined as: mbind2(struct mpol_args *args, size_t size, unsigned long flags) Input values include the following fields of mpl_args: mode: The MPOL_* policy (DEFAULT, INTERLEAVE, etc.) mode_flags: The MPOL_F_* flags that were previously passed in or'd into the mode. This was split to hopefully allow future extensions additional mode/flag space. pol_nodes: the nodemask to apply for the memory policy pol_maxnodes: The max number of nodes described by pol_nodes home_node: if MPOL_MF_HOME_NODE, set home node of policy to this vec: the vector of (address, len) memory ranges to operate on vlen: the number of entries in vec The semantics are otherwise the same as mbind(), except that the home_node can be set, and all address ranges defined by vec/vlen will be operated on. Valid flags for mbind2 include the same flags as mbind, plus MPOL_MF_HOME_NODE, which informs the syscall to utilize the value of mpol_args->home_node to set the mempolicy home node. Suggested-by: Michal Hocko Suggested-by: Frank van der Linden Suggested-by: Vinicius Tavares Petrucci Suggested-by: Rakie Kim Suggested-by: Hyeongtak Ji Suggested-by: Honggyu Kim Signed-off-by: Gregory Price Co-developed-by: Vinicius Tavares Petrucci --- .../admin-guide/mm/numa_memory_policy.rst | 12 +++- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 4 +- include/uapi/linux/mempolicy.h | 5 +- mm/mempolicy.c | 65 +++++++++++++++++++ 19 files changed, 98 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 82cdb765dd58..72ab21e24ec2 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -481,12 +481,18 @@ Install VMA/Shared Policy for a Range of Task's Address Space:: long mbind(void *start, unsigned long len, int mode, const unsigned long *nmask, unsigned long maxnode, unsigned flags); + long mbind2(struct mpol_args args, size_t size, + unsigned long flags); mbind() installs the policy specified by (mode, nmask, maxnodes) as a VMA policy for the range of the calling task's address space specified by the 'start' and 'len' arguments. Additional actions may be requested via the 'flags' argument. +mbind2() is an extended version of mbind() capable of operating on multiple +memory ranges in one scall, and which is capable of setting the home node +for the memory policy without an additional call to set_mempolicy_home_node() + See the mbind(2) man page for more details. Set home node for a Range of Task's Address Spacec:: @@ -502,6 +508,9 @@ closest to which page allocation will come from. Specifying the home node overri the default allocation policy to allocate memory close to the local node for an executing CPU. +mbind2() also provides a way for the home node to be set at the time the +mempolicy is set. See the mbind(2) man page for more details. + Extended Mempolicy Arguments:: struct mpol_args { @@ -529,7 +538,8 @@ Extended Mempolicy Arguments:: The extended mempolicy argument structure is defined to allow the mempolicy interfaces future extensibility without the need for additional system calls. -Extended interfaces (set_mempolicy2 and get_mempolicy2) use this structure. +Extended interfaces (set_mempolicy2, get_mempolicy2, and mbind2) use this +this argument structure. The core arguments (mode, mode_flags, pol_nodes, and pol_maxnodes) apply to all interfaces relative to their non-extended counterparts. Each additional diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 0301a8b0a262..e8239293c35a 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -498,3 +498,4 @@ 566 common futex_requeue sys_futex_requeue 567 common set_mempolicy2 sys_set_mempolicy2 568 common get_mempolicy2 sys_get_mempolicy2 +569 common mbind2 sys_mbind2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 771a33446e8e..a3f39750257a 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -472,3 +472,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 048a409e684c..9a12dface18e 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -458,3 +458,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index 327b01bd6793..6cb740123137 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -464,3 +464,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 921d58e1da23..52cf720f8ae2 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -397,3 +397,4 @@ 456 n32 futex_requeue sys_futex_requeue 457 n32 set_mempolicy2 sys_set_mempolicy2 458 n32 get_mempolicy2 sys_get_mempolicy2 +459 n32 mbind2 sys_mbind2 diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index 9271c83c9993..fd37c5301a48 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -446,3 +446,4 @@ 456 o32 futex_requeue sys_futex_requeue 457 o32 set_mempolicy2 sys_set_mempolicy2 458 o32 get_mempolicy2 sys_get_mempolicy2 +459 o32 mbind2 sys_mbind2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index 0654f3f89fc7..fcd67bc405b1 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -457,3 +457,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index ac11d2064e7a..89715417014c 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -545,3 +545,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 1cdcafe1ccca..c8304e0d0aa7 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 456 common futex_requeue sys_futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 sys_mbind2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index f71742024c29..e5c51b6c367f 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -461,3 +461,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 2fbf5dbe0620..74527f585500 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -504,3 +504,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 0af813b9a118..be2e2aa17dd8 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -463,3 +463,4 @@ 456 i386 futex_requeue sys_futex_requeue 457 i386 set_mempolicy2 sys_set_mempolicy2 458 i386 get_mempolicy2 sys_get_mempolicy2 +459 i386 mbind2 sys_mbind2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 0b777876fc15..6e2347eb8773 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -380,6 +380,7 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index 4536c9a4227d..f00a21317dc0 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -429,3 +429,4 @@ 456 common futex_requeue sys_futex_requeue 457 common set_mempolicy2 sys_set_mempolicy2 458 common get_mempolicy2 sys_get_mempolicy2 +459 common mbind2 sys_mbind2 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 774512b7934e..a49a67e496bc 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -816,6 +816,8 @@ asmlinkage long sys_mbind(unsigned long start, unsigned long len, const unsigned long __user *nmask, unsigned long maxnode, unsigned flags); +asmlinkage long sys_mbind2(struct mpol_args *args, size_t size, + unsigned long flags); asmlinkage long sys_get_mempolicy(int __user *policy, unsigned long __user *nmask, unsigned long maxnode, diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 719accc731db..cd31599bb9cc 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -832,9 +832,11 @@ __SYSCALL(__NR_futex_requeue, sys_futex_requeue) __SYSCALL(__NR_set_mempolicy2, sys_set_mempolicy2) #define __NR_get_mempolicy2 458 __SYSCALL(__NR_get_mempolicy2, sys_get_mempolicy2) +#define __NR_mbind2 459 +__SYSCALL(__NR_mbind2, sys_mbind2) #undef __NR_syscalls -#define __NR_syscalls 459 +#define __NR_syscalls 460 /* * 32 bit systems traditionally used different diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index e6b50903047c..3e463442fe28 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -62,13 +62,14 @@ struct mpol_args { #define MPOL_F_ADDR (1<<1) /* look up vma using address */ #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */ -/* Flags for mbind */ +/* Flags for mbind/mbind2 */ #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ #define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to policy */ #define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */ #define MPOL_MF_LAZY (1<<3) /* UNSUPPORTED FLAG: Lazy migrate on fault */ -#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */ +#define MPOL_MF_HOME_NODE (1<<4) /* mbind2: set home node */ +#define MPOL_MF_INTERNAL (1<<5) /* Internal flags start here */ #define MPOL_MF_VALID (MPOL_MF_STRICT | \ MPOL_MF_MOVE | \ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d1d10b2746e3..c203cea52ce9 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1603,6 +1603,71 @@ SYSCALL_DEFINE6(mbind, unsigned long, start, unsigned long, len, return kernel_mbind(start, len, mode, nmask, maxnode, flags); } +SYSCALL_DEFINE3(mbind2, struct mpol_args __user *, uargs, size_t, usize, + unsigned long, flags) +{ + struct mpol_args kargs; + struct mempolicy_args margs; + nodemask_t policy_nodes; + struct iovec iovstack[UIO_FASTIOV]; + struct iovec *iov = iovstack; + struct iov_iter iter; + int err; + + err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); + if (err) + return -EINVAL; + + err = validate_mpol_flags(kargs.mode, &kargs.mode_flags); + if (err) + return err; + + if (!kargs.vec || !kargs.vlen) + return -EINVAL; + + margs.mode = kargs.mode; + margs.mode_flags = kargs.mode_flags; + margs.addr = kargs.addr; + + /* if home node given, validate it is online */ + if (flags & MPOL_MF_HOME_NODE) { + if ((kargs.home_node >= MAX_NUMNODES) || + !node_online(kargs.home_node)) + return -EINVAL; + margs.home_node = kargs.home_node; + } else + margs.home_node = NUMA_NO_NODE; + flags &= ~MPOL_MF_HOME_NODE; + + if (kargs.pol_nodes) { + err = get_nodes(&policy_nodes, kargs.pol_nodes, + kargs.pol_maxnodes); + if (err) + return err; + margs.policy_nodes = &policy_nodes; + } else + margs.policy_nodes = NULL; + + /* For each address range in vector, do_mbind */ + err = import_iovec(ITER_DEST, kargs.vec, kargs.vlen, + ARRAY_SIZE(iovstack), &iov, &iter); + if (err) + return err; + while (iov_iter_count(&iter)) { + unsigned long start, len; + + start = untagged_addr((unsigned long)iter_iov_addr(&iter)); + len = iter_iov_len(&iter); + err = do_mbind(start, len, &margs, flags); + if (err) + break; + iov_iter_advance(&iter, iter_iov_len(&iter)); + } + + kfree(iov); + return err; +} + /* Set the process memory policy */ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, unsigned long maxnode) From patchwork Thu Dec 7 00:27:59 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13482526 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KFs9INeN" Received: from mail-yw1-x1143.google.com (mail-yw1-x1143.google.com [IPv6:2607:f8b0:4864:20::1143]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EBC3510FE; Wed, 6 Dec 2023 16:28:34 -0800 (PST) Received: by mail-yw1-x1143.google.com with SMTP id 00721157ae682-5d8ddcc433fso905317b3.1; Wed, 06 Dec 2023 16:28:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701908914; x=1702513714; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=8sMKntUCwGORCmofLfZT1MSDcnyRiDF38MmkYVSlo18=; b=KFs9INeNqgy6vR24SFP/lML49e+j1nzr1Sd+W5DrYenr9zGWYK/jDXuMRMl1/mqDIi XF9v5/HZfz6XNTgXGA3QC8qJSObbcF+oCzGM/Bze1aQqTG5ohpaSi2BaCsafhlbm+x3/ xmiWGujhl1UvFYp0K03IGLDVH2O//omklmRqJla4noJBU4IIR9DrjsP2YJwX86GX4kV6 dn+Hr+9OkAYcYR9EjmI+kVHvbCd8zSeZ88ETIodfnfHZYcM4f2FRqYJPmn30nUy9B7pe YPYUZEHXmrZ/K0J55lW8AhxG9fcU8pRY+wU2+gJiQ6wPZHYlBElFmdj2m/Gpc8v8Y72U qOTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701908914; x=1702513714; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8sMKntUCwGORCmofLfZT1MSDcnyRiDF38MmkYVSlo18=; b=N3n0V194E+hqRTgcjlJPFu88ZeC61nTpDmN5FTbDeRAo3DOu1c1GCZtvpw3eDWMAdv cFtiS74hgwltiuDuD62AnLyT7nWR2gbOBZ9b1dQhnjf4XarlWM8LV4IYwMYQUuzQz7Ac mBFY9pHyFcIbBNwZ06w7B5cuAbDKU8ciFEaDHun3Y3OJabRB9+/DabH4W3imsDaOm+xO FCtRsg4Ld8AI8QVyNooC88iJClwdVHgGuMxCScsOW4li4tmJnwLUUUqjQzXKavlGzTgZ 1B1fbTH6ukBdeFdaZEZ90t1MzqgZTOw3qImzE+pZ1XVtyRebC+bUW/VtusrNAl+AJbQe gNcA== X-Gm-Message-State: AOJu0YwR6ikmndscN4AVw9Tbxp9CrHT7cVksjQuAoD8vo1uKlRpyoFKx D62BmP/3cKpIYYt0Hm1KAQ== X-Google-Smtp-Source: AGHT+IHmyuouwkkne7RrSVBTRF2p/91Z1VhvWmitMBtJZHL73MgXl2+R3+Qk4aVsbxdFCFBs22j++A== X-Received: by 2002:a81:528f:0:b0:5d7:1940:8df7 with SMTP id g137-20020a81528f000000b005d719408df7mr1276443ywb.94.1701908913726; Wed, 06 Dec 2023 16:28:33 -0800 (PST) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id x145-20020a81a097000000b005d82fc8cc92sm19539ywg.105.2023.12.06.16.28.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 06 Dec 2023 16:28:33 -0800 (PST) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org, jgroves@micron.com, ravis.opensrc@micron.com, sthanneeru@micron.com, emirakhur@micron.com, Hasan.Maruf@amd.com Cc: linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, arnd@arndb.de, tglx@linutronix.de, luto@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, mhocko@kernel.org, tj@kernel.org, ying.huang@intel.com, gregory.price@memverge.com, corbet@lwn.net, rakie.kim@sk.com, hyeongtak.ji@sk.com, honggyu.kim@sk.com, vtavarespetr@micron.com, peterz@infradead.org, Gregory Price Subject: [RFC PATCH 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave Date: Wed, 6 Dec 2023 19:27:59 -0500 Message-Id: <20231207002759.51418-12-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231207002759.51418-1-gregory.price@memverge.com> References: <20231207002759.51418-1-gregory.price@memverge.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Gregory Price Extend set_mempolicy2 and mbind2 to support weighted interleave, and demonstrate the extensibility of the mpol_args structure. To support weighted interleave we add interleave weight fields to the following structures: Kernel Internal: (include/linux/mempolicy.h) struct mempolicy { /* task-local weights to apply to weighted interleave */ unsigned char weights[MAX_NUMNODES]; } struct mempolicy_args { /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ unsigned char *il_weights; /* of size MAX_NUMNODES */ } UAPI: (/include/uapi/linux/mempolicy.h) struct mpol_args { /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ unsigned char *il_weights; /* of size pol_max_nodes */ } The task-local weights are a single, one-dimensional array of weights that apply to all possible nodes on the system. If a node is set in the mempolicy nodemask, the weight in `il_weights` must be >= 1, otherwise set_mempolicy2() will return -EINVAL. If a node is not set in pol_nodemask, the weight will default to `1` in the task policy. The default value of `1` is required to handle the situation where a task migrates to a set of nodes for which weights were not set (up to and including the local numa node). For example, a migrated task whose nodemask changes entirely will have all its weights defaulted back to `1`, or if the nodemask changes to include a mix of nodes that were not previously accounted for - the weighted interleave may be suboptimal. If migrations are expected, a task should prefer not to use task-local interleave weights, and instead utilize the global settings for natural re-weighting on migration. To support global vs local weighting, we add the kernel-internal flag: MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ This flag is set when il_weights is omitted by set_mempolicy2(), or when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal mode_flag dictates whether global weights or task-local weights are utilized by the the various weighted interleave functions: * weighted_interleave_nodes * weighted_interleave_nid * alloc_pages_bulk_array_weighted_interleave if (pol->flags & MPOL_F_GWEIGHT) pol_weights = iw_table[numa_node_id()].weights; else pol_weights = pol->wil.weights; To simplify creations and duplication of mempolicies, the weights are added as a structure directly within mempolicy. This allows the existing logic in __mpol_dup to copy the weights without additional allocations: if (old == current->mempolicy) { task_lock(current); *new = *old; task_unlock(current); } else *new = *old Suggested-by: Rakie Kim Suggested-by: Hyeongtak Ji Suggested-by: Honggyu Kim Suggested-by: Vinicius Tavares Petrucci Signed-off-by: Gregory Price Co-developed-by: Rakie Kim Signed-off-by: Rakie Kim Co-developed-by: Hyeongtak Ji Signed-off-by: Hyeongtak Ji Co-developed-by: Honggyu Kim Signed-off-by: Honggyu Kim Co-developed-by: Vinicius Tavares Petrucci Signed-off-by: Vinicius Tavares Petrucci --- .../admin-guide/mm/numa_memory_policy.rst | 13 ++- include/linux/mempolicy.h | 2 + include/uapi/linux/mempolicy.h | 3 + mm/mempolicy.c | 87 ++++++++++++++++++- 4 files changed, 100 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 72ab21e24ec2..f3a9dcbaa7ed 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -254,7 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE This mode operates the same as MPOL_INTERLEAVE, except that interleaving behavior is executed based on weights set in /sys/kernel/mm/mempolicy/weighted_interleave/ - rather than simple round-robin interleave (which is the default). + when configured to utilize global weights, or based on task-local + weights configured with set_mempolicy2(2) or mbind2(2). When utilizing global weights from the sysfs interface, weights are applied in a src-node relative manner. For example @@ -267,6 +268,13 @@ MPOL_WEIGHTED_INTERLEAVE cgroup initiated migrations) to re-weight for the optimal distribution of bandwidth. + When utilizing task-local weights, weights are not rebalanced + in the event of a task migration. If a weight has not been + explicitly set for a node set in the new nodemask, the + value of that weight defaults to "1". For this reason, if + migrations are expected or possible, users should consider + utilizing global interleave weights. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES @@ -533,6 +541,9 @@ Extended Mempolicy Arguments:: /* mbind2: address ranges to apply the policy */ struct iovec *vec; size_t vlen; + + /* weighted interleave settings */ + unsigned char *il_weights; /* of size pol_maxnodes */ }; The extended mempolicy argument structure is defined to allow the mempolicy diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 117c5395c6eb..c78874bd84dd 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -58,6 +58,7 @@ struct mempolicy { /* Weighted interleave settings */ struct { unsigned char cur_weight; + unsigned char weights[MAX_NUMNODES]; } wil; }; @@ -73,6 +74,7 @@ struct mempolicy_args { unsigned long addr; /* get: vma address */ int addr_node; /* get: node the address belongs to */ int home_node; /* mbind: use MPOL_MF_HOME_NODE */ + unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */ }; /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index 3e463442fe28..c2f229037be3 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -43,6 +43,8 @@ struct mpol_args { /* mbind2: address ranges to apply the policy */ struct iovec *vec; size_t vlen; + /* weighted interleave settings */ + unsigned char *il_weights; /* of size pol_maxnodes */ }; /* Flags for set_mempolicy */ @@ -83,6 +85,7 @@ struct mpol_args { #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ +#define MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ /* * These bit locations are exposed in the vm.zone_reclaim_mode sysctl diff --git a/mm/mempolicy.c b/mm/mempolicy.c index c203cea52ce9..7273bb9540fa 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -274,6 +274,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args) unsigned short mode = args->mode; unsigned short flags = args->mode_flags; nodemask_t *nodes = args->policy_nodes; + int node; if (mode == MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) @@ -300,6 +301,19 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args) (flags & MPOL_F_STATIC_NODES) || (flags & MPOL_F_RELATIVE_NODES)) return ERR_PTR(-EINVAL); + } else if (mode == MPOL_WEIGHTED_INTERLEAVE) { + /* weighted interleave requires a nodemask and weights > 0 */ + if (nodes_empty(*nodes)) + return ERR_PTR(-EINVAL); + if (args->il_weights) { + node = first_node(*nodes); + while (node != MAX_NUMNODES) { + if (!args->il_weights[node]) + return ERR_PTR(-EINVAL); + node = next_node(node, *nodes); + } + } else if (!(args->mode_flags & MPOL_F_GWEIGHT)) + return ERR_PTR(-EINVAL); } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); @@ -312,6 +326,16 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args) policy->home_node = NUMA_NO_NODE; policy->wil.cur_weight = 0; policy->home_node = args->home_node; + if (policy->mode == MPOL_WEIGHTED_INTERLEAVE && args->il_weights) { + policy->wil.cur_weight = 0; + /* Minimum weight value is always 1 */ + memset(policy->wil.weights, 1, MAX_NUMNODES); + node = first_node(*nodes); + while (node != MAX_NUMNODES) { + policy->wil.weights[node] = args->il_weights[node]; + node = next_node(node, *nodes); + } + } return policy; } @@ -1612,6 +1636,7 @@ SYSCALL_DEFINE3(mbind2, struct mpol_args __user *, uargs, size_t, usize, struct iovec iovstack[UIO_FASTIOV]; struct iovec *iov = iovstack; struct iov_iter iter; + unsigned char weights[MAX_NUMNODES]; int err; err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); @@ -1648,6 +1673,19 @@ SYSCALL_DEFINE3(mbind2, struct mpol_args __user *, uargs, size_t, usize, } else margs.policy_nodes = NULL; + if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE) { + err = copy_struct_from_user(&weights, + sizeof(weights), + &kargs.il_weights, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights = weights; + } else { + margs.il_weights = NULL; + flags |= MPOL_F_GWEIGHT; + } + /* For each address range in vector, do_mbind */ err = import_iovec(ITER_DEST, kargs.vec, kargs.vlen, ARRAY_SIZE(iovstack), &iov, &iter); @@ -1686,6 +1724,9 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, if (err) return err; + if (mode & MPOL_WEIGHTED_INTERLEAVE) + mode_flags |= MPOL_F_GWEIGHT; + memset(&args, 0, sizeof(args)); args.mode = lmode; args.mode_flags = mode_flags; @@ -1708,6 +1749,7 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, struct mempolicy_args margs; int err; nodemask_t policy_nodemask; + unsigned char weights[MAX_NUMNODES]; if (flags) return -EINVAL; @@ -1732,6 +1774,19 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, } else margs.policy_nodes = NULL; + if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) { + err = copy_struct_from_user(weights, + sizeof(weights), + kargs.il_weights, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights = weights; + } else { + margs.il_weights = NULL; + flags |= MPOL_F_GWEIGHT; + } + return do_set_mempolicy(&margs); } @@ -2081,16 +2136,22 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { unsigned int next; struct task_struct *me = current; + unsigned char *weights; if (policy->wil.cur_weight > 0) { policy->wil.cur_weight--; return me->il_prev; } + if (policy->flags & MPOL_F_GWEIGHT) + weights = iw_table[numa_node_id()].weights; + else + weights = policy->wil.weights; + next = next_node_in(me->il_prev, policy->nodes); if (next < MAX_NUMNODES) { me->il_prev = next; - policy->wil.cur_weight = iw_table[numa_node_id()].weights[next]; + policy->wil.cur_weight = weights[next]; } return next; } @@ -2160,15 +2221,21 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) { nodemask_t nodemask = pol->nodes; unsigned int target, weight_total = 0; - int nid, local_node = numa_node_id(); + int nid; + unsigned char *pol_weights; unsigned char weights[MAX_NUMNODES]; unsigned char weight; barrier(); + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights = iw_table[numa_node_id()].weights; + else + pol_weights = pol->wil.weights; + /* Collect weights and save them on stack so they don't change */ for_each_node_mask(nid, nodemask) { - weight = iw_table[local_node].weights[nid]; + weight = pol_weights[nid]; weight_total += weight; weights[nid] = weight; } @@ -2564,6 +2631,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, unsigned long nr_allocated; unsigned long rounds; unsigned long node_pages, delta; + unsigned char *pol_weights; unsigned char weight; unsigned char weights[MAX_NUMNODES]; unsigned int weight_total; @@ -2576,9 +2644,14 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, nnodes = nodes_weight(nodes); + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights = iw_table[numa_node_id()].weights; + else + pol_weights = pol->wil.weights; + /* Collect weights and save them on stack so they don't change */ for_each_node_mask(node, nodes) { - weight = iw_table[numa_node_id()].weights[node]; + weight = pol_weights[node]; weight_total += weight; weights[node] = weight; } @@ -3095,6 +3168,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) { int ret; struct mempolicy_args margs; + unsigned char weights[MAX_NUMNODES]; sp->root = RB_ROOT; /* empty tree == default mempolicy */ rwlock_init(&sp->lock); @@ -3112,6 +3186,11 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) margs.mode_flags = mpol->flags; margs.policy_nodes = &mpol->w.user_nodemask; margs.home_node = NUMA_NO_NODE; + if (margs.mode == MPOL_WEIGHTED_INTERLEAVE && + !(margs.mode_flags & MPOL_F_GWEIGHT)) { + memcpy(weights, mpol->wil.weights, sizeof(weights)); + margs.il_weights = weights; + } /* contextualize the tmpfs mount point mempolicy to this file */ npol = mpol_new(&margs);