[RFC,0/3] mm/mempolicy: set/get_mempolicy2

This patch set is a proposal for set_mempolicy2 and get_mempolicy2 system calls. This is an extension to the existing mempolicy syscalls that allow for a more extensible mempolicy interface and new, complex memory policies. This RFC is broken into 3 patches for discussion: 1) A refactor of do_set_mempolicy that allows code reuse for the new syscalls and centralizes the mempolicy swap code. 2) The implementation of get_mempolicy2 and set_mempolicy2 which includes a new uapi type: "struct mempolicy_args" and denotes the original mempolicies as "legacy". This allows the existing policies to be routed through the original interface. (note: only implemented on x86 at this time, though can be hacked into other architectures somewhat trivially) 3) The implementation of a sample mempolicy ("partial-interleave") which was not possible on the old interface. x) next planned patches: selftest/ltp test/example programs/etc. I wanted to start discussion before i went too deep. Besides the obvious proposal of extending the mempolicy subsystem for new policies, the core proposal is the addition of the new uapi type "struct mempolicy". In this proposal, the get and set interfaces use the same structure, and some fields may be ignored depending on the requested operation. This sample implementation of get_mempolicy allows for the retrieval of all information that would have previously required multiple calls to get_mempolicy, and implements an area for per-policy information. The multiple err fields would allow for continuation of information retrieval should one or more failures occur (though notably this is probably not defensible, and should probably just error out - mostly a debugging interface for now). This allows for future extensibility, and would avoid the need for additional syscalls into the future, so long as the args structure is versioned or checked based on size. struct mempolicy_args { int err; unsigned short mode; unsigned long *nodemask; unsigned long maxnode; unsigned short flags; struct { /* Memory allowed */ struct { int err; unsigned long maxnode; unsigned long *nodemask; } allowed; /* Address information */ struct { int err; unsigned long addr; unsigned long node; unsigned short mode; unsigned short flags; } addr; } get; union { /* Interleave */ struct { unsigned long next_node; /* get only */ } interleave; /* Partial Interleave */ struct { unsigned long interval; /* get and set */ unsigned long next_node; /* get only */ } part_int; }; }; In the third patch, we implement a sample Partial-Interleave mempolicy that is not possible to implement given the existing mempolicy interface - and would either require the exposure of new interfaces to set the value described. We extend the internal mempolicy structure to include to include a new union area which can be used to host complex policy data. Example: union { /* Partial Interleave: Allocate local count, then interleave */ struct { int interval; /* allocation interval at which to interleave */ int count; /* the current allocation count */ } part_int; }; Summary of Partial Interleave: ============================= nodeset=0,1,2 interval=3 cpunode=0 The preferred node (cpunode) is taken by default to be the node on which [interval] allocations are made before an interleave occurs. Over 10 consecutive allocations, the following nodes will be selected: [0,0,0,1,2,0,0,0,1,2] In this example, there is a 60%/20%/20% distribution of memory across the node set. Some notes for discussion ========================= 0) Why? In the coming age of CXL and a many-numa-node system with memory hosted on the PCIe bus, new memory policies are likely to be beneficial to experiment with and ultimately implement new allocation-time placement policies. Presently, much focus is placed on memory-usage monitoring and data migration, but these methods steal performance to accomplish what could be optimized for up-front. For example, if maximum memory bandwidth is required for an operation, then a statistical distribution of memory can be calculated fairly easily based on approximate expected memory usage. Getting a fair approximation of distribution at allocation can help reduce the migration load required after-the fact. This is the intent of the included partial-interleave example, which allows for an approximate distribution of memory, where the local node is still the preferred location for the majority of memory. 1) Maybe this should be a set of sysfs interfaces? This would involve adding a /proc/pid/mempolicy interface that allows for external processes to interrogate and change the mempolicy of running processes. This would be a fundamental change to the mempolicy subsystem, as (so far as i can tell) this is not possible as of present. Additionally, the policy is per-thread, not per-pid. Making this work on a per-thread, so it would be /proc/pid/task/tid/mempolicy. I avoided that for this RFC as it seemed more radical than simply proposing a set/get_mempolicy2 interface. Though technically it could be done. 2) Do we need this level extensibility? Presently the ability to dictate allocation-time placement is limited to a few primitive mechanisms: 1) existing mempolicy, and those that can be implemented using the existing interface. 2) numa-aware applications, requiring code changes. 3) LDPRELOAD methods, which have compability issues. For the sake of compatibility, being able to extent numactl to include newer, more complex policies would be beneficial. While partial-interleave passes a simple interval as an interger, more complex policies may want to pass multiple, complex pieces of data. For example, a 'statistical-interleave' policy may pass a list of integers that dictates exactly how many allocations should happen per-node during interleave. Another policy may take in one or more nodemask's and do more complex distributions. Gregory Price (3): mm/mempolicy: refactor do_set_mempolicy for code re-use mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls mm/mempolicy: implement a partial-interleave mempolicy arch/x86/entry/syscalls/syscall_32.tbl | 2 + arch/x86/entry/syscalls/syscall_64.tbl | 2 + include/linux/mempolicy.h | 8 + include/linux/syscalls.h | 2 + include/uapi/asm-generic/unistd.h | 10 +- include/uapi/linux/mempolicy.h | 37 +++ mm/mempolicy.c | 420 +++++++++++++++++++++++-- 7 files changed, 456 insertions(+), 25 deletions(-)

Message ID	20230914235457.482710-1-gregory.price@memverge.com
Headers	show Return-Path: <linux-cxl-owner@vger.kernel.org> From: Gregory Price <gourry.memverge@gmail.com> To: linux-mm@vger.kernel.org Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-cxl@vger.kernel.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, arnd@arndb.de, akpm@linux-foundation.org, x86@kernel.org, Gregory Price <gregory.price@memverge.com> Subject: [RFC PATCH 0/3] mm/mempolicy: set/get_mempolicy2 Date: Thu, 14 Sep 2023 19:54:54 -0400 Message-Id: <20230914235457.482710-1-gregory.price@memverge.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	mm/mempolicy: set/get_mempolicy2 \| expand [RFC,0/3] mm/mempolicy: set/get_mempolicy2 [RFC,1/3] mm/mempolicy: refactor do_set_mempolicy for code re-use [RFC,2/3] mm/mempolicy: Implement set_mempolicy2 and get_mempolicy2 syscalls [RFC,3/3] mm/mempolicy: implement a partial-interleave mempolicy

[RFC,0/3] mm/mempolicy: set/get_mempolicy2

Message