From patchwork Tue Oct 3 00:21:55 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Gregory Price X-Patchwork-Id: 13406638 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0815E776DF for ; Tue, 3 Oct 2023 00:22:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238908AbjJCAWU (ORCPT ); Mon, 2 Oct 2023 20:22:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59966 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237475AbjJCAWL (ORCPT ); Mon, 2 Oct 2023 20:22:11 -0400 Received: from mail-oa1-x43.google.com (mail-oa1-x43.google.com [IPv6:2001:4860:4864:20::43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3998CE; Mon, 2 Oct 2023 17:22:07 -0700 (PDT) Received: by mail-oa1-x43.google.com with SMTP id 586e51a60fabf-1e10ba12fd3so208593fac.1; Mon, 02 Oct 2023 17:22:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1696292527; x=1696897327; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=M+oV1ZgyPCblrOh65x12ZqFYHoDSqMkCfYV5/Np3CLc=; b=iGSn5mlZ9uN3glRU+nqG/iL5dxscvtmbjxqGcGlkJ5bpoEhj6J7YK32DjiU+YDpTHF JwHBR3hWl4rX4aziRVgpcehOK2oBbbZ5Yft9lCQpVAVZYOQvdQrxUn3CI7Nrcwmzb8C/ q2bPwfQqpUVUCgR1znVzdHePViE+1Ok3aozUy6dFrP1IP1Hz/W/N+8yCdhjeERleHGQn 2irrnvlotJ2EK1840pqfQD79wXfqaklLX/T0ynZSGxz4cHIv3DsBygbSmcHeTh9ZiUAU 79EsEowoUXCWnslb4s9G8B2w9a2jQvZqaf9tC9jbcjkkGdzGAsktOn1GKnjQhOC9Amq0 a20A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696292527; x=1696897327; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=M+oV1ZgyPCblrOh65x12ZqFYHoDSqMkCfYV5/Np3CLc=; b=RvYEGYAgZEYi3657MYi9uZeCQSu1sRemAIP50AE7ntoPxwd6nIhrS8lwyoSddRkdoF kWndQ+R9Rn8YxIDnYXPX3DhwfkmQDlSYRem3x/GoJcHSIg/gYzfJPY/g1wPLP4GTSl1u Cqpx67RxTowisjmwCEgpPRqkRjL2Ma5yw6PD5KzFM1YsKx3yBMblLErSQ98wFx2tjuJv uZ4j7l9kINpwS3iS9vKkQMW7teZDdqg4MeMzfPGZRI0b+UdJTbT6/sm3BrNrt42wmCPe CbBD3mJJxxYm0WkEsc/sINaFpHwrW7jaqpGRCr3/+hqcKUMD93+lrA63QXg2KUgLS7Zz qlOA== X-Gm-Message-State: AOJu0YxWqjMzgs0u3I8XLr9wexQTWzHy3jDGNWQU32/w/BFpIf32SsDX aEyLVn+Iy8xJ/IprUtQnHg== X-Google-Smtp-Source: AGHT+IF3H7cFXE9uqnzYRiSwmCzVCT00Oy06v4WEmrurh9FJPLnPgHbJ2bdP2Q5gVA4VPLuva4WPHg== X-Received: by 2002:a05:6870:3282:b0:1d6:439d:d04e with SMTP id q2-20020a056870328200b001d6439dd04emr14842596oac.53.1696292526913; Mon, 02 Oct 2023 17:22:06 -0700 (PDT) Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net. [173.79.56.208]) by smtp.gmail.com with ESMTPSA id a2-20020a056870618200b001e135f4f849sm24725oah.9.2023.10.02.17.22.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 Oct 2023 17:22:06 -0700 (PDT) From: Gregory Price X-Google-Original-From: Gregory Price To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-cxl@vger.kernel.org, luto@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, arnd@arndb.de, akpm@linux-foundation.org, x86@kernel.org, Gregory Price Subject: [RFC PATCH v2 3/4] mm/mempolicy: implement a preferred-interleave Date: Mon, 2 Oct 2023 20:21:55 -0400 Message-Id: <20231003002156.740595-4-gregory.price@memverge.com> X-Mailer: git-send-email 2.39.1 In-Reply-To: <20231003002156.740595-1-gregory.price@memverge.com> References: <20231003002156.740595-1-gregory.price@memverge.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-cxl@vger.kernel.org The preferred-interleave mempolicy implements single-weight interleave mechanism where the preferred node is the local node. If the local node is not set in nodemask, the first node in the node mask is the preferred node. When set, N (weight) pages will be allocated on the preferred node beforce an interleave pass occurs. For example: nodes=0,1,2 interval=3 cpunode=0 Over 10 consecutive allocations, the following nodes will be selected: [0,0,0,1,2,0,0,0,1,2] In this example, there is a 60%/20%/20% distribution of memory. Using this mechanism, it becomes possible to define an approximate distribution percentage of memory across a set of nodes: local_node% : interval/((nr_nodes-1)+interval-1) other_node% : (1-local_node%)/(nr_nodes-1) The behavior can be preferred over a fully-weighted interleave (where each node has a separate weight) when migrations or multiple sockets may be in use. If a task migrates, the weight applies to the new local node without a need for the task to "rebalance" its weights. Similarly, if nodes are removed from the nodemask, no weights need to be recalculated. The exception to this is when the local node is removed from the nodemask, which is a rare situation. Similarly, consider a task executing on a 2-socket system which creates a new thread. If the first thread is scheduled to execute on socket 0 and the second thread is scheduled to execute on socket 1, weightings set by thread 1 (which are inherited by thread 2) would very likely be a poor interleave strategy for the new thread. In this scheme, thread 2 would inherit the same weight, but it would apply to the local node of thread 2, leading to more predictable behavior for new allocations. Signed-off-by: Gregory Price --- include/linux/mempolicy.h | 8 ++ include/uapi/linux/mempolicy.h | 6 + mm/mempolicy.c | 203 ++++++++++++++++++++++++++++++++- 3 files changed, 212 insertions(+), 5 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index d232de7cdc56..8f918488c61c 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -48,6 +48,14 @@ struct mempolicy { nodemask_t nodes; /* interleave/bind/perfer */ int home_node; /* Home node to use for MPOL_BIND and MPOL_PREFERRED_MANY */ + union { + /* Preferred Interleave: Weight local, then interleave */ + struct { + int weight; + int count; + } pil; + }; + union { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index ea386872094b..41c35f404c5e 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -24,6 +24,7 @@ enum { MPOL_LOCAL, MPOL_PREFERRED_MANY, MPOL_LEGACY, /* set_mempolicy limited to above modes */ + MPOL_PREFERRED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; @@ -52,6 +53,11 @@ struct mempolicy_args { struct { unsigned long next_node; /* get only */ } interleave; + /* Partial interleave */ + struct { + unsigned long weight; /* get and set */ + unsigned long next_node; /* get only */ + } pil; }; }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 936c641f554e..6374312cef5f 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -399,6 +399,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_nodemask, }, + [MPOL_PREFERRED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, [MPOL_PREFERRED] = { .create = mpol_new_preferred, .rebind = mpol_rebind_preferred, @@ -873,7 +877,8 @@ static long replace_mempolicy(struct mempolicy *new, nodemask_t *nodes) old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_PREFERRED_INTERLEAVE)) current->il_prev = MAX_NUMNODES-1; out: task_unlock(current); @@ -915,6 +920,7 @@ static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes) switch (p->mode) { case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_PREFERRED_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: *nodes = p->nodes; @@ -1609,6 +1615,23 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask, return kernel_set_mempolicy(mode, nmask, maxnode); } +static long do_set_preferred_interleave(struct mempolicy_args *args, + struct mempolicy *new, + nodemask_t *nodes) +{ + /* Preferred interleave cannot be done with no nodemask */ + if (nodes_empty(*nodes)) + return -EINVAL; + + /* Preferred interleave weight cannot be <= 0 */ + if (args->pil.weight <= 0) + return -EINVAL; + + new->pil.weight = args->pil.weight; + new->pil.count = 0; + return 0; +} + static long do_set_mempolicy2(struct mempolicy_args *args) { struct mempolicy *new = NULL; @@ -1630,6 +1653,9 @@ static long do_set_mempolicy2(struct mempolicy_args *args) return PTR_ERR(new); switch (args->mode) { + case MPOL_PREFERRED_INTERLEAVE: + err = do_set_preferred_interleave(args, new, &nodes); + break; default: BUG(); } @@ -1767,6 +1793,12 @@ static long do_get_mempolicy2(struct mempolicy_args *kargs) pol->nodes); rc = 0; break; + case MPOL_PREFERRED_INTERLEAVE: + kargs->pil.next_node = next_node_in(current->il_prev, + pol->nodes); + kargs->pil.weight = pol->pil.weight; + rc = 0; + break; default: BUG(); } @@ -2102,12 +2134,41 @@ static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd) return nd; } +static unsigned int preferred_interleave_nodes(struct mempolicy *policy) +{ + int mynode = numa_node_id(); + struct task_struct *me = current; + int next; + + /* + * If the local node is not in the node mask, we treat the + * lowest node as the preferred node. This can happen if the + * cpu is bound to a node that is not present in the mempolicy + */ + if (!node_isset(mynode, policy->nodes)) + mynode = first_node(policy->nodes); + + next = next_node_in(me->il_prev, policy->nodes); + if (next == mynode) { + if (++policy->pil.count >= policy->pil.weight) { + policy->pil.count = 0; + me->il_prev = next; + } + } else if (next < MAX_NUMNODES) { + me->il_prev = next; + } + return next; +} + /* Do dynamic interleaving for a process */ static unsigned interleave_nodes(struct mempolicy *policy) { unsigned next; struct task_struct *me = current; + if (policy->mode == MPOL_PREFERRED_INTERLEAVE) + return preferred_interleave_nodes(policy); + next = next_node_in(me->il_prev, policy->nodes); if (next < MAX_NUMNODES) me->il_prev = next; @@ -2135,6 +2196,7 @@ unsigned int mempolicy_slab_node(void) return first_node(policy->nodes); case MPOL_INTERLEAVE: + case MPOL_PREFERRED_INTERLEAVE: return interleave_nodes(policy); case MPOL_BIND: @@ -2161,6 +2223,56 @@ unsigned int mempolicy_slab_node(void) } } +static unsigned int offset_pil_node(struct mempolicy *pol, unsigned long n) +{ + nodemask_t nodemask = pol->nodes; + unsigned int target, nnodes; + int i; + int nid = MAX_NUMNODES; + int weight = pol->pil.weight; + + /* + * The barrier will stabilize the nodemask in a register or on + * the stack so that it will stop changing under the code. + * + * Between first_node() and next_node(), pol->nodes could be changed + * by other threads. So we put pol->nodes in a local stack. + */ + barrier(); + + nnodes = nodes_weight(nodemask); + + /* + * If the local node ID is not set (cpu is bound to a node + * but that node is not set in the memory nodemask), interleave + * based on the lowest set node. + */ + nid = numa_node_id(); + if (!node_isset(nid, nodemask)) + nid = first_node(nodemask); + /* + * Mode or weight can change so default to basic interleave + * if the weight has become invalid. Basic interleave is + * equivalent to weight=1. Don't double-count the base node + */ + if (weight == 0) + weight = 1; + weight -= 1; + + /* If target <= the weight, no need to call next_node */ + target = ((unsigned int)n % (nnodes + weight)); + target -= (target > weight) ? weight : target; + target %= MAX_NUMNODES; + + /* Target may not be the first node, so use next_node_in to wrap */ + for (i = 0; i < target; i++) { + nid = next_node_in(nid, nodemask); + if (nid == MAX_NUMNODES) + nid = first_node(nodemask); + } + return nid; +} + /* * Do static interleaving for a VMA with known offset @n. Returns the n'th * node in pol->nodes (starting from n=0), wrapping around if n exceeds the @@ -2168,10 +2280,16 @@ unsigned int mempolicy_slab_node(void) */ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) { - nodemask_t nodemask = pol->nodes; + nodemask_t nodemask; unsigned int target, nnodes; int i; int nid; + + if (pol->mode == MPOL_PREFERRED_INTERLEAVE) + return offset_pil_node(pol, n); + + nodemask = pol->nodes; + /* * The barrier will stabilize the nodemask in a register or on * the stack so that it will stop changing under the code. @@ -2239,7 +2357,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags, *nodemask = NULL; mode = (*mpol)->mode; - if (unlikely(mode == MPOL_INTERLEAVE)) { + if (unlikely(mode == MPOL_INTERLEAVE) || + unlikely(mode == MPOL_PREFERRED_INTERLEAVE)) { nid = interleave_nid(*mpol, vma, addr, huge_page_shift(hstate_vma(vma))); } else { @@ -2280,6 +2399,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_PREFERRED_INTERLEAVE: *mask = mempolicy->nodes; break; @@ -2390,7 +2510,8 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma, pol = get_vma_policy(vma, addr); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_PREFERRED_INTERLEAVE) { struct page *page; unsigned nid; @@ -2492,7 +2613,8 @@ struct page *alloc_pages(gfp_t gfp, unsigned order) * No reference counting needed for current->mempolicy * nor system default_policy */ - if (pol->mode == MPOL_INTERLEAVE) + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_PREFERRED_INTERLEAVE) page = alloc_page_interleave(gfp, order, interleave_nodes(pol)); else if (pol->mode == MPOL_PREFERRED_MANY) page = alloc_pages_preferred_many(gfp, order, @@ -2552,6 +2674,69 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; } +static unsigned long alloc_pages_bulk_array_pil(gfp_t gfp, + struct mempolicy *pol, + unsigned long nr_pages, + struct page **page_array) +{ + nodemask_t nodemask = pol->nodes; + unsigned long nr_pages_main; + unsigned long nr_pages_other; + unsigned long total_cycle; + unsigned long delta; + unsigned long weight; + int allocated = 0; + int start_nid; + int nnodes; + int prev, next; + int i; + + /* This stabilizes nodes on the stack incase pol->nodes changes */ + barrier(); + + nnodes = nodes_weight(nodemask); + start_nid = numa_node_id(); + + if (!node_isset(start_nid, nodemask)) + start_nid = first_node(nodemask); + + if (nnodes == 1) { + allocated = __alloc_pages_bulk(gfp, start_nid, + NULL, nr_pages_main, + NULL, page_array); + return allocated; + } + /* We don't want to double-count the main node in calculations */ + nnodes--; + + weight = pol->pil.weight; + total_cycle = (weight + nnodes); + /* Number of pages on main node: (cycles*weight + up to weight) */ + nr_pages_main = ((nr_pages / total_cycle) * weight); + nr_pages_main += (nr_pages % total_cycle % (weight + 1)); + /* Number of pages on others: (remaining/nodes) + 1 page if delta */ + nr_pages_other = (nr_pages - nr_pages_main) / nnodes; + nr_pages_other /= nnodes; + /* Delta is number of pages beyond weight up to full cycle */ + delta = nr_pages - (nr_pages_main + (nr_pages_other * nnodes)); + + /* start by allocating for the main node, then interleave rest */ + prev = start_nid; + allocated = __alloc_pages_bulk(gfp, start_nid, NULL, nr_pages_main, + NULL, page_array); + for (i = 0; i < nnodes; i++) { + int pages = nr_pages_other + (delta-- ? 1 : 0); + + next = next_node_in(prev, nodemask); + if (next < MAX_NUMNODES) + prev = next; + allocated += __alloc_pages_bulk(gfp, next, NULL, pages, + NULL, page_array); + } + + return allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2590,6 +2775,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_PREFERRED_INTERLEAVE) + return alloc_pages_bulk_array_pil(gfp, pol, nr_pages, + page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2662,6 +2851,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) switch (a->mode) { case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_PREFERRED_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: return !!nodes_equal(a->nodes, b->nodes); @@ -2798,6 +2988,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long switch (pol->mode) { case MPOL_INTERLEAVE: + case MPOL_PREFERRED_INTERLEAVE: pgoff = vma->vm_pgoff; pgoff += (addr - vma->vm_start) >> PAGE_SHIFT; polnid = offset_il_node(pol, pgoff); @@ -3185,6 +3376,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_PREFERRED_INTERLEAVE] = "preferred interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -3355,6 +3547,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_PREFERRED_INTERLEAVE: nodes = pol->nodes; break; default: