Message ID | 20240119175730.15484-4-gregory.price@memverge.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/mempolicy: weighted interleave mempolicy and sysfs extension | expand |
Gregory Price <gourry.memverge@gmail.com> writes: > When a system has multiple NUMA nodes and it becomes bandwidth hungry, > using the current MPOL_INTERLEAVE could be an wise option. > > However, if those NUMA nodes consist of different types of memory such > as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin > based interleave policy does not optimally distribute data to make use > of their different bandwidth characteristics. > > Instead, interleave is more effective when the allocation policy follows > each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. > > This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, > enabling weighted interleave between NUMA nodes. Weighted interleave > allows for proportional distribution of memory across multiple numa > nodes, preferably apportioned to match the bandwidth of each node. > > For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), > with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate > weight distribution is (2:1). > > Weights for each node can be assigned via the new sysfs extension: > /sys/kernel/mm/mempolicy/weighted_interleave/ > > For now, the default value of all nodes will be `1`, which matches > the behavior of standard 1:1 round-robin interleave. An extension > will be added in the future to allow default values to be registered > at kernel and device bringup time. > > The policy allocates a number of pages equal to the set weights. For > example, if the weights are (2,1), then 2 pages will be allocated on > node0 for every 1 page allocated on node1. > > The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) > and mbind(2). > > There are 3 integration points: > > weighted_interleave_nodes: > Counts the number of allocations as they occur, and applies the > weight for the current node. When the weight reaches 0, switch > to the next node. > > weighted_interleave_nid: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the node based on the given index. > > bulk_array_weighted_interleave: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the number of "interleave rounds" as > well as any delta ("partial round"). Calculates the number of > pages for each node and allocates them. > > If a node was scheduled for interleave via interleave_nodes, the > current weight (pol->cur_weight) will be allocated first, before > the remaining bulk calculation is done. > > One piece of complexity is the interaction between a recent refactor > which split the logic to acquire the "ilx" (interleave index) of an > allocation and the actually application of the interleave. The > calculation of the `interleave index` is done by `get_vma_policy()`, > while the actual selection of the node will be later appliex by the > relevant weighted_interleave function. > > Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> > Signed-off-by: Gregory Price <gregory.price@memverge.com> > Co-developed-by: Rakie Kim <rakie.kim@sk.com> > Signed-off-by: Rakie Kim <rakie.kim@sk.com> > Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > --- > .../admin-guide/mm/numa_memory_policy.rst | 9 + > include/linux/mempolicy.h | 5 + > include/uapi/linux/mempolicy.h | 1 + > mm/mempolicy.c | 234 +++++++++++++++++- > 4 files changed, 246 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst > index eca38fa81e0f..a70f20ce1ffb 100644 > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst > @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY > can fall back to all existing numa nodes. This is effectively > MPOL_PREFERRED allowed for a mask rather than a single node. > > +MPOL_WEIGHTED_INTERLEAVE > + This mode operates the same as MPOL_INTERLEAVE, except that > + interleaving behavior is executed based on weights set in > + /sys/kernel/mm/mempolicy/weighted_interleave/ > + > + Weighted interleave allocates pages on nodes according to a > + weight. For example if nodes [0,1] are weighted [5,2], 5 pages > + will be allocated on node0 for every 2 pages allocated on node1. > + > NUMA memory policy supports the following optional mode flags: > > MPOL_F_STATIC_NODES > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..c1a083eb0dd5 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -54,6 +54,11 @@ struct mempolicy { > nodemask_t cpuset_mems_allowed; /* relative to these nodes */ > nodemask_t user_nodemask; /* nodemask passed by user */ > } w; > + > + /* Weighted interleave settings */ > + struct { > + u8 cur_weight; > + } wil; > }; > > /* > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > index a8963f7ef4c2..1f9bb10d1a47 100644 > --- a/include/uapi/linux/mempolicy.h > +++ b/include/uapi/linux/mempolicy.h > @@ -23,6 +23,7 @@ enum { > MPOL_INTERLEAVE, > MPOL_LOCAL, > MPOL_PREFERRED_MANY, > + MPOL_WEIGHTED_INTERLEAVE, > MPOL_MAX, /* always last member of enum */ > }; > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 427bddf115df..aa3b2389d3e0 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -19,6 +19,13 @@ > * for anonymous memory. For process policy an process counter > * is used. > * > + * weighted interleave > + * Allocate memory interleaved over a set of nodes based on > + * a set of weights (per-node), with normal fallback if it > + * fails. Otherwise operates the same as interleave. > + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated > + * on node 0 for every 1 page allocated on node 1. > + * > * bind Only allocate memory on a specific set of nodes, > * no fallback. > * FIXME: memory is allocated starting with the first node > @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, > policy->mode = mode; > policy->flags = flags; > policy->home_node = NUMA_NO_NODE; > + policy->wil.cur_weight = 0; > > return policy; > } > @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > .create = mpol_new_nodemask, > .rebind = mpol_rebind_preferred, > }, > + [MPOL_WEIGHTED_INTERLEAVE] = { > + .create = mpol_new_nodemask, > + .rebind = mpol_rebind_nodemask, > + }, > }; > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, > > old = current->mempolicy; > current->mempolicy = new; > - if (new && new->mode == MPOL_INTERLEAVE) > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > *nodes = pol->nodes; > break; > case MPOL_LOCAL: > @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > } else if (pol == current->mempolicy && > pol->mode == MPOL_INTERLEAVE) { > *policy = next_node_in(current->il_prev, pol->nodes); > + } else if (pol == current->mempolicy && > + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { > + if (pol->wil.cur_weight) > + *policy = current->il_prev; > + else > + *policy = next_node_in(current->il_prev, > + pol->nodes); > } else { > err = -EINVAL; > goto out; > @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > pol = __get_vma_policy(vma, addr, ilx); > if (!pol) > pol = get_task_policy(current); > - if (pol->mode == MPOL_INTERLEAVE) { > + if (pol->mode == MPOL_INTERLEAVE || > + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { Should change the comments above get_vma_policy() definition too. > *ilx += vma->vm_pgoff >> order; > *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > } > @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > return zone >= dynamic_policy_zone; > } > > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > +{ > + unsigned int next; > + struct task_struct *me = current; > + u8 __rcu *table; > + > + next = next_node_in(me->il_prev, policy->nodes); > + if (next == MAX_NUMNODES) > + return next; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + if (!policy->wil.cur_weight) > + policy->wil.cur_weight = table ? table[next] : 1; > + rcu_read_unlock(); > + > + policy->wil.cur_weight--; > + if (!policy->wil.cur_weight) > + me->il_prev = next; > + return next; > +} > + > /* Do dynamic interleaving for a process */ > static unsigned int interleave_nodes(struct mempolicy *policy) > { > @@ -1869,6 +1913,9 @@ unsigned int mempolicy_slab_node(void) > case MPOL_INTERLEAVE: > return interleave_nodes(policy); > > + case MPOL_WEIGHTED_INTERLEAVE: > + return weighted_interleave_nodes(policy); > + > case MPOL_BIND: > case MPOL_PREFERRED_MANY: > { > @@ -1907,6 +1954,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, > return nodes_weight(*mask); > } > > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > +{ > + nodemask_t nodemask; > + unsigned int target, nr_nodes; > + u8 __rcu *table; > + unsigned int weight_total = 0; > + u8 weight; > + int nid; > + > + nr_nodes = read_once_policy_nodemask(pol, &nodemask); > + if (!nr_nodes) > + return numa_node_id(); > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + /* calculate the total weight */ > + for_each_node_mask(nid, nodemask) > + weight_total += table ? table[nid] : 1; > + > + /* Calculate the node offset based on totals */ > + target = ilx % weight_total; > + nid = first_node(nodemask); > + while (target) { > + weight = table ? table[nid] : 1; > + if (target < weight) > + break; > + target -= weight; > + nid = next_node_in(nid, nodemask); > + } > + rcu_read_unlock(); > + return nid; > +} > + > /* > * Do static interleaving for interleave index @ilx. Returns the ilx'th > * node in pol->nodes (starting from ilx=0), wrapping around if ilx > @@ -1967,6 +2047,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > *nid = (ilx == NO_INTERLEAVE_INDEX) ? > interleave_nodes(pol) : interleave_nid(pol, ilx); > break; > + case MPOL_WEIGHTED_INTERLEAVE: > + *nid = (ilx == NO_INTERLEAVE_INDEX) ? > + weighted_interleave_nodes(pol) : > + weighted_interleave_nid(pol, ilx); > + break; > } > > return nodemask; > @@ -2028,6 +2113,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > case MPOL_PREFERRED_MANY: > case MPOL_BIND: > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > *mask = mempolicy->nodes; > break; > > @@ -2127,7 +2213,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > * If the policy is interleave or does not allow the current > * node in its nodemask, we allocate the standard way. > */ > - if (pol->mode != MPOL_INTERLEAVE && > + if ((pol->mode != MPOL_INTERLEAVE && > + pol->mode != MPOL_WEIGHTED_INTERLEAVE) && > (!nodemask || node_isset(nid, *nodemask))) { > /* > * First, try to allocate THP only on local node, but > @@ -2263,6 +2350,135 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, > return total_allocated; > } > > +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > + struct mempolicy *pol, unsigned long nr_pages, > + struct page **page_array) > +{ > + struct task_struct *me = current; > + unsigned long total_allocated = 0; > + unsigned long nr_allocated; > + unsigned long rounds; > + unsigned long node_pages, delta; > + u8 weight, resume_weight; > + u8 __rcu *table; > + u8 *weights; > + unsigned int weight_total = 0; > + unsigned long rem_pages = nr_pages; > + nodemask_t nodes; > + int nnodes, node, weight_nodes, resume_node; > + int prev_node = NUMA_NO_NODE; It appears that we should initialize prev_node with me->il_prev? Details are as below. > + bool delta_depleted = false; > + int i; > + > + if (!nr_pages) > + return 0; > + > + nnodes = read_once_policy_nodemask(pol, &nodes); > + if (!nnodes) > + return 0; > + > + /* Continue allocating from most recent node and adjust the nr_pages */ > + if (pol->wil.cur_weight) { > + node = next_node_in(me->il_prev, nodes); > + node_pages = pol->wil.cur_weight; > + if (node_pages > rem_pages) > + node_pages = rem_pages; > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + /* if that's all the pages, no need to interleave */ > + if (rem_pages <= pol->wil.cur_weight) { > + pol->wil.cur_weight -= rem_pages; If "pol->wil.cur_weight == 0" here, we need to change me->il_prev? > + return total_allocated; > + } > + /* Otherwise we adjust nr_pages down, and continue from there */ > + rem_pages -= pol->wil.cur_weight; > + pol->wil.cur_weight = 0; > + prev_node = node; > + } else { prev_node = me->il_prev; } > + > + /* fetch the weights for this operation and calculate total weight */ > + weights = kmalloc(nnodes, GFP_KERNEL); > + if (!weights) > + return total_allocated; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + weight_nodes = 0; We can replace "weight_nodes" with "i" and use a "for" loop? > + while (weight_nodes < nnodes) { > + node = next_node_in(prev_node, nodes); IIUC, "node" will not change in the loop, so all "weight" below will be the same value. To keep it simple, I think we can just copy weights from the global iw_table and consider the default value? > + weight = table ? table[node] : 1; > + weights[weight_nodes++] = weight; > + weight_total += weight; > + } > + rcu_read_unlock(); > + > + /* > + * Now we can continue allocating from 0 instead of an offset > + * We calculate the number of rounds and any partial rounds so > + * that we minimize the number of calls to __alloc_pages_bulk > + * This requires us to track which node we should resume from. > + * > + * if (rounds > 0) and (delta == 0), resume_node will always be > + * the current me->il_prev > + * > + * if (delta > 0) and delta is depleted exactly on a node-weight > + * boundary, resume node will be the node last allocated from when > + * delta reached 0. > + * > + * if (delta > 0) and delta is not depleted on a node-weight boundary, > + * resume node will be the node prior to the node last allocated from. > + * > + * (rounds == 0) and (delta == 0) is not possible (earlier exit) > + */ > + rounds = rem_pages / weight_total; > + delta = rem_pages % weight_total; > + /* If no delta, we'll resume from current prev_node and first weight */ > + for (i = 0; i < nnodes; i++) { > + node = next_node_in(prev_node, nodes); > + weight = weights[i]; > + node_pages = weight * rounds; > + /* If a delta exists, add this node's portion of the delta */ > + if (delta >= weight) { > + node_pages += weight; > + delta -= weight; > + resume_node = node; > + resume_weight = i < (nnodes - 1) ? weights[i+1] : > + weights[0]; > + /* stop tracking iff (delta == weight) */ > + delta_depleted = !delta; > + } else if (delta) { /* <= weight */ The comment is unnecessary and wrong. > + /* if delta depleted, resume from this node */ > + node_pages += delta; > + delta = 0; > + resume_node = prev_node; > + resume_weight = weight - (node_pages % weight); > + delta_depleted = true; /* stop tracking */ > + } else if (!delta_depleted) { > + /* if there was no delta, track last allocated node */ > + resume_node = node; > + resume_weight = i < (nnodes - 1) ? weights[i+1] : > + weights[0]; > + } Can the above code be simplified as something like below? resume_node = prev_node; resume_weight = 0; for (...) { ... if (delta > weight) { node_pages += weight; delta -= weight; } else if (delta) { node_pages += delta; /* if delta depleted, resume from this node */ if (delta < weight) { resume_node = prev_node; resume_weight = weight - delta; } else { resume_node = node; } delta = 0; } ... } -- Best Regards, Huang, Ying > + /* node_pages can be 0 if an allocation fails and rounds == 0 */ > + if (!node_pages) > + break; > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + if (total_allocated == nr_pages) > + break; > + prev_node = node; > + } > + /* resume allocating from the calculated node and weight */ > + me->il_prev = resume_node; > + pol->wil.cur_weight = resume_weight; > + kfree(weights); > + return total_allocated; > +} > + > static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, > struct mempolicy *pol, unsigned long nr_pages, > struct page **page_array) > @@ -2303,6 +2519,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, > return alloc_pages_bulk_array_interleave(gfp, pol, > nr_pages, page_array); > > + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) > + return alloc_pages_bulk_array_weighted_interleave( > + gfp, pol, nr_pages, page_array); > + > if (pol->mode == MPOL_PREFERRED_MANY) > return alloc_pages_bulk_array_preferred_many(gfp, > numa_node_id(), pol, nr_pages, page_array); > @@ -2378,6 +2598,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > return !!nodes_equal(a->nodes, b->nodes); > case MPOL_LOCAL: > return true; > @@ -2514,6 +2735,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > polnid = interleave_nid(pol, ilx); > break; > > + case MPOL_WEIGHTED_INTERLEAVE: > + polnid = weighted_interleave_nid(pol, ilx); > + break; > + > case MPOL_PREFERRED: > if (node_isset(curnid, pol->nodes)) > goto out; > @@ -2888,6 +3113,7 @@ static const char * const policy_modes[] = > [MPOL_PREFERRED] = "prefer", > [MPOL_BIND] = "bind", > [MPOL_INTERLEAVE] = "interleave", > + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", > [MPOL_LOCAL] = "local", > [MPOL_PREFERRED_MANY] = "prefer (many)", > }; > @@ -2947,6 +3173,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) > } > break; > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > /* > * Default to online nodes with memory if no nodelist > */ > @@ -3057,6 +3284,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) > case MPOL_PREFERRED_MANY: > case MPOL_BIND: > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > nodes = pol->nodes; > break; > default:
On Tue, Jan 23, 2024 at 11:02:03AM +0800, Huang, Ying wrote: > Gregory Price <gourry.memverge@gmail.com> writes: > > > + int prev_node = NUMA_NO_NODE; > > It appears that we should initialize prev_node with me->il_prev? > Details are as below. > yeah good catch, was a rebase error from my tested code, where this is the case. patching now. > > + if (rem_pages <= pol->wil.cur_weight) { > > + pol->wil.cur_weight -= rem_pages; > > If "pol->wil.cur_weight == 0" here, we need to change me->il_prev? > you are right, and also need to fetch the next cur_weight. Seems I missed this specific case in my tests. (had this tested with a single node but not 2, so it looked right). Added to my test suite. > We can replace "weight_nodes" with "i" and use a "for" loop? > > > + while (weight_nodes < nnodes) { > > + node = next_node_in(prev_node, nodes); > > IIUC, "node" will not change in the loop, so all "weight" below will be > the same value. To keep it simple, I think we can just copy weights > from the global iw_table and consider the default value? > another rebase error here from my tested code, this should have been node = prev_node; while (...) node = next_node_in(node, nodes); I can change it to a for loop as suggested, but for more info on why I did it this way, see the chunk below > > + } else if (!delta_depleted) { > > + /* if there was no delta, track last allocated node */ > > + resume_node = node; > > + resume_weight = i < (nnodes - 1) ? weights[i+1] : > > + weights[0]; ^ this line acquires the weight of the *NEXT* node another chunk prior to this does the same thing. I suppose i can use next_node_in() instead and just copy the entire weigh array though, if that is preferable. > > + } > > Can the above code be simplified as something like below? > > resume_node = prev_node; > resume_weight = 0; > for (...) { > ... > if (delta > weight) { > node_pages += weight; > delta -= weight; > } else if (delta) { > node_pages += delta; > /* if delta depleted, resume from this node */ > if (delta < weight) { > resume_node = prev_node; > resume_weight = weight - delta; > } else { > resume_node = node; > } > delta = 0; > } > ... > } > I'll take another look at it, but this logic is annoying because of the corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. If it's NUMA_NO_NODE, then the logic you have above will say "the next node has no remaining weights assigned" and skip it on the next call to weighted_interleave_nid or weighted_interleave_nodes. This is incorrect - we want the weight of the first node to be resume_weight, which is what this chunk does: if (delta >= weight) { /* if delta == weight, get next node weight */ resume_weight = i < (nnodes - 1) ? weights[i+1] : weights[0]; else if (delta) { /* delta < weight */ /* there's a remaining weight, use the that for resume weight */ resume_weight = weight - (node_pages % weight); } else if (!delta_depleted) { /* there was never a delta, track the last node and get the weight * of the node AFTER that node, that's the resume weight */ resume_weight = i < (nnodes - 1) ? weights[i+1] : weights[0]; } If il_prev is an actual node, and delta == 0, we want to return with (il_prev = prev_node) but with the weight set to the weight of the first node we're about to allocate from. This is the reason for the annoying logic here: We have to come out of this loop with the actual node and the actual weight. I'll try to clean it up further and get my test suite to pass. ~Gregory
On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: > > > > Can the above code be simplified as something like below? > > > > resume_node = prev_node; --- resume_weight = 0; +++ resume_weight = weights[node]; > > for (...) { > > ... > > } > > > > I'll take another look at it, but this logic is annoying because of the > corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. > After a quick look, as long as no one objects to (me->il_prev) remaining NUMA_NO_NODE while having a weight assigned to pol->wil.cur_weight, then this looks like it can be simplified as above. I don't think it's harmful, but i'll have to take a quick look at what happens on rebind to make sure we don't have a stale weight. ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Tue, Jan 23, 2024 at 11:02:03AM +0800, Huang, Ying wrote: >> Gregory Price <gourry.memverge@gmail.com> writes: >> >> > + int prev_node = NUMA_NO_NODE; >> >> It appears that we should initialize prev_node with me->il_prev? >> Details are as below. >> > > yeah good catch, was a rebase error from my tested code, where this is > the case. patching now. > >> > + if (rem_pages <= pol->wil.cur_weight) { >> > + pol->wil.cur_weight -= rem_pages; >> >> If "pol->wil.cur_weight == 0" here, we need to change me->il_prev? >> > you are right, and also need to fetch the next cur_weight. Seems I > missed this specific case in my tests. (had this tested with a single > node but not 2, so it looked right). > > Added to my test suite. > >> We can replace "weight_nodes" with "i" and use a "for" loop? >> >> > + while (weight_nodes < nnodes) { >> > + node = next_node_in(prev_node, nodes); >> >> IIUC, "node" will not change in the loop, so all "weight" below will be >> the same value. To keep it simple, I think we can just copy weights >> from the global iw_table and consider the default value? >> > > another rebase error here from my tested code, this should have been > node = prev_node; > while (...) > node = next_node_in(node, nodes); > > I can change it to a for loop as suggested, but for more info on why I > did it this way, see the chunk below > >> > + } else if (!delta_depleted) { >> > + /* if there was no delta, track last allocated node */ >> > + resume_node = node; >> > + resume_weight = i < (nnodes - 1) ? weights[i+1] : >> > + weights[0]; > ^ this line acquires the weight of the *NEXT* node > another chunk prior to this does the same > thing. I suppose i can use next_node_in() > instead and just copy the entire weigh array > though, if that is preferable. Yes. I think copy the entire weight array make code logic simpler. -- Best Regards, Huang, Ying
Gregory Price <gregory.price@memverge.com> writes: > On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: >> > >> > Can the above code be simplified as something like below? >> > >> > resume_node = prev_node; > --- resume_weight = 0; > +++ resume_weight = weights[node]; >> > for (...) { >> > ... >> > } >> > >> >> I'll take another look at it, but this logic is annoying because of the >> corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. >> > > After a quick look, as long as no one objects to (me->il_prev) remaining > NUMA_NO_NODE MAX_NUMNODES-1 ? > while having a weight assigned to pol->wil.cur_weight, I think that it is OK. And, IIUC, pol->wil.cur_weight can be 0, as in weighted_interleave_nodes(), if it's 0, it will be assigned to default weight for the node. > then > this looks like it can be simplified as above. > > I don't think it's harmful, but i'll have to take a quick look at what > happens on rebind to make sure we don't have a stale weight. Make sense. -- Best Regards, Huang, Ying
Gregory Price <gourry.memverge@gmail.com> writes: > When a system has multiple NUMA nodes and it becomes bandwidth hungry, > using the current MPOL_INTERLEAVE could be an wise option. > > However, if those NUMA nodes consist of different types of memory such > as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin > based interleave policy does not optimally distribute data to make use > of their different bandwidth characteristics. > > Instead, interleave is more effective when the allocation policy follows > each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. > > This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, > enabling weighted interleave between NUMA nodes. Weighted interleave > allows for proportional distribution of memory across multiple numa > nodes, preferably apportioned to match the bandwidth of each node. > > For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), > with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate > weight distribution is (2:1). > > Weights for each node can be assigned via the new sysfs extension: > /sys/kernel/mm/mempolicy/weighted_interleave/ > > For now, the default value of all nodes will be `1`, which matches > the behavior of standard 1:1 round-robin interleave. An extension > will be added in the future to allow default values to be registered > at kernel and device bringup time. > > The policy allocates a number of pages equal to the set weights. For > example, if the weights are (2,1), then 2 pages will be allocated on > node0 for every 1 page allocated on node1. > > The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) > and mbind(2). > > There are 3 integration points: > > weighted_interleave_nodes: > Counts the number of allocations as they occur, and applies the > weight for the current node. When the weight reaches 0, switch > to the next node. > > weighted_interleave_nid: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the node based on the given index. > > bulk_array_weighted_interleave: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the number of "interleave rounds" as > well as any delta ("partial round"). Calculates the number of > pages for each node and allocates them. > > If a node was scheduled for interleave via interleave_nodes, the > current weight (pol->cur_weight) will be allocated first, before > the remaining bulk calculation is done. > > One piece of complexity is the interaction between a recent refactor > which split the logic to acquire the "ilx" (interleave index) of an > allocation and the actually application of the interleave. The > calculation of the `interleave index` is done by `get_vma_policy()`, > while the actual selection of the node will be later appliex by the > relevant weighted_interleave function. > > Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> > Signed-off-by: Gregory Price <gregory.price@memverge.com> > Co-developed-by: Rakie Kim <rakie.kim@sk.com> > Signed-off-by: Rakie Kim <rakie.kim@sk.com> > Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > --- > .../admin-guide/mm/numa_memory_policy.rst | 9 + > include/linux/mempolicy.h | 5 + > include/uapi/linux/mempolicy.h | 1 + > mm/mempolicy.c | 234 +++++++++++++++++- > 4 files changed, 246 insertions(+), 3 deletions(-) > > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst > index eca38fa81e0f..a70f20ce1ffb 100644 > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst > @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY > can fall back to all existing numa nodes. This is effectively > MPOL_PREFERRED allowed for a mask rather than a single node. > > +MPOL_WEIGHTED_INTERLEAVE > + This mode operates the same as MPOL_INTERLEAVE, except that > + interleaving behavior is executed based on weights set in > + /sys/kernel/mm/mempolicy/weighted_interleave/ > + > + Weighted interleave allocates pages on nodes according to a > + weight. For example if nodes [0,1] are weighted [5,2], 5 pages > + will be allocated on node0 for every 2 pages allocated on node1. > + > NUMA memory policy supports the following optional mode flags: > > MPOL_F_STATIC_NODES > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index 931b118336f4..c1a083eb0dd5 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -54,6 +54,11 @@ struct mempolicy { > nodemask_t cpuset_mems_allowed; /* relative to these nodes */ > nodemask_t user_nodemask; /* nodemask passed by user */ > } w; > + > + /* Weighted interleave settings */ > + struct { > + u8 cur_weight; > + } wil; > }; > > /* > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > index a8963f7ef4c2..1f9bb10d1a47 100644 > --- a/include/uapi/linux/mempolicy.h > +++ b/include/uapi/linux/mempolicy.h > @@ -23,6 +23,7 @@ enum { > MPOL_INTERLEAVE, > MPOL_LOCAL, > MPOL_PREFERRED_MANY, > + MPOL_WEIGHTED_INTERLEAVE, > MPOL_MAX, /* always last member of enum */ > }; > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 427bddf115df..aa3b2389d3e0 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -19,6 +19,13 @@ > * for anonymous memory. For process policy an process counter > * is used. > * > + * weighted interleave > + * Allocate memory interleaved over a set of nodes based on > + * a set of weights (per-node), with normal fallback if it > + * fails. Otherwise operates the same as interleave. > + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated > + * on node 0 for every 1 page allocated on node 1. > + * > * bind Only allocate memory on a specific set of nodes, > * no fallback. > * FIXME: memory is allocated starting with the first node > @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, > policy->mode = mode; > policy->flags = flags; > policy->home_node = NUMA_NO_NODE; > + policy->wil.cur_weight = 0; > > return policy; > } > @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > .create = mpol_new_nodemask, > .rebind = mpol_rebind_preferred, > }, > + [MPOL_WEIGHTED_INTERLEAVE] = { > + .create = mpol_new_nodemask, > + .rebind = mpol_rebind_nodemask, > + }, > }; > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, > > old = current->mempolicy; > current->mempolicy = new; > - if (new && new->mode == MPOL_INTERLEAVE) > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > *nodes = pol->nodes; > break; > case MPOL_LOCAL: > @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > } else if (pol == current->mempolicy && > pol->mode == MPOL_INTERLEAVE) { > *policy = next_node_in(current->il_prev, pol->nodes); > + } else if (pol == current->mempolicy && > + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { > + if (pol->wil.cur_weight) > + *policy = current->il_prev; > + else > + *policy = next_node_in(current->il_prev, > + pol->nodes); Per my understanding, we should always use "*policy = next_node_in()" here, as in weighted_interleave_nodes(). > } else { > err = -EINVAL; > goto out; > @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > pol = __get_vma_policy(vma, addr, ilx); > if (!pol) > pol = get_task_policy(current); > - if (pol->mode == MPOL_INTERLEAVE) { > + if (pol->mode == MPOL_INTERLEAVE || > + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { > *ilx += vma->vm_pgoff >> order; > *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > } > @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > return zone >= dynamic_policy_zone; > } > > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > +{ > + unsigned int next; > + struct task_struct *me = current; > + u8 __rcu *table; > + > + next = next_node_in(me->il_prev, policy->nodes); > + if (next == MAX_NUMNODES) > + return next; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + if (!policy->wil.cur_weight) > + policy->wil.cur_weight = table ? table[next] : 1; > + rcu_read_unlock(); > + > + policy->wil.cur_weight--; > + if (!policy->wil.cur_weight) > + me->il_prev = next; > + return next; > +} > + [snip] -- Best Regards, Huang, Ying
On Tue, Jan 23, 2024 at 04:35:19PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: > >> > > >> > Can the above code be simplified as something like below? > >> > > >> > resume_node = prev_node; > > --- resume_weight = 0; > > +++ resume_weight = weights[node]; > >> > for (...) { > >> > ... > >> > } > >> > > >> > >> I'll take another look at it, but this logic is annoying because of the > >> corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. > >> > > > > After a quick look, as long as no one objects to (me->il_prev) remaining > > NUMA_NO_NODE > > MAX_NUMNODES-1 ? > When setting a new policy, the il_prev gets set to NUMA_NO_NODE. It's not harmful and is just (-1), which is functionally the same as (MAX_NUMNODES-1) for the purpose of iterating the nodemask with next_node_in(). So it's fine to set (resume_node = me->il_prev) as discussed. I have a cleaned up function I'll push when i fix up a few other spots. > > while having a weight assigned to pol->wil.cur_weight, > > I think that it is OK. > > And, IIUC, pol->wil.cur_weight can be 0, as in > weighted_interleave_nodes(), if it's 0, it will be assigned to default > weight for the node. > cur_weight is different than the global weights. cur_weight tells us how many pages are remaining to allocate for the current node. (cur_weight = 0) can happen in two scenarios: - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0) - weighted_interleave_nodes decrements it down to 0 Now that i'm looking at it - the second condition should not exist, and we can eliminate it. The logic in weighted_interleave_nodes is actually annoyingly unclear at the moment, so I'm going to re-factor it a bit to be more explicit. ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Tue, Jan 23, 2024 at 04:35:19PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@memverge.com> writes: >> >> > On Mon, Jan 22, 2024 at 11:54:34PM -0500, Gregory Price wrote: >> >> > >> >> > Can the above code be simplified as something like below? >> >> > >> >> > resume_node = prev_node; >> > --- resume_weight = 0; >> > +++ resume_weight = weights[node]; >> >> > for (...) { >> >> > ... >> >> > } >> >> > >> >> >> >> I'll take another look at it, but this logic is annoying because of the >> >> corner case: me->il_prev can be NUMA_NO_NODE or an actual numa node. >> >> >> > >> > After a quick look, as long as no one objects to (me->il_prev) remaining >> > NUMA_NO_NODE >> >> MAX_NUMNODES-1 ? >> > > When setting a new policy, the il_prev gets set to NUMA_NO_NODE. It's IIUC, it is set to MAX_NUMNODES-1 as below, @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) current->il_prev = MAX_NUMNODES-1; task_unlock(current); mpol_put(old); I don't think we need to change this. > not harmful and is just (-1), which is functionally the same as > (MAX_NUMNODES-1) for the purpose of iterating the nodemask with > next_node_in(). So it's fine to set (resume_node = me->il_prev) > as discussed. > > I have a cleaned up function I'll push when i fix up a few other spots. > >> > while having a weight assigned to pol->wil.cur_weight, >> >> I think that it is OK. >> >> And, IIUC, pol->wil.cur_weight can be 0, as in >> weighted_interleave_nodes(), if it's 0, it will be assigned to default >> weight for the node. >> > > cur_weight is different than the global weights. cur_weight tells us > how many pages are remaining to allocate for the current node. > > (cur_weight = 0) can happen in two scenarios: > - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0) > - weighted_interleave_nodes decrements it down to 0 > > Now that i'm looking at it - the second condition should not exist, and > we can eliminate it. The logic in weighted_interleave_nodes is actually > annoyingly unclear at the moment, so I'm going to re-factor it a bit to > be more explicit. I am OK with either way. Just a reminder, the first condition may be true in alloc_pages_bulk_array_weighted_interleave() and perhaps some other places. -- Best Regards, Huang, Ying
On Wed, Jan 24, 2024 at 09:51:20AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) > current->il_prev = MAX_NUMNODES-1; > task_unlock(current); > mpol_put(old); > > I don't think we need to change this. > Ah you're right it's set to MAX_NUMNODES-1 here, but NUMA_NO_NODE can be passed in as an argument to alloc_pages_bulk_array_mempolicy, like here: vm_area_alloc_pages() if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE) nr = alloc_pages_bulk_array_mempolicy(bulk_gfp, nr_pages_request, pages + nr_allocated); > > (cur_weight = 0) can happen in two scenarios: > > - initial setting of mempolicy (NUMA_NO_NODE w/ cur_weight=0) > > - weighted_interleave_nodes decrements it down to 0 > > > > Now that i'm looking at it - the second condition should not exist, and > > we can eliminate it. The logic in weighted_interleave_nodes is actually > > annoyingly unclear at the moment, so I'm going to re-factor it a bit to > > be more explicit. > > I am OK with either way. Just a reminder, the first condition may be > true in alloc_pages_bulk_array_weighted_interleave() and perhaps some > other places. > Yeah, the bulk allocator handles it correctly, it's just a matter of clarity for weighted_interleave_nodes. What isn't necessarily handled correctly is the rebind code. Rebind due to a cgroup/mems_allowed change can cause a stale weight to be carried. Basically cur_weight is not cleared, but the node it applied to may no longer be the next node when next_node_in() is called. The race condition is 1) exceedingly rare, and 2) not necessarily harmful, just inaccurate. The worst case scenario is that a node receives up to 255 additional allocations once after a rebind (but more likely 10-20). I was considering forcing the interleave forward like this: @@ -356,6 +361,10 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes) tmp = *nodes; pol->nodes = tmp; + + /* Weighted interleave policies are forced forward to the next node */ + if (pol->mode & MPOL_WEIGHTED_INTERLEAVE) + pol->wil.cur_weight = 0; } But this creates 2 race conditions when we read cur_weight and nodemask in the allocator path. Example 1: 1) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight) 2) rebind changes nodemask and { cur_weight = 0; } 3) bulk allocator sets pol->wil.cur_weight In this scenario, resume_weight is stale coming out of bulk allocations if the resume_node has been removed from the node mask. Example 2: 1) rebind changes nodemask 2) bulk allocator READ_ONCE(mask), READ_ONCE(cur_weight) 3) rebind sets { cur_weight = 0; } In this scenario, cur_weight is stale going into bulk allocations. Neither of these can force a violation of mems_allowed, just a mis-application of a weight. I'll need to think on this a bit. We can either leave this as-is, meaning the first allocation after a rebind may apply the wrong weight to a node, or we can try to track the current-interleave-node and validate next_node_in(mask) == current-interleave-node before leaving the allocator path (this may also be just as racey). turns out concurrent counting is still hard :] ~Gregory
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..a70f20ce1ffb 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocates pages on nodes according to a + weight. For example if nodes [0,1] are weighted [5,2], 5 pages + will be allocated on node0 for every 2 pages allocated on node1. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 931b118336f4..c1a083eb0dd5 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -54,6 +54,11 @@ struct mempolicy { nodemask_t cpuset_mems_allowed; /* relative to these nodes */ nodemask_t user_nodemask; /* nodemask passed by user */ } w; + + /* Weighted interleave settings */ + struct { + u8 cur_weight; + } wil; }; /* diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 427bddf115df..aa3b2389d3e0 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -19,6 +19,13 @@ * for anonymous memory. For process policy an process counter * is used. * + * weighted interleave + * Allocate memory interleaved over a set of nodes based on + * a set of weights (per-node), with normal fallback if it + * fails. Otherwise operates the same as interleave. + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated + * on node 0 for every 1 page allocated on node 1. + * * bind Only allocate memory on a specific set of nodes, * no fallback. * FIXME: memory is allocated starting with the first node @@ -313,6 +320,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags, policy->mode = mode; policy->flags = flags; policy->home_node = NUMA_NO_NODE; + policy->wil.cur_weight = 0; return policy; } @@ -425,6 +433,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, }; static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, @@ -846,7 +858,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) current->il_prev = MAX_NUMNODES-1; task_unlock(current); mpol_put(old); @@ -872,6 +885,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes = pol->nodes; break; case MPOL_LOCAL: @@ -956,6 +970,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { *policy = next_node_in(current->il_prev, pol->nodes); + } else if (pol == current->mempolicy && + (pol->mode == MPOL_WEIGHTED_INTERLEAVE)) { + if (pol->wil.cur_weight) + *policy = current->il_prev; + else + *policy = next_node_in(current->il_prev, + pol->nodes); } else { err = -EINVAL; goto out; @@ -1785,7 +1806,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1835,6 +1857,28 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; } +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int next; + struct task_struct *me = current; + u8 __rcu *table; + + next = next_node_in(me->il_prev, policy->nodes); + if (next == MAX_NUMNODES) + return next; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + if (!policy->wil.cur_weight) + policy->wil.cur_weight = table ? table[next] : 1; + rcu_read_unlock(); + + policy->wil.cur_weight--; + if (!policy->wil.cur_weight) + me->il_prev = next; + return next; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1869,6 +1913,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1907,6 +1954,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, return nodes_weight(*mask); } +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) +{ + nodemask_t nodemask; + unsigned int target, nr_nodes; + u8 __rcu *table; + unsigned int weight_total = 0; + u8 weight; + int nid; + + nr_nodes = read_once_policy_nodemask(pol, &nodemask); + if (!nr_nodes) + return numa_node_id(); + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* calculate the total weight */ + for_each_node_mask(nid, nodemask) + weight_total += table ? table[nid] : 1; + + /* Calculate the node offset based on totals */ + target = ilx % weight_total; + nid = first_node(nodemask); + while (target) { + weight = table ? table[nid] : 1; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + rcu_read_unlock(); + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1967,6 +2047,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, *nid = (ilx == NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid = (ilx == NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } return nodemask; @@ -2028,6 +2113,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask = mempolicy->nodes; break; @@ -2127,7 +2213,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * If the policy is interleave or does not allow the current * node in its nodemask, we allocate the standard way. */ - if (pol->mode != MPOL_INTERLEAVE && + if ((pol->mode != MPOL_INTERLEAVE && + pol->mode != MPOL_WEIGHTED_INTERLEAVE) && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2263,6 +2350,135 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; } +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me = current; + unsigned long total_allocated = 0; + unsigned long nr_allocated; + unsigned long rounds; + unsigned long node_pages, delta; + u8 weight, resume_weight; + u8 __rcu *table; + u8 *weights; + unsigned int weight_total = 0; + unsigned long rem_pages = nr_pages; + nodemask_t nodes; + int nnodes, node, weight_nodes, resume_node; + int prev_node = NUMA_NO_NODE; + bool delta_depleted = false; + int i; + + if (!nr_pages) + return 0; + + nnodes = read_once_policy_nodemask(pol, &nodes); + if (!nnodes) + return 0; + + /* Continue allocating from most recent node and adjust the nr_pages */ + if (pol->wil.cur_weight) { + node = next_node_in(me->il_prev, nodes); + node_pages = pol->wil.cur_weight; + if (node_pages > rem_pages) + node_pages = rem_pages; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages <= pol->wil.cur_weight) { + pol->wil.cur_weight -= rem_pages; + return total_allocated; + } + /* Otherwise we adjust nr_pages down, and continue from there */ + rem_pages -= pol->wil.cur_weight; + pol->wil.cur_weight = 0; + prev_node = node; + } + + /* fetch the weights for this operation and calculate total weight */ + weights = kmalloc(nnodes, GFP_KERNEL); + if (!weights) + return total_allocated; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + weight_nodes = 0; + while (weight_nodes < nnodes) { + node = next_node_in(prev_node, nodes); + weight = table ? table[node] : 1; + weights[weight_nodes++] = weight; + weight_total += weight; + } + rcu_read_unlock(); + + /* + * Now we can continue allocating from 0 instead of an offset + * We calculate the number of rounds and any partial rounds so + * that we minimize the number of calls to __alloc_pages_bulk + * This requires us to track which node we should resume from. + * + * if (rounds > 0) and (delta == 0), resume_node will always be + * the current me->il_prev + * + * if (delta > 0) and delta is depleted exactly on a node-weight + * boundary, resume node will be the node last allocated from when + * delta reached 0. + * + * if (delta > 0) and delta is not depleted on a node-weight boundary, + * resume node will be the node prior to the node last allocated from. + * + * (rounds == 0) and (delta == 0) is not possible (earlier exit) + */ + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; + /* If no delta, we'll resume from current prev_node and first weight */ + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, nodes); + weight = weights[i]; + node_pages = weight * rounds; + /* If a delta exists, add this node's portion of the delta */ + if (delta >= weight) { + node_pages += weight; + delta -= weight; + resume_node = node; + resume_weight = i < (nnodes - 1) ? weights[i+1] : + weights[0]; + /* stop tracking iff (delta == weight) */ + delta_depleted = !delta; + } else if (delta) { /* <= weight */ + /* if delta depleted, resume from this node */ + node_pages += delta; + delta = 0; + resume_node = prev_node; + resume_weight = weight - (node_pages % weight); + delta_depleted = true; /* stop tracking */ + } else if (!delta_depleted) { + /* if there was no delta, track last allocated node */ + resume_node = node; + resume_weight = i < (nnodes - 1) ? weights[i+1] : + weights[0]; + } + /* node_pages can be 0 if an allocation fails and rounds == 0 */ + if (!node_pages) + break; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + if (total_allocated == nr_pages) + break; + prev_node = node; + } + /* resume allocating from the calculated node and weight */ + me->il_prev = resume_node; + pol->wil.cur_weight = resume_weight; + kfree(weights); + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2303,6 +2519,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave( + gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2378,6 +2598,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2514,6 +2735,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + polnid = weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2888,6 +3113,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -2947,6 +3173,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3057,6 +3284,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes = pol->nodes; break; default: