Message ID | 20240130182046.74278-4-gregory.price@memverge.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/mempolicy: weighted interleave mempolicy and sysfs extension | expand |
On Tue, Jan 30, 2024 at 01:20:46PM -0500, Gregory Price wrote: > + /* Continue allocating from most recent node and adjust the nr_pages */ > + node = me->il_prev; > + weight = me->il_weight; > + if (weight && node_isset(node, nodes)) { > + node_pages = min(rem_pages, weight); > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + /* if that's all the pages, no need to interleave */ > + if (rem_pages < weight) { > + /* stay on current node, adjust il_weight */ > + me->il_weight -= rem_pages; > + return total_allocated; > + } else if (rem_pages == weight) { > + /* move to next node / weight */ > + me->il_prev = next_node_in(node, nodes); > + me->il_weight = get_il_weight(next_node); Sigh, I managed to miss a small update that killed next_node in favor of operating directly on il_prev. Can you squash this fix into the patch? Otherwise I can submit a separate patch. ~Gregory diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 7cd92f4ec0d7..2c1aef8eab70 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2382,7 +2382,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, unsigned int weight_total = 0; unsigned long rem_pages = nr_pages; nodemask_t nodes; - int nnodes, node, next_node; + int nnodes, node; int resume_node = MAX_NUMNODES - 1; u8 resume_weight = 0; int prev_node; @@ -2412,7 +2412,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, } else if (rem_pages == weight) { /* move to next node / weight */ me->il_prev = next_node_in(node, nodes); - me->il_weight = get_il_weight(next_node); + me->il_weight = get_il_weight(me->il_prev); return total_allocated; } /* Otherwise we adjust remaining pages, continue from there */
Gregory Price <gourry.memverge@gmail.com> writes: > When a system has multiple NUMA nodes and it becomes bandwidth hungry, > using the current MPOL_INTERLEAVE could be an wise option. > > However, if those NUMA nodes consist of different types of memory such > as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin > based interleave policy does not optimally distribute data to make use > of their different bandwidth characteristics. > > Instead, interleave is more effective when the allocation policy follows > each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution. > > This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, > enabling weighted interleave between NUMA nodes. Weighted interleave > allows for proportional distribution of memory across multiple numa > nodes, preferably apportioned to match the bandwidth of each node. > > For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1), > with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate > weight distribution is (2:1). > > Weights for each node can be assigned via the new sysfs extension: > /sys/kernel/mm/mempolicy/weighted_interleave/ > > For now, the default value of all nodes will be `1`, which matches > the behavior of standard 1:1 round-robin interleave. An extension > will be added in the future to allow default values to be registered > at kernel and device bringup time. > > The policy allocates a number of pages equal to the set weights. For > example, if the weights are (2,1), then 2 pages will be allocated on > node0 for every 1 page allocated on node1. > > The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2) > and mbind(2). > > Some high level notes about the pieces of weighted interleave: > > current->il_prev: > Default interleave uses this to track the last used node. > Weighted interleave uses this to track the *current* node, and > when weight reaches 0 it will be used to acquire the next node. > > current->il_weight: > The active weight of the current node (current->il_prev) > When this reaches 0, current->il_prev is set to the next node > and current->il_weight is set to the next weight. I still think that my description of the 2 fields above is easier to be understood. For weighted interleave, current->il_prev is the node from which we allocated page in previous allocation. current->il_weight is the remaining weight for current->il_prev after previous allocation. But I will not force you to use this. Use it only if you think that they are better. > weighted_interleave_nodes: > Counts the number of allocations as they occur, and applies the > weight for the current node. When the weight reaches 0, switch > to the next node. Operates only on task->mempolicy. > > weighted_interleave_nid: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the node based on the given index. > Operates on VMA policies. > > bulk_array_weighted_interleave: > Gets the total weight of the nodemask as well as each individual > node weight, then calculates the number of "interleave rounds" as > well as any delta ("partial round"). Calculates the number of > pages for each node and allocates them. > > If a node was scheduled for interleave via interleave_nodes, the > current weight will be allocated first. > > Operates only on the task->mempolicy. > > One piece of complexity is the interaction between a recent refactor > which split the logic to acquire the "ilx" (interleave index) of an > allocation and the actual application of the interleave. If a call > to alloc_pages_mpol() were made with a weighted-interleave policy and > ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would > operate on a VMA policy - violating the description above. > > An inspection of all callers of alloc_pages_mpol() shows that all > external callers set ilx to `0`, an index value, or will call > get_vma_policy() to acquire the ilx. > > For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks > all set (pgoff_t ilx) or end up in `get_vma_policy()`. This enforces > the `weighted_interleave_nodes()` and `weighted_interleave_nid()` > policy requirements (task/vma respectively). > > Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com> > Signed-off-by: Gregory Price <gregory.price@memverge.com> > Co-developed-by: Rakie Kim <rakie.kim@sk.com> > Signed-off-by: Rakie Kim <rakie.kim@sk.com> > Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com> > Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com> > --- > .../admin-guide/mm/numa_memory_policy.rst | 9 + > include/linux/sched.h | 1 + > include/uapi/linux/mempolicy.h | 1 + > mm/mempolicy.c | 231 +++++++++++++++++- > 4 files changed, 238 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst > index eca38fa81e0f..a70f20ce1ffb 100644 > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst > @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY > can fall back to all existing numa nodes. This is effectively > MPOL_PREFERRED allowed for a mask rather than a single node. > > +MPOL_WEIGHTED_INTERLEAVE > + This mode operates the same as MPOL_INTERLEAVE, except that > + interleaving behavior is executed based on weights set in > + /sys/kernel/mm/mempolicy/weighted_interleave/ > + > + Weighted interleave allocates pages on nodes according to a > + weight. For example if nodes [0,1] are weighted [5,2], 5 pages > + will be allocated on node0 for every 2 pages allocated on node1. > + > NUMA memory policy supports the following optional mode flags: > > MPOL_F_STATIC_NODES > diff --git a/include/linux/sched.h b/include/linux/sched.h > index ffe8f618ab86..b9ce285d8c9c 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1259,6 +1259,7 @@ struct task_struct { > /* Protected by alloc_lock: */ > struct mempolicy *mempolicy; > short il_prev; > + u8 il_weight; > short pref_node_fork; > #endif > #ifdef CONFIG_NUMA_BALANCING > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > index a8963f7ef4c2..1f9bb10d1a47 100644 > --- a/include/uapi/linux/mempolicy.h > +++ b/include/uapi/linux/mempolicy.h > @@ -23,6 +23,7 @@ enum { > MPOL_INTERLEAVE, > MPOL_LOCAL, > MPOL_PREFERRED_MANY, > + MPOL_WEIGHTED_INTERLEAVE, > MPOL_MAX, /* always last member of enum */ > }; > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 3bdfaf03b660..7cd92f4ec0d7 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -19,6 +19,13 @@ > * for anonymous memory. For process policy an process counter > * is used. > * > + * weighted interleave > + * Allocate memory interleaved over a set of nodes based on > + * a set of weights (per-node), with normal fallback if it > + * fails. Otherwise operates the same as interleave. > + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated > + * on node 0 for every 1 page allocated on node 1. > + * > * bind Only allocate memory on a specific set of nodes, > * no fallback. > * FIXME: memory is allocated starting with the first node > @@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { > .create = mpol_new_nodemask, > .rebind = mpol_rebind_preferred, > }, > + [MPOL_WEIGHTED_INTERLEAVE] = { > + .create = mpol_new_nodemask, > + .rebind = mpol_rebind_nodemask, > + }, > }; > > static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, > @@ -862,8 +873,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, > > old = current->mempolicy; > current->mempolicy = new; > - if (new && new->mode == MPOL_INTERLEAVE) > + if (new && (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE)) { > current->il_prev = MAX_NUMNODES-1; > + current->il_weight = 0; > + } > task_unlock(current); > mpol_put(old); > ret = 0; > @@ -888,6 +902,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > *nodes = pol->nodes; > break; > case MPOL_LOCAL: > @@ -972,6 +987,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > } else if (pol == current->mempolicy && > pol->mode == MPOL_INTERLEAVE) { > *policy = next_node_in(current->il_prev, pol->nodes); > + } else if (pol == current->mempolicy && > + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { > + if (current->il_weight) > + *policy = current->il_prev; > + else > + *policy = next_node_in(current->il_prev, > + pol->nodes); > } else { > err = -EINVAL; > goto out; > @@ -1336,7 +1358,8 @@ static long do_mbind(unsigned long start, unsigned long len, > * VMAs, the nodes will still be interleaved from the targeted > * nodemask, but one by one may be selected differently. > */ > - if (new->mode == MPOL_INTERLEAVE) { > + if (new->mode == MPOL_INTERLEAVE || > + new->mode == MPOL_WEIGHTED_INTERLEAVE) { > struct page *page; > unsigned int order; > unsigned long addr = -EFAULT; > @@ -1784,7 +1807,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > * @vma: virtual memory area whose policy is sought > * @addr: address in @vma for shared policy lookup > * @order: 0, or appropriate huge_page_order for interleaving > - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE > + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or > + * MPOL_WEIGHTED_INTERLEAVE > * > * Returns effective policy for a VMA at specified address. > * Falls back to current->mempolicy or system default policy, as necessary. > @@ -1801,7 +1825,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > pol = __get_vma_policy(vma, addr, ilx); > if (!pol) > pol = get_task_policy(current); > - if (pol->mode == MPOL_INTERLEAVE) { > + if (pol->mode == MPOL_INTERLEAVE || > + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { > *ilx += vma->vm_pgoff >> order; > *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > } > @@ -1851,6 +1876,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > return zone >= dynamic_policy_zone; > } > > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > +{ > + unsigned int node = current->il_prev; > + > + if (!current->il_weight || !node_isset(node, policy->nodes)) { > + node = next_node_in(node, policy->nodes); > + /* can only happen if nodemask is being rebound */ > + if (node == MAX_NUMNODES) > + return node; I feel a little unsafe to read policy->nodes at same time of writing in rebound. Is it better to use a seqlock to guarantee its consistency? It's unnecessary to be a part of this series though. > + current->il_prev = node; > + current->il_weight = get_il_weight(node); > + } > + current->il_weight--; > + return node; > +} > + > /* Do dynamic interleaving for a process */ > static unsigned int interleave_nodes(struct mempolicy *policy) > { > @@ -1885,6 +1926,9 @@ unsigned int mempolicy_slab_node(void) > case MPOL_INTERLEAVE: > return interleave_nodes(policy); > > + case MPOL_WEIGHTED_INTERLEAVE: > + return weighted_interleave_nodes(policy); > + > case MPOL_BIND: > case MPOL_PREFERRED_MANY: > { > @@ -1923,6 +1967,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, > return nodes_weight(*mask); > } > > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) > +{ > + nodemask_t nodemask; > + unsigned int target, nr_nodes; > + u8 __rcu *table; > + unsigned int weight_total = 0; > + u8 weight; > + int nid; > + > + nr_nodes = read_once_policy_nodemask(pol, &nodemask); > + if (!nr_nodes) > + return numa_node_id(); > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + /* calculate the total weight */ > + for_each_node_mask(nid, nodemask) { > + /* detect system default usage */ > + weight = table ? table[nid] : 1; > + weight = weight ? weight : 1; > + weight_total += weight; > + } > + > + /* Calculate the node offset based on totals */ > + target = ilx % weight_total; > + nid = first_node(nodemask); > + while (target) { > + /* detect system default usage */ > + weight = table ? table[nid] : 1; > + weight = weight ? weight : 1; I found duplicated pattern as above in this patch. Can we define a function like below to remove the duplication? u8 __get_il_weight(u8 *table, int nid) { u8 weight; weight = table ? table[nid] : 1; return weight ? : 1; } This can be used in alloc_pages_bulk_array_weighted_interleave() to copy from global to local weights array too. But this isn't a big deal. I will leave it to you to decide. > + if (target < weight) > + break; > + target -= weight; > + nid = next_node_in(nid, nodemask); > + } > + rcu_read_unlock(); > + return nid; > +} > + > /* > * Do static interleaving for interleave index @ilx. Returns the ilx'th > * node in pol->nodes (starting from ilx=0), wrapping around if ilx > @@ -1983,6 +2066,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, > *nid = (ilx == NO_INTERLEAVE_INDEX) ? > interleave_nodes(pol) : interleave_nid(pol, ilx); > break; > + case MPOL_WEIGHTED_INTERLEAVE: > + *nid = (ilx == NO_INTERLEAVE_INDEX) ? > + weighted_interleave_nodes(pol) : > + weighted_interleave_nid(pol, ilx); > + break; > } > > return nodemask; > @@ -2044,6 +2132,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) > case MPOL_PREFERRED_MANY: > case MPOL_BIND: > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > *mask = mempolicy->nodes; > break; > > @@ -2144,6 +2233,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, > * node in its nodemask, we allocate the standard way. > */ > if (pol->mode != MPOL_INTERLEAVE && > + pol->mode != MPOL_WEIGHTED_INTERLEAVE && > (!nodemask || node_isset(nid, *nodemask))) { > /* > * First, try to allocate THP only on local node, but > @@ -2279,6 +2369,127 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, > return total_allocated; > } > > +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > + struct mempolicy *pol, unsigned long nr_pages, > + struct page **page_array) > +{ > + struct task_struct *me = current; > + unsigned long total_allocated = 0; > + unsigned long nr_allocated = 0; > + unsigned long rounds; > + unsigned long node_pages, delta; > + u8 __rcu *table, *weights, weight; > + unsigned int weight_total = 0; > + unsigned long rem_pages = nr_pages; > + nodemask_t nodes; > + int nnodes, node, next_node; > + int resume_node = MAX_NUMNODES - 1; > + u8 resume_weight = 0; > + int prev_node; > + int i; > + > + if (!nr_pages) > + return 0; > + > + nnodes = read_once_policy_nodemask(pol, &nodes); > + if (!nnodes) > + return 0; > + > + /* Continue allocating from most recent node and adjust the nr_pages */ > + node = me->il_prev; > + weight = me->il_weight; > + if (weight && node_isset(node, nodes)) { > + node_pages = min(rem_pages, weight); > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + /* if that's all the pages, no need to interleave */ > + if (rem_pages < weight) { > + /* stay on current node, adjust il_weight */ > + me->il_weight -= rem_pages; > + return total_allocated; > + } else if (rem_pages == weight) { > + /* move to next node / weight */ > + me->il_prev = next_node_in(node, nodes); > + me->il_weight = get_il_weight(next_node); > + return total_allocated; > + } > + /* Otherwise we adjust remaining pages, continue from there */ > + rem_pages -= weight; > + } > + /* clear active weight in case of an allocation failure */ > + me->il_weight = 0; > + prev_node = node; > + > + /* create a local copy of node weights to operate on outside rcu */ > + weights = kzalloc(nr_node_ids, GFP_KERNEL); > + if (!weights) > + return total_allocated; > + > + rcu_read_lock(); > + table = rcu_dereference(iw_table); > + if (table) > + memcpy(weights, table, nr_node_ids); > + rcu_read_unlock(); > + > + /* calculate total, detect system default usage */ > + for_each_node_mask(node, nodes) { > + if (!weights[node]) > + weights[node] = 1; > + weight_total += weights[node]; > + } > + > + /* > + * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. > + * Track which node weighted interleave should resume from. > + * > + * if (rounds > 0) and (delta == 0), resume_node will always be > + * the node following prev_node and its weight. > + */ > + rounds = rem_pages / weight_total; > + delta = rem_pages % weight_total; > + resume_node = next_node_in(prev_node, nodes); > + resume_weight = weights[resume_node]; > + for (i = 0; i < nnodes; i++) { > + node = next_node_in(prev_node, nodes); > + weight = weights[node]; > + node_pages = weight * rounds; > + /* If a delta exists, add this node's portion of the delta */ > + if (delta > weight) { > + node_pages += weight; > + delta -= weight; > + } else if (delta) { > + node_pages += delta; > + /* delta may deplete on a boundary or w/ a remainder */ > + if (delta == weight) { > + /* boundary: resume from next node/weight */ > + resume_node = next_node_in(node, nodes); > + resume_weight = weights[resume_node]; > + } else { > + /* remainder: resume this node w/ remainder */ > + resume_node = node; > + resume_weight = weight - delta; > + } If we are comfortable to leave resume_weight == 0, then the above branch can be simplified to. resume_node = node; resume_weight = weight - delta; But, this is a style issue again. I will leave it to you to decide. So, except the issue you pointed out already. All series looks good to me! Thanks! Feel free to add Reviewed-by: "Huang, Ying" <ying.huang@intel.com> to the whole series. > + delta = 0; > + } > + /* node_pages can be 0 if an allocation fails and rounds == 0 */ > + if (!node_pages) > + break; > + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, > + NULL, page_array); > + page_array += nr_allocated; > + total_allocated += nr_allocated; > + if (total_allocated == nr_pages) > + break; > + prev_node = node; > + } > + me->il_prev = resume_node; > + me->il_weight = resume_weight; > + kfree(weights); > + return total_allocated; > +} > + > static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, > struct mempolicy *pol, unsigned long nr_pages, > struct page **page_array) > @@ -2319,6 +2530,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, > return alloc_pages_bulk_array_interleave(gfp, pol, > nr_pages, page_array); > > + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) > + return alloc_pages_bulk_array_weighted_interleave( > + gfp, pol, nr_pages, page_array); > + > if (pol->mode == MPOL_PREFERRED_MANY) > return alloc_pages_bulk_array_preferred_many(gfp, > numa_node_id(), pol, nr_pages, page_array); > @@ -2394,6 +2609,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) > case MPOL_INTERLEAVE: > case MPOL_PREFERRED: > case MPOL_PREFERRED_MANY: > + case MPOL_WEIGHTED_INTERLEAVE: > return !!nodes_equal(a->nodes, b->nodes); > case MPOL_LOCAL: > return true; > @@ -2530,6 +2746,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, > polnid = interleave_nid(pol, ilx); > break; > > + case MPOL_WEIGHTED_INTERLEAVE: > + polnid = weighted_interleave_nid(pol, ilx); > + break; > + > case MPOL_PREFERRED: > if (node_isset(curnid, pol->nodes)) > goto out; > @@ -2904,6 +3124,7 @@ static const char * const policy_modes[] = > [MPOL_PREFERRED] = "prefer", > [MPOL_BIND] = "bind", > [MPOL_INTERLEAVE] = "interleave", > + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", > [MPOL_LOCAL] = "local", > [MPOL_PREFERRED_MANY] = "prefer (many)", > }; > @@ -2963,6 +3184,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) > } > break; > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > /* > * Default to online nodes with memory if no nodelist > */ > @@ -3073,6 +3295,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) > case MPOL_PREFERRED_MANY: > case MPOL_BIND: > case MPOL_INTERLEAVE: > + case MPOL_WEIGHTED_INTERLEAVE: > nodes = pol->nodes; > break; > default: -- Best Regards, Huang, Ying
On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote: > Gregory Price <gourry.memverge@gmail.com> writes: > > > > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > > +{ > > + unsigned int node = current->il_prev; > > + > > + if (!current->il_weight || !node_isset(node, policy->nodes)) { > > + node = next_node_in(node, policy->nodes); > > + /* can only happen if nodemask is being rebound */ > > + if (node == MAX_NUMNODES) > > + return node; > > I feel a little unsafe to read policy->nodes at same time of writing in > rebound. Is it better to use a seqlock to guarantee its consistency? > It's unnecessary to be a part of this series though. > I think this is handled already? It is definitely an explicit race condition that is documented elsewhere: /* * mpol_rebind_policy - Migrate a policy to a different set of nodes * * Per-vma policies are protected by mmap_lock. Allocations using per-task * policies are protected by task->mems_allowed_seq to prevent a premature * OOM/allocation failure due to parallel nodemask modification. */ example from slub: do { cpuset_mems_cookie = read_mems_allowed_begin(); zonelist = node_zonelist(mempolicy_slab_node(), pc->flags); ... } while (read_mems_allowed_retry(cpuset_mems_cookie)); quick perusal through other allocators, show similar checks. page_alloc.c - check_retry_cpusetset() filemap.c - filemap_alloc_folio() If we ever want mempolicy to be swappable from outside the current task context, this will have to change most likely - but that's another feature for another day. > > + while (target) { > > + /* detect system default usage */ > > + weight = table ? table[nid] : 1; > > + weight = weight ? weight : 1; > > I found duplicated pattern as above in this patch. Can we define a > function like below to remove the duplication? > > u8 __get_il_weight(u8 *table, int nid) > { > u8 weight; > > weight = table ? table[nid] : 1; > return weight ? : 1; > } > When we implement the system-default array, this will change to: weight = sysfs_table ? sysfs_table[nid] : default_table[nid]; This cleanup will get picked up in that patch set since this code is going to change anyway. > > + if (delta == weight) { > > + /* boundary: resume from next node/weight */ > > + resume_node = next_node_in(node, nodes); > > + resume_weight = weights[resume_node]; > > + } else { > > + /* remainder: resume this node w/ remainder */ > > + resume_node = node; > > + resume_weight = weight - delta; > > + } > > If we are comfortable to leave resume_weight == 0, then the above > branch can be simplified to. > > resume_node = node; > resume_weight = weight - delta; > > But, this is a style issue again. I will leave it to you to decide. Good point, and in fact there's a similar branch in the first half of the function that can be simplified. Will follow up with a style patch. mm/mempolicy.c | 21 ++++----------------- 1 file changed, 4 insertions(+), 17 deletions(-) My favorite style of patch :D Andrew if you happen to be monitoring, this is the patch (not tested yet, but it's pretty obvious, otherwise i'll submit individually tomorrow). diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2c1aef8eab70..b0ca9bcdd64c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, page_array += nr_allocated; total_allocated += nr_allocated; /* if that's all the pages, no need to interleave */ - if (rem_pages < weight) { - /* stay on current node, adjust il_weight */ + if (rem_pages <= weight) { me->il_weight -= rem_pages; return total_allocated; - } else if (rem_pages == weight) { - /* move to next node / weight */ - me->il_prev = next_node_in(node, nodes); - me->il_weight = get_il_weight(me->il_prev); - return total_allocated; } /* Otherwise we adjust remaining pages, continue from there */ rem_pages -= weight; @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, node_pages += weight; delta -= weight; } else if (delta) { + /* when delta is deleted, resume from that node */ node_pages += delta; - /* delta may deplete on a boundary or w/ a remainder */ - if (delta == weight) { - /* boundary: resume from next node/weight */ - resume_node = next_node_in(node, nodes); - resume_weight = weights[resume_node]; - } else { - /* remainder: resume this node w/ remainder */ - resume_node = node; - resume_weight = weight - delta; - } + resume_node = node; + resume_weight = weight - delta; delta = 0; } /* node_pages can be 0 if an allocation fails and rounds == 0 */ > > So, except the issue you pointed out already. All series looks good to > me! Thanks! Feel free to add > > Reviewed-by: "Huang, Ying" <ying.huang@intel.com> > > to the whole series. > Thank you so much for your patience with me! I appreciate all the help. I am looking forward to this feature very much! ~Gregory
Gregory Price <gregory.price@memverge.com> writes: > On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote: >> Gregory Price <gourry.memverge@gmail.com> writes: >> > >> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) >> > +{ >> > + unsigned int node = current->il_prev; >> > + >> > + if (!current->il_weight || !node_isset(node, policy->nodes)) { >> > + node = next_node_in(node, policy->nodes); >> > + /* can only happen if nodemask is being rebound */ >> > + if (node == MAX_NUMNODES) >> > + return node; >> >> I feel a little unsafe to read policy->nodes at same time of writing in >> rebound. Is it better to use a seqlock to guarantee its consistency? >> It's unnecessary to be a part of this series though. >> > > I think this is handled already? It is definitely an explicit race > condition that is documented elsewhere: > > /* > * mpol_rebind_policy - Migrate a policy to a different set of nodes > * > * Per-vma policies are protected by mmap_lock. Allocations using per-task > * policies are protected by task->mems_allowed_seq to prevent a premature > * OOM/allocation failure due to parallel nodemask modification. > */ Thanks for pointing this out! If we use task->mems_allowed_seq reader side in weighted_interleave_nodes() we can guarantee the consistency of policy->nodes. That may be not deserved, because it's not a big deal to allocate 1 page in a wrong node. It makes more sense to do that in alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may be allocated there. > example from slub: > > do { > cpuset_mems_cookie = read_mems_allowed_begin(); > zonelist = node_zonelist(mempolicy_slab_node(), pc->flags); > ... > } while (read_mems_allowed_retry(cpuset_mems_cookie)); > > quick perusal through other allocators, show similar checks. > > page_alloc.c - check_retry_cpusetset() > filemap.c - filemap_alloc_folio() > > If we ever want mempolicy to be swappable from outside the current task > context, this will have to change most likely - but that's another > feature for another day. > -- Best Regards, Huang, Ying
On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote: > >> Gregory Price <gourry.memverge@gmail.com> writes: > >> > > >> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > >> > +{ > >> > + unsigned int node = current->il_prev; > >> > + > >> > + if (!current->il_weight || !node_isset(node, policy->nodes)) { > >> > + node = next_node_in(node, policy->nodes); > >> > + /* can only happen if nodemask is being rebound */ > >> > + if (node == MAX_NUMNODES) > >> > + return node; > >> > >> I feel a little unsafe to read policy->nodes at same time of writing in > >> rebound. Is it better to use a seqlock to guarantee its consistency? > >> It's unnecessary to be a part of this series though. > >> > > > > I think this is handled already? It is definitely an explicit race > > condition that is documented elsewhere: > > > > /* > > * mpol_rebind_policy - Migrate a policy to a different set of nodes > > * > > * Per-vma policies are protected by mmap_lock. Allocations using per-task > > * policies are protected by task->mems_allowed_seq to prevent a premature > > * OOM/allocation failure due to parallel nodemask modification. > > */ > > Thanks for pointing this out! > > If we use task->mems_allowed_seq reader side in > weighted_interleave_nodes() we can guarantee the consistency of > policy->nodes. That may be not deserved, because it's not a big deal to > allocate 1 page in a wrong node. > > It makes more sense to do that in > alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may > be allocated there. > That's probably worth just adding now, I'll do it and squash the style updates into the branch. Sorry Andrew, I guess 1 last version is inbound :] I'll pick up the reviewed tags along the way. ~Gregory
On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > > > > I think this is handled already? It is definitely an explicit race > > condition that is documented elsewhere: > > > > /* > > * mpol_rebind_policy - Migrate a policy to a different set of nodes > > * > > * Per-vma policies are protected by mmap_lock. Allocations using per-task > > * policies are protected by task->mems_allowed_seq to prevent a premature > > * OOM/allocation failure due to parallel nodemask modification. > > */ > > Thanks for pointing this out! > > If we use task->mems_allowed_seq reader side in > weighted_interleave_nodes() we can guarantee the consistency of > policy->nodes. That may be not deserved, because it's not a big deal to > allocate 1 page in a wrong node. > > It makes more sense to do that in > alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may > be allocated there. > To save the versioning if there are issues, here are the 3 diffs that I have left. If you are good with these changes, I'll squash the first 2 into the third commit, keep the last one as a separate commit (it changes the interleave_nodes() logic too), and submit v5 w/ your reviewed tag on all of them. Fix one (pedantic?) warning from syzbot: ---------------------------------------- diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b1437396c357..dfd097009606 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, unsigned long nr_allocated = 0; unsigned long rounds; unsigned long node_pages, delta; - u8 __rcu *table, *weights, weight; + u8 __rcu *table, __rcu *weights, weight; unsigned int weight_total = 0; unsigned long rem_pages = nr_pages; nodemask_t nodes; Simplifying resume_node/weight logic: ------------------------------------- diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 2c1aef8eab70..b0ca9bcdd64c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, page_array += nr_allocated; total_allocated += nr_allocated; /* if that's all the pages, no need to interleave */ - if (rem_pages < weight) { - /* stay on current node, adjust il_weight */ + if (rem_pages <= weight) { me->il_weight -= rem_pages; return total_allocated; - } else if (rem_pages == weight) { - /* move to next node / weight */ - me->il_prev = next_node_in(node, nodes); - me->il_weight = get_il_weight(me->il_prev); - return total_allocated; } /* Otherwise we adjust remaining pages, continue from there */ rem_pages -= weight; @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, node_pages += weight; delta -= weight; } else if (delta) { + /* when delta is deleted, resume from that node */ node_pages += delta; - /* delta may deplete on a boundary or w/ a remainder */ - if (delta == weight) { - /* boundary: resume from next node/weight */ - resume_node = next_node_in(node, nodes); - resume_weight = weights[resume_node]; - } else { - /* remainder: resume this node w/ remainder */ - resume_node = node; - resume_weight = weight - delta; - } + resume_node = node; + resume_weight = weight - delta; delta = 0; } /* node_pages can be 0 if an allocation fails and rounds == 0 */ task->mems_allowed_seq protection (added as 4th patch) ------------------------------------------------------ diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b0ca9bcdd64c..b1437396c357 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { unsigned int node = current->il_prev; + unsigned int cpuset_mems_cookie; +retry: + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ + cpuset_mems_cookie = read_mems_allowed_begin(); if (!current->il_weight || !node_isset(node, policy->nodes)) { node = next_node_in(node, policy->nodes); - /* can only happen if nodemask is being rebound */ + if (read_mems_allowed_retry(cpuset_mems_cookie)) + goto retry; if (node == MAX_NUMNODES) return node; current->il_prev = node; @@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) static unsigned int interleave_nodes(struct mempolicy *policy) { unsigned int nid; + unsigned int cpuset_mems_cookie; + + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nid = next_node_in(current->il_prev, policy->nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); - nid = next_node_in(current->il_prev, policy->nodes); if (nid < MAX_NUMNODES) current->il_prev = nid; + return nid; } @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, struct page **page_array) { struct task_struct *me = current; + unsigned int cpuset_mems_cookie; unsigned long total_allocated = 0; unsigned long nr_allocated = 0; unsigned long rounds; @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, int prev_node; int i; + if (!nr_pages) return 0; - nnodes = read_once_policy_nodemask(pol, &nodes); + /* read the nodes onto the stack, retry if done during rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nnodes = read_once_policy_nodemask(pol, &nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); + + /* if the nodemask has become invalid, we cannot do anything */ if (!nnodes) return 0;
Gregory Price <gregory.price@memverge.com> writes: > On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@memverge.com> writes: >> >> > >> > I think this is handled already? It is definitely an explicit race >> > condition that is documented elsewhere: >> > >> > /* >> > * mpol_rebind_policy - Migrate a policy to a different set of nodes >> > * >> > * Per-vma policies are protected by mmap_lock. Allocations using per-task >> > * policies are protected by task->mems_allowed_seq to prevent a premature >> > * OOM/allocation failure due to parallel nodemask modification. >> > */ >> >> Thanks for pointing this out! >> >> If we use task->mems_allowed_seq reader side in >> weighted_interleave_nodes() we can guarantee the consistency of >> policy->nodes. That may be not deserved, because it's not a big deal to >> allocate 1 page in a wrong node. >> >> It makes more sense to do that in >> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may >> be allocated there. >> > > To save the versioning if there are issues, here are the 3 diffs that > I have left. If you are good with these changes, I'll squash the first > 2 into the third commit, keep the last one as a separate commit (it > changes the interleave_nodes() logic too), and submit v5 w/ your > reviewed tag on all of them. > > > Fix one (pedantic?) warning from syzbot: > ---------------------------------------- > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index b1437396c357..dfd097009606 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > unsigned long nr_allocated = 0; > unsigned long rounds; > unsigned long node_pages, delta; > - u8 __rcu *table, *weights, weight; > + u8 __rcu *table, __rcu *weights, weight; The __rcu usage can be checked with `sparse` directly. For example, make C=1 mm/mempolicy.o More details can be found in https://www.kernel.org/doc/html/latest/dev-tools/sparse.html Per my understanding, we shouldn't use "__rcu" here. Please search "__rcu" in the following document. https://www.kernel.org/doc/html/latest/RCU/checklist.html > unsigned int weight_total = 0; > unsigned long rem_pages = nr_pages; > nodemask_t nodes; > > > > Simplifying resume_node/weight logic: > ------------------------------------- > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 2c1aef8eab70..b0ca9bcdd64c 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > page_array += nr_allocated; > total_allocated += nr_allocated; > /* if that's all the pages, no need to interleave */ > - if (rem_pages < weight) { > - /* stay on current node, adjust il_weight */ > + if (rem_pages <= weight) { > me->il_weight -= rem_pages; > return total_allocated; > - } else if (rem_pages == weight) { > - /* move to next node / weight */ > - me->il_prev = next_node_in(node, nodes); > - me->il_weight = get_il_weight(me->il_prev); > - return total_allocated; > } > /* Otherwise we adjust remaining pages, continue from there */ > rem_pages -= weight; > @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > node_pages += weight; > delta -= weight; > } else if (delta) { > + /* when delta is deleted, resume from that node */ ~~~~~~~ depleted? > node_pages += delta; > - /* delta may deplete on a boundary or w/ a remainder */ > - if (delta == weight) { > - /* boundary: resume from next node/weight */ > - resume_node = next_node_in(node, nodes); > - resume_weight = weights[resume_node]; > - } else { > - /* remainder: resume this node w/ remainder */ > - resume_node = node; > - resume_weight = weight - delta; > - } > + resume_node = node; > + resume_weight = weight - delta; > delta = 0; > } > /* node_pages can be 0 if an allocation fails and rounds == 0 */ > > > > > > task->mems_allowed_seq protection (added as 4th patch) > ------------------------------------------------------ > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index b0ca9bcdd64c..b1437396c357 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > { > unsigned int node = current->il_prev; > + unsigned int cpuset_mems_cookie; > > +retry: > + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ > + cpuset_mems_cookie = read_mems_allowed_begin(); > if (!current->il_weight || !node_isset(node, policy->nodes)) { > node = next_node_in(node, policy->nodes); node will be changed in the loop. So we need to change the logic here. > - /* can only happen if nodemask is being rebound */ > + if (read_mems_allowed_retry(cpuset_mems_cookie)) > + goto retry; > if (node == MAX_NUMNODES) > return node; > current->il_prev = node; > @@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > static unsigned int interleave_nodes(struct mempolicy *policy) > { > unsigned int nid; > + unsigned int cpuset_mems_cookie; > + > + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ > + do { > + cpuset_mems_cookie = read_mems_allowed_begin(); > + nid = next_node_in(current->il_prev, policy->nodes); > + } while (read_mems_allowed_retry(cpuset_mems_cookie)); > > - nid = next_node_in(current->il_prev, policy->nodes); > if (nid < MAX_NUMNODES) > current->il_prev = nid; > + > return nid; > } > > @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > struct page **page_array) > { > struct task_struct *me = current; > + unsigned int cpuset_mems_cookie; > unsigned long total_allocated = 0; > unsigned long nr_allocated = 0; > unsigned long rounds; > @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > int prev_node; > int i; > > + Change by accident? > if (!nr_pages) > return 0; > > - nnodes = read_once_policy_nodemask(pol, &nodes); > + /* read the nodes onto the stack, retry if done during rebind */ > + do { > + cpuset_mems_cookie = read_mems_allowed_begin(); > + nnodes = read_once_policy_nodemask(pol, &nodes); > + } while (read_mems_allowed_retry(cpuset_mems_cookie)); > + > + /* if the nodemask has become invalid, we cannot do anything */ > if (!nnodes) > return 0; -- Best Regards, Huang, Ying
On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote: > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index b1437396c357..dfd097009606 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > > unsigned long nr_allocated = 0; > > unsigned long rounds; > > unsigned long node_pages, delta; > > - u8 __rcu *table, *weights, weight; > > + u8 __rcu *table, __rcu *weights, weight; > > The __rcu usage can be checked with `sparse` directly. For example, > > make C=1 mm/mempolicy.o > > More details can be found in > > https://www.kernel.org/doc/html/latest/dev-tools/sparse.html > > Per my understanding, we shouldn't use "__rcu" here. Please search > "__rcu" in the following document. > > https://www.kernel.org/doc/html/latest/RCU/checklist.html > Thanks for this, I will sort this out and respond here with changes before v5. > > @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > > node_pages += weight; > > delta -= weight; > > } else if (delta) { > > + /* when delta is deleted, resume from that node */ > ~~~~~~~ > depleted? ack. > > +retry: > > + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ > > + cpuset_mems_cookie = read_mems_allowed_begin(); > > if (!current->il_weight || !node_isset(node, policy->nodes)) { > > node = next_node_in(node, policy->nodes); > > node will be changed in the loop. So we need to change the logic here. > Good catch, stupid mistake. ack. > > @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > > int prev_node; > > int i; > > > > + > > Change by accident? > ack. ~Gregory
On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote: > Gregory Price <gregory.price@memverge.com> writes: > > - u8 __rcu *table, *weights, weight; > > + u8 __rcu *table, __rcu *weights, weight; > > The __rcu usage can be checked with `sparse` directly. For example, > > make C=1 mm/mempolicy.o > fixed and squashed, all the __rcu usage i had except the global pointer have been used. Thanks for the reference material, was struggling to understand that. > > task->mems_allowed_seq protection (added as 4th patch) > > ------------------------------------------------------ > > > > + cpuset_mems_cookie = read_mems_allowed_begin(); > > if (!current->il_weight || !node_isset(node, policy->nodes)) { > > node = next_node_in(node, policy->nodes); > > node will be changed in the loop. So we need to change the logic here. > new patch, if it all looks good i'll ship it in v5 ~Gregory diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d8cc3a577986..4e5a640d10b8 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { - unsigned int node = current->il_prev; - - if (!current->il_weight || !node_isset(node, policy->nodes)) { - node = next_node_in(node, policy->nodes); - /* can only happen if nodemask is being rebound */ + unsigned int node; + unsigned int cpuset_mems_cookie; + +retry: + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ + cpuset_mems_cookie = read_mems_allowed_begin(); + if (!current->il_weight || + !node_isset(current->il_prev, policy->nodes)) { + node = next_node_in(current->il_prev, policy->nodes); + if (read_mems_allowed_retry(cpuset_mems_cookie)) + goto retry; if (node == MAX_NUMNODES) return node; current->il_prev = node; @@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) static unsigned int interleave_nodes(struct mempolicy *policy) { unsigned int nid; + unsigned int cpuset_mems_cookie; + + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nid = next_node_in(current->il_prev, policy->nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); - nid = next_node_in(current->il_prev, policy->nodes); if (nid < MAX_NUMNODES) current->il_prev = nid; return nid; @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, struct page **page_array) { struct task_struct *me = current; + unsigned int cpuset_mems_cookie; unsigned long total_allocated = 0; unsigned long nr_allocated = 0; unsigned long rounds; @@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, if (!nr_pages) return 0; - nnodes = read_once_policy_nodemask(pol, &nodes); + /* read the nodes onto the stack, retry if done during rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nnodes = read_once_policy_nodemask(pol, &nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); + + /* if the nodemask has become invalid, we cannot do anything */ if (!nnodes) return 0;
Gregory Price <gregory.price@memverge.com> writes: > On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote: >> Gregory Price <gregory.price@memverge.com> writes: >> > - u8 __rcu *table, *weights, weight; >> > + u8 __rcu *table, __rcu *weights, weight; >> >> The __rcu usage can be checked with `sparse` directly. For example, >> >> make C=1 mm/mempolicy.o >> > > fixed and squashed, all the __rcu usage i had except the global pointer > have been used. Thanks for the reference material, was struggling to > understand that. > >> > task->mems_allowed_seq protection (added as 4th patch) >> > ------------------------------------------------------ >> > >> > + cpuset_mems_cookie = read_mems_allowed_begin(); >> > if (!current->il_weight || !node_isset(node, policy->nodes)) { >> > node = next_node_in(node, policy->nodes); >> >> node will be changed in the loop. So we need to change the logic here. >> > > new patch, if it all looks good i'll ship it in v5 > > ~Gregory > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index d8cc3a577986..4e5a640d10b8 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > > static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > { > - unsigned int node = current->il_prev; > - > - if (!current->il_weight || !node_isset(node, policy->nodes)) { > - node = next_node_in(node, policy->nodes); > - /* can only happen if nodemask is being rebound */ > + unsigned int node; IIUC, "node" may be used without initialization. -- Best Regards, Huang, Ying > + unsigned int cpuset_mems_cookie; > + > +retry: > + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ > + cpuset_mems_cookie = read_mems_allowed_begin(); > + if (!current->il_weight || > + !node_isset(current->il_prev, policy->nodes)) { > + node = next_node_in(current->il_prev, policy->nodes); > + if (read_mems_allowed_retry(cpuset_mems_cookie)) > + goto retry; > if (node == MAX_NUMNODES) > return node; > current->il_prev = node; > @@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > static unsigned int interleave_nodes(struct mempolicy *policy) > { > unsigned int nid; > + unsigned int cpuset_mems_cookie; > + > + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ > + do { > + cpuset_mems_cookie = read_mems_allowed_begin(); > + nid = next_node_in(current->il_prev, policy->nodes); > + } while (read_mems_allowed_retry(cpuset_mems_cookie)); > > - nid = next_node_in(current->il_prev, policy->nodes); > if (nid < MAX_NUMNODES) > current->il_prev = nid; > return nid; > @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > struct page **page_array) > { > struct task_struct *me = current; > + unsigned int cpuset_mems_cookie; > unsigned long total_allocated = 0; > unsigned long nr_allocated = 0; > unsigned long rounds; > @@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, > if (!nr_pages) > return 0; > > - nnodes = read_once_policy_nodemask(pol, &nodes); > + /* read the nodes onto the stack, retry if done during rebind */ > + do { > + cpuset_mems_cookie = read_mems_allowed_begin(); > + nnodes = read_once_policy_nodemask(pol, &nodes); > + } while (read_mems_allowed_retry(cpuset_mems_cookie)); > + > + /* if the nodemask has become invalid, we cannot do anything */ > if (!nnodes) > return 0;
On Thu, Feb 01, 2024 at 11:02:47AM +0800, Huang, Ying wrote: > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index d8cc3a577986..4e5a640d10b8 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) > > > > static unsigned int weighted_interleave_nodes(struct mempolicy *policy) > > { > > - unsigned int node = current->il_prev; > > - > > - if (!current->il_weight || !node_isset(node, policy->nodes)) { > > - node = next_node_in(node, policy->nodes); > > - /* can only happen if nodemask is being rebound */ > > + unsigned int node; > > IIUC, "node" may be used without initialization. > ok i should slow down lol. This should take care of it. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d8cc3a577986..ed0d5d2d456a 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { - unsigned int node = current->il_prev; - - if (!current->il_weight || !node_isset(node, policy->nodes)) { + unsigned int node; + unsigned int cpuset_mems_cookie; + +retry: + /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */ + cpuset_mems_cookie = read_mems_allowed_begin(); + node = current->il_prev; + if (!node || !node_isset(node, policy->nodes)) { node = next_node_in(node, policy->nodes); - /* can only happen if nodemask is being rebound */ + if (read_mems_allowed_retry(cpuset_mems_cookie)) + goto retry; if (node == MAX_NUMNODES) return node; current->il_prev = node; @@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) static unsigned int interleave_nodes(struct mempolicy *policy) { unsigned int nid; + unsigned int cpuset_mems_cookie; + + /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nid = next_node_in(current->il_prev, policy->nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); - nid = next_node_in(current->il_prev, policy->nodes); if (nid < MAX_NUMNODES) current->il_prev = nid; return nid; @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, struct page **page_array) { struct task_struct *me = current; + unsigned int cpuset_mems_cookie; unsigned long total_allocated = 0; unsigned long nr_allocated = 0; unsigned long rounds; @@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, if (!nr_pages) return 0; - nnodes = read_once_policy_nodemask(pol, &nodes); + /* read the nodes onto the stack, retry if done during rebind */ + do { + cpuset_mems_cookie = read_mems_allowed_begin(); + nnodes = read_once_policy_nodemask(pol, &nodes); + } while (read_mems_allowed_retry(cpuset_mems_cookie)); + + /* if the nodemask has become invalid, we cannot do anything */ if (!nnodes) return 0;
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index eca38fa81e0f..a70f20ce1ffb 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY can fall back to all existing numa nodes. This is effectively MPOL_PREFERRED allowed for a mask rather than a single node. +MPOL_WEIGHTED_INTERLEAVE + This mode operates the same as MPOL_INTERLEAVE, except that + interleaving behavior is executed based on weights set in + /sys/kernel/mm/mempolicy/weighted_interleave/ + + Weighted interleave allocates pages on nodes according to a + weight. For example if nodes [0,1] are weighted [5,2], 5 pages + will be allocated on node0 for every 2 pages allocated on node1. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES diff --git a/include/linux/sched.h b/include/linux/sched.h index ffe8f618ab86..b9ce285d8c9c 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1259,6 +1259,7 @@ struct task_struct { /* Protected by alloc_lock: */ struct mempolicy *mempolicy; short il_prev; + u8 il_weight; short pref_node_fork; #endif #ifdef CONFIG_NUMA_BALANCING diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index a8963f7ef4c2..1f9bb10d1a47 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -23,6 +23,7 @@ enum { MPOL_INTERLEAVE, MPOL_LOCAL, MPOL_PREFERRED_MANY, + MPOL_WEIGHTED_INTERLEAVE, MPOL_MAX, /* always last member of enum */ }; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3bdfaf03b660..7cd92f4ec0d7 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -19,6 +19,13 @@ * for anonymous memory. For process policy an process counter * is used. * + * weighted interleave + * Allocate memory interleaved over a set of nodes based on + * a set of weights (per-node), with normal fallback if it + * fails. Otherwise operates the same as interleave. + * Example: nodeset(0,1) & weights (2,1) - 2 pages allocated + * on node 0 for every 1 page allocated on node 1. + * * bind Only allocate memory on a specific set of nodes, * no fallback. * FIXME: memory is allocated starting with the first node @@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = { .create = mpol_new_nodemask, .rebind = mpol_rebind_preferred, }, + [MPOL_WEIGHTED_INTERLEAVE] = { + .create = mpol_new_nodemask, + .rebind = mpol_rebind_nodemask, + }, }; static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist, @@ -862,8 +873,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags, old = current->mempolicy; current->mempolicy = new; - if (new && new->mode == MPOL_INTERLEAVE) + if (new && (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE)) { current->il_prev = MAX_NUMNODES-1; + current->il_weight = 0; + } task_unlock(current); mpol_put(old); ret = 0; @@ -888,6 +902,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: *nodes = pol->nodes; break; case MPOL_LOCAL: @@ -972,6 +987,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, } else if (pol == current->mempolicy && pol->mode == MPOL_INTERLEAVE) { *policy = next_node_in(current->il_prev, pol->nodes); + } else if (pol == current->mempolicy && + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { + if (current->il_weight) + *policy = current->il_prev; + else + *policy = next_node_in(current->il_prev, + pol->nodes); } else { err = -EINVAL; goto out; @@ -1336,7 +1358,8 @@ static long do_mbind(unsigned long start, unsigned long len, * VMAs, the nodes will still be interleaved from the targeted * nodemask, but one by one may be selected differently. */ - if (new->mode == MPOL_INTERLEAVE) { + if (new->mode == MPOL_INTERLEAVE || + new->mode == MPOL_WEIGHTED_INTERLEAVE) { struct page *page; unsigned int order; unsigned long addr = -EFAULT; @@ -1784,7 +1807,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * @vma: virtual memory area whose policy is sought * @addr: address in @vma for shared policy lookup * @order: 0, or appropriate huge_page_order for interleaving - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or + * MPOL_WEIGHTED_INTERLEAVE * * Returns effective policy for a VMA at specified address. * Falls back to current->mempolicy or system default policy, as necessary. @@ -1801,7 +1825,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma, pol = __get_vma_policy(vma, addr, ilx); if (!pol) pol = get_task_policy(current); - if (pol->mode == MPOL_INTERLEAVE) { + if (pol->mode == MPOL_INTERLEAVE || + pol->mode == MPOL_WEIGHTED_INTERLEAVE) { *ilx += vma->vm_pgoff >> order; *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); } @@ -1851,6 +1876,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone) return zone >= dynamic_policy_zone; } +static unsigned int weighted_interleave_nodes(struct mempolicy *policy) +{ + unsigned int node = current->il_prev; + + if (!current->il_weight || !node_isset(node, policy->nodes)) { + node = next_node_in(node, policy->nodes); + /* can only happen if nodemask is being rebound */ + if (node == MAX_NUMNODES) + return node; + current->il_prev = node; + current->il_weight = get_il_weight(node); + } + current->il_weight--; + return node; +} + /* Do dynamic interleaving for a process */ static unsigned int interleave_nodes(struct mempolicy *policy) { @@ -1885,6 +1926,9 @@ unsigned int mempolicy_slab_node(void) case MPOL_INTERLEAVE: return interleave_nodes(policy); + case MPOL_WEIGHTED_INTERLEAVE: + return weighted_interleave_nodes(policy); + case MPOL_BIND: case MPOL_PREFERRED_MANY: { @@ -1923,6 +1967,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol, return nodes_weight(*mask); } +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) +{ + nodemask_t nodemask; + unsigned int target, nr_nodes; + u8 __rcu *table; + unsigned int weight_total = 0; + u8 weight; + int nid; + + nr_nodes = read_once_policy_nodemask(pol, &nodemask); + if (!nr_nodes) + return numa_node_id(); + + rcu_read_lock(); + table = rcu_dereference(iw_table); + /* calculate the total weight */ + for_each_node_mask(nid, nodemask) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + weight_total += weight; + } + + /* Calculate the node offset based on totals */ + target = ilx % weight_total; + nid = first_node(nodemask); + while (target) { + /* detect system default usage */ + weight = table ? table[nid] : 1; + weight = weight ? weight : 1; + if (target < weight) + break; + target -= weight; + nid = next_node_in(nid, nodemask); + } + rcu_read_unlock(); + return nid; +} + /* * Do static interleaving for interleave index @ilx. Returns the ilx'th * node in pol->nodes (starting from ilx=0), wrapping around if ilx @@ -1983,6 +2066,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol, *nid = (ilx == NO_INTERLEAVE_INDEX) ? interleave_nodes(pol) : interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + *nid = (ilx == NO_INTERLEAVE_INDEX) ? + weighted_interleave_nodes(pol) : + weighted_interleave_nid(pol, ilx); + break; } return nodemask; @@ -2044,6 +2132,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: *mask = mempolicy->nodes; break; @@ -2144,6 +2233,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order, * node in its nodemask, we allocate the standard way. */ if (pol->mode != MPOL_INTERLEAVE && + pol->mode != MPOL_WEIGHTED_INTERLEAVE && (!nodemask || node_isset(nid, *nodemask))) { /* * First, try to allocate THP only on local node, but @@ -2279,6 +2369,127 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp, return total_allocated; } +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, + struct mempolicy *pol, unsigned long nr_pages, + struct page **page_array) +{ + struct task_struct *me = current; + unsigned long total_allocated = 0; + unsigned long nr_allocated = 0; + unsigned long rounds; + unsigned long node_pages, delta; + u8 __rcu *table, *weights, weight; + unsigned int weight_total = 0; + unsigned long rem_pages = nr_pages; + nodemask_t nodes; + int nnodes, node, next_node; + int resume_node = MAX_NUMNODES - 1; + u8 resume_weight = 0; + int prev_node; + int i; + + if (!nr_pages) + return 0; + + nnodes = read_once_policy_nodemask(pol, &nodes); + if (!nnodes) + return 0; + + /* Continue allocating from most recent node and adjust the nr_pages */ + node = me->il_prev; + weight = me->il_weight; + if (weight && node_isset(node, nodes)) { + node_pages = min(rem_pages, weight); + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + /* if that's all the pages, no need to interleave */ + if (rem_pages < weight) { + /* stay on current node, adjust il_weight */ + me->il_weight -= rem_pages; + return total_allocated; + } else if (rem_pages == weight) { + /* move to next node / weight */ + me->il_prev = next_node_in(node, nodes); + me->il_weight = get_il_weight(next_node); + return total_allocated; + } + /* Otherwise we adjust remaining pages, continue from there */ + rem_pages -= weight; + } + /* clear active weight in case of an allocation failure */ + me->il_weight = 0; + prev_node = node; + + /* create a local copy of node weights to operate on outside rcu */ + weights = kzalloc(nr_node_ids, GFP_KERNEL); + if (!weights) + return total_allocated; + + rcu_read_lock(); + table = rcu_dereference(iw_table); + if (table) + memcpy(weights, table, nr_node_ids); + rcu_read_unlock(); + + /* calculate total, detect system default usage */ + for_each_node_mask(node, nodes) { + if (!weights[node]) + weights[node] = 1; + weight_total += weights[node]; + } + + /* + * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls. + * Track which node weighted interleave should resume from. + * + * if (rounds > 0) and (delta == 0), resume_node will always be + * the node following prev_node and its weight. + */ + rounds = rem_pages / weight_total; + delta = rem_pages % weight_total; + resume_node = next_node_in(prev_node, nodes); + resume_weight = weights[resume_node]; + for (i = 0; i < nnodes; i++) { + node = next_node_in(prev_node, nodes); + weight = weights[node]; + node_pages = weight * rounds; + /* If a delta exists, add this node's portion of the delta */ + if (delta > weight) { + node_pages += weight; + delta -= weight; + } else if (delta) { + node_pages += delta; + /* delta may deplete on a boundary or w/ a remainder */ + if (delta == weight) { + /* boundary: resume from next node/weight */ + resume_node = next_node_in(node, nodes); + resume_weight = weights[resume_node]; + } else { + /* remainder: resume this node w/ remainder */ + resume_node = node; + resume_weight = weight - delta; + } + delta = 0; + } + /* node_pages can be 0 if an allocation fails and rounds == 0 */ + if (!node_pages) + break; + nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages, + NULL, page_array); + page_array += nr_allocated; + total_allocated += nr_allocated; + if (total_allocated == nr_pages) + break; + prev_node = node; + } + me->il_prev = resume_node; + me->il_weight = resume_weight; + kfree(weights); + return total_allocated; +} + static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid, struct mempolicy *pol, unsigned long nr_pages, struct page **page_array) @@ -2319,6 +2530,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp, return alloc_pages_bulk_array_interleave(gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_WEIGHTED_INTERLEAVE) + return alloc_pages_bulk_array_weighted_interleave( + gfp, pol, nr_pages, page_array); + if (pol->mode == MPOL_PREFERRED_MANY) return alloc_pages_bulk_array_preferred_many(gfp, numa_node_id(), pol, nr_pages, page_array); @@ -2394,6 +2609,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b) case MPOL_INTERLEAVE: case MPOL_PREFERRED: case MPOL_PREFERRED_MANY: + case MPOL_WEIGHTED_INTERLEAVE: return !!nodes_equal(a->nodes, b->nodes); case MPOL_LOCAL: return true; @@ -2530,6 +2746,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma, polnid = interleave_nid(pol, ilx); break; + case MPOL_WEIGHTED_INTERLEAVE: + polnid = weighted_interleave_nid(pol, ilx); + break; + case MPOL_PREFERRED: if (node_isset(curnid, pol->nodes)) goto out; @@ -2904,6 +3124,7 @@ static const char * const policy_modes[] = [MPOL_PREFERRED] = "prefer", [MPOL_BIND] = "bind", [MPOL_INTERLEAVE] = "interleave", + [MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave", [MPOL_LOCAL] = "local", [MPOL_PREFERRED_MANY] = "prefer (many)", }; @@ -2963,6 +3184,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol) } break; case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: /* * Default to online nodes with memory if no nodelist */ @@ -3073,6 +3295,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol) case MPOL_PREFERRED_MANY: case MPOL_BIND: case MPOL_INTERLEAVE: + case MPOL_WEIGHTED_INTERLEAVE: nodes = pol->nodes; break; default: