diff mbox series

[v4,3/3] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

Message ID 20240130182046.74278-4-gregory.price@memverge.com (mailing list archive)
State New
Headers show
Series mm/mempolicy: weighted interleave mempolicy and sysfs extension | expand

Commit Message

Gregory Price Jan. 30, 2024, 6:20 p.m. UTC
When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
based interleave policy does not optimally distribute data to make use
of their different bandwidth characteristics.

Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes.  Weighted interleave
allows for proportional distribution of memory across multiple numa
nodes, preferably apportioned to match the bandwidth of each node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
weight distribution is (2:1).

Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/

For now, the default value of all nodes will be `1`, which matches
the behavior of standard 1:1 round-robin interleave. An extension
will be added in the future to allow default values to be registered
at kernel and device bringup time.

The policy allocates a number of pages equal to the set weights. For
example, if the weights are (2,1), then 2 pages will be allocated on
node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

Some high level notes about the pieces of weighted interleave:

current->il_prev:
    Default interleave uses this to track the last used node.
    Weighted interleave uses this to track the *current* node, and
    when weight reaches 0 it will be used to acquire the next node.

current->il_weight:
    The active weight of the current node (current->il_prev)
    When this reaches 0, current->il_prev is set to the next node
    and current->il_weight is set to the next weight.

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node.  Operates only on task->mempolicy.

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Operates on VMA policies.

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight will be allocated first.

    Operates only on the task->mempolicy.

One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actual application of the interleave. If a call
to alloc_pages_mpol() were made with a weighted-interleave policy and
ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would
operate on a VMA policy - violating the description above.

An inspection of all callers of alloc_pages_mpol() shows that all
external callers set ilx to `0`, an index value, or will call
get_vma_policy() to acquire the ilx.

For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks
all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces
the `weighted_interleave_nodes()` and `weighted_interleave_nid()`
policy requirements (task/vma respectively).

Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |   9 +
 include/linux/sched.h                         |   1 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 231 +++++++++++++++++-
 4 files changed, 238 insertions(+), 4 deletions(-)

Comments

Gregory Price Jan. 31, 2024, 5:12 a.m. UTC | #1
On Tue, Jan 30, 2024 at 01:20:46PM -0500, Gregory Price wrote:
> +	/* Continue allocating from most recent node and adjust the nr_pages */
> +	node = me->il_prev;
> +	weight = me->il_weight;
> +	if (weight && node_isset(node, nodes)) {
> +		node_pages = min(rem_pages, weight);
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		/* if that's all the pages, no need to interleave */
> +		if (rem_pages < weight) {
> +			/* stay on current node, adjust il_weight */
> +			me->il_weight -= rem_pages;
> +			return total_allocated;
> +		} else if (rem_pages == weight) {
> +			/* move to next node / weight */
> +			me->il_prev = next_node_in(node, nodes);
> +			me->il_weight = get_il_weight(next_node);

Sigh, I managed to miss a small update that killed next_node in favor of
operating directly on il_prev. Can you squash this fix into the patch?
Otherwise I can submit a separate patch.

~Gregory


diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7cd92f4ec0d7..2c1aef8eab70 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2382,7 +2382,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
        unsigned int weight_total = 0;
        unsigned long rem_pages = nr_pages;
        nodemask_t nodes;
-       int nnodes, node, next_node;
+       int nnodes, node;
        int resume_node = MAX_NUMNODES - 1;
        u8 resume_weight = 0;
        int prev_node;
@@ -2412,7 +2412,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                } else if (rem_pages == weight) {
                        /* move to next node / weight */
                        me->il_prev = next_node_in(node, nodes);
-                       me->il_weight = get_il_weight(next_node);
+                       me->il_weight = get_il_weight(me->il_prev);
                        return total_allocated;
                }
                /* Otherwise we adjust remaining pages, continue from there */
Huang, Ying Jan. 31, 2024, 6:43 a.m. UTC | #2
Gregory Price <gourry.memverge@gmail.com> writes:

> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> using the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
> based interleave policy does not optimally distribute data to make use
> of their different bandwidth characteristics.
>
> Instead, interleave is more effective when the allocation policy follows
> each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
> enabling weighted interleave between NUMA nodes.  Weighted interleave
> allows for proportional distribution of memory across multiple numa
> nodes, preferably apportioned to match the bandwidth of each node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
> weight distribution is (2:1).
>
> Weights for each node can be assigned via the new sysfs extension:
> /sys/kernel/mm/mempolicy/weighted_interleave/
>
> For now, the default value of all nodes will be `1`, which matches
> the behavior of standard 1:1 round-robin interleave. An extension
> will be added in the future to allow default values to be registered
> at kernel and device bringup time.
>
> The policy allocates a number of pages equal to the set weights. For
> example, if the weights are (2,1), then 2 pages will be allocated on
> node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> Some high level notes about the pieces of weighted interleave:
>
> current->il_prev:
>     Default interleave uses this to track the last used node.
>     Weighted interleave uses this to track the *current* node, and
>     when weight reaches 0 it will be used to acquire the next node.
>
> current->il_weight:
>     The active weight of the current node (current->il_prev)
>     When this reaches 0, current->il_prev is set to the next node
>     and current->il_weight is set to the next weight.

I still think that my description of the 2 fields above is easier to be
understood.  For weighted interleave,

current->il_prev is the node from which we allocated page in previous
allocation.

current->il_weight is the remaining weight for current->il_prev after
previous allocation.

But I will not force you to use this.  Use it only if you think that
they are better.

> weighted_interleave_nodes:
>     Counts the number of allocations as they occur, and applies the
>     weight for the current node.  When the weight reaches 0, switch
>     to the next node.  Operates only on task->mempolicy.
>
> weighted_interleave_nid:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the node based on the given index.
>     Operates on VMA policies.
>
> bulk_array_weighted_interleave:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the number of "interleave rounds" as
>     well as any delta ("partial round").  Calculates the number of
>     pages for each node and allocates them.
>
>     If a node was scheduled for interleave via interleave_nodes, the
>     current weight will be allocated first.
>
>     Operates only on the task->mempolicy.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actual application of the interleave. If a call
> to alloc_pages_mpol() were made with a weighted-interleave policy and
> ilx set to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would
> operate on a VMA policy - violating the description above.
>
> An inspection of all callers of alloc_pages_mpol() shows that all
> external callers set ilx to `0`, an index value, or will call
> get_vma_policy() to acquire the ilx.
>
> For example, mm/shmem.c may call into alloc_pages_mpol. The call stacks
> all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces
> the `weighted_interleave_nodes()` and `weighted_interleave_nid()`
> policy requirements (task/vma respectively).
>
> Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Rakie Kim <rakie.kim@sk.com>
> Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     |   9 +
>  include/linux/sched.h                         |   1 +
>  include/uapi/linux/mempolicy.h                |   1 +
>  mm/mempolicy.c                                | 231 +++++++++++++++++-
>  4 files changed, 238 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..a70f20ce1ffb 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
>  	can fall back to all existing numa nodes. This is effectively
>  	MPOL_PREFERRED allowed for a mask rather than a single node.
>  
> +MPOL_WEIGHTED_INTERLEAVE
> +	This mode operates the same as MPOL_INTERLEAVE, except that
> +	interleaving behavior is executed based on weights set in
> +	/sys/kernel/mm/mempolicy/weighted_interleave/
> +
> +	Weighted interleave allocates pages on nodes according to a
> +	weight.  For example if nodes [0,1] are weighted [5,2], 5 pages
> +	will be allocated on node0 for every 2 pages allocated on node1.
> +
>  NUMA memory policy supports the following optional mode flags:
>  
>  MPOL_F_STATIC_NODES
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffe8f618ab86..b9ce285d8c9c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1259,6 +1259,7 @@ struct task_struct {
>  	/* Protected by alloc_lock: */
>  	struct mempolicy		*mempolicy;
>  	short				il_prev;
> +	u8				il_weight;
>  	short				pref_node_fork;
>  #endif
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
>  	MPOL_INTERLEAVE,
>  	MPOL_LOCAL,
>  	MPOL_PREFERRED_MANY,
> +	MPOL_WEIGHTED_INTERLEAVE,
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 3bdfaf03b660..7cd92f4ec0d7 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -19,6 +19,13 @@
>   *                for anonymous memory. For process policy an process counter
>   *                is used.
>   *
> + * weighted interleave
> + *                Allocate memory interleaved over a set of nodes based on
> + *                a set of weights (per-node), with normal fallback if it
> + *                fails.  Otherwise operates the same as interleave.
> + *                Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
> + *                on node 0 for every 1 page allocated on node 1.
> + *
>   * bind           Only allocate memory on a specific set of nodes,
>   *                no fallback.
>   *                FIXME: memory is allocated starting with the first node
> @@ -441,6 +448,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  		.create = mpol_new_nodemask,
>  		.rebind = mpol_rebind_preferred,
>  	},
> +	[MPOL_WEIGHTED_INTERLEAVE] = {
> +		.create = mpol_new_nodemask,
> +		.rebind = mpol_rebind_nodemask,
> +	},
>  };
>  
>  static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -862,8 +873,11 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>  
>  	old = current->mempolicy;
>  	current->mempolicy = new;
> -	if (new && new->mode == MPOL_INTERLEAVE)
> +	if (new && (new->mode == MPOL_INTERLEAVE ||
> +		    new->mode == MPOL_WEIGHTED_INTERLEAVE)) {
>  		current->il_prev = MAX_NUMNODES-1;
> +		current->il_weight = 0;
> +	}
>  	task_unlock(current);
>  	mpol_put(old);
>  	ret = 0;
> @@ -888,6 +902,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*nodes = pol->nodes;
>  		break;
>  	case MPOL_LOCAL:
> @@ -972,6 +987,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  		} else if (pol == current->mempolicy &&
>  				pol->mode == MPOL_INTERLEAVE) {
>  			*policy = next_node_in(current->il_prev, pol->nodes);
> +		} else if (pol == current->mempolicy &&
> +				pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
> +			if (current->il_weight)
> +				*policy = current->il_prev;
> +			else
> +				*policy = next_node_in(current->il_prev,
> +						       pol->nodes);
>  		} else {
>  			err = -EINVAL;
>  			goto out;
> @@ -1336,7 +1358,8 @@ static long do_mbind(unsigned long start, unsigned long len,
>  		 * VMAs, the nodes will still be interleaved from the targeted
>  		 * nodemask, but one by one may be selected differently.
>  		 */
> -		if (new->mode == MPOL_INTERLEAVE) {
> +		if (new->mode == MPOL_INTERLEAVE ||
> +		    new->mode == MPOL_WEIGHTED_INTERLEAVE) {
>  			struct page *page;
>  			unsigned int order;
>  			unsigned long addr = -EFAULT;
> @@ -1784,7 +1807,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
>   * @vma: virtual memory area whose policy is sought
>   * @addr: address in @vma for shared policy lookup
>   * @order: 0, or appropriate huge_page_order for interleaving
> - * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
> + * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
> + *       MPOL_WEIGHTED_INTERLEAVE
>   *
>   * Returns effective policy for a VMA at specified address.
>   * Falls back to current->mempolicy or system default policy, as necessary.
> @@ -1801,7 +1825,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  	pol = __get_vma_policy(vma, addr, ilx);
>  	if (!pol)
>  		pol = get_task_policy(current);
> -	if (pol->mode == MPOL_INTERLEAVE) {
> +	if (pol->mode == MPOL_INTERLEAVE ||
> +	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
>  		*ilx += vma->vm_pgoff >> order;
>  		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
>  	}
> @@ -1851,6 +1876,22 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  	return zone >= dynamic_policy_zone;
>  }
>  
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> +	unsigned int node = current->il_prev;
> +
> +	if (!current->il_weight || !node_isset(node, policy->nodes)) {
> +		node = next_node_in(node, policy->nodes);
> +		/* can only happen if nodemask is being rebound */
> +		if (node == MAX_NUMNODES)
> +			return node;

I feel a little unsafe to read policy->nodes at same time of writing in
rebound.  Is it better to use a seqlock to guarantee its consistency?
It's unnecessary to be a part of this series though.

> +		current->il_prev = node;
> +		current->il_weight = get_il_weight(node);
> +	}
> +	current->il_weight--;
> +	return node;
> +}
> +
>  /* Do dynamic interleaving for a process */
>  static unsigned int interleave_nodes(struct mempolicy *policy)
>  {
> @@ -1885,6 +1926,9 @@ unsigned int mempolicy_slab_node(void)
>  	case MPOL_INTERLEAVE:
>  		return interleave_nodes(policy);
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		return weighted_interleave_nodes(policy);
> +
>  	case MPOL_BIND:
>  	case MPOL_PREFERRED_MANY:
>  	{
> @@ -1923,6 +1967,45 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
>  	return nodes_weight(*mask);
>  }
>  
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> +	nodemask_t nodemask;
> +	unsigned int target, nr_nodes;
> +	u8 __rcu *table;
> +	unsigned int weight_total = 0;
> +	u8 weight;
> +	int nid;
> +
> +	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
> +	if (!nr_nodes)
> +		return numa_node_id();
> +
> +	rcu_read_lock();
> +	table = rcu_dereference(iw_table);
> +	/* calculate the total weight */
> +	for_each_node_mask(nid, nodemask) {
> +		/* detect system default usage */
> +		weight = table ? table[nid] : 1;
> +		weight = weight ? weight : 1;
> +		weight_total += weight;
> +	}
> +
> +	/* Calculate the node offset based on totals */
> +	target = ilx % weight_total;
> +	nid = first_node(nodemask);
> +	while (target) {
> +		/* detect system default usage */
> +		weight = table ? table[nid] : 1;
> +		weight = weight ? weight : 1;

I found duplicated pattern as above in this patch.  Can we define a
function like below to remove the duplication?

u8 __get_il_weight(u8 *table, int nid)
{
        u8 weight;

        weight = table ? table[nid] : 1;
        return weight ? : 1;
}

This can be used in alloc_pages_bulk_array_weighted_interleave() to copy
from global to local weights array too.

But this isn't a big deal.  I will leave it to you to decide.

> +		if (target < weight)
> +			break;
> +		target -= weight;
> +		nid = next_node_in(nid, nodemask);
> +	}
> +	rcu_read_unlock();
> +	return nid;
> +}
> +
>  /*
>   * Do static interleaving for interleave index @ilx.  Returns the ilx'th
>   * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1983,6 +2066,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>  		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
>  			interleave_nodes(pol) : interleave_nid(pol, ilx);
>  		break;
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
> +			weighted_interleave_nodes(pol) :
> +			weighted_interleave_nid(pol, ilx);
> +		break;
>  	}
>  
>  	return nodemask;
> @@ -2044,6 +2132,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*mask = mempolicy->nodes;
>  		break;
>  
> @@ -2144,6 +2233,7 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		 * node in its nodemask, we allocate the standard way.
>  		 */
>  		if (pol->mode != MPOL_INTERLEAVE &&
> +		    pol->mode != MPOL_WEIGHTED_INTERLEAVE &&
>  		    (!nodemask || node_isset(nid, *nodemask))) {
>  			/*
>  			 * First, try to allocate THP only on local node, but
> @@ -2279,6 +2369,127 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
>  	return total_allocated;
>  }
>  
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> +		struct mempolicy *pol, unsigned long nr_pages,
> +		struct page **page_array)
> +{
> +	struct task_struct *me = current;
> +	unsigned long total_allocated = 0;
> +	unsigned long nr_allocated = 0;
> +	unsigned long rounds;
> +	unsigned long node_pages, delta;
> +	u8 __rcu *table, *weights, weight;
> +	unsigned int weight_total = 0;
> +	unsigned long rem_pages = nr_pages;
> +	nodemask_t nodes;
> +	int nnodes, node, next_node;
> +	int resume_node = MAX_NUMNODES - 1;
> +	u8 resume_weight = 0;
> +	int prev_node;
> +	int i;
> +
> +	if (!nr_pages)
> +		return 0;
> +
> +	nnodes = read_once_policy_nodemask(pol, &nodes);
> +	if (!nnodes)
> +		return 0;
> +
> +	/* Continue allocating from most recent node and adjust the nr_pages */
> +	node = me->il_prev;
> +	weight = me->il_weight;
> +	if (weight && node_isset(node, nodes)) {
> +		node_pages = min(rem_pages, weight);
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		/* if that's all the pages, no need to interleave */
> +		if (rem_pages < weight) {
> +			/* stay on current node, adjust il_weight */
> +			me->il_weight -= rem_pages;
> +			return total_allocated;
> +		} else if (rem_pages == weight) {
> +			/* move to next node / weight */
> +			me->il_prev = next_node_in(node, nodes);
> +			me->il_weight = get_il_weight(next_node);
> +			return total_allocated;
> +		}
> +		/* Otherwise we adjust remaining pages, continue from there */
> +		rem_pages -= weight;
> +	}
> +	/* clear active weight in case of an allocation failure */
> +	me->il_weight = 0;
> +	prev_node = node;
> +
> +	/* create a local copy of node weights to operate on outside rcu */
> +	weights = kzalloc(nr_node_ids, GFP_KERNEL);
> +	if (!weights)
> +		return total_allocated;
> +
> +	rcu_read_lock();
> +	table = rcu_dereference(iw_table);
> +	if (table)
> +		memcpy(weights, table, nr_node_ids);
> +	rcu_read_unlock();
> +
> +	/* calculate total, detect system default usage */
> +	for_each_node_mask(node, nodes) {
> +		if (!weights[node])
> +			weights[node] = 1;
> +		weight_total += weights[node];
> +	}
> +
> +	/*
> +	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
> +	 * Track which node weighted interleave should resume from.
> +	 *
> +	 * if (rounds > 0) and (delta == 0), resume_node will always be
> +	 * the node following prev_node and its weight.
> +	 */
> +	rounds = rem_pages / weight_total;
> +	delta = rem_pages % weight_total;
> +	resume_node = next_node_in(prev_node, nodes);
> +	resume_weight = weights[resume_node];
> +	for (i = 0; i < nnodes; i++) {
> +		node = next_node_in(prev_node, nodes);
> +		weight = weights[node];
> +		node_pages = weight * rounds;
> +		/* If a delta exists, add this node's portion of the delta */
> +		if (delta > weight) {
> +			node_pages += weight;
> +			delta -= weight;
> +		} else if (delta) {
> +			node_pages += delta;
> +			/* delta may deplete on a boundary or w/ a remainder */
> +			if (delta == weight) {
> +				/* boundary: resume from next node/weight */
> +				resume_node = next_node_in(node, nodes);
> +				resume_weight = weights[resume_node];
> +			} else {
> +				/* remainder: resume this node w/ remainder */
> +				resume_node = node;
> +				resume_weight = weight - delta;
> +			}

If we are comfortable to leave resume_weight == 0, then the above
branch can be simplified to.

        resume_node = node;
        resume_weight = weight - delta;

But, this is a style issue again.  I will leave it to you to decide.

So, except the issue you pointed out already.  All series looks good to
me!  Thanks!  Feel free to add

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

to the whole series.

> +			delta = 0;
> +		}
> +		/* node_pages can be 0 if an allocation fails and rounds == 0 */
> +		if (!node_pages)
> +			break;
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		if (total_allocated == nr_pages)
> +			break;
> +		prev_node = node;
> +	}
> +	me->il_prev = resume_node;
> +	me->il_weight = resume_weight;
> +	kfree(weights);
> +	return total_allocated;
> +}
> +
>  static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
>  		struct mempolicy *pol, unsigned long nr_pages,
>  		struct page **page_array)
> @@ -2319,6 +2530,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
>  		return alloc_pages_bulk_array_interleave(gfp, pol,
>  							 nr_pages, page_array);
>  
> +	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> +		return alloc_pages_bulk_array_weighted_interleave(
> +				  gfp, pol, nr_pages, page_array);
> +
>  	if (pol->mode == MPOL_PREFERRED_MANY)
>  		return alloc_pages_bulk_array_preferred_many(gfp,
>  				numa_node_id(), pol, nr_pages, page_array);
> @@ -2394,6 +2609,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		return !!nodes_equal(a->nodes, b->nodes);
>  	case MPOL_LOCAL:
>  		return true;
> @@ -2530,6 +2746,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
>  		polnid = interleave_nid(pol, ilx);
>  		break;
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		polnid = weighted_interleave_nid(pol, ilx);
> +		break;
> +
>  	case MPOL_PREFERRED:
>  		if (node_isset(curnid, pol->nodes))
>  			goto out;
> @@ -2904,6 +3124,7 @@ static const char * const policy_modes[] =
>  	[MPOL_PREFERRED]  = "prefer",
>  	[MPOL_BIND]       = "bind",
>  	[MPOL_INTERLEAVE] = "interleave",
> +	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
>  	[MPOL_LOCAL]      = "local",
>  	[MPOL_PREFERRED_MANY]  = "prefer (many)",
>  };
> @@ -2963,6 +3184,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
>  		}
>  		break;
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		/*
>  		 * Default to online nodes with memory if no nodelist
>  		 */
> @@ -3073,6 +3295,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		nodes = pol->nodes;
>  		break;
>  	default:

--
Best Regards,
Huang, Ying
Gregory Price Jan. 31, 2024, 7:43 a.m. UTC | #3
On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> >  
> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> > +{
> > +	unsigned int node = current->il_prev;
> > +
> > +	if (!current->il_weight || !node_isset(node, policy->nodes)) {
> > +		node = next_node_in(node, policy->nodes);
> > +		/* can only happen if nodemask is being rebound */
> > +		if (node == MAX_NUMNODES)
> > +			return node;
> 
> I feel a little unsafe to read policy->nodes at same time of writing in
> rebound.  Is it better to use a seqlock to guarantee its consistency?
> It's unnecessary to be a part of this series though.
> 

I think this is handled already? It is definitely an explicit race
condition that is documented elsewhere:

/*
 * mpol_rebind_policy - Migrate a policy to a different set of nodes
 *
 * Per-vma policies are protected by mmap_lock. Allocations using per-task
 * policies are protected by task->mems_allowed_seq to prevent a premature
 * OOM/allocation failure due to parallel nodemask modification.
 */

example from slub:

do {
	cpuset_mems_cookie = read_mems_allowed_begin();
	zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
	...
} while (read_mems_allowed_retry(cpuset_mems_cookie));

quick perusal through other allocators, show similar checks.

page_alloc.c  -  check_retry_cpusetset()
filemap.c     -  filemap_alloc_folio()

If we ever want mempolicy to be swappable from outside the current task
context, this will have to change most likely - but that's another
feature for another day.

> > +	while (target) {
> > +		/* detect system default usage */
> > +		weight = table ? table[nid] : 1;
> > +		weight = weight ? weight : 1;
> 
> I found duplicated pattern as above in this patch.  Can we define a
> function like below to remove the duplication?
> 
> u8 __get_il_weight(u8 *table, int nid)
> {
>         u8 weight;
> 
>         weight = table ? table[nid] : 1;
>         return weight ? : 1;
> }
> 

When we implement the system-default array, this will change to:

weight = sysfs_table ? sysfs_table[nid] : default_table[nid];

This cleanup will get picked up in that patch set since this code is
going to change anyway.

> > +			if (delta == weight) {
> > +				/* boundary: resume from next node/weight */
> > +				resume_node = next_node_in(node, nodes);
> > +				resume_weight = weights[resume_node];
> > +			} else {
> > +				/* remainder: resume this node w/ remainder */
> > +				resume_node = node;
> > +				resume_weight = weight - delta;
> > +			}
> 
> If we are comfortable to leave resume_weight == 0, then the above
> branch can be simplified to.
> 
>         resume_node = node;
>         resume_weight = weight - delta;
> 
> But, this is a style issue again.  I will leave it to you to decide.

Good point, and in fact there's a similar branch in the first half of
the function that can be simplified.  Will follow up with a style patch.

 mm/mempolicy.c | 21 ++++-----------------
 1 file changed, 4 insertions(+), 17 deletions(-)

My favorite style of patch :D


Andrew if you happen to be monitoring, this is the patch (not tested
yet, but it's pretty obvious, otherwise i'll submit individually
tomorrow).


diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2c1aef8eab70..b0ca9bcdd64c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                page_array += nr_allocated;
                total_allocated += nr_allocated;
                /* if that's all the pages, no need to interleave */
-               if (rem_pages < weight) {
-                       /* stay on current node, adjust il_weight */
+               if (rem_pages <= weight) {
                        me->il_weight -= rem_pages;
                        return total_allocated;
-               } else if (rem_pages == weight) {
-                       /* move to next node / weight */
-                       me->il_prev = next_node_in(node, nodes);
-                       me->il_weight = get_il_weight(me->il_prev);
-                       return total_allocated;
                }
                /* Otherwise we adjust remaining pages, continue from there */
                rem_pages -= weight;
@@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                        node_pages += weight;
                        delta -= weight;
                } else if (delta) {
+                       /* when delta is deleted, resume from that node */
                        node_pages += delta;
-                       /* delta may deplete on a boundary or w/ a remainder */
-                       if (delta == weight) {
-                               /* boundary: resume from next node/weight */
-                               resume_node = next_node_in(node, nodes);
-                               resume_weight = weights[resume_node];
-                       } else {
-                               /* remainder: resume this node w/ remainder */
-                               resume_node = node;
-                               resume_weight = weight - delta;
-                       }
+                       resume_node = node;
+                       resume_weight = weight - delta;
                        delta = 0;
                }
                /* node_pages can be 0 if an allocation fails and rounds == 0 */


> 
> So, except the issue you pointed out already.  All series looks good to
> me!  Thanks!  Feel free to add
> 
> Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
> 
> to the whole series.
> 

Thank you so much for your patience with me! I appreciate all the help.

I am looking forward to this feature very much!

~Gregory
Huang, Ying Jan. 31, 2024, 9:19 a.m. UTC | #4
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> >  
>> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>> > +{
>> > +	unsigned int node = current->il_prev;
>> > +
>> > +	if (!current->il_weight || !node_isset(node, policy->nodes)) {
>> > +		node = next_node_in(node, policy->nodes);
>> > +		/* can only happen if nodemask is being rebound */
>> > +		if (node == MAX_NUMNODES)
>> > +			return node;
>> 
>> I feel a little unsafe to read policy->nodes at same time of writing in
>> rebound.  Is it better to use a seqlock to guarantee its consistency?
>> It's unnecessary to be a part of this series though.
>> 
>
> I think this is handled already? It is definitely an explicit race
> condition that is documented elsewhere:
>
> /*
>  * mpol_rebind_policy - Migrate a policy to a different set of nodes
>  *
>  * Per-vma policies are protected by mmap_lock. Allocations using per-task
>  * policies are protected by task->mems_allowed_seq to prevent a premature
>  * OOM/allocation failure due to parallel nodemask modification.
>  */

Thanks for pointing this out!

If we use task->mems_allowed_seq reader side in
weighted_interleave_nodes() we can guarantee the consistency of
policy->nodes.  That may be not deserved, because it's not a big deal to
allocate 1 page in a wrong node.

It makes more sense to do that in
alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
be allocated there.

> example from slub:
>
> do {
> 	cpuset_mems_cookie = read_mems_allowed_begin();
> 	zonelist = node_zonelist(mempolicy_slab_node(), pc->flags);
> 	...
> } while (read_mems_allowed_retry(cpuset_mems_cookie));
>
> quick perusal through other allocators, show similar checks.
>
> page_alloc.c  -  check_retry_cpusetset()
> filemap.c     -  filemap_alloc_folio()
>
> If we ever want mempolicy to be swappable from outside the current task
> context, this will have to change most likely - but that's another
> feature for another day.
>

--
Best Regards,
Huang, Ying
Gregory Price Jan. 31, 2024, 4:35 p.m. UTC | #5
On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Wed, Jan 31, 2024 at 02:43:12PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry.memverge@gmail.com> writes:
> >> >  
> >> > +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> >> > +{
> >> > +	unsigned int node = current->il_prev;
> >> > +
> >> > +	if (!current->il_weight || !node_isset(node, policy->nodes)) {
> >> > +		node = next_node_in(node, policy->nodes);
> >> > +		/* can only happen if nodemask is being rebound */
> >> > +		if (node == MAX_NUMNODES)
> >> > +			return node;
> >> 
> >> I feel a little unsafe to read policy->nodes at same time of writing in
> >> rebound.  Is it better to use a seqlock to guarantee its consistency?
> >> It's unnecessary to be a part of this series though.
> >> 
> >
> > I think this is handled already? It is definitely an explicit race
> > condition that is documented elsewhere:
> >
> > /*
> >  * mpol_rebind_policy - Migrate a policy to a different set of nodes
> >  *
> >  * Per-vma policies are protected by mmap_lock. Allocations using per-task
> >  * policies are protected by task->mems_allowed_seq to prevent a premature
> >  * OOM/allocation failure due to parallel nodemask modification.
> >  */
> 
> Thanks for pointing this out!
> 
> If we use task->mems_allowed_seq reader side in
> weighted_interleave_nodes() we can guarantee the consistency of
> policy->nodes.  That may be not deserved, because it's not a big deal to
> allocate 1 page in a wrong node.
> 
> It makes more sense to do that in
> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
> be allocated there.
>

That's probably worth just adding now, I'll do it and squash the style
updates into the branch.  Sorry Andrew, I guess 1 last version is
inbound :]

I'll pick up the reviewed tags along the way.

~Gregory
Gregory Price Jan. 31, 2024, 5:29 p.m. UTC | #6
On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> >
> > I think this is handled already? It is definitely an explicit race
> > condition that is documented elsewhere:
> >
> > /*
> >  * mpol_rebind_policy - Migrate a policy to a different set of nodes
> >  *
> >  * Per-vma policies are protected by mmap_lock. Allocations using per-task
> >  * policies are protected by task->mems_allowed_seq to prevent a premature
> >  * OOM/allocation failure due to parallel nodemask modification.
> >  */
> 
> Thanks for pointing this out!
> 
> If we use task->mems_allowed_seq reader side in
> weighted_interleave_nodes() we can guarantee the consistency of
> policy->nodes.  That may be not deserved, because it's not a big deal to
> allocate 1 page in a wrong node.
> 
> It makes more sense to do that in
> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
> be allocated there.
> 

To save the versioning if there are issues, here are the 3 diffs that
I have left. If you are good with these changes, I'll squash the first
2 into the third commit, keep the last one as a separate commit (it
changes the interleave_nodes() logic too), and submit v5 w/ your
reviewed tag on all of them.


Fix one (pedantic?) warning from syzbot:
----------------------------------------

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b1437396c357..dfd097009606 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
        unsigned long nr_allocated = 0;
        unsigned long rounds;
        unsigned long node_pages, delta;
-       u8 __rcu *table, *weights, weight;
+       u8 __rcu *table, __rcu *weights, weight;
        unsigned int weight_total = 0;
        unsigned long rem_pages = nr_pages;
        nodemask_t nodes;



Simplifying resume_node/weight logic:
-------------------------------------

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2c1aef8eab70..b0ca9bcdd64c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                page_array += nr_allocated;
                total_allocated += nr_allocated;
                /* if that's all the pages, no need to interleave */
-               if (rem_pages < weight) {
-                       /* stay on current node, adjust il_weight */
+               if (rem_pages <= weight) {
                        me->il_weight -= rem_pages;
                        return total_allocated;
-               } else if (rem_pages == weight) {
-                       /* move to next node / weight */
-                       me->il_prev = next_node_in(node, nodes);
-                       me->il_weight = get_il_weight(me->il_prev);
-                       return total_allocated;
                }
                /* Otherwise we adjust remaining pages, continue from there */
                rem_pages -= weight;
@@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                        node_pages += weight;
                        delta -= weight;
                } else if (delta) {
+                       /* when delta is deleted, resume from that node */
                        node_pages += delta;
-                       /* delta may deplete on a boundary or w/ a remainder */
-                       if (delta == weight) {
-                               /* boundary: resume from next node/weight */
-                               resume_node = next_node_in(node, nodes);
-                               resume_weight = weights[resume_node];
-                       } else {
-                               /* remainder: resume this node w/ remainder */
-                               resume_node = node;
-                               resume_weight = weight - delta;
-                       }
+                       resume_node = node;
+                       resume_weight = weight - delta;
                        delta = 0;
                }
                /* node_pages can be 0 if an allocation fails and rounds == 0 */





task->mems_allowed_seq protection (added as 4th patch)
------------------------------------------------------

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b0ca9bcdd64c..b1437396c357 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
        unsigned int node = current->il_prev;
+       unsigned int cpuset_mems_cookie;

+retry:
+       /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+       cpuset_mems_cookie = read_mems_allowed_begin();
        if (!current->il_weight || !node_isset(node, policy->nodes)) {
                node = next_node_in(node, policy->nodes);
-               /* can only happen if nodemask is being rebound */
+               if (read_mems_allowed_retry(cpuset_mems_cookie))
+                       goto retry;
                if (node == MAX_NUMNODES)
                        return node;
                current->il_prev = node;
@@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
        unsigned int nid;
+       unsigned int cpuset_mems_cookie;
+
+       /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nid = next_node_in(current->il_prev, policy->nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));

-       nid = next_node_in(current->il_prev, policy->nodes);
        if (nid < MAX_NUMNODES)
                current->il_prev = nid;
+
        return nid;
 }

@@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                struct page **page_array)
 {
        struct task_struct *me = current;
+       unsigned int cpuset_mems_cookie;
        unsigned long total_allocated = 0;
        unsigned long nr_allocated = 0;
        unsigned long rounds;
@@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
        int prev_node;
        int i;

+
        if (!nr_pages)
                return 0;

-       nnodes = read_once_policy_nodemask(pol, &nodes);
+       /* read the nodes onto the stack, retry if done during rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nnodes = read_once_policy_nodemask(pol, &nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+       /* if the nodemask has become invalid, we cannot do anything */
        if (!nnodes)
                return 0;
Huang, Ying Feb. 1, 2024, 1:55 a.m. UTC | #7
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Jan 31, 2024 at 05:19:51PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> >
>> > I think this is handled already? It is definitely an explicit race
>> > condition that is documented elsewhere:
>> >
>> > /*
>> >  * mpol_rebind_policy - Migrate a policy to a different set of nodes
>> >  *
>> >  * Per-vma policies are protected by mmap_lock. Allocations using per-task
>> >  * policies are protected by task->mems_allowed_seq to prevent a premature
>> >  * OOM/allocation failure due to parallel nodemask modification.
>> >  */
>> 
>> Thanks for pointing this out!
>> 
>> If we use task->mems_allowed_seq reader side in
>> weighted_interleave_nodes() we can guarantee the consistency of
>> policy->nodes.  That may be not deserved, because it's not a big deal to
>> allocate 1 page in a wrong node.
>> 
>> It makes more sense to do that in
>> alloc_pages_bulk_array_weighted_interleave(), because a lot of pages may
>> be allocated there.
>> 
>
> To save the versioning if there are issues, here are the 3 diffs that
> I have left. If you are good with these changes, I'll squash the first
> 2 into the third commit, keep the last one as a separate commit (it
> changes the interleave_nodes() logic too), and submit v5 w/ your
> reviewed tag on all of them.
>
>
> Fix one (pedantic?) warning from syzbot:
> ----------------------------------------
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b1437396c357..dfd097009606 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>         unsigned long nr_allocated = 0;
>         unsigned long rounds;
>         unsigned long node_pages, delta;
> -       u8 __rcu *table, *weights, weight;
> +       u8 __rcu *table, __rcu *weights, weight;

The __rcu usage can be checked with `sparse` directly.  For example,

make C=1 mm/mempolicy.o

More details can be found in

https://www.kernel.org/doc/html/latest/dev-tools/sparse.html

Per my understanding, we shouldn't use "__rcu" here.  Please search
"__rcu" in the following document.

https://www.kernel.org/doc/html/latest/RCU/checklist.html

>         unsigned int weight_total = 0;
>         unsigned long rem_pages = nr_pages;
>         nodemask_t nodes;
>
>
>
> Simplifying resume_node/weight logic:
> -------------------------------------
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 2c1aef8eab70..b0ca9bcdd64c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2405,15 +2405,9 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>                 page_array += nr_allocated;
>                 total_allocated += nr_allocated;
>                 /* if that's all the pages, no need to interleave */
> -               if (rem_pages < weight) {
> -                       /* stay on current node, adjust il_weight */
> +               if (rem_pages <= weight) {
>                         me->il_weight -= rem_pages;
>                         return total_allocated;
> -               } else if (rem_pages == weight) {
> -                       /* move to next node / weight */
> -                       me->il_prev = next_node_in(node, nodes);
> -                       me->il_weight = get_il_weight(me->il_prev);
> -                       return total_allocated;
>                 }
>                 /* Otherwise we adjust remaining pages, continue from there */
>                 rem_pages -= weight;
> @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>                         node_pages += weight;
>                         delta -= weight;
>                 } else if (delta) {
> +                       /* when delta is deleted, resume from that node */
                                           ~~~~~~~
                                           depleted?


>                         node_pages += delta;
> -                       /* delta may deplete on a boundary or w/ a remainder */
> -                       if (delta == weight) {
> -                               /* boundary: resume from next node/weight */
> -                               resume_node = next_node_in(node, nodes);
> -                               resume_weight = weights[resume_node];
> -                       } else {
> -                               /* remainder: resume this node w/ remainder */
> -                               resume_node = node;
> -                               resume_weight = weight - delta;
> -                       }
> +                       resume_node = node;
> +                       resume_weight = weight - delta;
>                         delta = 0;
>                 }
>                 /* node_pages can be 0 if an allocation fails and rounds == 0 */
>
>
>
>
>
> task->mems_allowed_seq protection (added as 4th patch)
> ------------------------------------------------------
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index b0ca9bcdd64c..b1437396c357 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1879,10 +1879,15 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>  {
>         unsigned int node = current->il_prev;
> +       unsigned int cpuset_mems_cookie;
>
> +retry:
> +       /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
> +       cpuset_mems_cookie = read_mems_allowed_begin();
>         if (!current->il_weight || !node_isset(node, policy->nodes)) {
>                 node = next_node_in(node, policy->nodes);

node will be changed in the loop.  So we need to change the logic here.

> -               /* can only happen if nodemask is being rebound */
> +               if (read_mems_allowed_retry(cpuset_mems_cookie))
> +                       goto retry;
>                 if (node == MAX_NUMNODES)
>                         return node;
>                 current->il_prev = node;
> @@ -1896,10 +1901,17 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>  static unsigned int interleave_nodes(struct mempolicy *policy)
>  {
>         unsigned int nid;
> +       unsigned int cpuset_mems_cookie;
> +
> +       /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
> +       do {
> +               cpuset_mems_cookie = read_mems_allowed_begin();
> +               nid = next_node_in(current->il_prev, policy->nodes);
> +       } while (read_mems_allowed_retry(cpuset_mems_cookie));
>
> -       nid = next_node_in(current->il_prev, policy->nodes);
>         if (nid < MAX_NUMNODES)
>                 current->il_prev = nid;
> +
>         return nid;
>  }
>
> @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>                 struct page **page_array)
>  {
>         struct task_struct *me = current;
> +       unsigned int cpuset_mems_cookie;
>         unsigned long total_allocated = 0;
>         unsigned long nr_allocated = 0;
>         unsigned long rounds;
> @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>         int prev_node;
>         int i;
>
> +

Change by accident?

>         if (!nr_pages)
>                 return 0;
>
> -       nnodes = read_once_policy_nodemask(pol, &nodes);
> +       /* read the nodes onto the stack, retry if done during rebind */
> +       do {
> +               cpuset_mems_cookie = read_mems_allowed_begin();
> +               nnodes = read_once_policy_nodemask(pol, &nodes);
> +       } while (read_mems_allowed_retry(cpuset_mems_cookie));
> +
> +       /* if the nodemask has become invalid, we cannot do anything */
>         if (!nnodes)
>                 return 0;

--
Best Regards,
Huang, Ying
Gregory Price Feb. 1, 2024, 2:01 a.m. UTC | #8
On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index b1437396c357..dfd097009606 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2391,7 +2391,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> >         unsigned long nr_allocated = 0;
> >         unsigned long rounds;
> >         unsigned long node_pages, delta;
> > -       u8 __rcu *table, *weights, weight;
> > +       u8 __rcu *table, __rcu *weights, weight;
> 
> The __rcu usage can be checked with `sparse` directly.  For example,
> 
> make C=1 mm/mempolicy.o
> 
> More details can be found in
> 
> https://www.kernel.org/doc/html/latest/dev-tools/sparse.html
> 
> Per my understanding, we shouldn't use "__rcu" here.  Please search
> "__rcu" in the following document.
> 
> https://www.kernel.org/doc/html/latest/RCU/checklist.html
> 

Thanks for this, I will sort this out and respond here with changes
before v5.

> > @@ -2460,17 +2454,10 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> >                         node_pages += weight;
> >                         delta -= weight;
> >                 } else if (delta) {
> > +                       /* when delta is deleted, resume from that node */
>                                            ~~~~~~~
>                                            depleted?

ack.

> > +retry:
> > +       /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
> > +       cpuset_mems_cookie = read_mems_allowed_begin();
> >         if (!current->il_weight || !node_isset(node, policy->nodes)) {
> >                 node = next_node_in(node, policy->nodes);
> 
> node will be changed in the loop.  So we need to change the logic here.
> 

Good catch, stupid mistake. ack.

> > @@ -2388,10 +2401,17 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> >         int prev_node;
> >         int i;
> >
> > +
> 
> Change by accident?
>

ack.

~Gregory
Gregory Price Feb. 1, 2024, 2:18 a.m. UTC | #9
On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> > -       u8 __rcu *table, *weights, weight;
> > +       u8 __rcu *table, __rcu *weights, weight;
> 
> The __rcu usage can be checked with `sparse` directly.  For example,
> 
> make C=1 mm/mempolicy.o
> 

fixed and squashed, all the __rcu usage i had except the global pointer
have been used.  Thanks for the reference material, was struggling to
understand that.

> > task->mems_allowed_seq protection (added as 4th patch)
> > ------------------------------------------------------
> >
> > +       cpuset_mems_cookie = read_mems_allowed_begin();
> >         if (!current->il_weight || !node_isset(node, policy->nodes)) {
> >                 node = next_node_in(node, policy->nodes);
> 
> node will be changed in the loop.  So we need to change the logic here.
> 

new patch, if it all looks good i'll ship it in v5

~Gregory


diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d8cc3a577986..4e5a640d10b8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)

 static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
-       unsigned int node = current->il_prev;
-
-       if (!current->il_weight || !node_isset(node, policy->nodes)) {
-               node = next_node_in(node, policy->nodes);
-               /* can only happen if nodemask is being rebound */
+       unsigned int node;
+       unsigned int cpuset_mems_cookie;
+
+retry:
+       /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+       cpuset_mems_cookie = read_mems_allowed_begin();
+       if (!current->il_weight ||
+           !node_isset(current->il_prev, policy->nodes)) {
+               node = next_node_in(current->il_prev, policy->nodes);
+               if (read_mems_allowed_retry(cpuset_mems_cookie))
+                       goto retry;
                if (node == MAX_NUMNODES)
                        return node;
                current->il_prev = node;
@@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
        unsigned int nid;
+       unsigned int cpuset_mems_cookie;
+
+       /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nid = next_node_in(current->il_prev, policy->nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));

-       nid = next_node_in(current->il_prev, policy->nodes);
        if (nid < MAX_NUMNODES)
                current->il_prev = nid;
        return nid;
@@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                struct page **page_array)
 {
        struct task_struct *me = current;
+       unsigned int cpuset_mems_cookie;
        unsigned long total_allocated = 0;
        unsigned long nr_allocated = 0;
        unsigned long rounds;
@@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
        if (!nr_pages)
                return 0;

-       nnodes = read_once_policy_nodemask(pol, &nodes);
+       /* read the nodes onto the stack, retry if done during rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nnodes = read_once_policy_nodemask(pol, &nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+       /* if the nodemask has become invalid, we cannot do anything */
        if (!nnodes)
                return 0;
Huang, Ying Feb. 1, 2024, 3:02 a.m. UTC | #10
Gregory Price <gregory.price@memverge.com> writes:

> On Thu, Feb 01, 2024 at 09:55:07AM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> > -       u8 __rcu *table, *weights, weight;
>> > +       u8 __rcu *table, __rcu *weights, weight;
>> 
>> The __rcu usage can be checked with `sparse` directly.  For example,
>> 
>> make C=1 mm/mempolicy.o
>> 
>
> fixed and squashed, all the __rcu usage i had except the global pointer
> have been used.  Thanks for the reference material, was struggling to
> understand that.
>
>> > task->mems_allowed_seq protection (added as 4th patch)
>> > ------------------------------------------------------
>> >
>> > +       cpuset_mems_cookie = read_mems_allowed_begin();
>> >         if (!current->il_weight || !node_isset(node, policy->nodes)) {
>> >                 node = next_node_in(node, policy->nodes);
>> 
>> node will be changed in the loop.  So we need to change the logic here.
>> 
>
> new patch, if it all looks good i'll ship it in v5
>
> ~Gregory
>
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index d8cc3a577986..4e5a640d10b8 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>
>  static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>  {
> -       unsigned int node = current->il_prev;
> -
> -       if (!current->il_weight || !node_isset(node, policy->nodes)) {
> -               node = next_node_in(node, policy->nodes);
> -               /* can only happen if nodemask is being rebound */
> +       unsigned int node;

IIUC, "node" may be used without initialization.

--
Best Regards,
Huang, Ying

> +       unsigned int cpuset_mems_cookie;
> +
> +retry:
> +       /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
> +       cpuset_mems_cookie = read_mems_allowed_begin();
> +       if (!current->il_weight ||
> +           !node_isset(current->il_prev, policy->nodes)) {
> +               node = next_node_in(current->il_prev, policy->nodes);
> +               if (read_mems_allowed_retry(cpuset_mems_cookie))
> +                       goto retry;
>                 if (node == MAX_NUMNODES)
>                         return node;
>                 current->il_prev = node;
> @@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
>  static unsigned int interleave_nodes(struct mempolicy *policy)
>  {
>         unsigned int nid;
> +       unsigned int cpuset_mems_cookie;
> +
> +       /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
> +       do {
> +               cpuset_mems_cookie = read_mems_allowed_begin();
> +               nid = next_node_in(current->il_prev, policy->nodes);
> +       } while (read_mems_allowed_retry(cpuset_mems_cookie));
>
> -       nid = next_node_in(current->il_prev, policy->nodes);
>         if (nid < MAX_NUMNODES)
>                 current->il_prev = nid;
>         return nid;
> @@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>                 struct page **page_array)
>  {
>         struct task_struct *me = current;
> +       unsigned int cpuset_mems_cookie;
>         unsigned long total_allocated = 0;
>         unsigned long nr_allocated = 0;
>         unsigned long rounds;
> @@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>         if (!nr_pages)
>                 return 0;
>
> -       nnodes = read_once_policy_nodemask(pol, &nodes);
> +       /* read the nodes onto the stack, retry if done during rebind */
> +       do {
> +               cpuset_mems_cookie = read_mems_allowed_begin();
> +               nnodes = read_once_policy_nodemask(pol, &nodes);
> +       } while (read_mems_allowed_retry(cpuset_mems_cookie));
> +
> +       /* if the nodemask has become invalid, we cannot do anything */
>         if (!nnodes)
>                 return 0;
Gregory Price Feb. 1, 2024, 3:10 a.m. UTC | #11
On Thu, Feb 01, 2024 at 11:02:47AM +0800, Huang, Ying wrote:
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index d8cc3a577986..4e5a640d10b8 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
> >
> >  static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> >  {
> > -       unsigned int node = current->il_prev;
> > -
> > -       if (!current->il_weight || !node_isset(node, policy->nodes)) {
> > -               node = next_node_in(node, policy->nodes);
> > -               /* can only happen if nodemask is being rebound */
> > +       unsigned int node;
> 
> IIUC, "node" may be used without initialization.
> 

ok i should slow down lol.  This should take care of it.


diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d8cc3a577986..ed0d5d2d456a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1878,11 +1878,17 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)

 static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
-       unsigned int node = current->il_prev;
-
-       if (!current->il_weight || !node_isset(node, policy->nodes)) {
+       unsigned int node;
+       unsigned int cpuset_mems_cookie;
+
+retry:
+       /* to prevent miscount use tsk->mems_allowed_seq to detect rebind */
+       cpuset_mems_cookie = read_mems_allowed_begin();
+       node = current->il_prev;
+       if (!node || !node_isset(node, policy->nodes)) {
                node = next_node_in(node, policy->nodes);
-               /* can only happen if nodemask is being rebound */
+               if (read_mems_allowed_retry(cpuset_mems_cookie))
+                       goto retry;
                if (node == MAX_NUMNODES)
                        return node;
                current->il_prev = node;
@@ -1896,8 +1902,14 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
        unsigned int nid;
+       unsigned int cpuset_mems_cookie;
+
+       /* to prevent miscount, use tsk->mems_allowed_seq to detect rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nid = next_node_in(current->il_prev, policy->nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));

-       nid = next_node_in(current->il_prev, policy->nodes);
        if (nid < MAX_NUMNODES)
                current->il_prev = nid;
        return nid;
@@ -2374,6 +2386,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
                struct page **page_array)
 {
        struct task_struct *me = current;
+       unsigned int cpuset_mems_cookie;
        unsigned long total_allocated = 0;
        unsigned long nr_allocated = 0;
        unsigned long rounds;
@@ -2391,7 +2404,13 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
        if (!nr_pages)
                return 0;

-       nnodes = read_once_policy_nodemask(pol, &nodes);
+       /* read the nodes onto the stack, retry if done during rebind */
+       do {
+               cpuset_mems_cookie = read_mems_allowed_begin();
+               nnodes = read_once_policy_nodemask(pol, &nodes);
+       } while (read_mems_allowed_retry(cpuset_mems_cookie));
+
+       /* if the nodemask has become invalid, we cannot do anything */
        if (!nnodes)
                return 0;
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..a70f20ce1ffb 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,15 @@  MPOL_PREFERRED_MANY
 	can fall back to all existing numa nodes. This is effectively
 	MPOL_PREFERRED allowed for a mask rather than a single node.
 
+MPOL_WEIGHTED_INTERLEAVE
+	This mode operates the same as MPOL_INTERLEAVE, except that
+	interleaving behavior is executed based on weights set in
+	/sys/kernel/mm/mempolicy/weighted_interleave/
+
+	Weighted interleave allocates pages on nodes according to a
+	weight.  For example if nodes [0,1] are weighted [5,2], 5 pages
+	will be allocated on node0 for every 2 pages allocated on node1.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..b9ce285d8c9c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1259,6 +1259,7 @@  struct task_struct {
 	/* Protected by alloc_lock: */
 	struct mempolicy		*mempolicy;
 	short				il_prev;
+	u8				il_weight;
 	short				pref_node_fork;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@  enum {
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
+	MPOL_WEIGHTED_INTERLEAVE,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3bdfaf03b660..7cd92f4ec0d7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -19,6 +19,13 @@ 
  *                for anonymous memory. For process policy an process counter
  *                is used.
  *
+ * weighted interleave
+ *                Allocate memory interleaved over a set of nodes based on
+ *                a set of weights (per-node), with normal fallback if it
+ *                fails.  Otherwise operates the same as interleave.
+ *                Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
+ *                on node 0 for every 1 page allocated on node 1.
+ *
  * bind           Only allocate memory on a specific set of nodes,
  *                no fallback.
  *                FIXME: memory is allocated starting with the first node
@@ -441,6 +448,10 @@  static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_preferred,
 	},
+	[MPOL_WEIGHTED_INTERLEAVE] = {
+		.create = mpol_new_nodemask,
+		.rebind = mpol_rebind_nodemask,
+	},
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -862,8 +873,11 @@  static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_WEIGHTED_INTERLEAVE)) {
 		current->il_prev = MAX_NUMNODES-1;
+		current->il_weight = 0;
+	}
 	task_unlock(current);
 	mpol_put(old);
 	ret = 0;
@@ -888,6 +902,7 @@  static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*nodes = pol->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -972,6 +987,13 @@  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
 			*policy = next_node_in(current->il_prev, pol->nodes);
+		} else if (pol == current->mempolicy &&
+				pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
+			if (current->il_weight)
+				*policy = current->il_prev;
+			else
+				*policy = next_node_in(current->il_prev,
+						       pol->nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1336,7 +1358,8 @@  static long do_mbind(unsigned long start, unsigned long len,
 		 * VMAs, the nodes will still be interleaved from the targeted
 		 * nodemask, but one by one may be selected differently.
 		 */
-		if (new->mode == MPOL_INTERLEAVE) {
+		if (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_WEIGHTED_INTERLEAVE) {
 			struct page *page;
 			unsigned int order;
 			unsigned long addr = -EFAULT;
@@ -1784,7 +1807,8 @@  struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * @vma: virtual memory area whose policy is sought
  * @addr: address in @vma for shared policy lookup
  * @order: 0, or appropriate huge_page_order for interleaving
- * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
+ * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
+ *       MPOL_WEIGHTED_INTERLEAVE
  *
  * Returns effective policy for a VMA at specified address.
  * Falls back to current->mempolicy or system default policy, as necessary.
@@ -1801,7 +1825,8 @@  struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	pol = __get_vma_policy(vma, addr, ilx);
 	if (!pol)
 		pol = get_task_policy(current);
-	if (pol->mode == MPOL_INTERLEAVE) {
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
 		*ilx += vma->vm_pgoff >> order;
 		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
 	}
@@ -1851,6 +1876,22 @@  bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 	return zone >= dynamic_policy_zone;
 }
 
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+	unsigned int node = current->il_prev;
+
+	if (!current->il_weight || !node_isset(node, policy->nodes)) {
+		node = next_node_in(node, policy->nodes);
+		/* can only happen if nodemask is being rebound */
+		if (node == MAX_NUMNODES)
+			return node;
+		current->il_prev = node;
+		current->il_weight = get_il_weight(node);
+	}
+	current->il_weight--;
+	return node;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
@@ -1885,6 +1926,9 @@  unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		return weighted_interleave_nodes(policy);
+
 	case MPOL_BIND:
 	case MPOL_PREFERRED_MANY:
 	{
@@ -1923,6 +1967,45 @@  static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
 	return nodes_weight(*mask);
 }
 
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+	nodemask_t nodemask;
+	unsigned int target, nr_nodes;
+	u8 __rcu *table;
+	unsigned int weight_total = 0;
+	u8 weight;
+	int nid;
+
+	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
+	if (!nr_nodes)
+		return numa_node_id();
+
+	rcu_read_lock();
+	table = rcu_dereference(iw_table);
+	/* calculate the total weight */
+	for_each_node_mask(nid, nodemask) {
+		/* detect system default usage */
+		weight = table ? table[nid] : 1;
+		weight = weight ? weight : 1;
+		weight_total += weight;
+	}
+
+	/* Calculate the node offset based on totals */
+	target = ilx % weight_total;
+	nid = first_node(nodemask);
+	while (target) {
+		/* detect system default usage */
+		weight = table ? table[nid] : 1;
+		weight = weight ? weight : 1;
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+	rcu_read_unlock();
+	return nid;
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1983,6 +2066,11 @@  static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
 			interleave_nodes(pol) : interleave_nid(pol, ilx);
 		break;
+	case MPOL_WEIGHTED_INTERLEAVE:
+		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
+			weighted_interleave_nodes(pol) :
+			weighted_interleave_nid(pol, ilx);
+		break;
 	}
 
 	return nodemask;
@@ -2044,6 +2132,7 @@  bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*mask = mempolicy->nodes;
 		break;
 
@@ -2144,6 +2233,7 @@  struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		 * node in its nodemask, we allocate the standard way.
 		 */
 		if (pol->mode != MPOL_INTERLEAVE &&
+		    pol->mode != MPOL_WEIGHTED_INTERLEAVE &&
 		    (!nodemask || node_isset(nid, *nodemask))) {
 			/*
 			 * First, try to allocate THP only on local node, but
@@ -2279,6 +2369,127 @@  static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 	return total_allocated;
 }
 
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	struct task_struct *me = current;
+	unsigned long total_allocated = 0;
+	unsigned long nr_allocated = 0;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	u8 __rcu *table, *weights, weight;
+	unsigned int weight_total = 0;
+	unsigned long rem_pages = nr_pages;
+	nodemask_t nodes;
+	int nnodes, node, next_node;
+	int resume_node = MAX_NUMNODES - 1;
+	u8 resume_weight = 0;
+	int prev_node;
+	int i;
+
+	if (!nr_pages)
+		return 0;
+
+	nnodes = read_once_policy_nodemask(pol, &nodes);
+	if (!nnodes)
+		return 0;
+
+	/* Continue allocating from most recent node and adjust the nr_pages */
+	node = me->il_prev;
+	weight = me->il_weight;
+	if (weight && node_isset(node, nodes)) {
+		node_pages = min(rem_pages, weight);
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/* if that's all the pages, no need to interleave */
+		if (rem_pages < weight) {
+			/* stay on current node, adjust il_weight */
+			me->il_weight -= rem_pages;
+			return total_allocated;
+		} else if (rem_pages == weight) {
+			/* move to next node / weight */
+			me->il_prev = next_node_in(node, nodes);
+			me->il_weight = get_il_weight(next_node);
+			return total_allocated;
+		}
+		/* Otherwise we adjust remaining pages, continue from there */
+		rem_pages -= weight;
+	}
+	/* clear active weight in case of an allocation failure */
+	me->il_weight = 0;
+	prev_node = node;
+
+	/* create a local copy of node weights to operate on outside rcu */
+	weights = kzalloc(nr_node_ids, GFP_KERNEL);
+	if (!weights)
+		return total_allocated;
+
+	rcu_read_lock();
+	table = rcu_dereference(iw_table);
+	if (table)
+		memcpy(weights, table, nr_node_ids);
+	rcu_read_unlock();
+
+	/* calculate total, detect system default usage */
+	for_each_node_mask(node, nodes) {
+		if (!weights[node])
+			weights[node] = 1;
+		weight_total += weights[node];
+	}
+
+	/*
+	 * Calculate rounds/partial rounds to minimize __alloc_pages_bulk calls.
+	 * Track which node weighted interleave should resume from.
+	 *
+	 * if (rounds > 0) and (delta == 0), resume_node will always be
+	 * the node following prev_node and its weight.
+	 */
+	rounds = rem_pages / weight_total;
+	delta = rem_pages % weight_total;
+	resume_node = next_node_in(prev_node, nodes);
+	resume_weight = weights[resume_node];
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, nodes);
+		weight = weights[node];
+		node_pages = weight * rounds;
+		/* If a delta exists, add this node's portion of the delta */
+		if (delta > weight) {
+			node_pages += weight;
+			delta -= weight;
+		} else if (delta) {
+			node_pages += delta;
+			/* delta may deplete on a boundary or w/ a remainder */
+			if (delta == weight) {
+				/* boundary: resume from next node/weight */
+				resume_node = next_node_in(node, nodes);
+				resume_weight = weights[resume_node];
+			} else {
+				/* remainder: resume this node w/ remainder */
+				resume_node = node;
+				resume_weight = weight - delta;
+			}
+			delta = 0;
+		}
+		/* node_pages can be 0 if an allocation fails and rounds == 0 */
+		if (!node_pages)
+			break;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		if (total_allocated == nr_pages)
+			break;
+		prev_node = node;
+	}
+	me->il_prev = resume_node;
+	me->il_weight = resume_weight;
+	kfree(weights);
+	return total_allocated;
+}
+
 static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
@@ -2319,6 +2530,10 @@  unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
 		return alloc_pages_bulk_array_interleave(gfp, pol,
 							 nr_pages, page_array);
 
+	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+		return alloc_pages_bulk_array_weighted_interleave(
+				  gfp, pol, nr_pages, page_array);
+
 	if (pol->mode == MPOL_PREFERRED_MANY)
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
@@ -2394,6 +2609,7 @@  bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2530,6 +2746,10 @@  int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
 		polnid = interleave_nid(pol, ilx);
 		break;
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		polnid = weighted_interleave_nid(pol, ilx);
+		break;
+
 	case MPOL_PREFERRED:
 		if (node_isset(curnid, pol->nodes))
 			goto out;
@@ -2904,6 +3124,7 @@  static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
+	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
 	[MPOL_LOCAL]      = "local",
 	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
@@ -2963,6 +3184,7 @@  int mpol_parse_str(char *str, struct mempolicy **mpol)
 		}
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		/*
 		 * Default to online nodes with memory if no nodelist
 		 */
@@ -3073,6 +3295,7 @@  void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		nodes = pol->nodes;
 		break;
 	default: