diff mbox series

[v5,02/11] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

Message ID 20231223181101.1954-3-gregory.price@memverge.com (mailing list archive)
State New
Headers show
Series mempolicy2, mbind2, and weighted interleave | expand

Commit Message

Gregory Price Dec. 23, 2023, 6:10 p.m. UTC
When a system has multiple NUMA nodes and it becomes bandwidth hungry,
the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as having local DRAM and CXL memory together, the current round-robin
based interleaving policy doesn't maximize the overall bandwidth because
of their different bandwidth characteristics.

Instead, the interleaving can be more efficient when the allocation
policy follows each NUMA nodes' bandwidth weight rather than having 1:1
round-robin allocation.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which
enables weighted interleaving between NUMA nodes.  Weighted interleave
allows for a proportional distribution of memory across multiple numa
nodes, preferablly apportioned to match the bandwidth capacity of each
node from the perspective of the accessing node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
appropriate weight distribution is (2:1).

Weights will be acquired from the global weight matrix exposed by the
sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/

The policy will then allocate the number of pages according to the
set weights.  For example, if the weights are (2,1), then 2 pages
will be allocated on node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

There are 3 integration points:

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node. Applied by `mempolicy_slab_node()` and
    `policy_nodemask()`

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Applied by `policy_nodemask()` and `mpol_misplaced()`

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight (pol->cur_weight) will be allocated first, before
    the remaining bulk calculation is done. This simplifies the
    calculation at the cost of an additional allocation call.

One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actually application of the interleave.  The
calculation of the `interleave index` is done by `get_vma_policy()`,
while the actual selection of the node will be later appliex by the
relevant weighted_interleave function.

If CONFIG_SYSFS is disabled, the weight table will be initialized
to set all nodes to weight 1, but the weighting code is still called.
This is so that task-local weights (future patch) can still be
engaged cleanly without ifdef spaghetti.

Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  11 +
 include/linux/mempolicy.h                     |   5 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 197 +++++++++++++++++-
 4 files changed, 211 insertions(+), 3 deletions(-)

Comments

Gregory Price Dec. 26, 2023, 7:01 a.m. UTC | #1
On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> > +{
> > +	nodemask_t nodemask = pol->nodes;
> > +	unsigned int target, weight_total = 0;
> > +	int nid;
> > +	unsigned char weights[MAX_NUMNODES];
> 
> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
> large?
> 

I've been struggling with a good solution to this.  We need a local copy
of weights to prevent weights from changing out from under us during
allocation (which may take quite some time), but it seemed unwise to
to allocate 1KB heap in this particular path.

Is my concern unfounded?  If so, I can go ahead and add the allocation
code.

> > +	unsigned char weight;
> > +
> > +	barrier();
> 
> Memory barrier needs comments.
> 

Barrier is to stabilize nodemask on the stack, but yes i'll carry the
comment from interleave_nid into this barrier as well.

> > +
> > +	/* first ensure we have a valid nodemask */
> > +	nid = first_node(nodemask);
> > +	if (nid == MAX_NUMNODES)
> > +		return nid;
> 
> It appears that this isn't necessary, because we can check whether
> weight_total == 0 after the next loop.
> 

fair, will snip.

> > +
> > +	/* Then collect weights on stack and calculate totals */
> > +	for_each_node_mask(nid, nodemask) {
> > +		weight = iw_table[nid];
> > +		weight_total += weight;
> > +		weights[nid] = weight;
> > +	}
> > +
> > +	/* Finally, calculate the node offset based on totals */
> > +	target = (unsigned int)ilx % weight_total;
> 
> Why use type casting?
> 

Artifact of old prototypes, snipped.

> > +
> > +	/* Stabilize the nodemask on the stack */
> > +	barrier();
> 
> I don't think barrier() is needed to wait for memory operations for
> stack.  It's usually used for cross-processor memory order.
>

This is present in the old interleave code.  To the best of my
understanding, the concern is for mempolicy->nodemask rebinding that can
occur when cgroups.cpusets.mems_allowed changes.

so we can't iterate over (mempolicy->nodemask), we have to take a local
copy.

My *best* understanding of the barrier here is to prevent the compiler
from reordering operations such that it attempts to optimize out the
local copy (or do lazy-fetch).

It is present in the original interleave code, so I pulled it forward to
this, but I have not tested whether this is a bit paranoid or not.

from `interleave_nid`:

 /*
  * The barrier will stabilize the nodemask in a register or on
  * the stack so that it will stop changing under the code.
  *
  * Between first_node() and next_node(), pol->nodes could be changed
  * by other threads. So we put pol->nodes in a local stack.
  */
 barrier();

> > +		/* Otherwise we adjust nr_pages down, and continue from there */
> > +		rem_pages -= pol->wil.cur_weight;
> > +		pol->wil.cur_weight = 0;
> > +		prev_node = node;
> 
> If pol->wil.cur_weight == 0, prev_node will be used without being
> initialized below.
> 

pol->wil.cur_weight is not used below.

> > +	}
> > +
> > +	/* Now we can continue allocating as if from 0 instead of an offset */
> > +	rounds = rem_pages / weight_total;
> > +	delta = rem_pages % weight_total;
> > +	for (i = 0; i < nnodes; i++) {
> > +		node = next_node_in(prev_node, nodes);
> > +		weight = weights[node];
> > +		node_pages = weight * rounds;
> > +		if (delta) {
> > +			if (delta > weight) {
> > +				node_pages += weight;
> > +				delta -= weight;
> > +			} else {
> > +				node_pages += delta;
> > +				delta = 0;
> > +			}
> > +		}
> > +		/* We may not make it all the way around */
> > +		if (!node_pages)
> > +			break;
> > +		/* If an over-allocation would occur, floor it */
> > +		if (node_pages + total_allocated > nr_pages) {
> 
> Why is this possible?
> 

this may have been a paranoid artifact from an early prototype, will
snip and validate.

~Gregory
Gregory Price Dec. 26, 2023, 8:06 a.m. UTC | #2
On Tue, Dec 26, 2023 at 02:01:57AM -0500, Gregory Price wrote:
> > 
> > If pol->wil.cur_weight == 0, prev_node will be used without being
> > initialized below.
> > 
> 
> pol->wil.cur_weight is not used below.
> 

disregard, i misread your comment. prev_node should be initialized, to 
NO_NUMA_NODE.  Will fix.

~Gregory
Gregory Price Dec. 26, 2023, 11:32 a.m. UTC | #3
On Tue, Dec 26, 2023 at 02:01:57AM -0500, Gregory Price wrote:
> On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> > Gregory Price <gourry.memverge@gmail.com> writes:
> 
> Barrier is to stabilize nodemask on the stack, but yes i'll carry the
> comment from interleave_nid into this barrier as well.
> 
> > > +
> > > +	/* first ensure we have a valid nodemask */
> > > +	nid = first_node(nodemask);
> > > +	if (nid == MAX_NUMNODES)
> > > +		return nid;
> > 
> > It appears that this isn't necessary, because we can check whether
> > weight_total == 0 after the next loop.
> > 
> 
> fair, will snip.
> 

Follow up - this is only possible if the nodemask is invalid / has no
nodes, so it's better to check for this explicitly.  If nodemask is
valid, then it's not possible to have a weight_total of 0, because
weights cannot be 0.

~Gregory
Huang, Ying Dec. 27, 2023, 8:32 a.m. UTC | #4
Gregory Price <gourry.memverge@gmail.com> writes:

> When a system has multiple NUMA nodes and it becomes bandwidth hungry,
> the current MPOL_INTERLEAVE could be an wise option.
>
> However, if those NUMA nodes consist of different types of memory such
> as having local DRAM and CXL memory together, the current round-robin
> based interleaving policy doesn't maximize the overall bandwidth because
> of their different bandwidth characteristics.
>
> Instead, the interleaving can be more efficient when the allocation
> policy follows each NUMA nodes' bandwidth weight rather than having 1:1
> round-robin allocation.
>
> This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE, which
> enables weighted interleaving between NUMA nodes.  Weighted interleave
> allows for a proportional distribution of memory across multiple numa
> nodes, preferablly apportioned to match the bandwidth capacity of each
> node from the perspective of the accessing node.
>
> For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
> with a relative bandwidth of (100GB/s, 50GB/s) respectively, the
> appropriate weight distribution is (2:1).
>
> Weights will be acquired from the global weight matrix exposed by the
> sysfs extension: /sys/kernel/mm/mempolicy/weighted_interleave/
>
> The policy will then allocate the number of pages according to the
> set weights.  For example, if the weights are (2,1), then 2 pages
> will be allocated on node0 for every 1 page allocated on node1.
>
> The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
> and mbind(2).
>
> There are 3 integration points:
>
> weighted_interleave_nodes:
>     Counts the number of allocations as they occur, and applies the
>     weight for the current node.  When the weight reaches 0, switch
>     to the next node. Applied by `mempolicy_slab_node()` and
>     `policy_nodemask()`
>
> weighted_interleave_nid:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the node based on the given index.
>     Applied by `policy_nodemask()` and `mpol_misplaced()`
>
> bulk_array_weighted_interleave:
>     Gets the total weight of the nodemask as well as each individual
>     node weight, then calculates the number of "interleave rounds" as
>     well as any delta ("partial round").  Calculates the number of
>     pages for each node and allocates them.
>
>     If a node was scheduled for interleave via interleave_nodes, the
>     current weight (pol->cur_weight) will be allocated first, before
>     the remaining bulk calculation is done. This simplifies the
>     calculation at the cost of an additional allocation call.
>
> One piece of complexity is the interaction between a recent refactor
> which split the logic to acquire the "ilx" (interleave index) of an
> allocation and the actually application of the interleave.  The
> calculation of the `interleave index` is done by `get_vma_policy()`,
> while the actual selection of the node will be later appliex by the
> relevant weighted_interleave function.
>
> If CONFIG_SYSFS is disabled, the weight table will be initialized
> to set all nodes to weight 1, but the weighting code is still called.
> This is so that task-local weights (future patch) can still be
> engaged cleanly without ifdef spaghetti.
>
> Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Rakie Kim <rakie.kim@sk.com>
> Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
> Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     |  11 +
>  include/linux/mempolicy.h                     |   5 +
>  include/uapi/linux/mempolicy.h                |   1 +
>  mm/mempolicy.c                                | 197 +++++++++++++++++-
>  4 files changed, 211 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index eca38fa81e0f..d2c8e712785b 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -250,6 +250,17 @@ MPOL_PREFERRED_MANY
>  	can fall back to all existing numa nodes. This is effectively
>  	MPOL_PREFERRED allowed for a mask rather than a single node.
>  
> +MPOL_WEIGHTED_INTERLEAVE
> +	This mode operates the same as MPOL_INTERLEAVE, except that
> +	interleaving behavior is executed based on weights set in
> +	/sys/kernel/mm/mempolicy/weighted_interleave/
> +
> +	Weighted interleave allocations pages on nodes according to
> +	their weight.  For example if nodes [0,1] are weighted [5,2]
> +	respectively, 5 pages will be allocated on node0 for every
> +	2 pages allocated on node1.  This can better distribute data
> +	according to bandwidth on heterogeneous memory systems.
> +
>  NUMA memory policy supports the following optional mode flags:
>  
>  MPOL_F_STATIC_NODES
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 931b118336f4..ba09167e80f7 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -54,6 +54,11 @@ struct mempolicy {
>  		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
>  		nodemask_t user_nodemask;	/* nodemask passed by user */
>  	} w;
> +
> +	/* Weighted interleave settings */
> +	struct {
> +		unsigned char cur_weight;
> +	} wil;
>  };
>  
>  /*
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index a8963f7ef4c2..1f9bb10d1a47 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -23,6 +23,7 @@ enum {
>  	MPOL_INTERLEAVE,
>  	MPOL_LOCAL,
>  	MPOL_PREFERRED_MANY,
> +	MPOL_WEIGHTED_INTERLEAVE,
>  	MPOL_MAX,	/* always last member of enum */
>  };
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0e77633b07a5..0a180c670f0c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -305,6 +305,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
>  	policy->mode = mode;
>  	policy->flags = flags;
>  	policy->home_node = NUMA_NO_NODE;
> +	policy->wil.cur_weight = 0;
>  
>  	return policy;
>  }
> @@ -417,6 +418,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
>  		.create = mpol_new_nodemask,
>  		.rebind = mpol_rebind_preferred,
>  	},
> +	[MPOL_WEIGHTED_INTERLEAVE] = {
> +		.create = mpol_new_nodemask,
> +		.rebind = mpol_rebind_nodemask,
> +	},
>  };
>  
>  static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
> @@ -838,7 +843,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>  
>  	old = current->mempolicy;
>  	current->mempolicy = new;
> -	if (new && new->mode == MPOL_INTERLEAVE)
> +	if (new && (new->mode == MPOL_INTERLEAVE ||
> +		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
>  		current->il_prev = MAX_NUMNODES-1;
>  	task_unlock(current);
>  	mpol_put(old);
> @@ -864,6 +870,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*nodes = pol->nodes;
>  		break;
>  	case MPOL_LOCAL:
> @@ -948,6 +955,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  		} else if (pol == current->mempolicy &&
>  				pol->mode == MPOL_INTERLEAVE) {
>  			*policy = next_node_in(current->il_prev, pol->nodes);
> +		} else if (pol == current->mempolicy &&
> +				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
> +			if (pol->wil.cur_weight)
> +				*policy = current->il_prev;
> +			else
> +				*policy = next_node_in(current->il_prev,
> +						       pol->nodes);
>  		} else {
>  			err = -EINVAL;
>  			goto out;
> @@ -1777,7 +1791,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
>  	pol = __get_vma_policy(vma, addr, ilx);
>  	if (!pol)
>  		pol = get_task_policy(current);
> -	if (pol->mode == MPOL_INTERLEAVE) {
> +	if (pol->mode == MPOL_INTERLEAVE ||
> +	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
>  		*ilx += vma->vm_pgoff >> order;
>  		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
>  	}
> @@ -1827,6 +1842,24 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
>  	return zone >= dynamic_policy_zone;
>  }
>  
> +static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
> +{
> +	unsigned int next;
> +	struct task_struct *me = current;
> +
> +	next = next_node_in(me->il_prev, policy->nodes);
> +	if (next == MAX_NUMNODES)
> +		return next;
> +
> +	if (!policy->wil.cur_weight)
> +		policy->wil.cur_weight = iw_table[next];
> +
> +	policy->wil.cur_weight--;
> +	if (!policy->wil.cur_weight)
> +		me->il_prev = next;
> +	return next;
> +}
> +
>  /* Do dynamic interleaving for a process */
>  static unsigned int interleave_nodes(struct mempolicy *policy)
>  {
> @@ -1861,6 +1894,9 @@ unsigned int mempolicy_slab_node(void)
>  	case MPOL_INTERLEAVE:
>  		return interleave_nodes(policy);
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		return weighted_interleave_nodes(policy);
> +
>  	case MPOL_BIND:
>  	case MPOL_PREFERRED_MANY:
>  	{
> @@ -1885,6 +1921,41 @@ unsigned int mempolicy_slab_node(void)
>  	}
>  }
>  
> +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> +{
> +	nodemask_t nodemask = pol->nodes;
> +	unsigned int target, weight_total = 0;
> +	int nid;
> +	unsigned char weights[MAX_NUMNODES];

MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
large?

> +	unsigned char weight;
> +
> +	barrier();

Memory barrier needs comments.

> +
> +	/* first ensure we have a valid nodemask */
> +	nid = first_node(nodemask);
> +	if (nid == MAX_NUMNODES)
> +		return nid;

It appears that this isn't necessary, because we can check whether
weight_total == 0 after the next loop.

> +
> +	/* Then collect weights on stack and calculate totals */
> +	for_each_node_mask(nid, nodemask) {
> +		weight = iw_table[nid];
> +		weight_total += weight;
> +		weights[nid] = weight;
> +	}
> +
> +	/* Finally, calculate the node offset based on totals */
> +	target = (unsigned int)ilx % weight_total;

Why use type casting?

> +	nid = first_node(nodemask);
> +	while (target) {
> +		weight = weights[nid];
> +		if (target < weight)
> +			break;
> +		target -= weight;
> +		nid = next_node_in(nid, nodemask);
> +	}
> +	return nid;
> +}
> +
>  /*
>   * Do static interleaving for interleave index @ilx.  Returns the ilx'th
>   * node in pol->nodes (starting from ilx=0), wrapping around if ilx
> @@ -1953,6 +2024,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
>  		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
>  			interleave_nodes(pol) : interleave_nid(pol, ilx);
>  		break;
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
> +			weighted_interleave_nodes(pol) :
> +			weighted_interleave_nid(pol, ilx);
> +		break;
>  	}
>  
>  	return nodemask;
> @@ -2014,6 +2090,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		*mask = mempolicy->nodes;
>  		break;
>  
> @@ -2113,7 +2190,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
>  		 * If the policy is interleave or does not allow the current
>  		 * node in its nodemask, we allocate the standard way.
>  		 */
> -		if (pol->mode != MPOL_INTERLEAVE &&
> +		if ((pol->mode != MPOL_INTERLEAVE &&
> +		    pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
>  		    (!nodemask || node_isset(nid, *nodemask))) {
>  			/*
>  			 * First, try to allocate THP only on local node, but
> @@ -2249,6 +2327,106 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
>  	return total_allocated;
>  }
>  
> +static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> +		struct mempolicy *pol, unsigned long nr_pages,
> +		struct page **page_array)
> +{
> +	struct task_struct *me = current;
> +	unsigned long total_allocated = 0;
> +	unsigned long nr_allocated;
> +	unsigned long rounds;
> +	unsigned long node_pages, delta;
> +	unsigned char weight;
> +	unsigned char weights[MAX_NUMNODES];
> +	unsigned int weight_total = 0;
> +	unsigned long rem_pages = nr_pages;
> +	nodemask_t nodes = pol->nodes;
> +	int nnodes, node, prev_node;
> +	int i;
> +
> +	/* Stabilize the nodemask on the stack */
> +	barrier();

I don't think barrier() is needed to wait for memory operations for
stack.  It's usually used for cross-processor memory order.

> +
> +	nnodes = nodes_weight(nodes);
> +
> +	/* Collect weights and save them on stack so they don't change */
> +	for_each_node_mask(node, nodes) {
> +		weight = iw_table[node];
> +		weight_total += weight;
> +		weights[node] = weight;
> +	}
> +
> +	/* Continue allocating from most recent node and adjust the nr_pages */
> +	if (pol->wil.cur_weight) {
> +		node = next_node_in(me->il_prev, nodes);
> +		node_pages = pol->wil.cur_weight;
> +		if (node_pages > rem_pages)
> +			node_pages = rem_pages;
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		/* if that's all the pages, no need to interleave */
> +		if (rem_pages <= pol->wil.cur_weight) {
> +			pol->wil.cur_weight -= rem_pages;
> +			return total_allocated;
> +		}
> +		/* Otherwise we adjust nr_pages down, and continue from there */
> +		rem_pages -= pol->wil.cur_weight;
> +		pol->wil.cur_weight = 0;
> +		prev_node = node;

If pol->wil.cur_weight == 0, prev_node will be used without being
initialized below.

> +	}
> +
> +	/* Now we can continue allocating as if from 0 instead of an offset */
> +	rounds = rem_pages / weight_total;
> +	delta = rem_pages % weight_total;
> +	for (i = 0; i < nnodes; i++) {
> +		node = next_node_in(prev_node, nodes);
> +		weight = weights[node];
> +		node_pages = weight * rounds;
> +		if (delta) {
> +			if (delta > weight) {
> +				node_pages += weight;
> +				delta -= weight;
> +			} else {
> +				node_pages += delta;
> +				delta = 0;
> +			}
> +		}
> +		/* We may not make it all the way around */
> +		if (!node_pages)
> +			break;
> +		/* If an over-allocation would occur, floor it */
> +		if (node_pages + total_allocated > nr_pages) {

Why is this possible?

> +			node_pages = nr_pages - total_allocated;
> +			delta = 0;
> +		}
> +		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> +						  NULL, page_array);
> +		page_array += nr_allocated;
> +		total_allocated += nr_allocated;
> +		prev_node = node;
> +	}
> +
> +	/*
> +	 * Finally, we need to update me->il_prev and pol->wil.cur_weight
> +	 * if there were overflow pages, but not equivalent to the node
> +	 * weight, set the cur_weight to node_weight - delta and the
> +	 * me->il_prev to the previous node. Otherwise if it was perfect
> +	 * we can simply set il_prev to node and cur_weight to 0
> +	 */
> +	if (node_pages) {
> +		me->il_prev = prev_node;
> +		node_pages %= weight;
> +		pol->wil.cur_weight = weight - node_pages;
> +	} else {
> +		me->il_prev = node;
> +		pol->wil.cur_weight = 0;
> +	}
> +
> +	return total_allocated;
> +}
> +
>  static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
>  		struct mempolicy *pol, unsigned long nr_pages,
>  		struct page **page_array)
> @@ -2289,6 +2467,11 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
>  		return alloc_pages_bulk_array_interleave(gfp, pol,
>  							 nr_pages, page_array);
>  
> +	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
> +		return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
> +								  nr_pages,
> +								  page_array);
> +
>  	if (pol->mode == MPOL_PREFERRED_MANY)
>  		return alloc_pages_bulk_array_preferred_many(gfp,
>  				numa_node_id(), pol, nr_pages, page_array);
> @@ -2364,6 +2547,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
>  	case MPOL_INTERLEAVE:
>  	case MPOL_PREFERRED:
>  	case MPOL_PREFERRED_MANY:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		return !!nodes_equal(a->nodes, b->nodes);
>  	case MPOL_LOCAL:
>  		return true;
> @@ -2500,6 +2684,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
>  		polnid = interleave_nid(pol, ilx);
>  		break;
>  
> +	case MPOL_WEIGHTED_INTERLEAVE:
> +		polnid = weighted_interleave_nid(pol, ilx);
> +		break;
> +
>  	case MPOL_PREFERRED:
>  		if (node_isset(curnid, pol->nodes))
>  			goto out;
> @@ -2874,6 +3062,7 @@ static const char * const policy_modes[] =
>  	[MPOL_PREFERRED]  = "prefer",
>  	[MPOL_BIND]       = "bind",
>  	[MPOL_INTERLEAVE] = "interleave",
> +	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
>  	[MPOL_LOCAL]      = "local",
>  	[MPOL_PREFERRED_MANY]  = "prefer (many)",
>  };
> @@ -2933,6 +3122,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
>  		}
>  		break;
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		/*
>  		 * Default to online nodes with memory if no nodelist
>  		 */
> @@ -3043,6 +3233,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  	case MPOL_PREFERRED_MANY:
>  	case MPOL_BIND:
>  	case MPOL_INTERLEAVE:
> +	case MPOL_WEIGHTED_INTERLEAVE:
>  		nodes = pol->nodes;
>  		break;
>  	default:

--
Best Regards,
Huang, Ying
Huang, Ying Jan. 2, 2024, 8:42 a.m. UTC | #5
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> > +{
>> > +	nodemask_t nodemask = pol->nodes;
>> > +	unsigned int target, weight_total = 0;
>> > +	int nid;
>> > +	unsigned char weights[MAX_NUMNODES];
>> 
>> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
>> large?
>> 
>
> I've been struggling with a good solution to this.  We need a local copy
> of weights to prevent weights from changing out from under us during
> allocation (which may take quite some time), but it seemed unwise to
> to allocate 1KB heap in this particular path.
>
> Is my concern unfounded?  If so, I can go ahead and add the allocation
> code.

Please take a look at NODEMASK_ALLOC().

>> > +	unsigned char weight;
>> > +
>> > +	barrier();
>> 
>> Memory barrier needs comments.
>> 
>
> Barrier is to stabilize nodemask on the stack, but yes i'll carry the
> comment from interleave_nid into this barrier as well.

Please see below.

>> > +
>> > +	/* first ensure we have a valid nodemask */
>> > +	nid = first_node(nodemask);
>> > +	if (nid == MAX_NUMNODES)
>> > +		return nid;
>> 
>> It appears that this isn't necessary, because we can check whether
>> weight_total == 0 after the next loop.
>> 
>
> fair, will snip.
>
>> > +
>> > +	/* Then collect weights on stack and calculate totals */
>> > +	for_each_node_mask(nid, nodemask) {
>> > +		weight = iw_table[nid];
>> > +		weight_total += weight;
>> > +		weights[nid] = weight;
>> > +	}
>> > +
>> > +	/* Finally, calculate the node offset based on totals */
>> > +	target = (unsigned int)ilx % weight_total;
>> 
>> Why use type casting?
>> 
>
> Artifact of old prototypes, snipped.
>
>> > +
>> > +	/* Stabilize the nodemask on the stack */
>> > +	barrier();
>> 
>> I don't think barrier() is needed to wait for memory operations for
>> stack.  It's usually used for cross-processor memory order.
>>
>
> This is present in the old interleave code.  To the best of my
> understanding, the concern is for mempolicy->nodemask rebinding that can
> occur when cgroups.cpusets.mems_allowed changes.
>
> so we can't iterate over (mempolicy->nodemask), we have to take a local
> copy.
>
> My *best* understanding of the barrier here is to prevent the compiler
> from reordering operations such that it attempts to optimize out the
> local copy (or do lazy-fetch).
>
> It is present in the original interleave code, so I pulled it forward to
> this, but I have not tested whether this is a bit paranoid or not.
>
> from `interleave_nid`:
>
>  /*
>   * The barrier will stabilize the nodemask in a register or on
>   * the stack so that it will stop changing under the code.
>   *
>   * Between first_node() and next_node(), pol->nodes could be changed
>   * by other threads. So we put pol->nodes in a local stack.
>   */
>  barrier();

Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
comments all over the place.  Can we implement a wrapper for it?  For
example, memcpy_once().  __read_once_size() in
tools/include/linux/compiler.h can be used as reference.

Because node_weights[] may be changed simultaneously too.  We may need
to consider similar issue for it too.  But RCU seems more appropriate
for node_weights[].

>> > +		/* Otherwise we adjust nr_pages down, and continue from there */
>> > +		rem_pages -= pol->wil.cur_weight;
>> > +		pol->wil.cur_weight = 0;
>> > +		prev_node = node;
>> 
>> If pol->wil.cur_weight == 0, prev_node will be used without being
>> initialized below.
>> 
>
> pol->wil.cur_weight is not used below.
>
>> > +	}
>> > +
>> > +	/* Now we can continue allocating as if from 0 instead of an offset */
>> > +	rounds = rem_pages / weight_total;
>> > +	delta = rem_pages % weight_total;
>> > +	for (i = 0; i < nnodes; i++) {
>> > +		node = next_node_in(prev_node, nodes);
>> > +		weight = weights[node];
>> > +		node_pages = weight * rounds;
>> > +		if (delta) {
>> > +			if (delta > weight) {
>> > +				node_pages += weight;
>> > +				delta -= weight;
>> > +			} else {
>> > +				node_pages += delta;
>> > +				delta = 0;
>> > +			}
>> > +		}
>> > +		/* We may not make it all the way around */
>> > +		if (!node_pages)
>> > +			break;
>> > +		/* If an over-allocation would occur, floor it */
>> > +		if (node_pages + total_allocated > nr_pages) {
>> 
>> Why is this possible?
>> 
>
> this may have been a paranoid artifact from an early prototype, will
> snip and validate.

--
Best Regards,
Huang, Ying
Gregory Price Jan. 2, 2024, 8:30 p.m. UTC | #6
On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry.memverge@gmail.com> writes:
> >> 
> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> >> > +{
> >> > +	nodemask_t nodemask = pol->nodes;
> >> > +	unsigned int target, weight_total = 0;
> >> > +	int nid;
> >> > +	unsigned char weights[MAX_NUMNODES];
> >> 
> >> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
> >> large?
> >> 
> >
> > I've been struggling with a good solution to this.  We need a local copy
> > of weights to prevent weights from changing out from under us during
> > allocation (which may take quite some time), but it seemed unwise to
> > to allocate 1KB heap in this particular path.
> >
> > Is my concern unfounded?  If so, I can go ahead and add the allocation
> > code.
> 
> Please take a look at NODEMASK_ALLOC().
>

This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. 

Some of the allocations on the stack can be replaced with a scratch
allocation, that's no big deal.

I'm specifically concerned about:
	weighted_interleave_nid
	alloc_pages_bulk_array_weighted_interleave

I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
contexts. If kmalloc/kfree is safe fine, this problem is trivial.

If not, there is no good solution to this without pre-allocating a
scratch area per-task.

> >> I don't think barrier() is needed to wait for memory operations for
> >> stack.  It's usually used for cross-processor memory order.
> >>
> >
> > This is present in the old interleave code.  To the best of my
> > understanding, the concern is for mempolicy->nodemask rebinding that can
> > occur when cgroups.cpusets.mems_allowed changes.
> >
> > so we can't iterate over (mempolicy->nodemask), we have to take a local
> > copy.
> >
> > My *best* understanding of the barrier here is to prevent the compiler
> > from reordering operations such that it attempts to optimize out the
> > local copy (or do lazy-fetch).
> >
> > It is present in the original interleave code, so I pulled it forward to
> > this, but I have not tested whether this is a bit paranoid or not.
> >
> > from `interleave_nid`:
> >
> >  /*
> >   * The barrier will stabilize the nodemask in a register or on
> >   * the stack so that it will stop changing under the code.
> >   *
> >   * Between first_node() and next_node(), pol->nodes could be changed
> >   * by other threads. So we put pol->nodes in a local stack.
> >   */
> >  barrier();
> 
> Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
> comments all over the place.  Can we implement a wrapper for it?  For
> example, memcpy_once().  __read_once_size() in
> tools/include/linux/compiler.h can be used as reference.
> 
> Because node_weights[] may be changed simultaneously too.  We may need
> to consider similar issue for it too.  But RCU seems more appropriate
> for node_weights[].
> 

Weights are collected individually onto the stack because we have to sum
them up before we actually apply the weights.

A stale weight is not offensive.  RCU is not needed and doesn't help.

The reason the barrier is needed is not weights, it's the nodemask.

So you basically just want to replace barrier() with this and drop the
copy/pasted comments:

static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask)
{
        /*
         * The barrier will stabilize the nodemask in a register or on
         * the stack so that it will stop changing under the code.
         *
         * Between first_node() and next_node(), pol->nodes could be changed
         * by other threads. So we put pol->nodes in a local stack.
         */
        barrier();
        __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
        barrier();
}

- nodemask_t nodemask = pol->nodemask
- barrier()
+ nodemask_t nodemask;
+ read_once_policy_nodemask(pol, &nodemask)

Is that right?

~Gregory
Huang, Ying Jan. 3, 2024, 5:46 a.m. UTC | #7
Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry.memverge@gmail.com> writes:
>> >> 
>> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> >> > +{
>> >> > +	nodemask_t nodemask = pol->nodes;
>> >> > +	unsigned int target, weight_total = 0;
>> >> > +	int nid;
>> >> > +	unsigned char weights[MAX_NUMNODES];
>> >> 
>> >> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
>> >> large?
>> >> 
>> >
>> > I've been struggling with a good solution to this.  We need a local copy
>> > of weights to prevent weights from changing out from under us during
>> > allocation (which may take quite some time), but it seemed unwise to
>> > to allocate 1KB heap in this particular path.
>> >
>> > Is my concern unfounded?  If so, I can go ahead and add the allocation
>> > code.
>> 
>> Please take a look at NODEMASK_ALLOC().
>>
>
> This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. 
>
> Some of the allocations on the stack can be replaced with a scratch
> allocation, that's no big deal.
>
> I'm specifically concerned about:
> 	weighted_interleave_nid
> 	alloc_pages_bulk_array_weighted_interleave
>
> I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>
> If not, there is no good solution to this without pre-allocating a
> scratch area per-task.

You need to audit whether it's safe for all callers.  I guess that you
need to allocate pages after calling, so you can use the same GFP flags
here.

>> >> I don't think barrier() is needed to wait for memory operations for
>> >> stack.  It's usually used for cross-processor memory order.
>> >>
>> >
>> > This is present in the old interleave code.  To the best of my
>> > understanding, the concern is for mempolicy->nodemask rebinding that can
>> > occur when cgroups.cpusets.mems_allowed changes.
>> >
>> > so we can't iterate over (mempolicy->nodemask), we have to take a local
>> > copy.
>> >
>> > My *best* understanding of the barrier here is to prevent the compiler
>> > from reordering operations such that it attempts to optimize out the
>> > local copy (or do lazy-fetch).
>> >
>> > It is present in the original interleave code, so I pulled it forward to
>> > this, but I have not tested whether this is a bit paranoid or not.
>> >
>> > from `interleave_nid`:
>> >
>> >  /*
>> >   * The barrier will stabilize the nodemask in a register or on
>> >   * the stack so that it will stop changing under the code.
>> >   *
>> >   * Between first_node() and next_node(), pol->nodes could be changed
>> >   * by other threads. So we put pol->nodes in a local stack.
>> >   */
>> >  barrier();
>> 
>> Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
>> comments all over the place.  Can we implement a wrapper for it?  For
>> example, memcpy_once().  __read_once_size() in
>> tools/include/linux/compiler.h can be used as reference.
>> 
>> Because node_weights[] may be changed simultaneously too.  We may need
>> to consider similar issue for it too.  But RCU seems more appropriate
>> for node_weights[].
>> 
>
> Weights are collected individually onto the stack because we have to sum
> them up before we actually apply the weights.
>
> A stale weight is not offensive.  RCU is not needed and doesn't help.

When you copy weights from iw_table[] to stack, it's possible for
compiler to cache its contents in register, or merge, split the memory
operations.  At the same time, iw_table[] may be changed simultaneously
via sysfs interface.  So, we need a mechanism to guarantee that we read
the latest contents consistently.

> The reason the barrier is needed is not weights, it's the nodemask.

Yes.  So I said that we need similar stuff for weights.

> So you basically just want to replace barrier() with this and drop the
> copy/pasted comments:
>
> static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask)
> {
>         /*
>          * The barrier will stabilize the nodemask in a register or on
>          * the stack so that it will stop changing under the code.
>          *
>          * Between first_node() and next_node(), pol->nodes could be changed
>          * by other threads. So we put pol->nodes in a local stack.
>          */
>         barrier();
>         __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
>         barrier();
> }
>
> - nodemask_t nodemask = pol->nodemask
> - barrier()
> + nodemask_t nodemask;
> + read_once_policy_nodemask(pol, &nodemask)
>
> Is that right?

Yes.  Something like that.  Or even more general (need to be optimized?),

static inline static void memcpy_once(void *dst, void *src, size_t n)
{
        barrier();
        memcpy(dst, src, n);
        barrier();
}

        memcpy_once(&nodemask, &pol->nodemask, sizeof(nodemask));

The comments can be based on that of READ_ONCE().

--
Best Regards,
Huang, Ying
Gregory Price Jan. 3, 2024, 10:09 p.m. UTC | #8
On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> > I'm specifically concerned about:
> > 	weighted_interleave_nid
> > 	alloc_pages_bulk_array_weighted_interleave
> >
> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
> >
> > If not, there is no good solution to this without pre-allocating a
> > scratch area per-task.
> 
> You need to audit whether it's safe for all callers.  I guess that you
> need to allocate pages after calling, so you can use the same GFP flags
> here.
> 

After picking away i realized that this code is usually going to get
called during page fault handling - duh.  So kmalloc is almost never
safe (or can fail), and we it's nasty to try to handle those errors.

Instead of doing that, I simply chose to implement the scratch space
in the mempolicy structure

mempolicy->wil.scratch_weights[MAX_NUMNODES].

We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
scratch space we can use any time the task is allocating memory, and
prevents the need for any fancy error handling.  That seems like a
perfectly reasonable tradeoff.

> >
> > Weights are collected individually onto the stack because we have to sum
> > them up before we actually apply the weights.
> >
> > A stale weight is not offensive.  RCU is not needed and doesn't help.
> 
> When you copy weights from iw_table[] to stack, it's possible for
> compiler to cache its contents in register, or merge, split the memory
> operations.  At the same time, iw_table[] may be changed simultaneously
> via sysfs interface.  So, we need a mechanism to guarantee that we read
> the latest contents consistently.
> 

Fair enough, I went ahead and added a similar interaction.

~Gregoryg
Huang, Ying Jan. 4, 2024, 5:39 a.m. UTC | #9
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> > I'm specifically concerned about:
>> > 	weighted_interleave_nid
>> > 	alloc_pages_bulk_array_weighted_interleave
>> >
>> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
>> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>> >
>> > If not, there is no good solution to this without pre-allocating a
>> > scratch area per-task.
>> 
>> You need to audit whether it's safe for all callers.  I guess that you
>> need to allocate pages after calling, so you can use the same GFP flags
>> here.
>> 
>
> After picking away i realized that this code is usually going to get
> called during page fault handling - duh.  So kmalloc is almost never
> safe (or can fail), and we it's nasty to try to handle those errors.

Why not just OOM for allocation failure?

> Instead of doing that, I simply chose to implement the scratch space
> in the mempolicy structure
>
> mempolicy->wil.scratch_weights[MAX_NUMNODES].
>
> We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
> scratch space we can use any time the task is allocating memory, and
> prevents the need for any fancy error handling.  That seems like a
> perfectly reasonable tradeoff.

I don't think that this is a good idea.  The weight array is temporary.

--
Best Regards,
Huang, Ying
Gregory Price Jan. 4, 2024, 6:59 p.m. UTC | #10
On Thu, Jan 04, 2024 at 01:39:31PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
> >> Gregory Price <gregory.price@memverge.com> writes:
> >> > I'm specifically concerned about:
> >> > 	weighted_interleave_nid
> >> > 	alloc_pages_bulk_array_weighted_interleave
> >> >
> >> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> >> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
> >> >
> >> > If not, there is no good solution to this without pre-allocating a
> >> > scratch area per-task.
> >> 
> >> You need to audit whether it's safe for all callers.  I guess that you
> >> need to allocate pages after calling, so you can use the same GFP flags
> >> here.
> >> 
> >
> > After picking away i realized that this code is usually going to get
> > called during page fault handling - duh.  So kmalloc is almost never
> > safe (or can fail), and we it's nasty to try to handle those errors.
> 
> Why not just OOM for allocation failure?
>

2 notes:

1) callers of weighted_interleave_nid do not expect OOM conditions, they
   expect a node selection.  On error, we would simply return the local
   numa node without indication of failure.

2) callers of alloc_pages_bulk_array_weighted_interleave receive the
   total number of pages allocated, and they are expected to detect
   pages allocated != pages requested, and then handle whether to
   OOM or simply retry (allocation may fail for a variety of reasons).

By introducing an allocation into this area, if an allocation failure
occurs, we would essentially need to silently squash it and return
either local_node (interleave_nid) or return 0 (bulk allocator) and
allow the allocation logic to handle any subsequent OOM condition.

That felt less desirable than just allocating a scratch space up front
in the mempolicy and avoiding the issue altogether.

> > Instead of doing that, I simply chose to implement the scratch space
> > in the mempolicy structure
> >
> > mempolicy->wil.scratch_weights[MAX_NUMNODES].
> >
> > We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
> > scratch space we can use any time the task is allocating memory, and
> > prevents the need for any fancy error handling.  That seems like a
> > perfectly reasonable tradeoff.
> 
> I don't think that this is a good idea.  The weight array is temporary.
> 

It's temporary, but it's also only used in the context of the task while
the alloc lock is held.

If you think it's fine to introduce another potential OOM generating
spot, then I'll just go ahead and allocate the memory on the fly.

I do want to point out, though, that weighted_interleave_nid is called
per allocated page.  So now we're not just collecting weights to
calculate the offset, we're doing an allocation (that can fail) per page
allocated for that region.

The bulk allocator amortizes the cost of this allocation by doing it
once while allocating a chunk of pages - but the weighted_interleave_nid
function is called per-page.

By comparison, the memory cost to just allocate a static scratch area in
the mempolicy struct is only incurred by tasks with a mempolicy.


So we're talking ~1MB for 1024 threads with mempolicies to avoid error
conditions mid-page-allocation and to reduce the cost associated with
applying weighted interleave.

~Gregory
Huang, Ying Jan. 5, 2024, 6:51 a.m. UTC | #11
Gregory Price <gregory.price@memverge.com> writes:

> On Thu, Jan 04, 2024 at 01:39:31PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Wed, Jan 03, 2024 at 01:46:56PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gregory.price@memverge.com> writes:
>> >> > I'm specifically concerned about:
>> >> > 	weighted_interleave_nid
>> >> > 	alloc_pages_bulk_array_weighted_interleave
>> >> >
>> >> > I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
>> >> > contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>> >> >
>> >> > If not, there is no good solution to this without pre-allocating a
>> >> > scratch area per-task.
>> >> 
>> >> You need to audit whether it's safe for all callers.  I guess that you
>> >> need to allocate pages after calling, so you can use the same GFP flags
>> >> here.
>> >> 
>> >
>> > After picking away i realized that this code is usually going to get
>> > called during page fault handling - duh.  So kmalloc is almost never
>> > safe (or can fail), and we it's nasty to try to handle those errors.
>> 
>> Why not just OOM for allocation failure?
>>
>
> 2 notes:
>
> 1) callers of weighted_interleave_nid do not expect OOM conditions, they
>    expect a node selection.  On error, we would simply return the local
>    numa node without indication of failure.
>
> 2) callers of alloc_pages_bulk_array_weighted_interleave receive the
>    total number of pages allocated, and they are expected to detect
>    pages allocated != pages requested, and then handle whether to
>    OOM or simply retry (allocation may fail for a variety of reasons).
>
> By introducing an allocation into this area, if an allocation failure
> occurs, we would essentially need to silently squash it and return
> either local_node (interleave_nid) or return 0 (bulk allocator) and
> allow the allocation logic to handle any subsequent OOM condition.
>
> That felt less desirable than just allocating a scratch space up front
> in the mempolicy and avoiding the issue altogether.
>
>> > Instead of doing that, I simply chose to implement the scratch space
>> > in the mempolicy structure
>> >
>> > mempolicy->wil.scratch_weights[MAX_NUMNODES].
>> >
>> > We eat an extra 1kb of memory in the mempolicy, but it gives us a safe
>> > scratch space we can use any time the task is allocating memory, and
>> > prevents the need for any fancy error handling.  That seems like a
>> > perfectly reasonable tradeoff.
>> 
>> I don't think that this is a good idea.  The weight array is temporary.
>> 
>
> It's temporary, but it's also only used in the context of the task while
> the alloc lock is held.
>
> If you think it's fine to introduce another potential OOM generating
> spot, then I'll just go ahead and allocate the memory on the fly.
>
> I do want to point out, though, that weighted_interleave_nid is called
> per allocated page.  So now we're not just collecting weights to
> calculate the offset, we're doing an allocation (that can fail) per page
> allocated for that region.
>
> The bulk allocator amortizes the cost of this allocation by doing it
> once while allocating a chunk of pages - but the weighted_interleave_nid
> function is called per-page.
>
> By comparison, the memory cost to just allocate a static scratch area in
> the mempolicy struct is only incurred by tasks with a mempolicy.
>
>
> So we're talking ~1MB for 1024 threads with mempolicies to avoid error
> conditions mid-page-allocation and to reduce the cost associated with
> applying weighted interleave.

Think about this again.  Why do we need weights array on stack?  I think
this is used to keep weights consistent.  If so, we don't need weights
array on stack.  Just use RCU to access global weights array.

--
Best Regards,
Huang, Ying
Gregory Price Jan. 5, 2024, 7:25 a.m. UTC | #12
On Fri, Jan 05, 2024 at 02:51:40PM +0800, Huang, Ying wrote:
> >
> > So we're talking ~1MB for 1024 threads with mempolicies to avoid error
> > conditions mid-page-allocation and to reduce the cost associated with
> > applying weighted interleave.
> 
> Think about this again.  Why do we need weights array on stack?  I think
> this is used to keep weights consistent.  If so, we don't need weights
> array on stack.  Just use RCU to access global weights array.
> 

From the bulk allocation code:

__alloc_pages_bulk(gfp, node, NULL, node_pages, NULL, page_array);

This function can block. You cannot block during an RCU read context.

~Gregory
Huang, Ying Jan. 8, 2024, 7:08 a.m. UTC | #13
Gregory Price <gregory.price@memverge.com> writes:

> On Fri, Jan 05, 2024 at 02:51:40PM +0800, Huang, Ying wrote:
>> >
>> > So we're talking ~1MB for 1024 threads with mempolicies to avoid error
>> > conditions mid-page-allocation and to reduce the cost associated with
>> > applying weighted interleave.
>> 
>> Think about this again.  Why do we need weights array on stack?  I think
>> this is used to keep weights consistent.  If so, we don't need weights
>> array on stack.  Just use RCU to access global weights array.
>> 
>
> From the bulk allocation code:
>
> __alloc_pages_bulk(gfp, node, NULL, node_pages, NULL, page_array);
>
> This function can block. You cannot block during an RCU read context.

Yes.  You are right.  For __alloc_pages_bulk(), it should be OK to
allocate the weights array.  For weighted_interleave_nid(), we can use
RCU to avoid memory allocation in relative fast code path.

BTW, we can use nr_node_ids instead of MAX_NUMNODES if applicable.

--
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..d2c8e712785b 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,17 @@  MPOL_PREFERRED_MANY
 	can fall back to all existing numa nodes. This is effectively
 	MPOL_PREFERRED allowed for a mask rather than a single node.
 
+MPOL_WEIGHTED_INTERLEAVE
+	This mode operates the same as MPOL_INTERLEAVE, except that
+	interleaving behavior is executed based on weights set in
+	/sys/kernel/mm/mempolicy/weighted_interleave/
+
+	Weighted interleave allocations pages on nodes according to
+	their weight.  For example if nodes [0,1] are weighted [5,2]
+	respectively, 5 pages will be allocated on node0 for every
+	2 pages allocated on node1.  This can better distribute data
+	according to bandwidth on heterogeneous memory systems.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..ba09167e80f7 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,11 @@  struct mempolicy {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
 	} w;
+
+	/* Weighted interleave settings */
+	struct {
+		unsigned char cur_weight;
+	} wil;
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@  enum {
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
+	MPOL_WEIGHTED_INTERLEAVE,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0e77633b07a5..0a180c670f0c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -305,6 +305,7 @@  static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	policy->mode = mode;
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
+	policy->wil.cur_weight = 0;
 
 	return policy;
 }
@@ -417,6 +418,10 @@  static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_preferred,
 	},
+	[MPOL_WEIGHTED_INTERLEAVE] = {
+		.create = mpol_new_nodemask,
+		.rebind = mpol_rebind_nodemask,
+	},
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -838,7 +843,8 @@  static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
 		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
@@ -864,6 +870,7 @@  static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*nodes = pol->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -948,6 +955,13 @@  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
 			*policy = next_node_in(current->il_prev, pol->nodes);
+		} else if (pol == current->mempolicy &&
+				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
+			if (pol->wil.cur_weight)
+				*policy = current->il_prev;
+			else
+				*policy = next_node_in(current->il_prev,
+						       pol->nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1777,7 +1791,8 @@  struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	pol = __get_vma_policy(vma, addr, ilx);
 	if (!pol)
 		pol = get_task_policy(current);
-	if (pol->mode == MPOL_INTERLEAVE) {
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
 		*ilx += vma->vm_pgoff >> order;
 		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
 	}
@@ -1827,6 +1842,24 @@  bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 	return zone >= dynamic_policy_zone;
 }
 
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+	unsigned int next;
+	struct task_struct *me = current;
+
+	next = next_node_in(me->il_prev, policy->nodes);
+	if (next == MAX_NUMNODES)
+		return next;
+
+	if (!policy->wil.cur_weight)
+		policy->wil.cur_weight = iw_table[next];
+
+	policy->wil.cur_weight--;
+	if (!policy->wil.cur_weight)
+		me->il_prev = next;
+	return next;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
@@ -1861,6 +1894,9 @@  unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		return weighted_interleave_nodes(policy);
+
 	case MPOL_BIND:
 	case MPOL_PREFERRED_MANY:
 	{
@@ -1885,6 +1921,41 @@  unsigned int mempolicy_slab_node(void)
 	}
 }
 
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+	nodemask_t nodemask = pol->nodes;
+	unsigned int target, weight_total = 0;
+	int nid;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char weight;
+
+	barrier();
+
+	/* first ensure we have a valid nodemask */
+	nid = first_node(nodemask);
+	if (nid == MAX_NUMNODES)
+		return nid;
+
+	/* Then collect weights on stack and calculate totals */
+	for_each_node_mask(nid, nodemask) {
+		weight = iw_table[nid];
+		weight_total += weight;
+		weights[nid] = weight;
+	}
+
+	/* Finally, calculate the node offset based on totals */
+	target = (unsigned int)ilx % weight_total;
+	nid = first_node(nodemask);
+	while (target) {
+		weight = weights[nid];
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+	return nid;
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1953,6 +2024,11 @@  static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
 			interleave_nodes(pol) : interleave_nid(pol, ilx);
 		break;
+	case MPOL_WEIGHTED_INTERLEAVE:
+		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
+			weighted_interleave_nodes(pol) :
+			weighted_interleave_nid(pol, ilx);
+		break;
 	}
 
 	return nodemask;
@@ -2014,6 +2090,7 @@  bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*mask = mempolicy->nodes;
 		break;
 
@@ -2113,7 +2190,8 @@  struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		 * If the policy is interleave or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode != MPOL_INTERLEAVE &&
+		if ((pol->mode != MPOL_INTERLEAVE &&
+		    pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
 		    (!nodemask || node_isset(nid, *nodemask))) {
 			/*
 			 * First, try to allocate THP only on local node, but
@@ -2249,6 +2327,106 @@  static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 	return total_allocated;
 }
 
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	struct task_struct *me = current;
+	unsigned long total_allocated = 0;
+	unsigned long nr_allocated;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	unsigned char weight;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned int weight_total = 0;
+	unsigned long rem_pages = nr_pages;
+	nodemask_t nodes = pol->nodes;
+	int nnodes, node, prev_node;
+	int i;
+
+	/* Stabilize the nodemask on the stack */
+	barrier();
+
+	nnodes = nodes_weight(nodes);
+
+	/* Collect weights and save them on stack so they don't change */
+	for_each_node_mask(node, nodes) {
+		weight = iw_table[node];
+		weight_total += weight;
+		weights[node] = weight;
+	}
+
+	/* Continue allocating from most recent node and adjust the nr_pages */
+	if (pol->wil.cur_weight) {
+		node = next_node_in(me->il_prev, nodes);
+		node_pages = pol->wil.cur_weight;
+		if (node_pages > rem_pages)
+			node_pages = rem_pages;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/* if that's all the pages, no need to interleave */
+		if (rem_pages <= pol->wil.cur_weight) {
+			pol->wil.cur_weight -= rem_pages;
+			return total_allocated;
+		}
+		/* Otherwise we adjust nr_pages down, and continue from there */
+		rem_pages -= pol->wil.cur_weight;
+		pol->wil.cur_weight = 0;
+		prev_node = node;
+	}
+
+	/* Now we can continue allocating as if from 0 instead of an offset */
+	rounds = rem_pages / weight_total;
+	delta = rem_pages % weight_total;
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, nodes);
+		weight = weights[node];
+		node_pages = weight * rounds;
+		if (delta) {
+			if (delta > weight) {
+				node_pages += weight;
+				delta -= weight;
+			} else {
+				node_pages += delta;
+				delta = 0;
+			}
+		}
+		/* We may not make it all the way around */
+		if (!node_pages)
+			break;
+		/* If an over-allocation would occur, floor it */
+		if (node_pages + total_allocated > nr_pages) {
+			node_pages = nr_pages - total_allocated;
+			delta = 0;
+		}
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		prev_node = node;
+	}
+
+	/*
+	 * Finally, we need to update me->il_prev and pol->wil.cur_weight
+	 * if there were overflow pages, but not equivalent to the node
+	 * weight, set the cur_weight to node_weight - delta and the
+	 * me->il_prev to the previous node. Otherwise if it was perfect
+	 * we can simply set il_prev to node and cur_weight to 0
+	 */
+	if (node_pages) {
+		me->il_prev = prev_node;
+		node_pages %= weight;
+		pol->wil.cur_weight = weight - node_pages;
+	} else {
+		me->il_prev = node;
+		pol->wil.cur_weight = 0;
+	}
+
+	return total_allocated;
+}
+
 static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
@@ -2289,6 +2467,11 @@  unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
 		return alloc_pages_bulk_array_interleave(gfp, pol,
 							 nr_pages, page_array);
 
+	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+		return alloc_pages_bulk_array_weighted_interleave(gfp, pol,
+								  nr_pages,
+								  page_array);
+
 	if (pol->mode == MPOL_PREFERRED_MANY)
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
@@ -2364,6 +2547,7 @@  bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2500,6 +2684,10 @@  int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
 		polnid = interleave_nid(pol, ilx);
 		break;
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		polnid = weighted_interleave_nid(pol, ilx);
+		break;
+
 	case MPOL_PREFERRED:
 		if (node_isset(curnid, pol->nodes))
 			goto out;
@@ -2874,6 +3062,7 @@  static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
+	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
 	[MPOL_LOCAL]      = "local",
 	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
@@ -2933,6 +3122,7 @@  int mpol_parse_str(char *str, struct mempolicy **mpol)
 		}
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		/*
 		 * Default to online nodes with memory if no nodelist
 		 */
@@ -3043,6 +3233,7 @@  void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		nodes = pol->nodes;
 		break;
 	default: