diff mbox series

[v4,11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave

Message ID 20231218194631.21667-12-gregory.price@memverge.com (mailing list archive)
State New, archived
Headers show
Series mempolicy2, mbind2, and weighted interleave | expand

Commit Message

Gregory Price Dec. 18, 2023, 7:46 p.m. UTC
Extend set_mempolicy2 and mbind2 to support weighted interleave, and
demonstrate the extensibility of the mpol_args structure.

To support weighted interleave we add interleave weight fields to the
following structures:

Kernel Internal:  (include/linux/mempolicy.h)
struct mempolicy {
	/* task-local weights to apply to weighted interleave */
	unsigned char weights[MAX_NUMNODES];
}
struct mempolicy_args {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	unsigned char *il_weights;	/* of size MAX_NUMNODES */
}

UAPI: (/include/uapi/linux/mempolicy.h)
struct mpol_args {
	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
	unsigned char *il_weights;	/* of size pol_max_nodes */
}

The task-local weights are a single, one-dimensional array of weights
that apply to all possible nodes on the system.  If a node is set in
the mempolicy nodemask, the weight in `il_weights` must be >= 1,
otherwise set_mempolicy2() will return -EINVAL.  If a node is not
set in pol_nodemask, the weight will default to `1` in the task policy.

The default value of `1` is required to handle the situation where a
task migrates to a set of nodes for which weights were not set (up to
and including the local numa node).  For example, a migrated task whose
nodemask changes entirely will have all its weights defaulted back
to `1`, or if the nodemask changes to include a mix of nodes that
were not previously accounted for - the weighted interleave may be
suboptimal.

If migrations are expected, a task should prefer not to use task-local
interleave weights, and instead utilize the global settings for natural
re-weighting on migration.

To support global vs local weighting,  we add the kernel-internal flag:
MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */

This flag is set when il_weights is omitted by set_mempolicy2(), or
when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal
mode_flag dictates whether global weights or task-local weights are
utilized by the the various weighted interleave functions:

* weighted_interleave_nodes
* weighted_interleave_nid
* alloc_pages_bulk_array_weighted_interleave

if (pol->flags & MPOL_F_GWEIGHT)
	pol_weights = iw_table;
else
	pol_weights = pol->wil.weights;

To simplify creations and duplication of mempolicies, the weights are
added as a structure directly within mempolicy. This allows the
existing logic in __mpol_dup to copy the weights without additional
allocations:

if (old == current->mempolicy) {
	task_lock(current);
	*new = *old;
	task_unlock(current);
} else
	*new = *old

Suggested-by: Rakie Kim <rakie.kim@sk.com>
Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  10 ++
 include/linux/mempolicy.h                     |   2 +
 include/uapi/linux/mempolicy.h                |   2 +
 mm/mempolicy.c                                | 129 +++++++++++++++++-
 4 files changed, 139 insertions(+), 4 deletions(-)

Comments

Huang, Ying Dec. 19, 2023, 3:07 a.m. UTC | #1
Gregory Price <gourry.memverge@gmail.com> writes:

> Extend set_mempolicy2 and mbind2 to support weighted interleave, and
> demonstrate the extensibility of the mpol_args structure.
>
> To support weighted interleave we add interleave weight fields to the
> following structures:
>
> Kernel Internal:  (include/linux/mempolicy.h)
> struct mempolicy {
> 	/* task-local weights to apply to weighted interleave */
> 	unsigned char weights[MAX_NUMNODES];
> }
> struct mempolicy_args {
> 	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
> 	unsigned char *il_weights;	/* of size MAX_NUMNODES */
> }
>
> UAPI: (/include/uapi/linux/mempolicy.h)
> struct mpol_args {
> 	/* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */
> 	unsigned char *il_weights;	/* of size pol_max_nodes */
> }
>
> The task-local weights are a single, one-dimensional array of weights
> that apply to all possible nodes on the system.  If a node is set in
> the mempolicy nodemask, the weight in `il_weights` must be >= 1,
> otherwise set_mempolicy2() will return -EINVAL.  If a node is not
> set in pol_nodemask, the weight will default to `1` in the task policy.
>
> The default value of `1` is required to handle the situation where a
> task migrates to a set of nodes for which weights were not set (up to
> and including the local numa node).  For example, a migrated task whose
> nodemask changes entirely will have all its weights defaulted back
> to `1`, or if the nodemask changes to include a mix of nodes that
> were not previously accounted for - the weighted interleave may be
> suboptimal.
>
> If migrations are expected, a task should prefer not to use task-local
> interleave weights, and instead utilize the global settings for natural
> re-weighting on migration.
>
> To support global vs local weighting,  we add the kernel-internal flag:
> MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */
>
> This flag is set when il_weights is omitted by set_mempolicy2(), or
> when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal
> mode_flag dictates whether global weights or task-local weights are
> utilized by the the various weighted interleave functions:
>
> * weighted_interleave_nodes
> * weighted_interleave_nid
> * alloc_pages_bulk_array_weighted_interleave
>
> if (pol->flags & MPOL_F_GWEIGHT)
> 	pol_weights = iw_table;
> else
> 	pol_weights = pol->wil.weights;
>
> To simplify creations and duplication of mempolicies, the weights are
> added as a structure directly within mempolicy. This allows the
> existing logic in __mpol_dup to copy the weights without additional
> allocations:
>
> if (old == current->mempolicy) {
> 	task_lock(current);
> 	*new = *old;
> 	task_unlock(current);
> } else
> 	*new = *old
>
> Suggested-by: Rakie Kim <rakie.kim@sk.com>
> Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Suggested-by: Honggyu Kim <honggyu.kim@sk.com>
> Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Rakie Kim <rakie.kim@sk.com>
> Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     |  10 ++
>  include/linux/mempolicy.h                     |   2 +
>  include/uapi/linux/mempolicy.h                |   2 +
>  mm/mempolicy.c                                | 129 +++++++++++++++++-
>  4 files changed, 139 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 99e1f732cade..0e91efe9e769 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE
>  	This mode operates the same as MPOL_INTERLEAVE, except that
>  	interleaving behavior is executed based on weights set in
>  	/sys/kernel/mm/mempolicy/weighted_interleave/
> +	when configured to utilize global weights, or based on task-local
> +	weights configured with set_mempolicy2(2) or mbind2(2).
>  
>  	Weighted interleave allocations pages on nodes according to
>  	their weight.  For example if nodes [0,1] are weighted [5,2]
> @@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE
>  	2 pages allocated on node1.  This can better distribute data
>  	according to bandwidth on heterogeneous memory systems.
>  
> +	When utilizing task-local weights, weights are not rebalanced
> +	in the event of a task migration.  If a weight has not been
> +	explicitly set for a node set in the new nodemask, the
> +	value of that weight defaults to "1".  For this reason, if
> +	migrations are expected or possible, users should consider
> +	utilizing global interleave weights.
> +
>  NUMA memory policy supports the following optional mode flags:
>  
>  MPOL_F_STATIC_NODES
> @@ -514,6 +523,7 @@ Extended Mempolicy Arguments::
>  		__u16 mode_flags;
>  		__s32 home_node; /* mbind2: policy home node */
>  		__aligned_u64 pol_nodes; /* nodemask pointer */
> +		__aligned_u64 il_weights;  /* u8 buf of size pol_maxnodes */
>  		__u64 pol_maxnodes;
>  		__s32 policy_node; /* get_mempolicy2: policy node information */
>  	};
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index aeac19dfc2b6..387c5c418a66 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -58,6 +58,7 @@ struct mempolicy {
>  	/* Weighted interleave settings */
>  	struct {
>  		unsigned char cur_weight;
> +		unsigned char weights[MAX_NUMNODES];
>  	} wil;
>  };
>  
> @@ -70,6 +71,7 @@ struct mempolicy_args {
>  	unsigned short mode_flags;	/* policy mode flags */
>  	int home_node;			/* mbind: use MPOL_MF_HOME_NODE */
>  	nodemask_t *policy_nodes;	/* get/set/mbind */
> +	unsigned char *il_weights;	/* for mode MPOL_WEIGHTED_INTERLEAVE */
>  	int policy_node;		/* get: policy node information */
>  };
>  
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index ec1402dae35b..16fedf966166 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -33,6 +33,7 @@ struct mpol_args {
>  	__u16 mode_flags;
>  	__s32 home_node;	/* mbind2: policy home node */
>  	__aligned_u64 pol_nodes;
> +	__aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */
>  	__u64 pol_maxnodes;
>  	__s32 policy_node;	/* get_mempolicy: policy node info */
>  };

You break the ABI you introduced earlier in the patchset.  Although they
are done within a patchset, I don't think that it's a good idea.  I
suggest to finalize the ABI in the first place.  Otherwise, people check
git log will be confused by ABI broken.  This makes it easier to be
reviewed too.

> @@ -75,6 +76,7 @@ struct mpol_args {
>  #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
>  #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
>  #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
> +#define MPOL_F_GWEIGHT	(1 << 5) /* Utilize global weights */
>  
>  /*
>   * These bit locations are exposed in the vm.zone_reclaim_mode sysctl

--
Best Regards,
Huang, Ying

[snip]
Gregory Price Dec. 19, 2023, 6:12 p.m. UTC | #2
On Tue, Dec 19, 2023 at 11:07:10AM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> > index ec1402dae35b..16fedf966166 100644
> > --- a/include/uapi/linux/mempolicy.h
> > +++ b/include/uapi/linux/mempolicy.h
> > @@ -33,6 +33,7 @@ struct mpol_args {
> >  	__u16 mode_flags;
> >  	__s32 home_node;	/* mbind2: policy home node */
> >  	__aligned_u64 pol_nodes;
> > +	__aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */
> >  	__u64 pol_maxnodes;
> >  	__s32 policy_node;	/* get_mempolicy: policy node info */
> >  };
> 
> You break the ABI you introduced earlier in the patchset.  Although they
> are done within a patchset, I don't think that it's a good idea.  I
> suggest to finalize the ABI in the first place.  Otherwise, people check
> git log will be confused by ABI broken.  This makes it easier to be
> reviewed too.
> 

This is a result of fixing alignment/holes (suggested by Arnd) and my
not dropping policy_node, which I'd originally planned to do.

I figured that whenever we decided to move forward, mempolicy2 and
mbind2 syscalls would end up squashed into a single commit for the
purpose of ensuring the feature goes in as a whole.  I can fix this
though.

~Gregory
Dan Carpenter Jan. 3, 2024, 11:16 a.m. UTC | #3
Hi Gregory,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Gregory-Price/mm-mempolicy-implement-the-sysfs-based-weighted_interleave-interface/20231219-074837
base:   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools.git perf-tools
patch link:    https://lore.kernel.org/r/20231218194631.21667-12-gregory.price%40memverge.com
patch subject: [PATCH v4 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave
config: x86_64-randconfig-161-20231219 (https://download.01.org/0day-ci/archive/20231220/202312200223.7X9rUFgu-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
| Closes: https://lore.kernel.org/r/202312200223.7X9rUFgu-lkp@intel.com/

smatch warnings:
mm/mempolicy.c:2044 __do_sys_get_mempolicy2() warn: maybe return -EFAULT instead of the bytes remaining?
mm/mempolicy.c:2044 __do_sys_get_mempolicy2() warn: maybe return -EFAULT instead of the bytes remaining?

vim +2044 mm/mempolicy.c

a2af87404eb73e Gregory Price     2023-12-18  1992  SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
a2af87404eb73e Gregory Price     2023-12-18  1993  		unsigned long, addr, unsigned long, flags)
a2af87404eb73e Gregory Price     2023-12-18  1994  {
a2af87404eb73e Gregory Price     2023-12-18  1995  	struct mpol_args kargs;
a2af87404eb73e Gregory Price     2023-12-18  1996  	struct mempolicy_args margs;
a2af87404eb73e Gregory Price     2023-12-18  1997  	int err;
a2af87404eb73e Gregory Price     2023-12-18  1998  	nodemask_t policy_nodemask;
a2af87404eb73e Gregory Price     2023-12-18  1999  	unsigned long __user *nodes_ptr;
8bfd7ddc0dd439 Gregory Price     2023-12-18  2000  	unsigned char __user *weights_ptr;
8bfd7ddc0dd439 Gregory Price     2023-12-18  2001  	unsigned char weights[MAX_NUMNODES];
a2af87404eb73e Gregory Price     2023-12-18  2002  
a2af87404eb73e Gregory Price     2023-12-18  2003  	if (flags & ~(MPOL_F_ADDR))
a2af87404eb73e Gregory Price     2023-12-18  2004  		return -EINVAL;
a2af87404eb73e Gregory Price     2023-12-18  2005  
a2af87404eb73e Gregory Price     2023-12-18  2006  	/* initialize any memory liable to be copied to userland */
a2af87404eb73e Gregory Price     2023-12-18  2007  	memset(&margs, 0, sizeof(margs));
8bfd7ddc0dd439 Gregory Price     2023-12-18  2008  	memset(weights, 0, sizeof(weights));
a2af87404eb73e Gregory Price     2023-12-18  2009  
a2af87404eb73e Gregory Price     2023-12-18  2010  	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
a2af87404eb73e Gregory Price     2023-12-18  2011  	if (err)
a2af87404eb73e Gregory Price     2023-12-18  2012  		return -EINVAL;
a2af87404eb73e Gregory Price     2023-12-18  2013  
8bfd7ddc0dd439 Gregory Price     2023-12-18  2014  	if (kargs.il_weights)
8bfd7ddc0dd439 Gregory Price     2023-12-18  2015  		margs.il_weights = weights;
8bfd7ddc0dd439 Gregory Price     2023-12-18  2016  	else
8bfd7ddc0dd439 Gregory Price     2023-12-18  2017  		margs.il_weights = NULL;
8bfd7ddc0dd439 Gregory Price     2023-12-18  2018  
a2af87404eb73e Gregory Price     2023-12-18  2019  	margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL;
a2af87404eb73e Gregory Price     2023-12-18  2020  	if (flags & MPOL_F_ADDR)
a2af87404eb73e Gregory Price     2023-12-18  2021  		err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs);
a2af87404eb73e Gregory Price     2023-12-18  2022  	else
a2af87404eb73e Gregory Price     2023-12-18  2023  		err = do_get_task_mempolicy(&margs);
a2af87404eb73e Gregory Price     2023-12-18  2024  
a2af87404eb73e Gregory Price     2023-12-18  2025  	if (err)
a2af87404eb73e Gregory Price     2023-12-18  2026  		return err;
a2af87404eb73e Gregory Price     2023-12-18  2027  
a2af87404eb73e Gregory Price     2023-12-18  2028  	kargs.mode = margs.mode;
a2af87404eb73e Gregory Price     2023-12-18  2029  	kargs.mode_flags = margs.mode_flags;
a2af87404eb73e Gregory Price     2023-12-18  2030  	kargs.policy_node = margs.policy_node;
a2af87404eb73e Gregory Price     2023-12-18  2031  	kargs.home_node = margs.home_node;
a2af87404eb73e Gregory Price     2023-12-18  2032  	if (kargs.pol_nodes) {
a2af87404eb73e Gregory Price     2023-12-18  2033  		nodes_ptr = u64_to_user_ptr(kargs.pol_nodes);
a2af87404eb73e Gregory Price     2023-12-18  2034  		err = copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes,
a2af87404eb73e Gregory Price     2023-12-18  2035  					 margs.policy_nodes);
a2af87404eb73e Gregory Price     2023-12-18  2036  		if (err)
a2af87404eb73e Gregory Price     2023-12-18  2037  			return err;

This looks wrong as well.

a2af87404eb73e Gregory Price     2023-12-18  2038  	}
a2af87404eb73e Gregory Price     2023-12-18  2039  
8bfd7ddc0dd439 Gregory Price     2023-12-18  2040  	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) {
8bfd7ddc0dd439 Gregory Price     2023-12-18  2041  		weights_ptr = u64_to_user_ptr(kargs.il_weights);
8bfd7ddc0dd439 Gregory Price     2023-12-18  2042  		err = copy_to_user(weights_ptr, weights, kargs.pol_maxnodes);
8bfd7ddc0dd439 Gregory Price     2023-12-18  2043  		if (err)
8bfd7ddc0dd439 Gregory Price     2023-12-18 @2044  			return err;

This should return -EFAULT same as the copy_to_user() on the next line.

8bfd7ddc0dd439 Gregory Price     2023-12-18  2045  	}
8bfd7ddc0dd439 Gregory Price     2023-12-18  2046  
a2af87404eb73e Gregory Price     2023-12-18  2047  	return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0;
a2af87404eb73e Gregory Price     2023-12-18  2048  }
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 99e1f732cade..0e91efe9e769 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -254,6 +254,8 @@  MPOL_WEIGHTED_INTERLEAVE
 	This mode operates the same as MPOL_INTERLEAVE, except that
 	interleaving behavior is executed based on weights set in
 	/sys/kernel/mm/mempolicy/weighted_interleave/
+	when configured to utilize global weights, or based on task-local
+	weights configured with set_mempolicy2(2) or mbind2(2).
 
 	Weighted interleave allocations pages on nodes according to
 	their weight.  For example if nodes [0,1] are weighted [5,2]
@@ -261,6 +263,13 @@  MPOL_WEIGHTED_INTERLEAVE
 	2 pages allocated on node1.  This can better distribute data
 	according to bandwidth on heterogeneous memory systems.
 
+	When utilizing task-local weights, weights are not rebalanced
+	in the event of a task migration.  If a weight has not been
+	explicitly set for a node set in the new nodemask, the
+	value of that weight defaults to "1".  For this reason, if
+	migrations are expected or possible, users should consider
+	utilizing global interleave weights.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
@@ -514,6 +523,7 @@  Extended Mempolicy Arguments::
 		__u16 mode_flags;
 		__s32 home_node; /* mbind2: policy home node */
 		__aligned_u64 pol_nodes; /* nodemask pointer */
+		__aligned_u64 il_weights;  /* u8 buf of size pol_maxnodes */
 		__u64 pol_maxnodes;
 		__s32 policy_node; /* get_mempolicy2: policy node information */
 	};
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index aeac19dfc2b6..387c5c418a66 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -58,6 +58,7 @@  struct mempolicy {
 	/* Weighted interleave settings */
 	struct {
 		unsigned char cur_weight;
+		unsigned char weights[MAX_NUMNODES];
 	} wil;
 };
 
@@ -70,6 +71,7 @@  struct mempolicy_args {
 	unsigned short mode_flags;	/* policy mode flags */
 	int home_node;			/* mbind: use MPOL_MF_HOME_NODE */
 	nodemask_t *policy_nodes;	/* get/set/mbind */
+	unsigned char *il_weights;	/* for mode MPOL_WEIGHTED_INTERLEAVE */
 	int policy_node;		/* get: policy node information */
 };
 
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index ec1402dae35b..16fedf966166 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -33,6 +33,7 @@  struct mpol_args {
 	__u16 mode_flags;
 	__s32 home_node;	/* mbind2: policy home node */
 	__aligned_u64 pol_nodes;
+	__aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */
 	__u64 pol_maxnodes;
 	__s32 policy_node;	/* get_mempolicy: policy node info */
 };
@@ -75,6 +76,7 @@  struct mpol_args {
 #define MPOL_F_SHARED  (1 << 0)	/* identify shared policies */
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
+#define MPOL_F_GWEIGHT	(1 << 5) /* Utilize global weights */
 
 /*
  * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0882fa4aa516..1d73ad29e36c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -271,6 +271,7 @@  static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	unsigned short mode = args->mode;
 	unsigned short flags = args->mode_flags;
 	nodemask_t *nodes = args->policy_nodes;
+	int node;
 
 	if (mode == MPOL_DEFAULT) {
 		if (nodes && !nodes_empty(*nodes))
@@ -297,6 +298,19 @@  static struct mempolicy *mpol_new(struct mempolicy_args *args)
 		    (flags & MPOL_F_STATIC_NODES) ||
 		    (flags & MPOL_F_RELATIVE_NODES))
 			return ERR_PTR(-EINVAL);
+	} else if (mode == MPOL_WEIGHTED_INTERLEAVE) {
+		/* weighted interleave requires a nodemask and weights > 0 */
+		if (nodes_empty(*nodes))
+			return ERR_PTR(-EINVAL);
+		if (args->il_weights) {
+			node = first_node(*nodes);
+			while (node != MAX_NUMNODES) {
+				if (!args->il_weights[node])
+					return ERR_PTR(-EINVAL);
+				node = next_node(node, *nodes);
+			}
+		} else if (!(args->mode_flags & MPOL_F_GWEIGHT))
+			return ERR_PTR(-EINVAL);
 	} else if (nodes_empty(*nodes))
 		return ERR_PTR(-EINVAL);
 
@@ -309,6 +323,17 @@  static struct mempolicy *mpol_new(struct mempolicy_args *args)
 	policy->home_node = args->home_node;
 	policy->wil.cur_weight = 0;
 
+	if (policy->mode == MPOL_WEIGHTED_INTERLEAVE && args->il_weights) {
+		policy->wil.cur_weight = 0;
+		/* Minimum weight value is always 1 */
+		memset(policy->wil.weights, 1, MAX_NUMNODES);
+		node = first_node(*nodes);
+		while (node != MAX_NUMNODES) {
+			policy->wil.weights[node] = args->il_weights[node];
+			node = next_node(node, *nodes);
+		}
+	}
+
 	return policy;
 }
 
@@ -937,6 +962,17 @@  static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask)
 	}
 }
 
+static void do_get_mempolicy_il_weights(struct mempolicy *pol,
+					unsigned char weights[MAX_NUMNODES])
+{
+	if (pol->mode != MPOL_WEIGHTED_INTERLEAVE)
+		memset(weights, 0, MAX_NUMNODES);
+	else if (pol->flags & MPOL_F_GWEIGHT)
+		memcpy(weights, iw_table, MAX_NUMNODES);
+	else
+		memcpy(weights, pol->wil.weights, MAX_NUMNODES);
+}
+
 /* Retrieve NUMA policy for a VMA assocated with a given address  */
 static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
 				 struct mempolicy_args *args)
@@ -973,6 +1009,9 @@  static long do_get_vma_mempolicy(unsigned long addr, int *addr_node,
 	if (args->policy_nodes)
 		do_get_mempolicy_nodemask(pol, args->policy_nodes);
 
+	if (args->il_weights)
+		do_get_mempolicy_il_weights(pol, args->il_weights);
+
 	if (pol != &default_policy) {
 		mpol_put(pol);
 		mpol_cond_put(pol);
@@ -999,6 +1038,9 @@  static long do_get_task_mempolicy(struct mempolicy_args *args)
 	if (args->policy_nodes)
 		do_get_mempolicy_nodemask(pol, args->policy_nodes);
 
+	if (args->il_weights)
+		do_get_mempolicy_il_weights(pol, args->il_weights);
+
 	return 0;
 }
 
@@ -1521,6 +1563,9 @@  static long kernel_mbind(unsigned long start, unsigned long len,
 	if (err)
 		return err;
 
+	if (mode & MPOL_WEIGHTED_INTERLEAVE)
+		mode_flags |= MPOL_F_GWEIGHT;
+
 	memset(&margs, 0, sizeof(margs));
 	margs.mode = lmode;
 	margs.mode_flags = mode_flags;
@@ -1611,6 +1656,8 @@  SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
 	struct mempolicy_args margs;
 	nodemask_t policy_nodes;
 	unsigned long __user *nodes_ptr;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char __user *weights_ptr;
 	int err;
 
 	if (!start || !len)
@@ -1643,6 +1690,23 @@  SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len,
 		return err;
 	margs.policy_nodes = &policy_nodes;
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		if (weights_ptr) {
+			err = copy_struct_from_user(weights,
+						    sizeof(weights),
+						    weights_ptr,
+						    kargs.pol_maxnodes);
+			if (err)
+				return err;
+			margs.il_weights = weights;
+		} else {
+			margs.il_weights = NULL;
+			margs.mode_flags |= MPOL_F_GWEIGHT;
+		}
+	} else
+		margs.il_weights = NULL;
+
 	return do_mbind(untagged_addr(start), len, &margs, flags);
 }
 
@@ -1664,6 +1728,9 @@  static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	if (err)
 		return err;
 
+	if (mode & MPOL_WEIGHTED_INTERLEAVE)
+		mode_flags |= MPOL_F_GWEIGHT;
+
 	memset(&args, 0, sizeof(args));
 	args.mode = lmode;
 	args.mode_flags = mode_flags;
@@ -1687,6 +1754,8 @@  SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	unsigned char weights[MAX_NUMNODES];
+	unsigned char __user *weights_ptr;
 
 	if (flags)
 		return -EINVAL;
@@ -1712,6 +1781,20 @@  SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	} else
 		margs.policy_nodes = NULL;
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		err = copy_struct_from_user(weights,
+					    sizeof(weights),
+					    weights_ptr,
+					    kargs.pol_maxnodes);
+		if (err)
+			return err;
+		margs.il_weights = weights;
+	} else {
+		margs.il_weights = NULL;
+		margs.mode_flags |= MPOL_F_GWEIGHT;
+	}
+
 	return do_set_mempolicy(&margs);
 }
 
@@ -1914,17 +1997,25 @@  SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 	int err;
 	nodemask_t policy_nodemask;
 	unsigned long __user *nodes_ptr;
+	unsigned char __user *weights_ptr;
+	unsigned char weights[MAX_NUMNODES];
 
 	if (flags & ~(MPOL_F_ADDR))
 		return -EINVAL;
 
 	/* initialize any memory liable to be copied to userland */
 	memset(&margs, 0, sizeof(margs));
+	memset(weights, 0, sizeof(weights));
 
 	err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize);
 	if (err)
 		return -EINVAL;
 
+	if (kargs.il_weights)
+		margs.il_weights = weights;
+	else
+		margs.il_weights = NULL;
+
 	margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL;
 	if (flags & MPOL_F_ADDR)
 		err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs);
@@ -1946,6 +2037,13 @@  SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize,
 			return err;
 	}
 
+	if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) {
+		weights_ptr = u64_to_user_ptr(kargs.il_weights);
+		err = copy_to_user(weights_ptr, weights, kargs.pol_maxnodes);
+		if (err)
+			return err;
+	}
+
 	return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0;
 }
 
@@ -2062,13 +2160,18 @@  static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
 {
 	unsigned int next;
 	struct task_struct *me = current;
+	unsigned char next_weight;
 
 	next = next_node_in(me->il_prev, policy->nodes);
 	if (next == MAX_NUMNODES)
 		return next;
 
-	if (!policy->wil.cur_weight)
-		policy->wil.cur_weight = iw_table[next];
+	if (!policy->wil.cur_weight) {
+		next_weight = (policy->flags & MPOL_F_GWEIGHT) ?
+				iw_table[next] :
+				policy->wil.weights[next];
+		policy->wil.cur_weight = next_weight ? next_weight : 1;
+	}
 
 	policy->wil.cur_weight--;
 	if (!policy->wil.cur_weight)
@@ -2142,6 +2245,7 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	nodemask_t nodemask = pol->nodes;
 	unsigned int target, weight_total = 0;
 	int nid;
+	unsigned char *pol_weights;
 	unsigned char weights[MAX_NUMNODES];
 	unsigned char weight;
 
@@ -2153,8 +2257,13 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 		return nid;
 
 	/* Then collect weights on stack and calculate totals */
+	if (pol->flags & MPOL_F_GWEIGHT)
+		pol_weights = iw_table;
+	else
+		pol_weights = pol->wil.weights;
+
 	for_each_node_mask(nid, nodemask) {
-		weight = iw_table[nid];
+		weight = pol_weights[nid];
 		weight_total += weight;
 		weights[nid] = weight;
 	}
@@ -2552,6 +2661,7 @@  static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 	unsigned long nr_allocated;
 	unsigned long rounds;
 	unsigned long node_pages, delta;
+	unsigned char *pol_weights;
 	unsigned char weight;
 	unsigned char weights[MAX_NUMNODES];
 	unsigned int weight_total = 0;
@@ -2565,9 +2675,14 @@  static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 
 	nnodes = nodes_weight(nodes);
 
+	if (pol->flags & MPOL_F_GWEIGHT)
+		pol_weights = iw_table;
+	else
+		pol_weights = pol->wil.weights;
+
 	/* Collect weights and save them on stack so they don't change */
 	for_each_node_mask(node, nodes) {
-		weight = iw_table[node];
+		weight = pol_weights[node];
 		weight_total += weight;
 		weights[node] = weight;
 	}
@@ -3092,6 +3207,7 @@  void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 {
 	int ret;
 	struct mempolicy_args margs;
+	unsigned char weights[MAX_NUMNODES];
 
 	sp->root = RB_ROOT;		/* empty tree == default mempolicy */
 	rwlock_init(&sp->lock);
@@ -3109,6 +3225,11 @@  void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 		margs.mode_flags = mpol->flags;
 		margs.policy_nodes = &mpol->w.user_nodemask;
 		margs.home_node = NUMA_NO_NODE;
+		if (margs.mode == MPOL_WEIGHTED_INTERLEAVE &&
+		    !(margs.mode_flags & MPOL_F_GWEIGHT)) {
+			memcpy(weights, mpol->wil.weights, sizeof(weights));
+			margs.il_weights = weights;
+		}
 
 		/* contextualize the tmpfs mount point mempolicy to this file */
 		npol = mpol_new(&margs);