Message ID | 20231218194631.21667-12-gregory.price@memverge.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mempolicy2, mbind2, and weighted interleave | expand |
Gregory Price <gourry.memverge@gmail.com> writes: > Extend set_mempolicy2 and mbind2 to support weighted interleave, and > demonstrate the extensibility of the mpol_args structure. > > To support weighted interleave we add interleave weight fields to the > following structures: > > Kernel Internal: (include/linux/mempolicy.h) > struct mempolicy { > /* task-local weights to apply to weighted interleave */ > unsigned char weights[MAX_NUMNODES]; > } > struct mempolicy_args { > /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ > unsigned char *il_weights; /* of size MAX_NUMNODES */ > } > > UAPI: (/include/uapi/linux/mempolicy.h) > struct mpol_args { > /* Optional: interleave weights for MPOL_WEIGHTED_INTERLEAVE */ > unsigned char *il_weights; /* of size pol_max_nodes */ > } > > The task-local weights are a single, one-dimensional array of weights > that apply to all possible nodes on the system. If a node is set in > the mempolicy nodemask, the weight in `il_weights` must be >= 1, > otherwise set_mempolicy2() will return -EINVAL. If a node is not > set in pol_nodemask, the weight will default to `1` in the task policy. > > The default value of `1` is required to handle the situation where a > task migrates to a set of nodes for which weights were not set (up to > and including the local numa node). For example, a migrated task whose > nodemask changes entirely will have all its weights defaulted back > to `1`, or if the nodemask changes to include a mix of nodes that > were not previously accounted for - the weighted interleave may be > suboptimal. > > If migrations are expected, a task should prefer not to use task-local > interleave weights, and instead utilize the global settings for natural > re-weighting on migration. > > To support global vs local weighting, we add the kernel-internal flag: > MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ > > This flag is set when il_weights is omitted by set_mempolicy2(), or > when MPOL_WEIGHTED_INTERLEAVE is set by set_mempolicy(). This internal > mode_flag dictates whether global weights or task-local weights are > utilized by the the various weighted interleave functions: > > * weighted_interleave_nodes > * weighted_interleave_nid > * alloc_pages_bulk_array_weighted_interleave > > if (pol->flags & MPOL_F_GWEIGHT) > pol_weights = iw_table; > else > pol_weights = pol->wil.weights; > > To simplify creations and duplication of mempolicies, the weights are > added as a structure directly within mempolicy. This allows the > existing logic in __mpol_dup to copy the weights without additional > allocations: > > if (old == current->mempolicy) { > task_lock(current); > *new = *old; > task_unlock(current); > } else > *new = *old > > Suggested-by: Rakie Kim <rakie.kim@sk.com> > Suggested-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Suggested-by: Honggyu Kim <honggyu.kim@sk.com> > Suggested-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com> > Signed-off-by: Gregory Price <gregory.price@memverge.com> > Co-developed-by: Rakie Kim <rakie.kim@sk.com> > Signed-off-by: Rakie Kim <rakie.kim@sk.com> > Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com> > Co-developed-by: Honggyu Kim <honggyu.kim@sk.com> > Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> > Co-developed-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com> > Signed-off-by: Vinicius Tavares Petrucci <vtavarespetr@micron.com> > --- > .../admin-guide/mm/numa_memory_policy.rst | 10 ++ > include/linux/mempolicy.h | 2 + > include/uapi/linux/mempolicy.h | 2 + > mm/mempolicy.c | 129 +++++++++++++++++- > 4 files changed, 139 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst > index 99e1f732cade..0e91efe9e769 100644 > --- a/Documentation/admin-guide/mm/numa_memory_policy.rst > +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst > @@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE > This mode operates the same as MPOL_INTERLEAVE, except that > interleaving behavior is executed based on weights set in > /sys/kernel/mm/mempolicy/weighted_interleave/ > + when configured to utilize global weights, or based on task-local > + weights configured with set_mempolicy2(2) or mbind2(2). > > Weighted interleave allocations pages on nodes according to > their weight. For example if nodes [0,1] are weighted [5,2] > @@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE > 2 pages allocated on node1. This can better distribute data > according to bandwidth on heterogeneous memory systems. > > + When utilizing task-local weights, weights are not rebalanced > + in the event of a task migration. If a weight has not been > + explicitly set for a node set in the new nodemask, the > + value of that weight defaults to "1". For this reason, if > + migrations are expected or possible, users should consider > + utilizing global interleave weights. > + > NUMA memory policy supports the following optional mode flags: > > MPOL_F_STATIC_NODES > @@ -514,6 +523,7 @@ Extended Mempolicy Arguments:: > __u16 mode_flags; > __s32 home_node; /* mbind2: policy home node */ > __aligned_u64 pol_nodes; /* nodemask pointer */ > + __aligned_u64 il_weights; /* u8 buf of size pol_maxnodes */ > __u64 pol_maxnodes; > __s32 policy_node; /* get_mempolicy2: policy node information */ > }; > diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h > index aeac19dfc2b6..387c5c418a66 100644 > --- a/include/linux/mempolicy.h > +++ b/include/linux/mempolicy.h > @@ -58,6 +58,7 @@ struct mempolicy { > /* Weighted interleave settings */ > struct { > unsigned char cur_weight; > + unsigned char weights[MAX_NUMNODES]; > } wil; > }; > > @@ -70,6 +71,7 @@ struct mempolicy_args { > unsigned short mode_flags; /* policy mode flags */ > int home_node; /* mbind: use MPOL_MF_HOME_NODE */ > nodemask_t *policy_nodes; /* get/set/mbind */ > + unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */ > int policy_node; /* get: policy node information */ > }; > > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > index ec1402dae35b..16fedf966166 100644 > --- a/include/uapi/linux/mempolicy.h > +++ b/include/uapi/linux/mempolicy.h > @@ -33,6 +33,7 @@ struct mpol_args { > __u16 mode_flags; > __s32 home_node; /* mbind2: policy home node */ > __aligned_u64 pol_nodes; > + __aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */ > __u64 pol_maxnodes; > __s32 policy_node; /* get_mempolicy: policy node info */ > }; You break the ABI you introduced earlier in the patchset. Although they are done within a patchset, I don't think that it's a good idea. I suggest to finalize the ABI in the first place. Otherwise, people check git log will be confused by ABI broken. This makes it easier to be reviewed too. > @@ -75,6 +76,7 @@ struct mpol_args { > #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ > #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ > #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ > +#define MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ > > /* > * These bit locations are exposed in the vm.zone_reclaim_mode sysctl -- Best Regards, Huang, Ying [snip]
On Tue, Dec 19, 2023 at 11:07:10AM +0800, Huang, Ying wrote: > Gregory Price <gourry.memverge@gmail.com> writes: > > > diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h > > index ec1402dae35b..16fedf966166 100644 > > --- a/include/uapi/linux/mempolicy.h > > +++ b/include/uapi/linux/mempolicy.h > > @@ -33,6 +33,7 @@ struct mpol_args { > > __u16 mode_flags; > > __s32 home_node; /* mbind2: policy home node */ > > __aligned_u64 pol_nodes; > > + __aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */ > > __u64 pol_maxnodes; > > __s32 policy_node; /* get_mempolicy: policy node info */ > > }; > > You break the ABI you introduced earlier in the patchset. Although they > are done within a patchset, I don't think that it's a good idea. I > suggest to finalize the ABI in the first place. Otherwise, people check > git log will be confused by ABI broken. This makes it easier to be > reviewed too. > This is a result of fixing alignment/holes (suggested by Arnd) and my not dropping policy_node, which I'd originally planned to do. I figured that whenever we decided to move forward, mempolicy2 and mbind2 syscalls would end up squashed into a single commit for the purpose of ensuring the feature goes in as a whole. I can fix this though. ~Gregory
Hi Gregory, kernel test robot noticed the following build warnings: https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Gregory-Price/mm-mempolicy-implement-the-sysfs-based-weighted_interleave-interface/20231219-074837 base: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools.git perf-tools patch link: https://lore.kernel.org/r/20231218194631.21667-12-gregory.price%40memverge.com patch subject: [PATCH v4 11/11] mm/mempolicy: extend set_mempolicy2 and mbind2 to support weighted interleave config: x86_64-randconfig-161-20231219 (https://download.01.org/0day-ci/archive/20231220/202312200223.7X9rUFgu-lkp@intel.com/config) compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Reported-by: Dan Carpenter <dan.carpenter@linaro.org> | Closes: https://lore.kernel.org/r/202312200223.7X9rUFgu-lkp@intel.com/ smatch warnings: mm/mempolicy.c:2044 __do_sys_get_mempolicy2() warn: maybe return -EFAULT instead of the bytes remaining? mm/mempolicy.c:2044 __do_sys_get_mempolicy2() warn: maybe return -EFAULT instead of the bytes remaining? vim +2044 mm/mempolicy.c a2af87404eb73e Gregory Price 2023-12-18 1992 SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, a2af87404eb73e Gregory Price 2023-12-18 1993 unsigned long, addr, unsigned long, flags) a2af87404eb73e Gregory Price 2023-12-18 1994 { a2af87404eb73e Gregory Price 2023-12-18 1995 struct mpol_args kargs; a2af87404eb73e Gregory Price 2023-12-18 1996 struct mempolicy_args margs; a2af87404eb73e Gregory Price 2023-12-18 1997 int err; a2af87404eb73e Gregory Price 2023-12-18 1998 nodemask_t policy_nodemask; a2af87404eb73e Gregory Price 2023-12-18 1999 unsigned long __user *nodes_ptr; 8bfd7ddc0dd439 Gregory Price 2023-12-18 2000 unsigned char __user *weights_ptr; 8bfd7ddc0dd439 Gregory Price 2023-12-18 2001 unsigned char weights[MAX_NUMNODES]; a2af87404eb73e Gregory Price 2023-12-18 2002 a2af87404eb73e Gregory Price 2023-12-18 2003 if (flags & ~(MPOL_F_ADDR)) a2af87404eb73e Gregory Price 2023-12-18 2004 return -EINVAL; a2af87404eb73e Gregory Price 2023-12-18 2005 a2af87404eb73e Gregory Price 2023-12-18 2006 /* initialize any memory liable to be copied to userland */ a2af87404eb73e Gregory Price 2023-12-18 2007 memset(&margs, 0, sizeof(margs)); 8bfd7ddc0dd439 Gregory Price 2023-12-18 2008 memset(weights, 0, sizeof(weights)); a2af87404eb73e Gregory Price 2023-12-18 2009 a2af87404eb73e Gregory Price 2023-12-18 2010 err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); a2af87404eb73e Gregory Price 2023-12-18 2011 if (err) a2af87404eb73e Gregory Price 2023-12-18 2012 return -EINVAL; a2af87404eb73e Gregory Price 2023-12-18 2013 8bfd7ddc0dd439 Gregory Price 2023-12-18 2014 if (kargs.il_weights) 8bfd7ddc0dd439 Gregory Price 2023-12-18 2015 margs.il_weights = weights; 8bfd7ddc0dd439 Gregory Price 2023-12-18 2016 else 8bfd7ddc0dd439 Gregory Price 2023-12-18 2017 margs.il_weights = NULL; 8bfd7ddc0dd439 Gregory Price 2023-12-18 2018 a2af87404eb73e Gregory Price 2023-12-18 2019 margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL; a2af87404eb73e Gregory Price 2023-12-18 2020 if (flags & MPOL_F_ADDR) a2af87404eb73e Gregory Price 2023-12-18 2021 err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs); a2af87404eb73e Gregory Price 2023-12-18 2022 else a2af87404eb73e Gregory Price 2023-12-18 2023 err = do_get_task_mempolicy(&margs); a2af87404eb73e Gregory Price 2023-12-18 2024 a2af87404eb73e Gregory Price 2023-12-18 2025 if (err) a2af87404eb73e Gregory Price 2023-12-18 2026 return err; a2af87404eb73e Gregory Price 2023-12-18 2027 a2af87404eb73e Gregory Price 2023-12-18 2028 kargs.mode = margs.mode; a2af87404eb73e Gregory Price 2023-12-18 2029 kargs.mode_flags = margs.mode_flags; a2af87404eb73e Gregory Price 2023-12-18 2030 kargs.policy_node = margs.policy_node; a2af87404eb73e Gregory Price 2023-12-18 2031 kargs.home_node = margs.home_node; a2af87404eb73e Gregory Price 2023-12-18 2032 if (kargs.pol_nodes) { a2af87404eb73e Gregory Price 2023-12-18 2033 nodes_ptr = u64_to_user_ptr(kargs.pol_nodes); a2af87404eb73e Gregory Price 2023-12-18 2034 err = copy_nodes_to_user(nodes_ptr, kargs.pol_maxnodes, a2af87404eb73e Gregory Price 2023-12-18 2035 margs.policy_nodes); a2af87404eb73e Gregory Price 2023-12-18 2036 if (err) a2af87404eb73e Gregory Price 2023-12-18 2037 return err; This looks wrong as well. a2af87404eb73e Gregory Price 2023-12-18 2038 } a2af87404eb73e Gregory Price 2023-12-18 2039 8bfd7ddc0dd439 Gregory Price 2023-12-18 2040 if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) { 8bfd7ddc0dd439 Gregory Price 2023-12-18 2041 weights_ptr = u64_to_user_ptr(kargs.il_weights); 8bfd7ddc0dd439 Gregory Price 2023-12-18 2042 err = copy_to_user(weights_ptr, weights, kargs.pol_maxnodes); 8bfd7ddc0dd439 Gregory Price 2023-12-18 2043 if (err) 8bfd7ddc0dd439 Gregory Price 2023-12-18 @2044 return err; This should return -EFAULT same as the copy_to_user() on the next line. 8bfd7ddc0dd439 Gregory Price 2023-12-18 2045 } 8bfd7ddc0dd439 Gregory Price 2023-12-18 2046 a2af87404eb73e Gregory Price 2023-12-18 2047 return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0; a2af87404eb73e Gregory Price 2023-12-18 2048 }
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst index 99e1f732cade..0e91efe9e769 100644 --- a/Documentation/admin-guide/mm/numa_memory_policy.rst +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst @@ -254,6 +254,8 @@ MPOL_WEIGHTED_INTERLEAVE This mode operates the same as MPOL_INTERLEAVE, except that interleaving behavior is executed based on weights set in /sys/kernel/mm/mempolicy/weighted_interleave/ + when configured to utilize global weights, or based on task-local + weights configured with set_mempolicy2(2) or mbind2(2). Weighted interleave allocations pages on nodes according to their weight. For example if nodes [0,1] are weighted [5,2] @@ -261,6 +263,13 @@ MPOL_WEIGHTED_INTERLEAVE 2 pages allocated on node1. This can better distribute data according to bandwidth on heterogeneous memory systems. + When utilizing task-local weights, weights are not rebalanced + in the event of a task migration. If a weight has not been + explicitly set for a node set in the new nodemask, the + value of that weight defaults to "1". For this reason, if + migrations are expected or possible, users should consider + utilizing global interleave weights. + NUMA memory policy supports the following optional mode flags: MPOL_F_STATIC_NODES @@ -514,6 +523,7 @@ Extended Mempolicy Arguments:: __u16 mode_flags; __s32 home_node; /* mbind2: policy home node */ __aligned_u64 pol_nodes; /* nodemask pointer */ + __aligned_u64 il_weights; /* u8 buf of size pol_maxnodes */ __u64 pol_maxnodes; __s32 policy_node; /* get_mempolicy2: policy node information */ }; diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index aeac19dfc2b6..387c5c418a66 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -58,6 +58,7 @@ struct mempolicy { /* Weighted interleave settings */ struct { unsigned char cur_weight; + unsigned char weights[MAX_NUMNODES]; } wil; }; @@ -70,6 +71,7 @@ struct mempolicy_args { unsigned short mode_flags; /* policy mode flags */ int home_node; /* mbind: use MPOL_MF_HOME_NODE */ nodemask_t *policy_nodes; /* get/set/mbind */ + unsigned char *il_weights; /* for mode MPOL_WEIGHTED_INTERLEAVE */ int policy_node; /* get: policy node information */ }; diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h index ec1402dae35b..16fedf966166 100644 --- a/include/uapi/linux/mempolicy.h +++ b/include/uapi/linux/mempolicy.h @@ -33,6 +33,7 @@ struct mpol_args { __u16 mode_flags; __s32 home_node; /* mbind2: policy home node */ __aligned_u64 pol_nodes; + __aligned_u64 il_weights; /* size: pol_maxnodes * sizeof(char) */ __u64 pol_maxnodes; __s32 policy_node; /* get_mempolicy: policy node info */ }; @@ -75,6 +76,7 @@ struct mpol_args { #define MPOL_F_SHARED (1 << 0) /* identify shared policies */ #define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */ #define MPOL_F_MORON (1 << 4) /* Migrate On protnone Reference On Node */ +#define MPOL_F_GWEIGHT (1 << 5) /* Utilize global weights */ /* * These bit locations are exposed in the vm.zone_reclaim_mode sysctl diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 0882fa4aa516..1d73ad29e36c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -271,6 +271,7 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args) unsigned short mode = args->mode; unsigned short flags = args->mode_flags; nodemask_t *nodes = args->policy_nodes; + int node; if (mode == MPOL_DEFAULT) { if (nodes && !nodes_empty(*nodes)) @@ -297,6 +298,19 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args) (flags & MPOL_F_STATIC_NODES) || (flags & MPOL_F_RELATIVE_NODES)) return ERR_PTR(-EINVAL); + } else if (mode == MPOL_WEIGHTED_INTERLEAVE) { + /* weighted interleave requires a nodemask and weights > 0 */ + if (nodes_empty(*nodes)) + return ERR_PTR(-EINVAL); + if (args->il_weights) { + node = first_node(*nodes); + while (node != MAX_NUMNODES) { + if (!args->il_weights[node]) + return ERR_PTR(-EINVAL); + node = next_node(node, *nodes); + } + } else if (!(args->mode_flags & MPOL_F_GWEIGHT)) + return ERR_PTR(-EINVAL); } else if (nodes_empty(*nodes)) return ERR_PTR(-EINVAL); @@ -309,6 +323,17 @@ static struct mempolicy *mpol_new(struct mempolicy_args *args) policy->home_node = args->home_node; policy->wil.cur_weight = 0; + if (policy->mode == MPOL_WEIGHTED_INTERLEAVE && args->il_weights) { + policy->wil.cur_weight = 0; + /* Minimum weight value is always 1 */ + memset(policy->wil.weights, 1, MAX_NUMNODES); + node = first_node(*nodes); + while (node != MAX_NUMNODES) { + policy->wil.weights[node] = args->il_weights[node]; + node = next_node(node, *nodes); + } + } + return policy; } @@ -937,6 +962,17 @@ static void do_get_mempolicy_nodemask(struct mempolicy *pol, nodemask_t *nmask) } } +static void do_get_mempolicy_il_weights(struct mempolicy *pol, + unsigned char weights[MAX_NUMNODES]) +{ + if (pol->mode != MPOL_WEIGHTED_INTERLEAVE) + memset(weights, 0, MAX_NUMNODES); + else if (pol->flags & MPOL_F_GWEIGHT) + memcpy(weights, iw_table, MAX_NUMNODES); + else + memcpy(weights, pol->wil.weights, MAX_NUMNODES); +} + /* Retrieve NUMA policy for a VMA assocated with a given address */ static long do_get_vma_mempolicy(unsigned long addr, int *addr_node, struct mempolicy_args *args) @@ -973,6 +1009,9 @@ static long do_get_vma_mempolicy(unsigned long addr, int *addr_node, if (args->policy_nodes) do_get_mempolicy_nodemask(pol, args->policy_nodes); + if (args->il_weights) + do_get_mempolicy_il_weights(pol, args->il_weights); + if (pol != &default_policy) { mpol_put(pol); mpol_cond_put(pol); @@ -999,6 +1038,9 @@ static long do_get_task_mempolicy(struct mempolicy_args *args) if (args->policy_nodes) do_get_mempolicy_nodemask(pol, args->policy_nodes); + if (args->il_weights) + do_get_mempolicy_il_weights(pol, args->il_weights); + return 0; } @@ -1521,6 +1563,9 @@ static long kernel_mbind(unsigned long start, unsigned long len, if (err) return err; + if (mode & MPOL_WEIGHTED_INTERLEAVE) + mode_flags |= MPOL_F_GWEIGHT; + memset(&margs, 0, sizeof(margs)); margs.mode = lmode; margs.mode_flags = mode_flags; @@ -1611,6 +1656,8 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len, struct mempolicy_args margs; nodemask_t policy_nodes; unsigned long __user *nodes_ptr; + unsigned char weights[MAX_NUMNODES]; + unsigned char __user *weights_ptr; int err; if (!start || !len) @@ -1643,6 +1690,23 @@ SYSCALL_DEFINE5(mbind2, unsigned long, start, unsigned long, len, return err; margs.policy_nodes = &policy_nodes; + if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE) { + weights_ptr = u64_to_user_ptr(kargs.il_weights); + if (weights_ptr) { + err = copy_struct_from_user(weights, + sizeof(weights), + weights_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights = weights; + } else { + margs.il_weights = NULL; + margs.mode_flags |= MPOL_F_GWEIGHT; + } + } else + margs.il_weights = NULL; + return do_mbind(untagged_addr(start), len, &margs, flags); } @@ -1664,6 +1728,9 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask, if (err) return err; + if (mode & MPOL_WEIGHTED_INTERLEAVE) + mode_flags |= MPOL_F_GWEIGHT; + memset(&args, 0, sizeof(args)); args.mode = lmode; args.mode_flags = mode_flags; @@ -1687,6 +1754,8 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, int err; nodemask_t policy_nodemask; unsigned long __user *nodes_ptr; + unsigned char weights[MAX_NUMNODES]; + unsigned char __user *weights_ptr; if (flags) return -EINVAL; @@ -1712,6 +1781,20 @@ SYSCALL_DEFINE3(set_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, } else margs.policy_nodes = NULL; + if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) { + weights_ptr = u64_to_user_ptr(kargs.il_weights); + err = copy_struct_from_user(weights, + sizeof(weights), + weights_ptr, + kargs.pol_maxnodes); + if (err) + return err; + margs.il_weights = weights; + } else { + margs.il_weights = NULL; + margs.mode_flags |= MPOL_F_GWEIGHT; + } + return do_set_mempolicy(&margs); } @@ -1914,17 +1997,25 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, int err; nodemask_t policy_nodemask; unsigned long __user *nodes_ptr; + unsigned char __user *weights_ptr; + unsigned char weights[MAX_NUMNODES]; if (flags & ~(MPOL_F_ADDR)) return -EINVAL; /* initialize any memory liable to be copied to userland */ memset(&margs, 0, sizeof(margs)); + memset(weights, 0, sizeof(weights)); err = copy_struct_from_user(&kargs, sizeof(kargs), uargs, usize); if (err) return -EINVAL; + if (kargs.il_weights) + margs.il_weights = weights; + else + margs.il_weights = NULL; + margs.policy_nodes = kargs.pol_nodes ? &policy_nodemask : NULL; if (flags & MPOL_F_ADDR) err = do_get_vma_mempolicy(untagged_addr(addr), NULL, &margs); @@ -1946,6 +2037,13 @@ SYSCALL_DEFINE4(get_mempolicy2, struct mpol_args __user *, uargs, size_t, usize, return err; } + if (kargs.mode == MPOL_WEIGHTED_INTERLEAVE && kargs.il_weights) { + weights_ptr = u64_to_user_ptr(kargs.il_weights); + err = copy_to_user(weights_ptr, weights, kargs.pol_maxnodes); + if (err) + return err; + } + return copy_to_user(uargs, &kargs, usize) ? -EFAULT : 0; } @@ -2062,13 +2160,18 @@ static unsigned int weighted_interleave_nodes(struct mempolicy *policy) { unsigned int next; struct task_struct *me = current; + unsigned char next_weight; next = next_node_in(me->il_prev, policy->nodes); if (next == MAX_NUMNODES) return next; - if (!policy->wil.cur_weight) - policy->wil.cur_weight = iw_table[next]; + if (!policy->wil.cur_weight) { + next_weight = (policy->flags & MPOL_F_GWEIGHT) ? + iw_table[next] : + policy->wil.weights[next]; + policy->wil.cur_weight = next_weight ? next_weight : 1; + } policy->wil.cur_weight--; if (!policy->wil.cur_weight) @@ -2142,6 +2245,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) nodemask_t nodemask = pol->nodes; unsigned int target, weight_total = 0; int nid; + unsigned char *pol_weights; unsigned char weights[MAX_NUMNODES]; unsigned char weight; @@ -2153,8 +2257,13 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx) return nid; /* Then collect weights on stack and calculate totals */ + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights = iw_table; + else + pol_weights = pol->wil.weights; + for_each_node_mask(nid, nodemask) { - weight = iw_table[nid]; + weight = pol_weights[nid]; weight_total += weight; weights[nid] = weight; } @@ -2552,6 +2661,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, unsigned long nr_allocated; unsigned long rounds; unsigned long node_pages, delta; + unsigned char *pol_weights; unsigned char weight; unsigned char weights[MAX_NUMNODES]; unsigned int weight_total = 0; @@ -2565,9 +2675,14 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp, nnodes = nodes_weight(nodes); + if (pol->flags & MPOL_F_GWEIGHT) + pol_weights = iw_table; + else + pol_weights = pol->wil.weights; + /* Collect weights and save them on stack so they don't change */ for_each_node_mask(node, nodes) { - weight = iw_table[node]; + weight = pol_weights[node]; weight_total += weight; weights[node] = weight; } @@ -3092,6 +3207,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) { int ret; struct mempolicy_args margs; + unsigned char weights[MAX_NUMNODES]; sp->root = RB_ROOT; /* empty tree == default mempolicy */ rwlock_init(&sp->lock); @@ -3109,6 +3225,11 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol) margs.mode_flags = mpol->flags; margs.policy_nodes = &mpol->w.user_nodemask; margs.home_node = NUMA_NO_NODE; + if (margs.mode == MPOL_WEIGHTED_INTERLEAVE && + !(margs.mode_flags & MPOL_F_GWEIGHT)) { + memcpy(weights, mpol->wil.weights, sizeof(weights)); + margs.il_weights = weights; + } /* contextualize the tmpfs mount point mempolicy to this file */ npol = mpol_new(&margs);