Message ID | 20231122211200.31620-1-gregory.price@memverge.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/mempolicy: Make task->mempolicy externally modifiable via syscall and procfs | expand |
On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <gourry.memverge@gmail.com> wrote: > The patch set changes task->mempolicy to be modifiable by tasks other > than just current. > > The ultimate goal is to make mempolicy more flexible and extensible, > such as adding interleave weights (which may need to change at runtime > due to hotplug events). Making mempolicy externally modifiable allows > for userland daemons to make runtime performance adjustments to running > tasks without that software needing to be made numa-aware. Please add to this [0/N] a full description of the security aspect: who can modify whose mempolicy, along with a full description of the reasoning behind this decision. > 3. Add external interfaces which allow for a task mempolicy to be > modified by another task. This is implemented in 4 syscalls > and a procfs interface: > sys_set_task_mempolicy > sys_get_task_mempolicy > sys_set_task_mempolicy_home_node > sys_task_mbind > /proc/[pid]/mempolicy Why is the procfs interface needed? Doesn't it simply duplicate the syscall interface? Please update [0/N] with a description of this decision. > The new syscalls are the same as their current-task counterparts, > except that they take a pid as an argument. The exception is > task_mbind, which required a new struct due to the number of args. > > The /proc/pid/mempolicy re-uses the interface mpol_parse_str format > to enable get/set of mempolicy via procsfs. > > mpol_parse_str format: > <mode>[=<flags>][:<nodelist>] > > Example usage: > > echo "default" > /proc/pid/mempolicy > echo "prefer=relative:0" > /proc/pid/mempolicy > echo "interleave:0-3" > /proc/pid/mempolicy What do we get when we read from this? Please add to changelog. > Changing the mempolicy does not induce memory migrations via the > procfs interface (which is the exact same behavior as set_mempolicy). >
On Wed, 22 Nov 2023 13:33:48 -0800 Andrew Morton <akpm@linux-foundation.org> wrote: > On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <gourry.memverge@gmail.com> wrote: > > echo "default" > /proc/pid/mempolicy > > echo "prefer=relative:0" > /proc/pid/mempolicy > > echo "interleave:0-3" > /proc/pid/mempolicy > > What do we get when we read from this? Please add to changelog. > Also a description of the permissions for this procfs file, along with reasoning. If it has global readability, and there's something interesting in there, let's show that the security implications have been fully considered.
On Wed, Nov 22, 2023 at 01:33:48PM -0800, Andrew Morton wrote: > On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <gourry.memverge@gmail.com> wrote: > > > The patch set changes task->mempolicy to be modifiable by tasks other > > than just current. > > > > The ultimate goal is to make mempolicy more flexible and extensible, > > such as adding interleave weights (which may need to change at runtime > > due to hotplug events). Making mempolicy externally modifiable allows > > for userland daemons to make runtime performance adjustments to running > > tasks without that software needing to be made numa-aware. > > Please add to this [0/N] a full description of the security aspect: who > can modify whose mempolicy, along with a full description of the > reasoning behind this decision. > Will do. For the sake of v0 for now: 1) the task itself (task == current) for obvious reasons: it already can 2) from external interfaces: CAP_SYS_NICE There might be an argument for CAP_SYS_ADMIN, but CAP_SYS_NICE has access to scheduling controls, and mbind uses CAP_SYS_NICE to validate whether shared pages can be migrated. The same is true of migrate_pages and other memory management controls. For this reason, I chose to gate the task syscalls behind CAP_SYS_NICE unless (task == current). I'm by no means an expert in this area, so slap away if i'm egregiously wrong here. I will add additional security context to v2 as to what impacts changing a mempolicy can have at runtime. This will mostly be related to cpusets implications, as mempolicy itself is not a "constraining" interface in terms of security. For example: one can mbind/interleave/whatever a set of nodes, and then use migrate_pages or move_pages to violate that mempolicy. This is explicitly allowed and discussed in the implementation of the existing syscalls / libnuma. However, if cpusets must be respected. This is why i refactored out replace_mempolicy and reused it, because this enforcement is already handled by checking task->mems_allowed. > > 3. Add external interfaces which allow for a task mempolicy to be > > modified by another task. This is implemented in 4 syscalls > > and a procfs interface: > > sys_set_task_mempolicy > > sys_get_task_mempolicy > > sys_set_task_mempolicy_home_node > > sys_task_mbind > > /proc/[pid]/mempolicy > > Why is the procfs interface needed? Doesn't it simply duplicate the > syscall interface? Please update [0/N] with a description of this > decision. > Honestly I wrote the procfs interface first, and then came back around to just implement the syscalls. mbind is not friendly to being procfs'd so if the preference is to have only one, not both, then it should probably be the syscalls. That said, when I introduce weighted interleave on top of this, having a simple procfs interface to those weights would be valuable, so I imagined something like `proc/mempolicy` to determine if interleave was being used and something like `proc/mpol_interleave_weights` for a clean interface to update weights. However, in the same breath, I have a prior RFC with set/get_mempolicy2 which could probably take all future mempolicy extensions and wrap them up into one pair of syscalls, instead of us ending up with 200 more sys_mempolicy_whatever as memory attached fabrics become more common. So... yeah... the is one area I think the community very much needs to comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All of the above? The procfs route provides a command-line user a nice, clean way to update policies without the need for an additional tool, but if there is an "all or nothing" preference on mempolicy controls - then procfs is probably not the way to go. This RFC at least shows there are options. I very much welcome input in this particular area. > > The new syscalls are the same as their current-task counterparts, > > except that they take a pid as an argument. The exception is > > task_mbind, which required a new struct due to the number of args. > > > > The /proc/pid/mempolicy re-uses the interface mpol_parse_str format > > to enable get/set of mempolicy via procsfs. > > > > mpol_parse_str format: > > <mode>[=<flags>][:<nodelist>] > > > > Example usage: > > > > echo "default" > /proc/pid/mempolicy > > echo "prefer=relative:0" > /proc/pid/mempolicy > > echo "interleave:0-3" > /proc/pid/mempolicy > > What do we get when we read from this? Please add to changelog. > > Also a description of the permissions for this procfs file, along with > reasoning. If it has global readability, and there's something > interesting in there, let's show that the security implications have > been fully considered. > Ah, should have included that. Will add. For the sake of v0: Current permissions: (S_IRUSR|S_IWUSR) Which presumes the owner and obviosly root. Tried parity the syscall. the total set of (current) policy outputs are: "default" "local" "prefer:node" "prefer=static:node" "prefer=relative:node" "prefer (many):nodelist" "prefer (many)=static:nodelist" "prefer (many)=relative:nodelist" "interleave:nodelist" "interleave=static:nodelist" "interleave=relative:nodelist" "bind:nodelist" "bind=static:nodelist" "bind=relative:nodelist" There doesn't seem to be much of a security implication here, at least not anything that can't already be gleaned via something like numa_maps, but it does provide *some* level of memory placement imformation, so it's still probably best gated behind owner/root. That said, changing this policy may not imply it is actually used, because individual VMA policies can override this policy. So it really doesn't provide much info at all. Something I just noticed: mpol_parse_str does not presently support the numa balancing flag, so that would have to be added to achieve parity with the set_mempolicy syscall. > > Changing the mempolicy does not induce memory migrations via the > > procfs interface (which is the exact same behavior as set_mempolicy). > > > Thanks for taking a quick look! ~Gregory
Sorry, didn't have much time to do a proper review. Couple of points here at least. On Wed 22-11-23 17:24:10, Gregory Price wrote: > On Wed, Nov 22, 2023 at 01:33:48PM -0800, Andrew Morton wrote: > > On Wed, 22 Nov 2023 16:11:49 -0500 Gregory Price <gourry.memverge@gmail.com> wrote: > > > > > The patch set changes task->mempolicy to be modifiable by tasks other > > > than just current. > > > > > > The ultimate goal is to make mempolicy more flexible and extensible, > > > such as adding interleave weights (which may need to change at runtime > > > due to hotplug events). Making mempolicy externally modifiable allows > > > for userland daemons to make runtime performance adjustments to running > > > tasks without that software needing to be made numa-aware. > > > > Please add to this [0/N] a full description of the security aspect: who > > can modify whose mempolicy, along with a full description of the > > reasoning behind this decision. > > > > Will do. For the sake of v0 for now: > > 1) the task itself (task == current) > for obvious reasons: it already can > > 2) from external interfaces: CAP_SYS_NICE Makes sense. [...] > > > 3. Add external interfaces which allow for a task mempolicy to be > > > modified by another task. This is implemented in 4 syscalls > > > and a procfs interface: > > > sys_set_task_mempolicy > > > sys_get_task_mempolicy > > > sys_set_task_mempolicy_home_node > > > sys_task_mbind > > > /proc/[pid]/mempolicy > > > > Why is the procfs interface needed? Doesn't it simply duplicate the > > syscall interface? Please update [0/N] with a description of this > > decision. > > > > Honestly I wrote the procfs interface first, and then came back around > to just implement the syscalls. mbind is not friendly to being procfs'd > so if the preference is to have only one, not both, then it should > probably be the syscalls. > > That said, when I introduce weighted interleave on top of this, having a > simple procfs interface to those weights would be valuable, so I > imagined something like `proc/mempolicy` to determine if interleave was > being used and something like `proc/mpol_interleave_weights` for a clean > interface to update weights. > > However, in the same breath, I have a prior RFC with set/get_mempolicy2 > which could probably take all future mempolicy extensions and wrap them > up into one pair of syscalls, instead of us ending up with 200 more > sys_mempolicy_whatever as memory attached fabrics become more common. > > So... yeah... the is one area I think the community very much needs to > comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All > of the above? I think we should actively avoid using proc interface. The most reasonable way would be to add get_mempolicy2 interface that would allow extensions and then create a pidfd counterpart to allow acting on a remote task. The latter would require some changes to make mempolicy code less current oriented.
On Mon, Nov 27, 2023 at 04:29:56PM +0100, Michal Hocko wrote: > Sorry, didn't have much time to do a proper review. Couple of points > here at least. > > > > > So... yeah... the is one area I think the community very much needs to > > comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All > > of the above? > > I think we should actively avoid using proc interface. The most > reasonable way would be to add get_mempolicy2 interface that would allow > extensions and then create a pidfd counterpart to allow acting on a > remote task. The latter would require some changes to make mempolicy > code less current oriented. Sounds good, I'll pull my get/set_mempolicy2 RFC on top of this. Just context: patches 1-6 refactor mempolicy to allow remote task twiddling (fixing the current-oriented issues), and patch 7 adds the pidfd interfaces you describe above. Couple Questions 1) Should we consider simply adding a pidfd arg to set/get_mempolicy2, where if (pidfd == 0), then it operates on current, otherwise it operates on the target task? That would mitigate the need for what amounts to the exact same interface. 2) Should we combine all the existing operations into set_mempolicy2 and add an operation arg. set_mempolicy2(pidfd, arg_struct, len) struct { int pidfd; /* optional */ int operation; /* describe which op_args to use */ union { struct { } set_mempolicy; struct { } set_vma_home_node; struct { } mbind; ... } op_args; } args; capturing: sys_set_mempolicy sys_set_mempolicy_home_node sys_mbind or should we just make a separate interface for mbind/home_node to limit complexity of the single syscall? Personally I like the dispatch for the extensibility nature of the arg struct, but I can understand wanting to limit complexity of a syscall interface for a variety of reasons. ~Gregory
On Mon 27-11-23 11:14:44, Gregory Price wrote: > On Mon, Nov 27, 2023 at 04:29:56PM +0100, Michal Hocko wrote: > > Sorry, didn't have much time to do a proper review. Couple of points > > here at least. > > > > > > > > So... yeah... the is one area I think the community very much needs to > > > comment: set/get_mempolicy2, many new mempolicy syscalls, procfs? All > > > of the above? > > > > I think we should actively avoid using proc interface. The most > > reasonable way would be to add get_mempolicy2 interface that would allow > > extensions and then create a pidfd counterpart to allow acting on a > > remote task. The latter would require some changes to make mempolicy > > code less current oriented. > > Sounds good, I'll pull my get/set_mempolicy2 RFC on top of this. > > Just context: patches 1-6 refactor mempolicy to allow remote task > twiddling (fixing the current-oriented issues), and patch 7 adds the pidfd > interfaces you describe above. > > > Couple Questions > > 1) Should we consider simply adding a pidfd arg to set/get_mempolicy2, > where if (pidfd == 0), then it operates on current, otherwise it > operates on the target task? That would mitigate the need for what > amounts to the exact same interface. This wouldn't fit into existing pidfd interfaces I am aware of. We assume pidfd to be real fd, no special cases. > 2) Should we combine all the existing operations into set_mempolicy2 and > add an operation arg. > > set_mempolicy2(pidfd, arg_struct, len) > > struct { > int pidfd; /* optional */ > int operation; /* describe which op_args to use */ > union { > struct { > } set_mempolicy; > struct { > } set_vma_home_node; > struct { > } mbind; > ... > } op_args; > } args; > > capturing: > sys_set_mempolicy > sys_set_mempolicy_home_node > sys_mbind > > or should we just make a separate interface for mbind/home_node to > limit complexity of the single syscall? My preference would be to go with specific syscalls. Multiplexing syscalls have turned much more complex and less flexible over time. Just have a look at futex.
On Tue, Nov 28, 2023 at 10:45:02AM +0100, Michal Hocko wrote: > > 2) Should we combine all the existing operations into set_mempolicy2 and > > add an operation arg. > > > > set_mempolicy2(pidfd, arg_struct, len) > > > > struct { > > int pidfd; /* optional */ > > int operation; /* describe which op_args to use */ > > union { > > struct { > > } set_mempolicy; > > struct { > > } set_vma_home_node; > > struct { > > } mbind; > > ... > > } op_args; > > } args; > > > > capturing: > > sys_set_mempolicy > > sys_set_mempolicy_home_node > > sys_mbind > > > > or should we just make a separate interface for mbind/home_node to > > limit complexity of the single syscall? > > My preference would be to go with specific syscalls. Multiplexing > syscalls have turned much more complex and less flexible over time. > Just have a look at futex. got it, that simplifies things a bit. I can pull my set/get mempolicy2 work forward and just keep the interfaces pretty much the same. Only difference being an argument structure that is extensible and possibly some additional refactoring in do_get_mempolicy to make things a bit cleaner. ~Gregory
On Wed 22-11-23 16:11:51, Gregory Price wrote: [...] > @@ -982,11 +991,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > } > > out: > - mpol_cond_put(pol); > + mpol_put(pol); > if (vma) > mmap_read_unlock(mm); > if (pol_refcount) > - mpol_put(pol_refcount); > + mpol_cond_put(pol_refcount); Maybe I am just misreading the patch but pol_refcount should be always NULL with this patch > return err; > } > > -- > 2.39.1 >
On Wed 22-11-23 16:11:54, Gregory Price wrote: [...] > + > + /* > + * Behavior when task == current allows a task modifying itself > + * to bypass the check in get_task_mm and acquire the mm directly > + */ > + if (task == current) { > + mm = task->mm; > + mmget(mm); > + } else > + mm = get_task_mm(task); Similar question as in the previous patch. When is a kthread legit to do manipulate borrowed mm memory policy? > + > + if (!mm) > + return -ENODEV; Is this an expected error? No mm means task dying and mm has been already released. Should we simply return ESRCH wouldbe a better error code IMO. This should never happen for the current task so it sould be safe to add.
On Wed 22-11-23 16:11:55, Gregory Price wrote: [...] > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock > + * while calling this. > + */ > +static struct mempolicy *get_task_vma_policy(struct task_struct *task, > + struct vm_area_struct *vma, > + unsigned long addr, int order, > + pgoff_t *ilx) [...] You should add lockdep annotation for alloc_lock/task_lock here for clarity and also... > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > unsigned long addr, int order, pgoff_t *ilx) > { > - struct mempolicy *pol; > - > - pol = __get_vma_policy(vma, addr, ilx); > - if (!pol) > - pol = get_task_policy(current); > - if (pol->mode == MPOL_INTERLEAVE) { > - *ilx += vma->vm_pgoff >> order; > - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > - } > - return pol; > + return get_task_vma_policy(current, vma, addr, order, ilx); I do not think that all get_vma_policy take task_lock (just random check dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS) Also I do not see policy_nodemask to be handled anywhere. That one is used along with get_vma_policy (sometimes hidden like in alloc_pages_mpol). It has a dependency on cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a remote task would be constrained by current task cpuset when allocating migration targets for the target task. I am wondering how many other dependencies like that are lurking there.
On Tue, Nov 28, 2023 at 03:07:38PM +0100, Michal Hocko wrote: > On Wed 22-11-23 16:11:54, Gregory Price wrote: > [...] > > + > > + if (!mm) > > + return -ENODEV; > > Is this an expected error? No mm means task dying and mm has been > already released. Should we simply return ESRCH wouldbe a better error > code IMO. This should never happen for the current task so it sould be > safe to add. Hm, good point, and in fact i should make sure that the new pidfd code is not susceptible to this. This particular set of lines is not present in the new RFC (not out yet) but the access to the remote task mm is more controlled via mm_access, so I would presume that mm_access either returns error, I will make sure it cannot return null and check for this. > > -- > Michal Hocko > SUSE Labs
[restoring the CC list, I supect you didn't want this to be a private discussion] On Tue 28-11-23 09:10:18, Gregory Price wrote: > On Tue, Nov 28, 2023 at 03:07:10PM +0100, Michal Hocko wrote: > > On Wed 22-11-23 16:11:51, Gregory Price wrote: > > [...] > > > @@ -982,11 +991,11 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask, > > > } > > > > > > out: > > > - mpol_cond_put(pol); > > > + mpol_put(pol); > > > if (vma) > > > mmap_read_unlock(mm); > > > if (pol_refcount) > > > - mpol_put(pol_refcount); > > > + mpol_cond_put(pol_refcount); > > > > Maybe I am just misreading the patch but pol_refcount should be always > > NULL with this patch > > > > earlier: > > + pol = pol_refcount = __get_vma_policy(vma, addr, &ilx); > > i can split this into two lines if preferred. > > If addr is not set, then yes pol_refcount is always null. My bad, missed that. Making that two lines would be easier to read but nothing I would insist on of course.
On Tue, Nov 28, 2023 at 03:11:06PM +0100, Michal Hocko wrote: > On Wed 22-11-23 16:11:55, Gregory Price wrote: > [...] > > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock > > + * while calling this. > > + */ > > +static struct mempolicy *get_task_vma_policy(struct task_struct *task, > > + struct vm_area_struct *vma, > > + unsigned long addr, int order, > > + pgoff_t *ilx) > [...] > > You should add lockdep annotation for alloc_lock/task_lock here for clarity and > also... > > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > > unsigned long addr, int order, pgoff_t *ilx) > > { > > - struct mempolicy *pol; > > - > > - pol = __get_vma_policy(vma, addr, ilx); > > - if (!pol) > > - pol = get_task_policy(current); > > - if (pol->mode == MPOL_INTERLEAVE) { > > - *ilx += vma->vm_pgoff >> order; > > - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > > - } > > - return pol; > > + return get_task_vma_policy(current, vma, addr, order, ilx); > > I do not think that all get_vma_policy take task_lock (just random check > dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS) > hm, i might have gotten turned around on this one. Forgot to check for external references to get_vma_policy. I thought I considered it, but i clearly did not leave myself any notes if I did. This pattern is troublesome, we're holding the task lock during the callback stack in __get_vma_policy - just incase that returns NULL so we can return the task policy instead. If that vma is shared, it will take the vma shared policy lock (sp->lock) I almost want to change this interface to return NULL if the VMA doesn't have one, and change callers to fetch the task policy explicitly instead of implicitly returning the task policy. At least then we'd only take the task lock on an explicit access to the *Task* policy. > Also I do not see policy_nodemask to be handled anywhere. That one is > used along with get_vma_policy (sometimes hidden like in > alloc_pages_mpol). It has a dependency on > cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a > remote task would be constrained by current task cpuset when allocating > migration targets for the target task. I am wondering how many other > dependencies like that are lurking there. bah! thought i dug all these out, but i missed alloc_migration_target_by_mpol from do_mbind. I'll need to take another look at the calls to cpusets interfaces to make sure i dig this out. The number of hidden accesses to current is really nasty :[ > -- > Michal Hocko > SUSE Labs
On Tue, Nov 28, 2023 at 03:11:06PM +0100, Michal Hocko wrote: > On Wed 22-11-23 16:11:55, Gregory Price wrote: > [...] > > + * Like get_vma_policy and get_task_policy, must hold alloc/task_lock > > + * while calling this. > > + */ > > +static struct mempolicy *get_task_vma_policy(struct task_struct *task, > > + struct vm_area_struct *vma, > > + unsigned long addr, int order, > > + pgoff_t *ilx) > [...] > > You should add lockdep annotation for alloc_lock/task_lock here for clarity and > also... > > @@ -1844,16 +1899,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, > > struct mempolicy *get_vma_policy(struct vm_area_struct *vma, > > unsigned long addr, int order, pgoff_t *ilx) > > { > > - struct mempolicy *pol; > > - > > - pol = __get_vma_policy(vma, addr, ilx); > > - if (!pol) > > - pol = get_task_policy(current); > > - if (pol->mode == MPOL_INTERLEAVE) { > > - *ilx += vma->vm_pgoff >> order; > > - *ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order); > > - } > > - return pol; > > + return get_task_vma_policy(current, vma, addr, order, ilx); > > I do not think that all get_vma_policy take task_lock (just random check > dequeue_hugetlb_folio_vma->huge_node->get_vma_policy AFAICS) > > Also I do not see policy_nodemask to be handled anywhere. That one is > used along with get_vma_policy (sometimes hidden like in > alloc_pages_mpol). It has a dependency on > cpuset_nodemask_valid_mems_allowed. That means that e.g. mbind on a > remote task would be constrained by current task cpuset when allocating > migration targets for the target task. I am wondering how many other > dependencies like that are lurking there. So after further investigation, I'm going to have to back out the changes that make home_node and mbind modifiable by an external task and revisit it at a later time. Right now, there's a very nasty rats nest of entanglement between mempolicy and vma/shmem that hides a bunch of accesses to current. It only becomes apparently when you start chasing all the callers of mpol_dup, which had another silent access to current->cpusets. mpol_dup calls the following: current_cpuset_is_being_rebound cpuset_mems_allowed(current) So we would need to do the following 1) create mpol_dup_task and make current explicit, not implicit 2) chase down all callers to mpol_dup and make sure it isn't generated from any of the task interfaces 3) if it is generated from the task interfaces, plumb a reference to current down through... somehow... if possible... Here's a ~1 hour chase that lead me to the conclusion that this will take considerably more work, and is not to be taken lightly: do_mbind mbind_range vma_modify_policy split_vma __split_vma vma_dup_policy mpol_dup vma_replace_policy mpol_dup vma->vm_ops->set_policy - see below __set_mempolicy_home_node mbind_range ... same as above ... digging into vma->vm_ops->set_policy we end up in mm/shmem.c shmem_set_policy mpol_set_shared_policy sp_alloc mpol_dup current_cpuset_is_being_rebound() cpuset_mems_allowed(current) Who knows what else is burried in the vma stack, but making vma mempolicies externally modifiable looks to be a much more monumental task than just simply making the task policy modifiable. For now i'm going to submit a V2 with home_node and mbind removed from the proposal. Those will take far more investigation. This also means that process_set_mempolicy should not be extended to allow for vma policy replacements. ~Gregory
The patch set changes task->mempolicy to be modifiable by tasks other than just current. The ultimate goal is to make mempolicy more flexible and extensible, such as adding interleave weights (which may need to change at runtime due to hotplug events). Making mempolicy externally modifiable allows for userland daemons to make runtime performance adjustments to running tasks without that software needing to be made numa-aware. This initial RFC involves 3 major updates the mempolicy. 1. Refactor modifying interfaces to accept a task as an argument, and change existing callers to send `current` in to retain the existing behavior. 2. Change locking behaviors to ensure task->mpol is referenced safely by acquiring the task_lock where required. Since allocators take the alloc lock (task lock), this successfully prevents changes from being made during allocations. 3. Add external interfaces which allow for a task mempolicy to be modified by another task. This is implemented in 4 syscalls and a procfs interface: sys_set_task_mempolicy sys_get_task_mempolicy sys_set_task_mempolicy_home_node sys_task_mbind /proc/[pid]/mempolicy The new syscalls are the same as their current-task counterparts, except that they take a pid as an argument. The exception is task_mbind, which required a new struct due to the number of args. The /proc/pid/mempolicy re-uses the interface mpol_parse_str format to enable get/set of mempolicy via procsfs. mpol_parse_str format: <mode>[=<flags>][:<nodelist>] Example usage: echo "default" > /proc/pid/mempolicy echo "prefer=relative:0" > /proc/pid/mempolicy echo "interleave:0-3" > /proc/pid/mempolicy Changing the mempolicy does not induce memory migrations via the procfs interface (which is the exact same behavior as set_mempolicy). Signed-off-by: Gregory Price <gregory.price@memverge.com> Gregory Price (11): mm/mempolicy: refactor do_set_mempolicy for code re-use mm/mempolicy: swap cond reference counting logic in do_get_mempolicy mm/mempolicy: refactor set_mempolicy stack to take a task argument mm/mempolicy: modify get_mempolicy call stack to take a task argument mm/mempolicy: modify set_mempolicy_home_node to take a task argument mm/mempolicy: modify do_mbind to operate on task argument instead of current mm/mempolicy: add task mempolicy syscall variants mm/mempolicy: export replace_mempolicy for use by procfs mm/mempolicy: build mpol_parse_str unconditionally mm/mempolicy: mpol_parse_str should ignore trailing characters in nodelist fs/proc: Add mempolicy attribute to allow read/write of task mempolicy arch/x86/entry/syscalls/syscall_32.tbl | 4 + arch/x86/entry/syscalls/syscall_64.tbl | 4 + fs/proc/Makefile | 1 + fs/proc/base.c | 1 + fs/proc/internal.h | 1 + fs/proc/mempolicy.c | 117 +++++++ include/linux/mempolicy.h | 13 +- include/linux/syscalls.h | 14 + include/uapi/asm-generic/unistd.h | 10 +- include/uapi/linux/mempolicy.h | 10 + mm/mempolicy.c | 432 +++++++++++++++++++------ 11 files changed, 502 insertions(+), 105 deletions(-) create mode 100644 fs/proc/mempolicy.c