diff mbox series

[v2] mm: add new syscall pidfd_set_mempolicy().

Message ID 20221111084051.2121029-1-hezhongkun.hzk@bytedance.com (mailing list archive)
State New
Headers show
Series [v2] mm: add new syscall pidfd_set_mempolicy(). | expand

Commit Message

Zhongkun He Nov. 11, 2022, 8:40 a.m. UTC
There are usecases when system management utilities want to
apply memory policy to processes to make better use of memory.

These utilities doesn't set memory usage policy, but rather
the job of reporting memory usage and setting the policy is
offloaded to a userspace daemon.

Cpuset has the ability to dynamically modify the numa nodes of
memory policy, but not the mode, which determines how to apply
for memory.

To solve the issue above, introduce new syscall
pidfd_set_mempolicy(2). The syscall sets NUMA memory policy for
the thread specified in pidfd.

Page allocation usage of task or vma policy occurs in the fault
path where we hold the mmap_lock for read. because replacing the
task or vma policy requires that the mmap_lock be held for write,
the policy can't be freed out from under us while we're using
it for page allocation. But there are some corner cases(e.g.
alloc_pages()) which not acquire any lock for read during the
page allocation. For this reason, task_work is used in
mpol_put_async() to free mempolicy in  pidfd_set_mempolicy().
Thuse, it avoids into race conditions.

The API is as follows,

                long pidfd_set_mempolicy(int pidfd, int mode,
                                     const unsigned long __user *nmask,
                                     unsigned long maxnode,
                                     unsigned int flags);

Set's the [pidfd] task's "task/process memory policy". The pidfd argument
is a PID file descriptor (see pidfd_open(2) man page) that specifies the
process to which the mempolicy is to be applied. The flags argument is
reserved for future use; currently, this argument must be specified as 0.
Please see the set_mempolicy(2) man page for more details about
other's arguments.

Permission to apply memory policy to another process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check (see
ptrace(2)); in addition, because of the performance implications
of applying the policy, the caller must have the CAP_SYS_NICE
capability.

Suggested-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |  19 +-
 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd.h               |   2 +-
 arch/arm64/include/asm/unistd32.h             |   3 +-
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 include/linux/cpuset.h                        |   2 +
 include/linux/mempolicy.h                     |   2 +
 include/linux/syscalls.h                      |   4 +
 include/uapi/asm-generic/unistd.h             |   5 +-
 kernel/sys_ni.c                               |   1 +
 mm/mempolicy.c                                | 178 ++++++++++++++----
 25 files changed, 187 insertions(+), 45 deletions(-)

Comments

Andrew Morton Nov. 11, 2022, 7:27 p.m. UTC | #1
On Fri, 11 Nov 2022 16:40:51 +0800 Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:

> Page allocation usage of task or vma policy occurs in the fault
> path where we hold the mmap_lock for read. because replacing the
> task or vma policy requires that the mmap_lock be held for write,
> the policy can't be freed out from under us while we're using
> it for page allocation. But there are some corner cases(e.g.
> alloc_pages()) which not acquire any lock for read during the
> page allocation. For this reason, task_work is used in
> mpol_put_async() to free mempolicy in  pidfd_set_mempolicy().
> Thuse, it avoids into race conditions.

This sounds a bit suspicious.  Please share much more detail about
these races.  If we proced with this design then mpol_put_async()
shouild have comments which fully describe the need for the async free.

How do we *know* that these races are fully prevented with this
approach?  How do we know that mpol_put_async() won't free the data
until the race window has fully passed?

Also, in some situations mpol_put_async() will free the data
synchronously anyway, so aren't these races still present?


Secondly, why was the `flags' argument added?  We might use it one day?
For what purpose?  I mean, every syscall could have a does-nothing
`flags' arg, but we don't do that.  What's the plan here?
kernel test robot Nov. 12, 2022, 2:09 a.m. UTC | #2
Hi Zhongkun,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Zhongkun-He/mm-add-new-syscall-pidfd_set_mempolicy/20221111-164156
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20221111084051.2121029-1-hezhongkun.hzk%40bytedance.com
patch subject: [PATCH v2] mm: add new syscall pidfd_set_mempolicy().
config: x86_64-randconfig-a014
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/da593185d0d1e8a20d1084142960f9ee46c5871b
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Zhongkun-He/mm-add-new-syscall-pidfd_set_mempolicy/20221111-164156
        git checkout da593185d0d1e8a20d1084142960f9ee46c5871b
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> mm/mempolicy.c:325:6: warning: no previous prototype for function 'mpol_put_async' [-Wmissing-prototypes]
   void mpol_put_async(struct task_struct *task, struct mempolicy *p)
        ^
   mm/mempolicy.c:325:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void mpol_put_async(struct task_struct *task, struct mempolicy *p)
   ^
   static 
   1 warning generated.


vim +/mpol_put_async +325 mm/mempolicy.c

   320	
   321	/*
   322	 * mpol destructor for pidfd_set_mempolicy().
   323	 * free mempolicy directly if task is null or task_work_add() failed.
   324	 */
 > 325	void mpol_put_async(struct task_struct *task, struct mempolicy *p)
   326	{
   327		enum task_work_notify_mode notify = TWA_RESUME;
   328	
   329		if (!atomic_dec_and_test(&p->refcnt))
   330			return;
   331	
   332		if (!task)
   333			goto out;
   334	
   335		init_task_work(&p->w.cb_head, mpol_free_async);
   336		if (task_work_pending(task))
   337			notify = TWA_SIGNAL; /* free memory in time */
   338	
   339		if (!task_work_add(task, &p->w.cb_head, notify))
   340			return;
   341	out:
   342		kmem_cache_free(policy_cache, p);
   343	}
   344
Zhongkun He Nov. 13, 2022, 4:41 p.m. UTC | #3
Hi Andrew, thanks for your replay.

> This sounds a bit suspicious.  Please share much more detail about
> these races.  If we proced with this design then mpol_put_async()
> shouild have comments which fully describe the need for the async free.
> 
> How do we *know* that these races are fully prevented with this
> approach?  How do we know that mpol_put_async() won't free the data
> until the race window has fully passed?

A mempolicy can be either associated with a process or with a VMA.
All vma manipulation is somewhat protected by a down_read on
mmap_lock.In process context there is no locking because only
the process accesses its own state before.

Now  we need to change the process context mempolicy specified
in pidfd. the mempolicy may about to be freed by
pidfd_set_mempolicy() while alloc_pages() is using it,
the race condition appears.

process context mempolicy is used in:
alloc_pages()
alloc_pages_bulk_array_mempolicy()
policy_mbind_nodemask()
mempolicy_slab_node()
.....

Say something like the following:

pidfd_set_mempolicy()        target task stack:
                                 alloc_pages:
                                 mpol = p->mempolicy;
task_lock(task);
  old = task->mempolicy;
  task->mempolicy = new;
  task_unlock(task);
  mpol_put(old);
                               /*old mpol has been freed.*/
                               policy_node(...., mpol)
    	           __alloc_pages(mpol);
To reduce the use of locks and atomic operations(mpol_get/put)
in the hot path,task_work is used in mpol_put_async(),
when the target task exit to user mode,	the process context
mempolicy is not used anymore, mpol_free_async()
will be called as task_work to free mempolicy in
target context.			


> Also, in some situations mpol_put_async() will free the data
> synchronously anyway, so aren't these races still present?
> If the task has run exit_task_work(),task_work_add() will fail.
we can free the mempolicy directly because mempolicy is not used.

> 
> Secondly, why was the `flags' argument added?  We might use it one day?
> For what purpose?  I mean, every syscall could have a does-nothing
> `flags' arg, but we don't do that.  What's the plan here?
> 
I found that some functions use 'flags' for scalability, such
as process_madvise(), set_mempolicy_home_node(). back to our case, This 
operation has per-thread rather than per-process semantic ,we could use 
flags to switch for future extension if any. but I'm not sure.

Thanks.
Zhongkun He Nov. 14, 2022, 9:24 a.m. UTC | #4
Hi Andrew,
> This sounds a bit suspicious.  Please share much more detail about
> these races.  If we proced with this design then mpol_put_async()
> shouild have comments which fully describe the need for the async free.
> 

Add some comments for async free, and use the TWA_SIGNAL_NO_IPI to
notify the @task.


-/*
- * mpol destructor for pidfd_set_mempolicy().
+/**
+ * mpol_put_async - free mempolicy asynchronously.
+ * @task: the target task to free mempolicy.
+ * @p : mempolicy to free
+ *
+ * @task must be specified by user.
   * free mempolicy directly if task is null or task_work_add() failed.
+ *
+ * A mempolicy can be either associated with a process or with a VMA.
+ * All vma manipulation is protected by mmap_lock.In process context
+ * there is no locking. If we need to apply mempolicy to other's
+ * task specified in pidfd, the original mempolicy may about to be
+ * freed by pidfd_set_mempolicy() while target task is using it.
+ * So,mpol_put_async() is used for free old mempolicy asynchronously.
   */
-void mpol_put_async(struct task_struct *task, struct mempolicy *p)
+static void mpol_put_async(struct task_struct *task, struct mempolicy *p)
  {
-       enum task_work_notify_mode notify = TWA_RESUME;
-
         if (!atomic_dec_and_test(&p->refcnt))
                 return;

@@ -333,10 +342,8 @@ void mpol_put_async(struct task_struct *task, 
struct mempolicy *p)
                 goto out;

         init_task_work(&p->w.cb_head, mpol_free_async);
-       if (task_work_pending(task))
-               notify = TWA_SIGNAL; /* free memory in time */

-       if (!task_work_add(task, &p->w.cb_head, notify))
+       if (!task_work_add(task, &p->w.cb_head, TWA_SIGNAL_NO_IPI))
                 return;
  out:
         kmem_cache_free(policy_cache, p);



Thanks.
Michal Hocko Nov. 14, 2022, 11:44 a.m. UTC | #5
On Mon 14-11-22 00:41:21, Zhongkun He wrote:
> Hi Andrew, thanks for your replay.
> 
> > This sounds a bit suspicious.  Please share much more detail about
> > these races.  If we proced with this design then mpol_put_async()
> > shouild have comments which fully describe the need for the async free.
> > 
> > How do we *know* that these races are fully prevented with this
> > approach?  How do we know that mpol_put_async() won't free the data
> > until the race window has fully passed?
> 
> A mempolicy can be either associated with a process or with a VMA.
> All vma manipulation is somewhat protected by a down_read on
> mmap_lock.In process context there is no locking because only
> the process accesses its own state before.

We shouldn't really rely on mmap_sem for this IMO. There is alloc_lock
(aka task lock) that makes sure the policy is stable so that caller can
atomically take a reference and hold on the policy. And we do not do
that consistently and this should be fixed. E.g. just looking at some
random places like allowed_mems_nr (relying on get_task_policy) is
completely lockless and some paths (like fadvise) do not use any of the
explicit (alloc_lock) or implicit (mmap_lock) locking. That means that
the task_work based approach cannot really work in this case, right?

Playing more tricks will not really help long term. So while your patch
tries to workaround the current state of the art I do not think we
really want that. As stated previously, I would much rather see proper
reference counting instead. I thought we have agreed this would be the
first approach unless the resulting performance is really bad. Have you
concluded that to be the case? I do not see any numbers or notes in the
changelog.
Michal Hocko Nov. 14, 2022, 11:46 a.m. UTC | #6
On Mon 14-11-22 12:44:48, Michal Hocko wrote:
> On Mon 14-11-22 00:41:21, Zhongkun He wrote:
> > Hi Andrew, thanks for your replay.
> > 
> > > This sounds a bit suspicious.  Please share much more detail about
> > > these races.  If we proced with this design then mpol_put_async()
> > > shouild have comments which fully describe the need for the async free.
> > > 
> > > How do we *know* that these races are fully prevented with this
> > > approach?  How do we know that mpol_put_async() won't free the data
> > > until the race window has fully passed?
> > 
> > A mempolicy can be either associated with a process or with a VMA.
> > All vma manipulation is somewhat protected by a down_read on
> > mmap_lock.In process context there is no locking because only
> > the process accesses its own state before.
> 
> We shouldn't really rely on mmap_sem for this IMO. There is alloc_lock
> (aka task lock) that makes sure the policy is stable so that caller can
> atomically take a reference and hold on the policy. And we do not do
> that consistently and this should be fixed. E.g. just looking at some
> random places like allowed_mems_nr (relying on get_task_policy) is
> completely lockless and some paths (like fadvise) do not use any of the
> explicit (alloc_lock) or implicit (mmap_lock) locking. That means that
> the task_work based approach cannot really work in this case, right?

Just to be more explicit. Task work based approach still requires an
additional synchronization among different threads unless I miss
something so this is really fragile synchronization model.
Zhongkun He Nov. 14, 2022, 3:12 p.m. UTC | #7
Sorry,michal. I dont know if my expression is accurate.
> 
> We shouldn't really rely on mmap_sem for this IMO. 

  Yes, We should rely on mmap_sem for vma->vm_policy,but not for
  process context policy(task->mempolicy).

> There is alloc_lock
> (aka task lock) that makes sure the policy is stable so that caller can
> atomically take a reference and hold on the policy. And we do not do
> that consistently and this should be fixed.

I saw some explanations in the doc("numa_memory_policy.rst") and
comments(mempolcy.h) why not use locks and reference in page
allocation:

In process context there is no locking because only the process accesses
its own state.

During run-time "usage" of the policy, we attempt to minimize atomic
operations on the reference count, as this can lead to cache lines
bouncing between cpus and NUMA nodes.  "Usage" here means one of
the following:
1) querying of the policy, either by the task itself [using
the get_mempolicy() API discussed below] or by another task using
the /proc/<pid>/numa_maps interface.

2) examination of the policy to determine the policy mode and
associated node or node lists, if any, for page allocation.
This is considered a "hot path".  Note that for MPOL_BIND, the
"usage" extends across the entire allocation process, which may
sleep during page reclaimation, because the BIND policy nodemask
is used, by reference, to filter ineligible nodes.

> E.g. just looking at some
> random places like allowed_mems_nr (relying on get_task_policy) is
> completely lockless and some paths (like fadvise) do not use any of the
> explicit (alloc_lock) or implicit (mmap_lock) locking. That means that
> the task_work based approach cannot really work in this case, right?

The task_work based approach (mpol_put_async()) allows mempolicy release
to be transferred from the pidfd_set_mempolicy() context to the
target process context.The old mempolicy droped by pidfd_set_mempolicy()
will be freed by task_work(mpol_free_async) whenever the target task
exit to user mode. At this point task->mempolicy will not be used,
thus avoiding race conditions.

pidfd process context:
void mpol_put_async()
{.....
         init_task_work(&p->w.cb_head, "mpol_free_async");
         if (!task_work_add(task, &p->w.cb_head, TWA_SIGNAL_NO_IPI))
                 return;}

target task:
/* there is no lock and mempolicy may about to be freed by
  * pidfd_set_mempolicy().
  */
pol = get_task_policy()
page = __alloc_pages(..pol)
.....
exit_to_user_mode
	task_work_run()
	/* It's safe to free old mempolicy
  	* dropped by pidfd_set_mempolicy() at this time.*/
		mpol_free_async()

> Playing more tricks will not really help long term. So while your patch
> tries to workaround the current state of the art I do not think we
> really want that. As stated previously, I would much rather see proper
> reference counting instead. I thought we have agreed this would be the
> first approach unless the resulting performance is really bad. Have you
> concluded that to be the case? I do not see any numbers or notes in the
> changelog.

I have tried it, but I found that using task_work to release the
old policy on the target process can solve the problem, but I'm
not sure. My expression may not be very clear, Looking forward to
your reply.

Thanks.
Michal Hocko Nov. 14, 2022, 5:52 p.m. UTC | #8
On Mon 14-11-22 12:46:53, Michal Hocko wrote:
> On Mon 14-11-22 12:44:48, Michal Hocko wrote:
> > On Mon 14-11-22 00:41:21, Zhongkun He wrote:
> > > Hi Andrew, thanks for your replay.
> > > 
> > > > This sounds a bit suspicious.  Please share much more detail about
> > > > these races.  If we proced with this design then mpol_put_async()
> > > > shouild have comments which fully describe the need for the async free.
> > > > 
> > > > How do we *know* that these races are fully prevented with this
> > > > approach?  How do we know that mpol_put_async() won't free the data
> > > > until the race window has fully passed?
> > > 
> > > A mempolicy can be either associated with a process or with a VMA.
> > > All vma manipulation is somewhat protected by a down_read on
> > > mmap_lock.In process context there is no locking because only
> > > the process accesses its own state before.
> > 
> > We shouldn't really rely on mmap_sem for this IMO. There is alloc_lock
> > (aka task lock) that makes sure the policy is stable so that caller can
> > atomically take a reference and hold on the policy. And we do not do
> > that consistently and this should be fixed. E.g. just looking at some
> > random places like allowed_mems_nr (relying on get_task_policy) is
> > completely lockless and some paths (like fadvise) do not use any of the
> > explicit (alloc_lock) or implicit (mmap_lock) locking. That means that
> > the task_work based approach cannot really work in this case, right?
> 
> Just to be more explicit. Task work based approach still requires an
> additional synchronization among different threads unless I miss
> something so this is really fragile synchronization model.

Scratch that. I've managed to confuse myself. Multi-threading doesn't
play any role as the mempolicy changed by the syscall is per-task_struct
so task_work context is indeed mutually exclusive with any in kernel use
of the policy.

I will need to think about it some more.
Michal Hocko Nov. 14, 2022, 6:12 p.m. UTC | #9
On Mon 14-11-22 23:12:00, Zhongkun He wrote:
> Sorry,michal. I dont know if my expression is accurate.
> > 
> > We shouldn't really rely on mmap_sem for this IMO.
> 
>  Yes, We should rely on mmap_sem for vma->vm_policy,but not for
>  process context policy(task->mempolicy).

But the caller has no way to know which kind of policy is returned so
the locking cannot be conditional on the policy type.

> > There is alloc_lock
> > (aka task lock) that makes sure the policy is stable so that caller can
> > atomically take a reference and hold on the policy. And we do not do
> > that consistently and this should be fixed.
> 
> I saw some explanations in the doc("numa_memory_policy.rst") and
> comments(mempolcy.h) why not use locks and reference in page
> allocation:
> 
> In process context there is no locking because only the process accesses
> its own state.
> 
> During run-time "usage" of the policy, we attempt to minimize atomic
> operations on the reference count, as this can lead to cache lines
> bouncing between cpus and NUMA nodes.

Yes this is all understood but the level of the overhead is not really
clear. So the question is whether this will induce a visible overhead.
Because from the maintainability point of view it is much less costly to
have a clear life time model. Right now we have a mix of reference
counting and per-task requirements which is rather subtle and easy to
get wrong. In an ideal world we would have get_vma_policy always
returning a reference counted policy or NULL. If we really need to
optimize for cache line bouncing we can go with per cpu reference
counters (something that was not available at the time the mempolicy
code has been introduced).

So I am not saying that the task_work based solution is not possible I
just think that this looks like a good opportunity to get from the
existing subtle model.
Zhongkun He Nov. 15, 2022, 7:39 a.m. UTC | #10
>>> We shouldn't really rely on mmap_sem for this IMO.
>>
>>   Yes, We should rely on mmap_sem for vma->vm_policy,but not for
>>   process context policy(task->mempolicy).
> 
> But the caller has no way to know which kind of policy is returned so
> the locking cannot be conditional on the policy type.

Yes. vma->vm_policy is protected by mmap_sem, which is reliable if
we want to add a new apis(pidfd_mbind()) to change the vma->vm_policy
specified in pidfd. but not for pidfd_set_mempolicy(task->mempolicy is
protected by alloc_lock).

> 
> Yes this is all understood but the level of the overhead is not really
> clear. So the question is whether this will induce a visible overhead.
OK,i will try it.

> Because from the maintainability point of view it is much less costly to
> have a clear life time model. Right now we have a mix of reference
> counting and per-task requirements which is rather subtle and easy to
> get wrong. In an ideal world we would have get_vma_policy always
> returning a reference counted policy or NULL. If we really need to
> optimize for cache line bouncing we can go with per cpu reference
> counters (something that was not available at the time the mempolicy
> code has been introduced).
> 
> So I am not saying that the task_work based solution is not possible I
> just think that this looks like a good opportunity to get from the
> existing subtle model.

OK, i got it. Thanks for your reply and suggestions.


Zhongkun.
Huang, Ying Nov. 16, 2022, 7:04 a.m. UTC | #11
Zhongkun He <hezhongkun.hzk@bytedance.com> writes:

> There are usecases when system management utilities want to
> apply memory policy to processes to make better use of memory.
>
> These utilities doesn't set memory usage policy, but rather
> the job of reporting memory usage and setting the policy is
> offloaded to a userspace daemon.
>
> Cpuset has the ability to dynamically modify the numa nodes of
> memory policy, but not the mode, which determines how to apply
> for memory.
>
> To solve the issue above, introduce new syscall
> pidfd_set_mempolicy(2). The syscall sets NUMA memory policy for
> the thread specified in pidfd.
>
> Page allocation usage of task or vma policy occurs in the fault
> path where we hold the mmap_lock for read. because replacing the
> task or vma policy requires that the mmap_lock be held for write,
> the policy can't be freed out from under us while we're using
> it for page allocation. But there are some corner cases(e.g.
> alloc_pages()) which not acquire any lock for read during the
> page allocation. For this reason, task_work is used in
> mpol_put_async() to free mempolicy in  pidfd_set_mempolicy().
> Thuse, it avoids into race conditions.
>
> The API is as follows,
>
>                 long pidfd_set_mempolicy(int pidfd, int mode,
>                                      const unsigned long __user *nmask,
>                                      unsigned long maxnode,
>                                      unsigned int flags);
>
> Set's the [pidfd] task's "task/process memory policy". The pidfd argument
> is a PID file descriptor (see pidfd_open(2) man page) that specifies the
> process to which the mempolicy is to be applied. The flags argument is
> reserved for future use; currently, this argument must be specified as 0.
> Please see the set_mempolicy(2) man page for more details about
> other's arguments.

I suggest to move the flags in "mode" parameter (MPOL_F_STATIC_NODES,
MPOL_F_RELATIVE_NODES, MPOL_F_NUMA_BALANCING, etc.) to "flags"
parameter, otherwise, why add it?

And, how about add a "home_node" parameter?  I don't think that it's a
good idea to add another new syscall for pidfd_set_mempolicy_home_node()
in the future.

> Permission to apply memory policy to another process is governed by a
> ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check (see
> ptrace(2)); in addition, because of the performance implications
> of applying the policy, the caller must have the CAP_SYS_NICE
> capability.
>
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
> ---
>  .../admin-guide/mm/numa_memory_policy.rst     |  19 +-
>  arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
>  arch/arm/tools/syscall.tbl                    |   1 +
>  arch/arm64/include/asm/unistd.h               |   2 +-
>  arch/arm64/include/asm/unistd32.h             |   3 +-
>  arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
>  arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
>  arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
>  arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
>  arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
>  arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
>  arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
>  arch/s390/kernel/syscalls/syscall.tbl         |   1 +
>  arch/sh/kernel/syscalls/syscall.tbl           |   1 +
>  arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
>  arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
>  arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
>  include/linux/cpuset.h                        |   2 +
>  include/linux/mempolicy.h                     |   2 +
>  include/linux/syscalls.h                      |   4 +
>  include/uapi/asm-generic/unistd.h             |   5 +-
>  kernel/sys_ni.c                               |   1 +
>  mm/mempolicy.c                                | 178 ++++++++++++++----
>  25 files changed, 187 insertions(+), 45 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
> index 5a6afecbb0d0..b45ceb0b165c 100644
> --- a/Documentation/admin-guide/mm/numa_memory_policy.rst
> +++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
> @@ -408,9 +408,10 @@ follows:
>  Memory Policy APIs
>  ==================
>  
> -Linux supports 4 system calls for controlling memory policy.  These APIS
> -always affect only the calling task, the calling task's address space, or
> -some shared object mapped into the calling task's address space.
> +Linux supports 5 system calls for controlling memory policy.  The first four
> +APIS affect only the calling task, the calling task's address space, or
> +some shared object mapped into the calling task's address space. The last
> +one sets the mempolicy of task specified in the pidfd.

IMHO, "The first four APIS" and "The last one" isn't easy to be
understood.  How about

"sys_pidfd_set_mempolicy sets the mempolicy of task specified in the
pidfd, the others affect only the calling task, ...".

>  
>  .. note::
>     the headers that define these APIs and the parameter data types for
> @@ -473,6 +474,18 @@ closest to which page allocation will come from. Specifying the home node overri
>  the default allocation policy to allocate memory close to the local node for an
>  executing CPU.
>  
> +Set [pidfd Task] Memory Policy::
> +
> +        long sys_pidfd_set_mempolicy(int pidfd, int mode,
> +                                     const unsigned long __user *nmask,
> +                                     unsigned long maxnode,
> +                                     unsigned int flags);
> +

Why add "sys_"?  I fount that there's no "sys_" before set_mempolicy()/mbind() etc.

> +Sets the task/process memory policy for the [pidfd] task. The pidfd argument
> +is a PID file descriptor (see pidfd_open(2) man page for details) that
> +specifies the process for which the mempolicy is applied to. The flags
> +argument is reserved for future use; currently, it must be specified as 0.
> +For the description of all other arguments, see set_mempolicy(2) man page.
>  
>  Memory Policy Command Line Interface
>  ====================================
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 8ebacf37a8cf..3aeaefe7d45b 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -490,3 +490,4 @@
>  558	common	process_mrelease		sys_process_mrelease
>  559	common  futex_waitv                     sys_futex_waitv
>  560	common	set_mempolicy_home_node		sys_ni_syscall
> +561	common	pidfd_set_mempolicy		sys_ni_syscall
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index ac964612d8b0..a7ccab8aafef 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -464,3 +464,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common	futex_waitv			sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 037feba03a51..64a514f90131 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -39,7 +39,7 @@
>  #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls		451
> +#define __NR_compat_syscalls		452
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 604a2053d006..2e178e9152e6 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -907,7 +907,8 @@ __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
>  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
> -
> +#define __NR_pidfd_set_mempolicy 451
> +__SYSCALL(__NR_pidfd_set_mempolicy, sys_pidfd_set_mempolicy)
>  /*
>   * Please add new compat syscalls above this comment and update
>   * __NR_compat_syscalls in asm/unistd.h.
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index 72c929d9902b..6f60981592b4 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -371,3 +371,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common  futex_waitv                     sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index b1f3940bc298..8e50bf8ec41d 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -450,3 +450,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common  futex_waitv                     sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index 820145e47350..f48e32532c5f 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -456,3 +456,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common  futex_waitv                     sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
> index 253ff994ed2e..58e7c3140f4a 100644
> --- a/arch/mips/kernel/syscalls/syscall_n32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
> @@ -389,3 +389,4 @@
>  448	n32	process_mrelease		sys_process_mrelease
>  449	n32	futex_waitv			sys_futex_waitv
>  450	n32	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	n32	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
> index 3f1886ad9d80..70090c3ada16 100644
> --- a/arch/mips/kernel/syscalls/syscall_n64.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
> @@ -365,3 +365,4 @@
>  448	n64	process_mrelease		sys_process_mrelease
>  449	n64	futex_waitv			sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	n64	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index 8f243e35a7b2..b0b0bad64fa0 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -438,3 +438,4 @@
>  448	o32	process_mrelease		sys_process_mrelease
>  449	o32	futex_waitv			sys_futex_waitv
>  450	o32	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	o32	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index 8a99c998da9b..4dfd328490ad 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -448,3 +448,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common	futex_waitv			sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index a0be127475b1..34bd3cea5954 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -537,3 +537,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common  futex_waitv                     sys_futex_waitv
>  450 	nospu	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	nospu	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 799147658dee..5e170dca81f6 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -453,3 +453,4 @@
>  448  common	process_mrelease	sys_process_mrelease		sys_process_mrelease
>  449  common	futex_waitv		sys_futex_waitv			sys_futex_waitv
>  450  common	set_mempolicy_home_node	sys_set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451  common	pidfd_set_mempolicy	sys_pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 2de85c977f54..bd3a24a9b41f 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -453,3 +453,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common  futex_waitv                     sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index 4398cc6fb68d..bea2b5c6314b 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -496,3 +496,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common  futex_waitv                     sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 320480a8db4f..97bc70f5a8f7 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -455,3 +455,4 @@
>  448	i386	process_mrelease	sys_process_mrelease
>  449	i386	futex_waitv		sys_futex_waitv
>  450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	i386	pidfd_set_mempolicy	sys_pidfd_set_mempolicy
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index c84d12608cd2..ba6d19dbd8f8 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -372,6 +372,7 @@
>  448	common	process_mrelease	sys_process_mrelease
>  449	common	futex_waitv		sys_futex_waitv
>  450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy	sys_pidfd_set_mempolicy
>  
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index 52c94ab5c205..9f7c563da4fb 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -421,3 +421,4 @@
>  448	common	process_mrelease		sys_process_mrelease
>  449	common  futex_waitv                     sys_futex_waitv
>  450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
> +451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index d58e0476ee8e..e9db7e10f171 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -77,6 +77,7 @@ extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
>  extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
>  extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
>  #define cpuset_current_mems_allowed (current->mems_allowed)
> +#define cpuset_task_mems_allowed(task) ((task)->mems_allowed)
>  void cpuset_init_current_mems_allowed(void);
>  int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
>  
> @@ -216,6 +217,7 @@ static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
>  }
>  
>  #define cpuset_current_mems_allowed (node_states[N_MEMORY])
> +#define cpuset_task_mems_allowed(task) (node_states[N_MEMORY])
>  static inline void cpuset_init_current_mems_allowed(void) {}
>  
>  static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index d232de7cdc56..afb92020808e 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -13,6 +13,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/nodemask.h>
>  #include <linux/pagemap.h>
> +#include <linux/task_work.h>
>  #include <uapi/linux/mempolicy.h>
>  
>  struct mm_struct;
> @@ -51,6 +52,7 @@ struct mempolicy {
>  	union {
>  		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
>  		nodemask_t user_nodemask;	/* nodemask passed by user */
> +		struct callback_head cb_head;   /* async free */
>  	} w;
>  };
>  
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a34b0f9a9972..e611c088050b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1056,6 +1056,10 @@ asmlinkage long sys_memfd_secret(unsigned int flags);
>  asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
>  					    unsigned long home_node,
>  					    unsigned long flags);
> +asmlinkage long sys_pidfd_set_mempolicy(int pidfd, int mode,
> +					const unsigned long __user *nmask,
> +					unsigned long maxnode,
> +					unsigned int flags);
>  
>  /*
>   * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 45fa180cc56a..c38013bbf5b0 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -886,8 +886,11 @@ __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
>  #define __NR_set_mempolicy_home_node 450
>  __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
>  
> +#define __NR_pidfd_set_mempolicy 451
> +__SYSCALL(__NR_pidfd_set_mempolicy, sys_pidfd_set_mempolicy)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 451
> +#define __NR_syscalls 452
>  
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 860b2dcf3ac4..5053deb888f7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -299,6 +299,7 @@ COND_SYSCALL(set_mempolicy);
>  COND_SYSCALL(migrate_pages);
>  COND_SYSCALL(move_pages);
>  COND_SYSCALL(set_mempolicy_home_node);
> +COND_SYSCALL(pidfd_set_mempolicy);
>  
>  COND_SYSCALL(perf_event_open);
>  COND_SYSCALL(accept4);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 61aa9aedb728..2ac307aba01c 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -221,7 +221,7 @@ static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
>   * Must be called holding task's alloc_lock to protect task's mems_allowed
>   * and mempolicy.  May also be called holding the mmap_lock for write.
>   */
> -static int mpol_set_nodemask(struct mempolicy *pol,
> +static int mpol_set_nodemask(struct task_struct *task, struct mempolicy *pol,
>  		     const nodemask_t *nodes, struct nodemask_scratch *nsc)
>  {
>  	int ret;
> @@ -236,7 +236,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
>  
>  	/* Check N_MEMORY */
>  	nodes_and(nsc->mask1,
> -		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
> +		  cpuset_task_mems_allowed(task), node_states[N_MEMORY]);
>  
>  	VM_BUG_ON(!nodes);
>  
> @@ -248,7 +248,7 @@ static int mpol_set_nodemask(struct mempolicy *pol,
>  	if (mpol_store_user_nodemask(pol))
>  		pol->w.user_nodemask = *nodes;
>  	else
> -		pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed;
> +		pol->w.cpuset_mems_allowed = cpuset_task_mems_allowed(task);
>  
>  	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
>  	return ret;
> @@ -312,6 +312,36 @@ void __mpol_put(struct mempolicy *p)
>  	kmem_cache_free(policy_cache, p);
>  }
>  
> +static void mpol_free_async(struct callback_head *work)
> +{
> +	kmem_cache_free(policy_cache,
> +			container_of(work, struct mempolicy, w.cb_head));
> +}
> +
> +/*
> + * mpol destructor for pidfd_set_mempolicy().
> + * free mempolicy directly if task is null or task_work_add() failed.
> + */
> +void mpol_put_async(struct task_struct *task, struct mempolicy *p)

How about change __mpol_put() directly?

> +{
> +	enum task_work_notify_mode notify = TWA_RESUME;
> +
> +	if (!atomic_dec_and_test(&p->refcnt))
> +		return;
> +
> +	if (!task)
> +		goto out;
> +
> +	init_task_work(&p->w.cb_head, mpol_free_async);
> +	if (task_work_pending(task))
> +		notify = TWA_SIGNAL; /* free memory in time */
> +
> +	if (!task_work_add(task, &p->w.cb_head, notify))
> +		return;

Why can we fall back to freeing directly if task_work_add() failed?
Should we check the return code and fall back only if -ESRCH and WARN
for other cases?

> +out:
> +	kmem_cache_free(policy_cache, p);
> +}
> +
>  static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
>  {
>  }
> @@ -850,8 +880,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
>  }
>  
>  /* Set the process memory policy */
> -static long do_set_mempolicy(unsigned short mode, unsigned short flags,
> -			     nodemask_t *nodes)
> +static long do_set_mempolicy(struct task_struct *task, unsigned short mode,
> +		unsigned short flags, nodemask_t *nodes)
>  {
>  	struct mempolicy *new, *old;
>  	NODEMASK_SCRATCH(scratch);
> @@ -866,20 +896,24 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
>  		goto out;
>  	}
>  
> -	task_lock(current);
> -	ret = mpol_set_nodemask(new, nodes, scratch);
> +	task_lock(task);
> +	ret = mpol_set_nodemask(task, new, nodes, scratch);
>  	if (ret) {
> -		task_unlock(current);
> +		task_unlock(task);
>  		mpol_put(new);
>  		goto out;
>  	}
>  
> -	old = current->mempolicy;
> -	current->mempolicy = new;
> +	old = task->mempolicy;
> +	task->mempolicy = new;
>  	if (new && new->mode == MPOL_INTERLEAVE)
> -		current->il_prev = MAX_NUMNODES-1;
> -	task_unlock(current);
> -	mpol_put(old);
> +		task->il_prev = MAX_NUMNODES-1;
> +	task_unlock(task);
> +
> +	if (old && task != current)
> +		mpol_put_async(task, old);
> +	else
> +		mpol_put(old);
>  	ret = 0;
>  out:
>  	NODEMASK_SCRATCH_FREE(scratch);
> @@ -932,7 +966,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  	int err;
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
> -	struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL;
> +	struct mempolicy *pol;
>  
>  	if (flags &
>  		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
> @@ -947,6 +981,15 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  		task_unlock(current);
>  		return 0;
>  	}
> +	/*
> +	 * Take a refcount on the mpol, because it may be freed by
> +	 * pidfd_set_mempolicy() asynchronously, which will
> +	 * override w.user_nodemask used below.
> +	 */
> +	task_lock(current);
> +	pol = current->mempolicy;
> +	mpol_get(pol);
> +	task_unlock(current);
>  
>  	if (flags & MPOL_F_ADDR) {
>  		/*
> @@ -954,6 +997,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  		 * vma/shared policy at addr is NULL.  We
>  		 * want to return MPOL_DEFAULT in this case.
>  		 */
> +		mpol_put(pol);	/* put the refcount of task mpol*/
>  		mmap_read_lock(mm);
>  		vma = vma_lookup(mm, addr);
>  		if (!vma) {
> @@ -964,23 +1008,18 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  			pol = vma->vm_ops->get_policy(vma, addr);
>  		else
>  			pol = vma->vm_policy;
> -	} else if (addr)
> -		return -EINVAL;
> +		mpol_get(pol);
> +		mmap_read_unlock(mm);
> +	} else if (addr) {
> +		err = -EINVAL;
> +		goto out;
> +	}
>  
>  	if (!pol)
>  		pol = &default_policy;	/* indicates default behavior */
>  
>  	if (flags & MPOL_F_NODE) {
>  		if (flags & MPOL_F_ADDR) {
> -			/*
> -			 * Take a refcount on the mpol, because we are about to
> -			 * drop the mmap_lock, after which only "pol" remains
> -			 * valid, "vma" is stale.
> -			 */
> -			pol_refcount = pol;
> -			vma = NULL;
> -			mpol_get(pol);
> -			mmap_read_unlock(mm);
>  			err = lookup_node(mm, addr);
>  			if (err < 0)
>  				goto out;
> @@ -1004,21 +1043,17 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
>  
>  	err = 0;
>  	if (nmask) {
> -		if (mpol_store_user_nodemask(pol)) {
> +		if (mpol_store_user_nodemask(pol))
>  			*nmask = pol->w.user_nodemask;
> -		} else {
> -			task_lock(current);
> +		else
>  			get_policy_nodemask(pol, nmask);
> -			task_unlock(current);
> -		}
>  	}
>  
>   out:
>  	mpol_cond_put(pol);
> -	if (vma)
> -		mmap_read_unlock(mm);
> -	if (pol_refcount)
> -		mpol_put(pol_refcount);
> +
> +	if (pol != &default_policy)
> +		mpol_put(pol);
>  	return err;
>  }
>  
> @@ -1309,7 +1344,7 @@ static long do_mbind(unsigned long start, unsigned long len,
>  		NODEMASK_SCRATCH(scratch);
>  		if (scratch) {
>  			mmap_write_lock(mm);
> -			err = mpol_set_nodemask(new, nmask, scratch);
> +			err = mpol_set_nodemask(current, new, nmask, scratch);
>  			if (err)
>  				mmap_write_unlock(mm);
>  		} else
> @@ -1578,7 +1613,7 @@ static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
>  	if (err)
>  		return err;
>  
> -	return do_set_mempolicy(lmode, mode_flags, &nodes);
> +	return do_set_mempolicy(current, lmode, mode_flags, &nodes);
>  }
>  
>  SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
> @@ -1587,6 +1622,71 @@ SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
>  	return kernel_set_mempolicy(mode, nmask, maxnode);
>  }
>  
> +/* Set the memory policy of the process specified in pidfd. */
> +static long do_pidfd_set_mempolicy(int pidfd, int mode, const unsigned long __user *nmask,
> +		unsigned long maxnode, unsigned int flags)
> +{
> +	struct mm_struct *mm = NULL;
> +	struct task_struct *task;
> +	unsigned short mode_flags;
> +	int err, lmode = mode;
> +	unsigned int f_flags;
> +	nodemask_t nodes;
> +
> +	if (flags)
> +		return -EINVAL;
> +
> +	err = get_nodes(&nodes, nmask, maxnode);
> +	if (err)
> +		return err;
> +
> +	err = sanitize_mpol_flags(&lmode, &mode_flags);
> +	if (err)
> +		return err;
> +
> +	task = pidfd_get_task(pidfd, &f_flags);
> +	if (IS_ERR(task))
> +		return PTR_ERR(task);
> +
> +	/*
> +	 * Require CAP_SYS_NICE for influencing process performance.
> +	 * User's task is allowed only.
> +	 */
> +	if (!capable(CAP_SYS_NICE) || (task->flags & PF_KTHREAD)) {
> +		err = -EPERM;
> +		goto out;
> +	}
> +
> +	/*
> +	 * Require PTRACE_MODE_ATTACH_REALCREDS  to avoid
> +	 * leaking ASLR metadata.
> +	 */
> +	mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
> +	if (IS_ERR_OR_NULL(mm)) {
> +		err = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +		goto out;
> +	}
> +
> +	if (mmap_write_lock_killable(mm)) {
> +		err = -EINTR;
> +		goto release_mm;
> +	}

Why do we need to write lock mmap_sem?  IIUC, we don't touch vma.

> +
> +	err = do_set_mempolicy(task, lmode, mode_flags, &nodes);
> +	mmap_write_unlock(mm);
> +release_mm:
> +	mmput(mm);
> +out:
> +	put_task_struct(task);
> +	return err;
> +}
> +
> +SYSCALL_DEFINE5(pidfd_set_mempolicy, int, pidfd, int, mode, const unsigned long __user *, nmask,
> +		unsigned long, maxnode, unsigned int, flags)
> +{
> +	return do_pidfd_set_mempolicy(pidfd, mode, nmask, maxnode, flags);
> +}
> +
>  static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>  				const unsigned long __user *old_nodes,
>  				const unsigned long __user *new_nodes)
> @@ -2790,7 +2890,7 @@ void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
>  			goto free_scratch; /* no valid nodemask intersection */
>  
>  		task_lock(current);
> -		ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
> +		ret = mpol_set_nodemask(current, new, &mpol->w.user_nodemask, scratch);
>  		task_unlock(current);
>  		if (ret)
>  			goto put_new;
> @@ -2946,7 +3046,7 @@ void __init numa_policy_init(void)
>  	if (unlikely(nodes_empty(interleave_nodes)))
>  		node_set(prefer, interleave_nodes);
>  
> -	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
> +	if (do_set_mempolicy(current, MPOL_INTERLEAVE, 0, &interleave_nodes))
>  		pr_err("%s: interleaving failed\n", __func__);
>  
>  	check_numabalancing_enable();
> @@ -2955,7 +3055,7 @@ void __init numa_policy_init(void)
>  /* Reset policy of current process to default */
>  void numa_default_policy(void)
>  {
> -	do_set_mempolicy(MPOL_DEFAULT, 0, NULL);
> +	do_set_mempolicy(current, MPOL_DEFAULT, 0, NULL);
>  }
>  
>  /*

Because we will change task_struct->mempolicy in another task, we need
to use kind of "load acquire" / "store release" memory order.  For
example, rcu_dereference() / rcu_assign_pointer(), etc.

Best Regards,
Huang, Ying
Zhongkun He Nov. 16, 2022, 9:38 a.m. UTC | #12
Hi Ying, thanks for your replay and suggestions.

> 
> I suggest to move the flags in "mode" parameter (MPOL_F_STATIC_NODES,
> MPOL_F_RELATIVE_NODES, MPOL_F_NUMA_BALANCING, etc.) to "flags"
> parameter, otherwise, why add it?

The "flags" is used for future extension if any, just like
process_madvise() and set_mempolicy_home_node().
Maybe it should be removed.

> 
> And, how about add a "home_node" parameter?  I don't think that it's a
> good idea to add another new syscall for pidfd_set_mempolicy_home_node()
> in the future.
> 

Good idea, but "home_node" is used for vma policy, not task policy.
It is possible to use it in pidfd_mbind() in the future.

> 
> IMHO, "The first four APIS" and "The last one" isn't easy to be
> understood.  How about
> 
> "sys_pidfd_set_mempolicy sets the mempolicy of task specified in the
> pidfd, the others affect only the calling task, ...".
> 

Got it.

> 
> Why add "sys_"?  I fount that there's no "sys_" before set_mempolicy()/mbind() etc.
> 

Got it.

>> +void mpol_put_async(struct task_struct *task, struct mempolicy *p)
> 
> How about change __mpol_put() directly?

> 
> Why can we fall back to freeing directly if task_work_add() failed?
> Should we check the return code and fall back only if -ESRCH and WARN
> for other cases?
> 

A task_work based solution has not been accepted yet, it will be 
considered later if needed.


>> +	}
> 
> Why do we need to write lock mmap_sem?  IIUC, we don't touch vma.
> 

Yes, it should be removed.

>>   /*
> 
> Because we will change task_struct->mempolicy in another task, we need
> to use kind of "load acquire" / "store release" memory order.  For
> example, rcu_dereference() / rcu_assign_pointer(), etc.
> 
Thanks again for your suggestion.

Best Regards,
Zhongkun
Michal Hocko Nov. 16, 2022, 9:44 a.m. UTC | #13
On Wed 16-11-22 17:38:09, Zhongkun He wrote:
> Hi Ying, thanks for your replay and suggestions.
> 
> > 
> > I suggest to move the flags in "mode" parameter (MPOL_F_STATIC_NODES,
> > MPOL_F_RELATIVE_NODES, MPOL_F_NUMA_BALANCING, etc.) to "flags"
> > parameter, otherwise, why add it?
> 
> The "flags" is used for future extension if any, just like
> process_madvise() and set_mempolicy_home_node().
> Maybe it should be removed.

No, please! Even if there is no use for the flags now we are usually
terrible at predicting future and potential extensions. MPOL_F* is kinda
flags but for historical reasons it is a separate mode and we shouldn't
create a new confusion when this is treated differently for pidfd based
APIs.

> > And, how about add a "home_node" parameter?  I don't think that it's a
> > good idea to add another new syscall for pidfd_set_mempolicy_home_node()
> > in the future.

Why would this be a bad idea?

> Good idea, but "home_node" is used for vma policy, not task policy.
> It is possible to use it in pidfd_mbind() in the future.

I woould go with pidfd_set_mempolicy_home_node to counterpart an
existing syscall.
Zhongkun He Nov. 16, 2022, 11:28 a.m. UTC | #14
Hi Michal, I've done the performance testing, please check it out.

>> Yes this is all understood but the level of the overhead is not really
>> clear. So the question is whether this will induce a visible overhead.
>> Because from the maintainability point of view it is much less costly to
>> have a clear life time model. Right now we have a mix of reference
>> counting and per-task requirements which is rather subtle and easy to
>> get wrong. In an ideal world we would have get_vma_policy always
>> returning a reference counted policy or NULL. If we really need to
>> optimize for cache line bouncing we can go with per cpu reference
>> counters (something that was not available at the time the mempolicy
>> code has been introduced).
>>
>> So I am not saying that the task_work based solution is not possible I
>> just think that this looks like a good opportunity to get from the
>> existing subtle model.

Test tools:
numactl -m 0-3 ./run-mmtests.sh -n -c configs/config-workload-
aim9-pagealloc  test_name

Modification:
Get_vma_policy(), get_task_policy() always returning a reference
counted policy, except for the static policy(default_policy and
preferred_node_policy[nid]).

All vma manipulation is protected by a down_read, so mpol_get()
can be called directly to take a refcount on the mpol. but there
is no lock in task->mempolicy context.
so task->mempolicy should be protected by task_lock.

struct mempolicy *get_task_policy(struct task_struct *p)
{
	struct mempolicy *pol;
	int node;

	if (p->mempolicy) {
		task_lock(p);
		pol = p->mempolicy;
		mpol_get(pol);
		task_unlock(p);
		if (pol)
			return pol;
	}
	.....
}

Test Case1:
Describe:
	Test directly, no other user processes.
Result:
	This will degrade performance about 1% to 3%.
For more information, please see the attachment:mpol.txt

aim9

Hmean     page_test   484561.68 (   0.00%)   471039.34 *  -2.79%*
Hmean     brk_test   1400702.48 (   0.00%)  1388949.10 *  -0.84%*
Hmean     exec_test     2339.45 (   0.00%)     2278.41 *  -2.61%*
Hmean     fork_test     6500.02 (   0.00%)     6500.17 *   0.00%*



Test Case2:
Describe:
	Added a user process, top.
Result:
	This will degrade performance about 2.1%.
For more information, please see the attachment:mpol_top.txt

Hmean     page_test   477916.47 (   0.00%)   467829.01 *  -2.11%*
Hmean     brk_test   1351439.76 (   0.00%)  1373663.90 *   1.64%*
Hmean     exec_test     2312.24 (   0.00%)     2296.06 *  -0.70%*
Hmean     fork_test     6483.46 (   0.00%)     6472.06 *  -0.18%*


Test Case3:
	
Describe:
	Add a daemon to read /proc/$test_pid/status, which will acquire 
task_lock. while :;do cat /proc/$(pidof singleuser)/status;done

Result:
	the baseline is degrade from 484561(case1) to 438591(about 10%)
when the daemon was add, but the performance degradation in case3 is
about 3.2%. For more information, please see the
attachment:mpol_status.txt

Hmean     page_test   438591.97 (   0.00%)   424251.22 *  -3.27%*
Hmean     brk_test   1268906.57 (   0.00%)  1278100.12 *   0.72%*
Hmean     exec_test     2301.19 (   0.00%)     2192.71 *  -4.71%*
Hmean     fork_test     6453.24 (   0.00%)     6090.48 *  -5.62%*


Thanks,
Zhongkun.
Runtime options
Run: /usr/bin/expect -f /home/mmtests/work/tmp/1333151/aim9-expect

aim9
                                 baseline_        mpol_count_ref_        mpol_count_ref_        mpol_count_ref_        mpol_count_ref_
                                baseline_8       mpol_count_ref_2       mpol_count_ref_3       mpol_count_ref_7       mpol_count_ref_9
Min       page_test   472285.14 (   0.00%)   453371.09 (  -4.00%)   470133.24 (  -0.46%)   469227.18 (  -0.65%)   463080.00 (  -1.95%)
Min       brk_test   1351165.89 (   0.00%)  1344133.33 (  -0.52%)  1357733.33 (   0.49%)  1355696.20 (   0.34%)  1347533.33 (  -0.27%)
Min       exec_test     2306.67 (   0.00%)     2214.00 (  -4.02%)     2263.67 (  -1.86%)     2259.00 (  -2.07%)     2305.67 (  -0.04%)
Min       fork_test     6278.30 (   0.00%)     6320.00 (   0.66%)     6295.80 (   0.28%)     6358.19 (   1.27%)     6442.37 (   2.61%)
Hmean     page_test   484561.68 (   0.00%)   471039.34 *  -2.79%*   476627.62 *  -1.64%*   481148.11 *  -0.70%*   476917.47 *  -1.58%*
Hmean     brk_test   1400702.48 (   0.00%)  1388949.10 *  -0.84%*  1394670.24 *  -0.43%*  1422099.32 *   1.53%*  1406177.46 *   0.39%*
Hmean     exec_test     2339.45 (   0.00%)     2278.41 *  -2.61%*     2298.16 *  -1.76%*     2304.95 *  -1.47%*     2357.82 *   0.79%*
Hmean     fork_test     6500.02 (   0.00%)     6500.17 *   0.00%*     6480.55 *  -0.30%*     6569.89 *   1.07%*     6591.39 *   1.41%*
Stddev    page_test     6600.36 (   0.00%)    10835.44 ( -64.16%)     5150.00 (  21.97%)     8711.07 ( -31.98%)     8053.32 ( -22.01%)
Stddev    brk_test     19970.69 (   0.00%)    25822.17 ( -29.30%)    21887.54 (  -9.60%)    36587.97 ( -83.21%)    35023.25 ( -75.37%)
Stddev    exec_test       22.55 (   0.00%)       33.23 ( -47.37%)       15.89 (  29.52%)       24.03 (  -6.60%)       21.13 (   6.27%)
Stddev    fork_test      136.68 (   0.00%)      117.65 (  13.93%)      134.49 (   1.60%)      131.78 (   3.59%)       87.57 (  35.93%)
CoeffVar  page_test        1.36 (   0.00%)        2.30 ( -68.82%)        1.08 (  20.67%)        1.81 ( -32.90%)        1.69 ( -23.96%)
CoeffVar  brk_test         1.43 (   0.00%)        1.86 ( -30.38%)        1.57 ( -10.07%)        2.57 ( -80.38%)        2.49 ( -74.62%)
CoeffVar  exec_test        0.96 (   0.00%)        1.46 ( -51.30%)        0.69 (  28.25%)        1.04 (  -8.19%)        0.90 (   7.00%)
CoeffVar  fork_test        2.10 (   0.00%)        1.81 (  13.92%)        2.07 (   1.30%)        2.01 (   4.61%)        1.33 (  36.81%)
Max       page_test   491640.00 (   0.00%)   490859.43 (  -0.16%)   486540.00 (  -1.04%)   497533.33 (   1.20%)   494370.42 (   0.56%)
Max       brk_test   1423650.90 (   0.00%)  1416855.43 (  -0.48%)  1440639.57 (   1.19%)  1478014.66 (   3.82%)  1466688.87 (   3.02%)
Max       exec_test     2378.00 (   0.00%)     2344.67 (  -1.40%)     2327.00 (  -2.14%)     2337.33 (  -1.71%)     2388.67 (   0.45%)
Max       fork_test     6715.52 (   0.00%)     6700.00 (  -0.23%)     6748.83 (   0.50%)     6826.67 (   1.66%)     6728.85 (   0.20%)
BHmean-50 page_test   489938.40 (   0.00%)   480237.62 (  -1.98%)   480639.35 (  -1.90%)   488179.53 (  -0.36%)   482965.63 (  -1.42%)
BHmean-50 brk_test   1414684.18 (   0.00%)  1409209.66 (  -0.39%)  1411192.84 (  -0.25%)  1448648.65 (   2.40%)  1430353.35 (   1.11%)
BHmean-50 exec_test     2357.00 (   0.00%)     2301.91 (  -2.34%)     2308.53 (  -2.06%)     2323.03 (  -1.44%)     2373.26 (   0.69%)
BHmean-50 fork_test     6607.04 (   0.00%)     6590.56 (  -0.25%)     6582.29 (  -0.37%)     6668.66 (   0.93%)     6661.09 (   0.82%)
BHmean-95 page_test   485709.45 (   0.00%)   472714.08 (  -2.68%)   477226.93 (  -1.75%)   482261.94 (  -0.71%)   478216.54 (  -1.54%)
BHmean-95 brk_test   1405386.52 (   0.00%)  1393171.90 (  -0.87%)  1398128.04 (  -0.52%)  1428459.97 (   1.64%)  1411762.86 (   0.45%)
BHmean-95 exec_test     2342.47 (   0.00%)     2284.46 (  -2.48%)     2301.35 (  -1.76%)     2309.22 (  -1.42%)     2362.68 (   0.86%)
BHmean-95 fork_test     6520.96 (   0.00%)     6517.06 (  -0.06%)     6497.89 (  -0.35%)     6589.83 (   1.06%)     6605.28 (   1.29%)
BHmean-99 page_test   485709.45 (   0.00%)   472714.08 (  -2.68%)   477226.93 (  -1.75%)   482261.94 (  -0.71%)   478216.54 (  -1.54%)
BHmean-99 brk_test   1405386.52 (   0.00%)  1393171.90 (  -0.87%)  1398128.04 (  -0.52%)  1428459.97 (   1.64%)  1411762.86 (   0.45%)
BHmean-99 exec_test     2342.47 (   0.00%)     2284.46 (  -2.48%)     2301.35 (  -1.76%)     2309.22 (  -1.42%)     2362.68 (   0.86%)
BHmean-99 fork_test     6520.96 (   0.00%)     6517.06 (  -0.06%)     6497.89 (  -0.35%)     6589.83 (   1.06%)     6605.28 (   1.29%)

                   baseline_mpol_count_ref_mpol_count_ref_mpol_count_ref_mpol_count_ref_
                  baseline_8mpol_count_ref_2mpol_count_ref_3mpol_count_ref_7mpol_count_ref_9
Duration User           0.13        0.19        0.16        0.13        0.20
Duration System         0.33        0.58        0.41        0.34        0.59
Duration Elapsed      727.95      728.86      731.01      731.82      728.95

                                      baseline_mpol_count_ref_mpol_count_ref_mpol_count_ref_mpol_count_ref_
                                     baseline_8mpol_count_ref_2mpol_count_ref_3mpol_count_ref_7mpol_count_ref_9
Ops Minor Faults                   148665940.00   146350657.00   147621496.00   148298368.00   148936433.00
Ops Major Faults                           0.00          19.00           0.00           0.00           0.00
Ops Swap Ins                               0.00           0.00           0.00           0.00           0.00
Ops Swap Outs                              0.00           0.00           0.00           0.00           0.00
Ops Allocation stalls                      0.00           0.00           0.00           0.00           0.00
Ops Fragmentation stalls                   0.00           0.00           0.00           0.00           0.00
Ops DMA allocs                             0.00           0.00           0.00           0.00           0.00
Ops DMA32 allocs                           0.00           0.00           0.00           0.00           0.00
Ops Normal allocs                  160063372.00   156683873.00   157754955.00   159087404.00   159293789.00
Ops Movable allocs                         0.00           0.00           0.00           0.00           0.00
Ops Direct pages scanned                   0.00           0.00           0.00           0.00           0.00
Ops Kswapd pages scanned                   0.00           0.00           0.00           0.00           0.00
Ops Kswapd pages reclaimed                 0.00           0.00           0.00           0.00           0.00
Ops Direct pages reclaimed                 0.00           0.00           0.00           0.00           0.00
Ops Kswapd efficiency %                  100.00         100.00         100.00         100.00         100.00
Ops Kswapd velocity                        0.00           0.00           0.00           0.00           0.00
Ops Direct efficiency %                  100.00         100.00         100.00         100.00         100.00
Ops Direct velocity                        0.00           0.00           0.00           0.00           0.00
Ops Percentage direct scans                0.00           0.00           0.00           0.00           0.00
Ops Page writes by reclaim                 0.00           0.00           0.00           0.00           0.00
Ops Page writes file                       0.00           0.00           0.00           0.00           0.00
Ops Page writes anon                       0.00           0.00           0.00           0.00           0.00
Ops Page reclaim immediate                 0.00           0.00           0.00           0.00           0.00
Ops Sector Reads                           0.00        3704.00         100.00           0.00           0.00
Ops Sector Writes                      71068.00       70508.00       70520.00       71108.00       70588.00
Ops Page rescued immediate                 0.00           0.00           0.00           0.00           0.00
Ops Slabs scanned                          0.00           0.00           0.00           0.00           0.00
Ops Direct inode steals                    0.00           0.00           0.00           0.00           0.00
Ops Kswapd inode steals                    0.00           0.00           0.00           0.00           0.00
Ops Kswapd skipped wait                    0.00           0.00           0.00           0.00           0.00
Ops THP fault alloc                        0.00           0.00           0.00           0.00           0.00
Ops THP fault fallback                     0.00           0.00           0.00           0.00           0.00
Ops THP collapse alloc                     0.00           0.00           0.00           0.00           0.00
Ops THP collapse fail                      0.00           0.00           0.00           0.00           0.00
Ops THP split                              0.00           0.00           0.00           0.00           0.00
Ops THP split failed                       0.00           0.00           0.00           0.00           0.00
Ops Compaction stalls                      0.00           0.00           0.00           0.00           0.00
Ops Compaction success                     0.00           0.00           0.00           0.00           0.00
Ops Compaction failures                    0.00           0.00           0.00           0.00           0.00
Ops Compaction efficiency                  0.00           0.00           0.00           0.00           0.00
Ops Page migrate success                   0.00           0.00           0.00           0.00           0.00
Ops Page migrate failure                   0.00           0.00           0.00           0.00           0.00
Ops Compaction pages isolated              0.00           0.00           0.00           0.00           0.00
Ops Compaction migrate scanned             0.00           0.00           0.00           0.00           0.00
Ops Compaction free scanned                0.00           0.00           0.00           0.00           0.00
Ops Compact scan efficiency                0.00           0.00           0.00           0.00           0.00
Ops Compaction cost                        0.00           0.00           0.00           0.00           0.00
Ops Kcompactd wake                         0.00           0.00           0.00           0.00           0.00
Ops Kcompactd migrate scanned              0.00           0.00           0.00           0.00           0.00
Ops Kcompactd free scanned                 0.00           0.00           0.00           0.00           0.00
Ops NUMA alloc hit                 158036190.00   154662349.00   155735982.00   157043059.00   157235471.00
Ops NUMA alloc miss                        0.00           0.00           0.00           0.00           0.00
Ops NUMA interleave hit                    0.00           0.00           0.00           0.00           0.00
Ops NUMA alloc local               158036188.00   154662339.00   155735974.00   157043059.00   157235466.00
Ops NUMA base-page range updates           0.00           0.00           0.00           0.00           0.00
Ops NUMA PTE updates                       0.00           0.00           0.00           0.00           0.00
Ops NUMA PMD updates                       0.00           0.00           0.00           0.00           0.00
Ops NUMA hint faults                       0.00           0.00           0.00           0.00           0.00
Ops NUMA hint local faults %               0.00           0.00           0.00           0.00           0.00
Ops NUMA hint local percent              100.00         100.00         100.00         100.00         100.00
Ops NUMA pages migrated                    0.00           0.00           0.00           0.00           0.00
Ops AutoNUMA cost                          0.00           0.00           0.00           0.00           0.00

                                 baseline_mpol_count_ref_mpol_count_ref_mpol_count_ref_mpol_count_ref_
                                baseline_8mpol_count_ref_2mpol_count_ref_3mpol_count_ref_7mpol_count_ref_9
Ops TTWU Count                        0.00           0.00           0.00           0.00           0.00
Ops TTWU Local                        0.00           0.00           0.00           0.00           0.00
Ops SIS Search                        0.00           0.00           0.00           0.00           0.00
Ops SIS Domain Search                 0.00           0.00           0.00           0.00           0.00
Ops SIS Scanned                       0.00           0.00           0.00           0.00           0.00
Ops SIS Domain Scanned                0.00           0.00           0.00           0.00           0.00
Ops SIS Failures                      0.00           0.00           0.00           0.00           0.00
Ops SIS Core Search                   0.00           0.00           0.00           0.00           0.00
Ops SIS Core Hit                      0.00           0.00           0.00           0.00           0.00
Ops SIS Core Miss                     0.00           0.00           0.00           0.00           0.00
Ops SIS Recent Used Hit               0.00           0.00           0.00           0.00           0.00
Ops SIS Recent Used Miss              0.00           0.00           0.00           0.00           0.00
Ops SIS Recent Attempts               0.00           0.00           0.00           0.00           0.00
Ops SIS Search Efficiency           100.00         100.00         100.00         100.00         100.00
Ops SIS Domain Search Eff           100.00         100.00         100.00         100.00         100.00
Ops SIS Fast Success Rate           100.00         100.00         100.00         100.00         100.00
Ops SIS Success Rate                100.00         100.00         100.00         100.00         100.00
Ops SIS Recent Success Rate           0.00           0.00           0.00           0.00           0.00
Runtime options
Run: /usr/bin/expect -f /home/mmtests/work/tmp/41623/aim9-expect

aim9
                            baseline_statumpol_cout_proc_status_po
                           baseline_statusmpol_cout_proc_status_pol
Min       page_test   420353.33 (   0.00%)   407501.67 (  -3.06%)
Min       brk_test   1194870.09 (   0.00%)  1226582.28 (   2.65%)
Min       exec_test     2285.00 (   0.00%)     2178.33 (  -4.67%)
Min       fork_test     6335.78 (   0.00%)     5936.04 (  -6.31%)
Hmean     page_test   438591.97 (   0.00%)   424251.22 *  -3.27%*
Hmean     brk_test   1268906.57 (   0.00%)  1278100.12 *   0.72%*
Hmean     exec_test     2301.19 (   0.00%)     2192.71 *  -4.71%*
Hmean     fork_test     6453.24 (   0.00%)     6090.48 *  -5.62%*
Stddev    page_test    12073.40 (   0.00%)     9619.28 (  20.33%)
Stddev    brk_test     45931.72 (   0.00%)    29084.39 (  36.68%)
Stddev    exec_test       11.74 (   0.00%)        7.55 (  35.69%)
Stddev    fork_test       52.69 (   0.00%)       71.34 ( -35.39%)
CoeffVar  page_test        2.75 (   0.00%)        2.27 (  17.62%)
CoeffVar  brk_test         3.62 (   0.00%)        2.27 (  37.09%)
CoeffVar  exec_test        0.51 (   0.00%)        0.34 (  32.50%)
CoeffVar  fork_test        0.82 (   0.00%)        1.17 ( -43.45%)
Max       page_test   459000.00 (   0.00%)   436382.41 (  -4.93%)
Max       brk_test   1355466.67 (   0.00%)  1320333.33 (  -2.59%)
Max       exec_test     2321.67 (   0.00%)     2204.33 (  -5.05%)
Max       fork_test     6526.67 (   0.00%)     6198.40 (  -5.03%)
BHmean-50 page_test   447612.82 (   0.00%)   432170.76 (  -3.45%)
BHmean-50 brk_test   1304644.85 (   0.00%)  1301840.22 (  -0.21%)
BHmean-50 exec_test     2310.30 (   0.00%)     2198.64 (  -4.83%)
BHmean-50 fork_test     6487.77 (   0.00%)     6146.59 (  -5.26%)
BHmean-95 page_test   440328.82 (   0.00%)   425842.44 (  -3.29%)
BHmean-95 brk_test   1276094.69 (   0.00%)  1282998.97 (   0.54%)
BHmean-95 exec_test     2302.67 (   0.00%)     2194.03 (  -4.72%)
BHmean-95 fork_test     6464.13 (   0.00%)     6104.92 (  -5.56%)
BHmean-99 page_test   440328.82 (   0.00%)   425842.44 (  -3.29%)
BHmean-99 brk_test   1276094.69 (   0.00%)  1282998.97 (   0.54%)
BHmean-99 exec_test     2302.67 (   0.00%)     2194.03 (  -4.72%)
BHmean-99 fork_test     6464.13 (   0.00%)     6104.92 (  -5.56%)

                baseline_statumpol_cout_proc_status_po
                baseline_statusmpol_cout_proc_status_pol
Duration User           0.10        0.12
Duration System         0.27        0.28
Duration Elapsed      723.94      726.94

                                 baseline_statumpol_cout_proc_status_po
                                baseline_statusmpol_cout_proc_status_pol
Ops Minor Faults                   186873421.00   193420839.00
Ops Major Faults                          19.00           0.00
Ops Swap Ins                               0.00           0.00
Ops Swap Outs                              0.00           0.00
Ops Allocation stalls                      0.00           0.00
Ops Fragmentation stalls                   0.00           0.00
Ops DMA allocs                             0.00           0.00
Ops DMA32 allocs                           0.00           0.00
Ops Normal allocs                  189934842.00   192623035.00
Ops Movable allocs                         0.00           0.00
Ops Direct pages scanned                   0.00           0.00
Ops Kswapd pages scanned                   0.00           0.00
Ops Kswapd pages reclaimed                 0.00           0.00
Ops Direct pages reclaimed                 0.00           0.00
Ops Kswapd efficiency %                  100.00         100.00
Ops Kswapd velocity                        0.00           0.00
Ops Direct efficiency %                  100.00         100.00
Ops Direct velocity                        0.00           0.00
Ops Percentage direct scans                0.00           0.00
Ops Page writes by reclaim                 0.00           0.00
Ops Page writes file                       0.00           0.00
Ops Page writes anon                       0.00           0.00
Ops Page reclaim immediate                 0.00           0.00
Ops Sector Reads                        4220.00           0.00
Ops Sector Writes                      71216.00       70892.00
Ops Page rescued immediate                 0.00           0.00
Ops Slabs scanned                          0.00           0.00
Ops Direct inode steals                    0.00           0.00
Ops Kswapd inode steals                    0.00           0.00
Ops Kswapd skipped wait                    0.00           0.00
Ops THP fault alloc                        0.00           0.00
Ops THP fault fallback                     0.00           0.00
Ops THP collapse alloc                     0.00           0.00
Ops THP collapse fail                      0.00           0.00
Ops THP split                              0.00           0.00
Ops THP split failed                       0.00           0.00
Ops Compaction stalls                      0.00           0.00
Ops Compaction success                     0.00           0.00
Ops Compaction failures                    0.00           0.00
Ops Compaction efficiency                  0.00           0.00
Ops Page migrate success                   0.00           0.00
Ops Page migrate failure                   0.00           0.00
Ops Compaction pages isolated              0.00           0.00
Ops Compaction migrate scanned             0.00           0.00
Ops Compaction free scanned                0.00           0.00
Ops Compact scan efficiency                0.00           0.00
Ops Compaction cost                        0.00           0.00
Ops Kcompactd wake                         0.00           0.00
Ops Kcompactd migrate scanned              0.00           0.00
Ops Kcompactd free scanned                 0.00           0.00
Ops NUMA alloc hit                 186860637.00   189746782.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local               186860608.00   189746775.00
Ops NUMA base-page range updates           0.00           0.00
Ops NUMA PTE updates                       0.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                       0.00           0.00
Ops NUMA hint local faults %               0.00           0.00
Ops NUMA hint local percent              100.00         100.00
Ops NUMA pages migrated                    0.00           0.00
Ops AutoNUMA cost                          0.00           0.00

                            baseline_statumpol_cout_proc_status_po
                           baseline_statusmpol_cout_proc_status_pol
Ops TTWU Count                        0.00           0.00
Ops TTWU Local                        0.00           0.00
Ops SIS Search                        0.00           0.00
Ops SIS Domain Search                 0.00           0.00
Ops SIS Scanned                       0.00           0.00
Ops SIS Domain Scanned                0.00           0.00
Ops SIS Failures                      0.00           0.00
Ops SIS Core Search                   0.00           0.00
Ops SIS Core Hit                      0.00           0.00
Ops SIS Core Miss                     0.00           0.00
Ops SIS Recent Used Hit               0.00           0.00
Ops SIS Recent Used Miss              0.00           0.00
Ops SIS Recent Attempts               0.00           0.00
Ops SIS Search Efficiency           100.00         100.00
Ops SIS Domain Search Eff           100.00         100.00
Ops SIS Fast Success Rate           100.00         100.00
Ops SIS Success Rate                100.00         100.00
Ops SIS Recent Success Rate           0.00           0.00
Runtime options
Run: /usr/bin/expect -f /home/mmtests/work/tmp/2095/aim9-expect

aim9
                               baseline_to            mpol_ref_to
                              baseline_top           mpol_ref_top
Min       page_test   462286.67 (   0.00%)   461153.33 (  -0.25%)
Min       brk_test   1302200.00 (   0.00%)  1336200.00 (   2.61%)
Min       exec_test     2234.33 (   0.00%)     2266.67 (   1.45%)
Min       fork_test     6335.78 (   0.00%)     6326.67 (  -0.14%)
Hmean     page_test   477916.47 (   0.00%)   467829.01 *  -2.11%*
Hmean     brk_test   1351439.76 (   0.00%)  1373663.90 *   1.64%*
Hmean     exec_test     2312.24 (   0.00%)     2296.06 *  -0.70%*
Hmean     fork_test     6483.46 (   0.00%)     6472.06 *  -0.18%*
Stddev    page_test     8729.22 (   0.00%)     5189.50 (  40.55%)
Stddev    brk_test     34969.33 (   0.00%)    25595.72 (  26.81%)
Stddev    exec_test       34.73 (   0.00%)       19.05 (  45.14%)
Stddev    fork_test       87.11 (   0.00%)       98.77 ( -13.39%)
CoeffVar  page_test        1.83 (   0.00%)        1.11 (  39.26%)
CoeffVar  brk_test         2.59 (   0.00%)        1.86 (  27.97%)
CoeffVar  exec_test        1.50 (   0.00%)        0.83 (  44.75%)
CoeffVar  fork_test        1.34 (   0.00%)        1.53 ( -13.58%)
Max       page_test   493680.00 (   0.00%)   475116.59 (  -3.76%)
Max       brk_test   1430446.37 (   0.00%)  1410059.96 (  -1.43%)
Max       exec_test     2357.00 (   0.00%)     2328.67 (  -1.20%)
Max       fork_test     6602.27 (   0.00%)     6655.56 (   0.81%)
BHmean-50 page_test   484400.69 (   0.00%)   472585.58 (  -2.44%)
BHmean-50 brk_test   1379144.94 (   0.00%)  1396565.13 (   1.26%)
BHmean-50 exec_test     2336.73 (   0.00%)     2310.79 (  -1.11%)
BHmean-50 fork_test     6555.14 (   0.00%)     6554.09 (  -0.02%)
BHmean-95 page_test   479389.93 (   0.00%)   468445.49 (  -2.28%)
BHmean-95 brk_test   1356101.38 (   0.00%)  1377174.15 (   1.55%)
BHmean-95 exec_test     2319.59 (   0.00%)     2298.77 (  -0.90%)
BHmean-95 fork_test     6497.23 (   0.00%)     6485.61 (  -0.18%)
BHmean-99 page_test   479389.93 (   0.00%)   468445.49 (  -2.28%)
BHmean-99 brk_test   1356101.38 (   0.00%)  1377174.15 (   1.55%)
BHmean-99 exec_test     2319.59 (   0.00%)     2298.77 (  -0.90%)
BHmean-99 fork_test     6497.23 (   0.00%)     6485.61 (  -0.18%)

                 baseline_to mpol_ref_to
                baseline_topmpol_ref_top
Duration User           0.12        0.21
Duration System         0.38        0.58
Duration Elapsed      729.23      728.33

                                    baseline_to    mpol_ref_to
                                   baseline_top   mpol_ref_top
Ops Minor Faults                   148357637.00   145855290.00
Ops Major Faults                          19.00          21.00
Ops Swap Ins                               0.00           0.00
Ops Swap Outs                              0.00           0.00
Ops Allocation stalls                      0.00           0.00
Ops Fragmentation stalls                   0.00           0.00
Ops DMA allocs                             0.00           0.00
Ops DMA32 allocs                           0.00           0.00
Ops Normal allocs                  158702635.00   156776622.00
Ops Movable allocs                         0.00           0.00
Ops Direct pages scanned                   0.00           0.00
Ops Kswapd pages scanned                   0.00           0.00
Ops Kswapd pages reclaimed                 0.00           0.00
Ops Direct pages reclaimed                 0.00           0.00
Ops Kswapd efficiency %                  100.00         100.00
Ops Kswapd velocity                        0.00           0.00
Ops Direct efficiency %                  100.00         100.00
Ops Direct velocity                        0.00           0.00
Ops Percentage direct scans                0.00           0.00
Ops Page writes by reclaim                 0.00           0.00
Ops Page writes file                       0.00           0.00
Ops Page writes anon                       0.00           0.00
Ops Page reclaim immediate                 0.00           0.00
Ops Sector Reads                        3568.00        3160.00
Ops Sector Writes                      71048.00       70828.00
Ops Page rescued immediate                 0.00           0.00
Ops Slabs scanned                          0.00           0.00
Ops Direct inode steals                    0.00           0.00
Ops Kswapd inode steals                    0.00           0.00
Ops Kswapd skipped wait                    0.00           0.00
Ops THP fault alloc                        0.00           0.00
Ops THP fault fallback                     0.00           0.00
Ops THP collapse alloc                     0.00           0.00
Ops THP collapse fail                      0.00           0.00
Ops THP split                              0.00           0.00
Ops THP split failed                       0.00           0.00
Ops Compaction stalls                      0.00           0.00
Ops Compaction success                     0.00           0.00
Ops Compaction failures                    0.00           0.00
Ops Compaction efficiency                  0.00           0.00
Ops Page migrate success                   0.00           0.00
Ops Page migrate failure                   0.00           0.00
Ops Compaction pages isolated              0.00           0.00
Ops Compaction migrate scanned             0.00           0.00
Ops Compaction free scanned                0.00           0.00
Ops Compact scan efficiency                0.00           0.00
Ops Compaction cost                        0.00           0.00
Ops Kcompactd wake                         0.00           0.00
Ops Kcompactd migrate scanned              0.00           0.00
Ops Kcompactd free scanned                 0.00           0.00
Ops NUMA alloc hit                 156656490.00   154741172.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local               156656458.00   154741167.00
Ops NUMA base-page range updates           0.00           0.00
Ops NUMA PTE updates                       0.00           0.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                       0.00           0.00
Ops NUMA hint local faults %               0.00           0.00
Ops NUMA hint local percent              100.00         100.00
Ops NUMA pages migrated                    0.00           0.00
Ops AutoNUMA cost                          0.00           0.00

                               baseline_to    mpol_ref_to
                              baseline_top   mpol_ref_top
Ops TTWU Count                        0.00           0.00
Ops TTWU Local                        0.00           0.00
Ops SIS Search                        0.00           0.00
Ops SIS Domain Search                 0.00           0.00
Ops SIS Scanned                       0.00           0.00
Ops SIS Domain Scanned                0.00           0.00
Ops SIS Failures                      0.00           0.00
Ops SIS Core Search                   0.00           0.00
Ops SIS Core Hit                      0.00           0.00
Ops SIS Core Miss                     0.00           0.00
Ops SIS Recent Used Hit               0.00           0.00
Ops SIS Recent Used Miss              0.00           0.00
Ops SIS Recent Attempts               0.00           0.00
Ops SIS Search Efficiency           100.00         100.00
Ops SIS Domain Search Eff           100.00         100.00
Ops SIS Fast Success Rate           100.00         100.00
Ops SIS Success Rate                100.00         100.00
Ops SIS Recent Success Rate           0.00           0.00
Michal Hocko Nov. 16, 2022, 2:57 p.m. UTC | #15
On Wed 16-11-22 19:28:10, Zhongkun He wrote:
> Hi Michal, I've done the performance testing, please check it out.
> 
> > > Yes this is all understood but the level of the overhead is not really
> > > clear. So the question is whether this will induce a visible overhead.
> > > Because from the maintainability point of view it is much less costly to
> > > have a clear life time model. Right now we have a mix of reference
> > > counting and per-task requirements which is rather subtle and easy to
> > > get wrong. In an ideal world we would have get_vma_policy always
> > > returning a reference counted policy or NULL. If we really need to
> > > optimize for cache line bouncing we can go with per cpu reference
> > > counters (something that was not available at the time the mempolicy
> > > code has been introduced).
> > > 
> > > So I am not saying that the task_work based solution is not possible I
> > > just think that this looks like a good opportunity to get from the
> > > existing subtle model.
> 
> Test tools:
> numactl -m 0-3 ./run-mmtests.sh -n -c configs/config-workload-
> aim9-pagealloc  test_name
> 
> Modification:
> Get_vma_policy(), get_task_policy() always returning a reference
> counted policy, except for the static policy(default_policy and
> preferred_node_policy[nid]).

It would be better to add the patch that has been tested.

> All vma manipulation is protected by a down_read, so mpol_get()
> can be called directly to take a refcount on the mpol. but there
> is no lock in task->mempolicy context.
> so task->mempolicy should be protected by task_lock.
> 
> struct mempolicy *get_task_policy(struct task_struct *p)
> {
> 	struct mempolicy *pol;
> 	int node;
> 
> 	if (p->mempolicy) {
> 		task_lock(p);
> 		pol = p->mempolicy;
> 		mpol_get(pol);
> 		task_unlock(p);
> 		if (pol)
> 			return pol;
> 	}


One way to deal with that would be to use a similar model as css_tryget

Btw. have you tried to profile those slowdowns to identify hotspots?

Thanks
Huang, Ying Nov. 17, 2022, 6:29 a.m. UTC | #16
Michal Hocko <mhocko@suse.com> writes:

> On Wed 16-11-22 17:38:09, Zhongkun He wrote:
>> Hi Ying, thanks for your replay and suggestions.
>> 
>> > 
>> > I suggest to move the flags in "mode" parameter (MPOL_F_STATIC_NODES,
>> > MPOL_F_RELATIVE_NODES, MPOL_F_NUMA_BALANCING, etc.) to "flags"
>> > parameter, otherwise, why add it?
>> 
>> The "flags" is used for future extension if any, just like
>> process_madvise() and set_mempolicy_home_node().
>> Maybe it should be removed.
>
> No, please! Even if there is no use for the flags now we are usually
> terrible at predicting future and potential extensions. MPOL_F* is kinda
> flags but for historical reasons it is a separate mode and we shouldn't
> create a new confusion when this is treated differently for pidfd based
> APIs.
>
>> > And, how about add a "home_node" parameter?  I don't think that it's a
>> > good idea to add another new syscall for pidfd_set_mempolicy_home_node()
>> > in the future.
>
> Why would this be a bad idea?
>
>> Good idea, but "home_node" is used for vma policy, not task policy.
>> It is possible to use it in pidfd_mbind() in the future.
>
> I woould go with pidfd_set_mempolicy_home_node to counterpart an
> existing syscall.

Yes.  I understand that it's important to make API consistent.

Just my two cents.

From another point of view, the new API gives us an opportunity to fix
the issues in existing API too?  For example, moving all flags into
"flags" parameter, add another parameter "home_node"?  Maybe we can
switch to this new API in the future completely after finding a way to
indicate "current" task in "pidfd" parameter.

Best Regards,
Huang, Ying
Zhongkun He Nov. 17, 2022, 7:19 a.m. UTC | #17
Hi Michal, thanks for your replay.

> 
> It would be better to add the patch that has been tested.

OK.

> 
> One way to deal with that would be to use a similar model as css_tryget

Percpu_ref is a good way to  reduce memory footprint in fast path.But it
has the potential to make mempolicy heavy. the sizeof mempolicy is 32
bytes and it may not have a long life time, which duplicated from the
parent in fork().If we modify atomic_t to percpu_ref, the efficiency of
reading in fastpath will increase, the efficiency of creation and
deletion will decrease, and the occupied space will increase
significantly.I am not really sure it is worth it.

atomic_t; 4
sizeof(percpu_ref + percpu_ref_data + cpus* unsigned long)
16+56+cpus*8

> 
> Btw. have you tried to profile those slowdowns to identify hotspots?
> 
> Thanks

Yes, it will degrade performance about 2%-%3 may because of the 
task_lock and  atomic operations on the reference count as shown
in the previous email.

new hotspots in perf.
1.34%  [kernel]          [k] __mpol_put
0.53%  [kernel]          [k] _raw_spin_lock
0.44%  [kernel]          [k] get_task_policy


Tested patch.

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 8a74cdcc9af0..3f1b5c8329a8 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -105,10 +105,7 @@ static void hold_task_mempolicy(struct 
proc_maps_private *priv)
  {
         struct task_struct *task = priv->task;

-       task_lock(task);
         priv->task_mempolicy = get_task_policy(task);
-       mpol_get(priv->task_mempolicy);
-       task_unlock(task);
  }
  static void release_task_mempolicy(struct proc_maps_private *priv)
  {
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index d232de7cdc56..786481d7abfd 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -62,7 +62,7 @@ struct mempolicy {
  extern void __mpol_put(struct mempolicy *pol);
  static inline void mpol_put(struct mempolicy *pol)
  {
-       if (pol)
+       if (pol && !(pol->flags & MPOL_F_STATIC))
                 __mpol_put(pol);
  }

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 046d0ccba4cd..7c2068163a0c 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -63,7 +63,7 @@ enum {
  #define MPOL_F_SHARED  (1 << 0)        /* identify shared policies */
  #define MPOL_F_MOF     (1 << 3) /* this policy wants migrate on fault */
  #define MPOL_F_MORON   (1 << 4) /* Migrate On protnone Reference On 
Node */
-
+#define MPOL_F_STATIC (1 << 5)
  /*
   * These bit locations are exposed in the vm.zone_reclaim_mode sysctl
   * ABI.  New bits are OK, but existing bits can never change.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 546df97c31e4..4cca96a40d04 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1247,6 +1247,7 @@ static struct page *dequeue_huge_page_vma(struct 
hstate *h,
         }

         mpol_cond_put(mpol);
+       mpol_put(mpol);
         return page;

  err:
@@ -2316,6 +2317,7 @@ struct page 
*alloc_buddy_huge_page_with_mpol(struct hstate *h,
         if (!page)
                 page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
         mpol_cond_put(mpol);
+       mpol_put(mpol);
         return page;
  }

@@ -2352,6 +2354,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, 
struct vm_area_struct *vma,
         node = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
         page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask);
         mpol_cond_put(mpol);
+       mpol_put(mpol);

         return page;
  }
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 61aa9aedb728..ea670db6881f 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -126,6 +126,7 @@ enum zone_type policy_zone = 0;
  static struct mempolicy default_policy = {
         .refcnt = ATOMIC_INIT(1), /* never free it */
         .mode = MPOL_LOCAL,
+       .flags = MPOL_F_STATIC
  };

  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
@@ -160,11 +161,19 @@ EXPORT_SYMBOL_GPL(numa_map_to_online_node);

  struct mempolicy *get_task_policy(struct task_struct *p)
  {
-       struct mempolicy *pol = p->mempolicy;
+       struct mempolicy *pol;
         int node;

-       if (pol)
-               return pol;
+       if (p->mempolicy)
+       {
+               task_lock(p);
+               pol = p->mempolicy;
+               mpol_get(pol);
+               task_unlock(p);
+
+               if(pol)
+                       return pol;
+       }

         node = numa_node_id();
         if (node != NUMA_NO_NODE) {
@@ -1764,10 +1773,12 @@ struct mempolicy *__get_vma_policy(struct 
vm_area_struct *vma,
                          * a pseudo vma whose vma->vm_ops=NULL. Take a 
reference
                          * count on these policies which will be dropped by
                          * mpol_cond_put() later
+                        *
+                        * if (mpol_needs_cond_ref(pol))
+                        *      mpol_get(pol);
                          */
-                       if (mpol_needs_cond_ref(pol))
-                               mpol_get(pol);
                 }
+               mpol_get(pol);
         }

         return pol;
@@ -1799,9 +1810,9 @@ static struct mempolicy *get_vma_policy(struct 
vm_area_struct *vma,
  bool vma_policy_mof(struct vm_area_struct *vma)
  {
         struct mempolicy *pol;
+       bool ret = false;

         if (vma->vm_ops && vma->vm_ops->get_policy) {
-               bool ret = false;

                 pol = vma->vm_ops->get_policy(vma, vma->vm_start);
                 if (pol && (pol->flags & MPOL_F_MOF))
@@ -1812,10 +1823,13 @@ bool vma_policy_mof(struct vm_area_struct *vma)
         }

         pol = vma->vm_policy;
+       mpol_get(pol);
         if (!pol)
                 pol = get_task_policy(current);
+       ret = pol && (pol->flags & MPOL_F_MOF);
+       mpol_put(pol);

-       return pol->flags & MPOL_F_MOF;
+       return ret;
  }

  bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
@@ -2179,7 +2193,6 @@ struct folio *vma_alloc_folio(gfp_t gfp, int 
order, struct vm_area_struct *vma,
                 unsigned nid;

                 nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
-               mpol_cond_put(pol);
                 gfp |= __GFP_COMP;
                 page = alloc_page_interleave(gfp, order, nid);
                 if (page && order > 1)
@@ -2194,7 +2207,6 @@ struct folio *vma_alloc_folio(gfp_t gfp, int 
order, struct vm_area_struct *vma,
                 node = policy_node(gfp, pol, node);
                 gfp |= __GFP_COMP;
                 page = alloc_pages_preferred_many(gfp, order, node, pol);
-               mpol_cond_put(pol);
                 if (page && order > 1)
                         prep_transhuge_page(page);
                 folio = (struct folio *)page;
@@ -2219,7 +2231,6 @@ struct folio *vma_alloc_folio(gfp_t gfp, int 
order, struct vm_area_struct *vma,

                 nmask = policy_nodemask(gfp, pol);
                 if (!nmask || node_isset(hpage_node, *nmask)) {
-                       mpol_cond_put(pol);
                         /*
                          * First, try to allocate THP only on local 
node, but
                          * don't reclaim unnecessarily, just compact.
@@ -2244,8 +2255,9 @@ struct folio *vma_alloc_folio(gfp_t gfp, int 
order, struct vm_area_struct *vma,
         nmask = policy_nodemask(gfp, pol);
         preferred_nid = policy_node(gfp, pol, node);
         folio = __folio_alloc(gfp, order, preferred_nid, nmask);
-       mpol_cond_put(pol);
  out:
+       mpol_cond_put(pol);
+       mpol_put(pol);
         return folio;
  }
  EXPORT_SYMBOL(vma_alloc_folio);
@@ -2286,6 +2298,7 @@ struct page *alloc_pages(gfp_t gfp, unsigned order)
                                 policy_node(gfp, pol, numa_node_id()),
                                 policy_nodemask(gfp, pol));

+       mpol_put(pol);
         return page;
  }
  EXPORT_SYMBOL(alloc_pages);
@@ -2365,21 +2378,23 @@ unsigned long 
alloc_pages_bulk_array_mempolicy(gfp_t gfp,
                 unsigned long nr_pages, struct page **page_array)
  {
         struct mempolicy *pol = &default_policy;
+       unsigned long allocated;

         if (!in_interrupt() && !(gfp & __GFP_THISNODE))
                 pol = get_task_policy(current);

-       if (pol->mode == MPOL_INTERLEAVE)
-               return alloc_pages_bulk_array_interleave(gfp, pol,
+       if (pol->mode == MPOL_INTERLEAVE) {
+               allocated =  alloc_pages_bulk_array_interleave(gfp, pol,
                                                          nr_pages, 
page_array);
-
-       if (pol->mode == MPOL_PREFERRED_MANY)
-               return alloc_pages_bulk_array_preferred_many(gfp,
+       } else if (pol->mode == MPOL_PREFERRED_MANY)
+               allocated = alloc_pages_bulk_array_preferred_many(gfp,
                                 numa_node_id(), pol, nr_pages, page_array);
-
-       return __alloc_pages_bulk(gfp, policy_node(gfp, pol, 
numa_node_id()),
+       else
+              allocated = __alloc_pages_bulk(gfp, policy_node(gfp, pol, 
numa_node_id()),
                                   policy_nodemask(gfp, pol), nr_pages, 
NULL,
                                   page_array);
+       mpol_put(pol);
+       return allocated;
  }

  int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
@@ -2636,6 +2651,7 @@ int mpol_misplaced(struct page *page, struct 
vm_area_struct *vma, unsigned long
                 ret = polnid;
  out:
         mpol_cond_put(pol);
+       mpol_put(pol);

         return ret;
  }
@@ -2917,7 +2933,7 @@ void __init numa_policy_init(void)
                 preferred_node_policy[nid] = (struct mempolicy) {
                         .refcnt = ATOMIC_INIT(1),
                         .mode = MPOL_PREFERRED,
-                       .flags = MPOL_F_MOF | MPOL_F_MORON,
+                       .flags = MPOL_F_MOF | MPOL_F_MORON | MPOL_F_STATIC,
                         .nodes = nodemask_of_node(nid),
                 };
         }
Michal Hocko Nov. 21, 2022, 2:38 p.m. UTC | #18
On Thu 17-11-22 15:19:20, Zhongkun He wrote:
> Hi Michal, thanks for your replay.
> 
> > 
> > It would be better to add the patch that has been tested.
> 
> OK.
> 
> > 
> > One way to deal with that would be to use a similar model as css_tryget
> 
> Percpu_ref is a good way to  reduce memory footprint in fast path.But it
> has the potential to make mempolicy heavy. the sizeof mempolicy is 32
> bytes and it may not have a long life time, which duplicated from the
> parent in fork().If we modify atomic_t to percpu_ref, the efficiency of
> reading in fastpath will increase, the efficiency of creation and
> deletion will decrease, and the occupied space will increase
> significantly.I am not really sure it is worth it.
> 
> atomic_t; 4
> sizeof(percpu_ref + percpu_ref_data + cpus* unsigned long)
> 16+56+cpus*8

Yes the memory consumption is going to increase but the question is
whether this is something that is a real problem. Is it really common to
have many vmas with a dedicated policy?

What I am arguing here is that there are essentially 2 ways forward.
Either we continue to build up on top of the existing and arguably very
fragile code and make it even more subtle or follow a general pattern of
a proper reference counting (with usual tricks to reduce cache line
bouncing and similar issues). I do not really see why memory policies
should be any different and require very special treatment.

> > Btw. have you tried to profile those slowdowns to identify hotspots?
> > 
> > Thanks
> 
> Yes, it will degrade performance about 2%-%3 may because of the task_lock
> and  atomic operations on the reference count as shown
> in the previous email.
> 
> new hotspots in perf.
> 1.34%  [kernel]          [k] __mpol_put
> 0.53%  [kernel]          [k] _raw_spin_lock
> 0.44%  [kernel]          [k] get_task_policy

Thanks!
Zhongkun He Nov. 22, 2022, 8:33 a.m. UTC | #19
Hi Michal, thanks for your replay and suggestions.

> 
> Yes the memory consumption is going to increase but the question is
> whether this is something that is a real problem. Is it really common to
> have many vmas with a dedicated policy?

Yes, it does not a realy problem.

> 
> What I am arguing here is that there are essentially 2 ways forward.
> Either we continue to build up on top of the existing and arguably very
> fragile code and make it even more subtle or follow a general pattern of
> a proper reference counting (with usual tricks to reduce cache line
> bouncing and similar issues). I do not really see why memory policies
> should be any different and require very special treatment.
> 

I got it. It is rather subtle and easy to get wrong if we push forward
with the existing way and it is a good opportunity to get from the
existing subtle model. I will try that on next version.
Michal Hocko Nov. 22, 2022, 8:40 a.m. UTC | #20
On Tue 22-11-22 16:33:09, Zhongkun He wrote:
> Hi Michal, thanks for your replay and suggestions.
> 
> > 
> > Yes the memory consumption is going to increase but the question is
> > whether this is something that is a real problem. Is it really common to
> > have many vmas with a dedicated policy?
> 
> Yes, it does not a realy problem.
> 
> > 
> > What I am arguing here is that there are essentially 2 ways forward.
> > Either we continue to build up on top of the existing and arguably very
> > fragile code and make it even more subtle or follow a general pattern of
> > a proper reference counting (with usual tricks to reduce cache line
> > bouncing and similar issues). I do not really see why memory policies
> > should be any different and require very special treatment.
> > 
> 
> I got it. It is rather subtle and easy to get wrong if we push forward
> with the existing way and it is a good opportunity to get from the
> existing subtle model. I will try that on next version.

Thanks for being receptive to the review feedback!
diff mbox series

Patch

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index 5a6afecbb0d0..b45ceb0b165c 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -408,9 +408,10 @@  follows:
 Memory Policy APIs
 ==================
 
-Linux supports 4 system calls for controlling memory policy.  These APIS
-always affect only the calling task, the calling task's address space, or
-some shared object mapped into the calling task's address space.
+Linux supports 5 system calls for controlling memory policy.  The first four
+APIS affect only the calling task, the calling task's address space, or
+some shared object mapped into the calling task's address space. The last
+one sets the mempolicy of task specified in the pidfd.
 
 .. note::
    the headers that define these APIs and the parameter data types for
@@ -473,6 +474,18 @@  closest to which page allocation will come from. Specifying the home node overri
 the default allocation policy to allocate memory close to the local node for an
 executing CPU.
 
+Set [pidfd Task] Memory Policy::
+
+        long sys_pidfd_set_mempolicy(int pidfd, int mode,
+                                     const unsigned long __user *nmask,
+                                     unsigned long maxnode,
+                                     unsigned int flags);
+
+Sets the task/process memory policy for the [pidfd] task. The pidfd argument
+is a PID file descriptor (see pidfd_open(2) man page for details) that
+specifies the process for which the mempolicy is applied to. The flags
+argument is reserved for future use; currently, it must be specified as 0.
+For the description of all other arguments, see set_mempolicy(2) man page.
 
 Memory Policy Command Line Interface
 ====================================
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8ebacf37a8cf..3aeaefe7d45b 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -490,3 +490,4 @@ 
 558	common	process_mrelease		sys_process_mrelease
 559	common  futex_waitv                     sys_futex_waitv
 560	common	set_mempolicy_home_node		sys_ni_syscall
+561	common	pidfd_set_mempolicy		sys_ni_syscall
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index ac964612d8b0..a7ccab8aafef 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -464,3 +464,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common	futex_waitv			sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 037feba03a51..64a514f90131 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -39,7 +39,7 @@ 
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		451
+#define __NR_compat_syscalls		452
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 604a2053d006..2e178e9152e6 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -907,7 +907,8 @@  __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
 __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
-
+#define __NR_pidfd_set_mempolicy 451
+__SYSCALL(__NR_pidfd_set_mempolicy, sys_pidfd_set_mempolicy)
 /*
  * Please add new compat syscalls above this comment and update
  * __NR_compat_syscalls in asm/unistd.h.
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 72c929d9902b..6f60981592b4 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -371,3 +371,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index b1f3940bc298..8e50bf8ec41d 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -450,3 +450,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 820145e47350..f48e32532c5f 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -456,3 +456,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 253ff994ed2e..58e7c3140f4a 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -389,3 +389,4 @@ 
 448	n32	process_mrelease		sys_process_mrelease
 449	n32	futex_waitv			sys_futex_waitv
 450	n32	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	n32	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 3f1886ad9d80..70090c3ada16 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -365,3 +365,4 @@ 
 448	n64	process_mrelease		sys_process_mrelease
 449	n64	futex_waitv			sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	n64	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 8f243e35a7b2..b0b0bad64fa0 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -438,3 +438,4 @@ 
 448	o32	process_mrelease		sys_process_mrelease
 449	o32	futex_waitv			sys_futex_waitv
 450	o32	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	o32	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 8a99c998da9b..4dfd328490ad 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -448,3 +448,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common	futex_waitv			sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index a0be127475b1..34bd3cea5954 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -537,3 +537,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450 	nospu	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	nospu	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 799147658dee..5e170dca81f6 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@ 
 448  common	process_mrelease	sys_process_mrelease		sys_process_mrelease
 449  common	futex_waitv		sys_futex_waitv			sys_futex_waitv
 450  common	set_mempolicy_home_node	sys_set_mempolicy_home_node	sys_set_mempolicy_home_node
+451  common	pidfd_set_mempolicy	sys_pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 2de85c977f54..bd3a24a9b41f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -453,3 +453,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 4398cc6fb68d..bea2b5c6314b 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -496,3 +496,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 320480a8db4f..97bc70f5a8f7 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -455,3 +455,4 @@ 
 448	i386	process_mrelease	sys_process_mrelease
 449	i386	futex_waitv		sys_futex_waitv
 450	i386	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	i386	pidfd_set_mempolicy	sys_pidfd_set_mempolicy
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c84d12608cd2..ba6d19dbd8f8 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -372,6 +372,7 @@ 
 448	common	process_mrelease	sys_process_mrelease
 449	common	futex_waitv		sys_futex_waitv
 450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy	sys_pidfd_set_mempolicy
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 52c94ab5c205..9f7c563da4fb 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -421,3 +421,4 @@ 
 448	common	process_mrelease		sys_process_mrelease
 449	common  futex_waitv                     sys_futex_waitv
 450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
+451	common	pidfd_set_mempolicy		sys_pidfd_set_mempolicy
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index d58e0476ee8e..e9db7e10f171 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -77,6 +77,7 @@  extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
 extern bool cpuset_cpus_allowed_fallback(struct task_struct *p);
 extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 #define cpuset_current_mems_allowed (current->mems_allowed)
+#define cpuset_task_mems_allowed(task) ((task)->mems_allowed)
 void cpuset_init_current_mems_allowed(void);
 int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
@@ -216,6 +217,7 @@  static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
 }
 
 #define cpuset_current_mems_allowed (node_states[N_MEMORY])
+#define cpuset_task_mems_allowed(task) (node_states[N_MEMORY])
 static inline void cpuset_init_current_mems_allowed(void) {}
 
 static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index d232de7cdc56..afb92020808e 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -13,6 +13,7 @@ 
 #include <linux/spinlock.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
+#include <linux/task_work.h>
 #include <uapi/linux/mempolicy.h>
 
 struct mm_struct;
@@ -51,6 +52,7 @@  struct mempolicy {
 	union {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
+		struct callback_head cb_head;   /* async free */
 	} w;
 };
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a34b0f9a9972..e611c088050b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1056,6 +1056,10 @@  asmlinkage long sys_memfd_secret(unsigned int flags);
 asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len,
 					    unsigned long home_node,
 					    unsigned long flags);
+asmlinkage long sys_pidfd_set_mempolicy(int pidfd, int mode,
+					const unsigned long __user *nmask,
+					unsigned long maxnode,
+					unsigned int flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 45fa180cc56a..c38013bbf5b0 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -886,8 +886,11 @@  __SYSCALL(__NR_futex_waitv, sys_futex_waitv)
 #define __NR_set_mempolicy_home_node 450
 __SYSCALL(__NR_set_mempolicy_home_node, sys_set_mempolicy_home_node)
 
+#define __NR_pidfd_set_mempolicy 451
+__SYSCALL(__NR_pidfd_set_mempolicy, sys_pidfd_set_mempolicy)
+
 #undef __NR_syscalls
-#define __NR_syscalls 451
+#define __NR_syscalls 452
 
 /*
  * 32 bit systems traditionally used different
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 860b2dcf3ac4..5053deb888f7 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -299,6 +299,7 @@  COND_SYSCALL(set_mempolicy);
 COND_SYSCALL(migrate_pages);
 COND_SYSCALL(move_pages);
 COND_SYSCALL(set_mempolicy_home_node);
+COND_SYSCALL(pidfd_set_mempolicy);
 
 COND_SYSCALL(perf_event_open);
 COND_SYSCALL(accept4);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 61aa9aedb728..2ac307aba01c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -221,7 +221,7 @@  static int mpol_new_preferred(struct mempolicy *pol, const nodemask_t *nodes)
  * Must be called holding task's alloc_lock to protect task's mems_allowed
  * and mempolicy.  May also be called holding the mmap_lock for write.
  */
-static int mpol_set_nodemask(struct mempolicy *pol,
+static int mpol_set_nodemask(struct task_struct *task, struct mempolicy *pol,
 		     const nodemask_t *nodes, struct nodemask_scratch *nsc)
 {
 	int ret;
@@ -236,7 +236,7 @@  static int mpol_set_nodemask(struct mempolicy *pol,
 
 	/* Check N_MEMORY */
 	nodes_and(nsc->mask1,
-		  cpuset_current_mems_allowed, node_states[N_MEMORY]);
+		  cpuset_task_mems_allowed(task), node_states[N_MEMORY]);
 
 	VM_BUG_ON(!nodes);
 
@@ -248,7 +248,7 @@  static int mpol_set_nodemask(struct mempolicy *pol,
 	if (mpol_store_user_nodemask(pol))
 		pol->w.user_nodemask = *nodes;
 	else
-		pol->w.cpuset_mems_allowed = cpuset_current_mems_allowed;
+		pol->w.cpuset_mems_allowed = cpuset_task_mems_allowed(task);
 
 	ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
 	return ret;
@@ -312,6 +312,36 @@  void __mpol_put(struct mempolicy *p)
 	kmem_cache_free(policy_cache, p);
 }
 
+static void mpol_free_async(struct callback_head *work)
+{
+	kmem_cache_free(policy_cache,
+			container_of(work, struct mempolicy, w.cb_head));
+}
+
+/*
+ * mpol destructor for pidfd_set_mempolicy().
+ * free mempolicy directly if task is null or task_work_add() failed.
+ */
+void mpol_put_async(struct task_struct *task, struct mempolicy *p)
+{
+	enum task_work_notify_mode notify = TWA_RESUME;
+
+	if (!atomic_dec_and_test(&p->refcnt))
+		return;
+
+	if (!task)
+		goto out;
+
+	init_task_work(&p->w.cb_head, mpol_free_async);
+	if (task_work_pending(task))
+		notify = TWA_SIGNAL; /* free memory in time */
+
+	if (!task_work_add(task, &p->w.cb_head, notify))
+		return;
+out:
+	kmem_cache_free(policy_cache, p);
+}
+
 static void mpol_rebind_default(struct mempolicy *pol, const nodemask_t *nodes)
 {
 }
@@ -850,8 +880,8 @@  static int mbind_range(struct mm_struct *mm, unsigned long start,
 }
 
 /* Set the process memory policy */
-static long do_set_mempolicy(unsigned short mode, unsigned short flags,
-			     nodemask_t *nodes)
+static long do_set_mempolicy(struct task_struct *task, unsigned short mode,
+		unsigned short flags, nodemask_t *nodes)
 {
 	struct mempolicy *new, *old;
 	NODEMASK_SCRATCH(scratch);
@@ -866,20 +896,24 @@  static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 		goto out;
 	}
 
-	task_lock(current);
-	ret = mpol_set_nodemask(new, nodes, scratch);
+	task_lock(task);
+	ret = mpol_set_nodemask(task, new, nodes, scratch);
 	if (ret) {
-		task_unlock(current);
+		task_unlock(task);
 		mpol_put(new);
 		goto out;
 	}
 
-	old = current->mempolicy;
-	current->mempolicy = new;
+	old = task->mempolicy;
+	task->mempolicy = new;
 	if (new && new->mode == MPOL_INTERLEAVE)
-		current->il_prev = MAX_NUMNODES-1;
-	task_unlock(current);
-	mpol_put(old);
+		task->il_prev = MAX_NUMNODES-1;
+	task_unlock(task);
+
+	if (old && task != current)
+		mpol_put_async(task, old);
+	else
+		mpol_put(old);
 	ret = 0;
 out:
 	NODEMASK_SCRATCH_FREE(scratch);
@@ -932,7 +966,7 @@  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 	int err;
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
-	struct mempolicy *pol = current->mempolicy, *pol_refcount = NULL;
+	struct mempolicy *pol;
 
 	if (flags &
 		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
@@ -947,6 +981,15 @@  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		task_unlock(current);
 		return 0;
 	}
+	/*
+	 * Take a refcount on the mpol, because it may be freed by
+	 * pidfd_set_mempolicy() asynchronously, which will
+	 * override w.user_nodemask used below.
+	 */
+	task_lock(current);
+	pol = current->mempolicy;
+	mpol_get(pol);
+	task_unlock(current);
 
 	if (flags & MPOL_F_ADDR) {
 		/*
@@ -954,6 +997,7 @@  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		 * vma/shared policy at addr is NULL.  We
 		 * want to return MPOL_DEFAULT in this case.
 		 */
+		mpol_put(pol);	/* put the refcount of task mpol*/
 		mmap_read_lock(mm);
 		vma = vma_lookup(mm, addr);
 		if (!vma) {
@@ -964,23 +1008,18 @@  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 			pol = vma->vm_ops->get_policy(vma, addr);
 		else
 			pol = vma->vm_policy;
-	} else if (addr)
-		return -EINVAL;
+		mpol_get(pol);
+		mmap_read_unlock(mm);
+	} else if (addr) {
+		err = -EINVAL;
+		goto out;
+	}
 
 	if (!pol)
 		pol = &default_policy;	/* indicates default behavior */
 
 	if (flags & MPOL_F_NODE) {
 		if (flags & MPOL_F_ADDR) {
-			/*
-			 * Take a refcount on the mpol, because we are about to
-			 * drop the mmap_lock, after which only "pol" remains
-			 * valid, "vma" is stale.
-			 */
-			pol_refcount = pol;
-			vma = NULL;
-			mpol_get(pol);
-			mmap_read_unlock(mm);
 			err = lookup_node(mm, addr);
 			if (err < 0)
 				goto out;
@@ -1004,21 +1043,17 @@  static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 
 	err = 0;
 	if (nmask) {
-		if (mpol_store_user_nodemask(pol)) {
+		if (mpol_store_user_nodemask(pol))
 			*nmask = pol->w.user_nodemask;
-		} else {
-			task_lock(current);
+		else
 			get_policy_nodemask(pol, nmask);
-			task_unlock(current);
-		}
 	}
 
  out:
 	mpol_cond_put(pol);
-	if (vma)
-		mmap_read_unlock(mm);
-	if (pol_refcount)
-		mpol_put(pol_refcount);
+
+	if (pol != &default_policy)
+		mpol_put(pol);
 	return err;
 }
 
@@ -1309,7 +1344,7 @@  static long do_mbind(unsigned long start, unsigned long len,
 		NODEMASK_SCRATCH(scratch);
 		if (scratch) {
 			mmap_write_lock(mm);
-			err = mpol_set_nodemask(new, nmask, scratch);
+			err = mpol_set_nodemask(current, new, nmask, scratch);
 			if (err)
 				mmap_write_unlock(mm);
 		} else
@@ -1578,7 +1613,7 @@  static long kernel_set_mempolicy(int mode, const unsigned long __user *nmask,
 	if (err)
 		return err;
 
-	return do_set_mempolicy(lmode, mode_flags, &nodes);
+	return do_set_mempolicy(current, lmode, mode_flags, &nodes);
 }
 
 SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
@@ -1587,6 +1622,71 @@  SYSCALL_DEFINE3(set_mempolicy, int, mode, const unsigned long __user *, nmask,
 	return kernel_set_mempolicy(mode, nmask, maxnode);
 }
 
+/* Set the memory policy of the process specified in pidfd. */
+static long do_pidfd_set_mempolicy(int pidfd, int mode, const unsigned long __user *nmask,
+		unsigned long maxnode, unsigned int flags)
+{
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	unsigned short mode_flags;
+	int err, lmode = mode;
+	unsigned int f_flags;
+	nodemask_t nodes;
+
+	if (flags)
+		return -EINVAL;
+
+	err = get_nodes(&nodes, nmask, maxnode);
+	if (err)
+		return err;
+
+	err = sanitize_mpol_flags(&lmode, &mode_flags);
+	if (err)
+		return err;
+
+	task = pidfd_get_task(pidfd, &f_flags);
+	if (IS_ERR(task))
+		return PTR_ERR(task);
+
+	/*
+	 * Require CAP_SYS_NICE for influencing process performance.
+	 * User's task is allowed only.
+	 */
+	if (!capable(CAP_SYS_NICE) || (task->flags & PF_KTHREAD)) {
+		err = -EPERM;
+		goto out;
+	}
+
+	/*
+	 * Require PTRACE_MODE_ATTACH_REALCREDS  to avoid
+	 * leaking ASLR metadata.
+	 */
+	mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+	if (IS_ERR_OR_NULL(mm)) {
+		err = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+		goto out;
+	}
+
+	if (mmap_write_lock_killable(mm)) {
+		err = -EINTR;
+		goto release_mm;
+	}
+
+	err = do_set_mempolicy(task, lmode, mode_flags, &nodes);
+	mmap_write_unlock(mm);
+release_mm:
+	mmput(mm);
+out:
+	put_task_struct(task);
+	return err;
+}
+
+SYSCALL_DEFINE5(pidfd_set_mempolicy, int, pidfd, int, mode, const unsigned long __user *, nmask,
+		unsigned long, maxnode, unsigned int, flags)
+{
+	return do_pidfd_set_mempolicy(pidfd, mode, nmask, maxnode, flags);
+}
+
 static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
 				const unsigned long __user *old_nodes,
 				const unsigned long __user *new_nodes)
@@ -2790,7 +2890,7 @@  void mpol_shared_policy_init(struct shared_policy *sp, struct mempolicy *mpol)
 			goto free_scratch; /* no valid nodemask intersection */
 
 		task_lock(current);
-		ret = mpol_set_nodemask(new, &mpol->w.user_nodemask, scratch);
+		ret = mpol_set_nodemask(current, new, &mpol->w.user_nodemask, scratch);
 		task_unlock(current);
 		if (ret)
 			goto put_new;
@@ -2946,7 +3046,7 @@  void __init numa_policy_init(void)
 	if (unlikely(nodes_empty(interleave_nodes)))
 		node_set(prefer, interleave_nodes);
 
-	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
+	if (do_set_mempolicy(current, MPOL_INTERLEAVE, 0, &interleave_nodes))
 		pr_err("%s: interleaving failed\n", __func__);
 
 	check_numabalancing_enable();
@@ -2955,7 +3055,7 @@  void __init numa_policy_init(void)
 /* Reset policy of current process to default */
 void numa_default_policy(void)
 {
-	do_set_mempolicy(MPOL_DEFAULT, 0, NULL);
+	do_set_mempolicy(current, MPOL_DEFAULT, 0, NULL);
 }
 
 /*