[RFC,5/7] mm: introduce external memory hinting API

Message ID	20190520035254.57579-6-minchan@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of minchan.kim@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; From: Minchan Kim <minchan@kernel.org> To: Andrew Morton <akpm@linux-foundation.org> Cc: LKML <linux-kernel@vger.kernel.org>, linux-mm <linux-mm@kvack.org>, Michal Hocko <mhocko@suse.com>, Johannes Weiner <hannes@cmpxchg.org>, Tim Murray <timmurray@google.com>, Joel Fernandes <joel@joelfernandes.org>, Suren Baghdasaryan <surenb@google.com>, Daniel Colascione <dancol@google.com>, Shakeel Butt <shakeelb@google.com>, Sonny Rao <sonnyrao@google.com>, Brian Geffon <bgeffon@google.com>, Minchan Kim <minchan@kernel.org> Subject: [RFC 5/7] mm: introduce external memory hinting API Date: Mon, 20 May 2019 12:52:52 +0900 Message-Id: <20190520035254.57579-6-minchan@kernel.org> In-Reply-To: <20190520035254.57579-1-minchan@kernel.org> References: <20190520035254.57579-1-minchan@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	introduce memory hinting API for external process \| expand [RFC,0/7] introduce memory hinting API for external process [RFC,1/7] mm: introduce MADV_COOL [RFC,2/7] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM [RFC,3/7] mm: introduce MADV_COLD [RFC,4/7] mm: factor out madvise's core functionality [RFC,5/7] mm: introduce external memory hinting API [RFC,6/7] mm: extend process_madvise syscall to support vector arrary [RFC,7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER

Minchan Kim May 20, 2019, 3:52 a.m. UTC

There is some usecase that centralized userspace daemon want to give
a memory hint like MADV_[COOL|COLD] to other process. Android's
ActivityManagerService is one of them.

It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.

To solve the issue, this patch introduces new syscall process_madvise(2)
which works based on pidfd so it could give a hint to the exeternal
process.

int process_madvise(int pidfd, void *addr, size_t length, int advise);

All advises madvise provides can be supported in process_madvise, too.
Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrrace the process could use it successfully.

Please suggest better idea if you have other idea about the permission.

* from v1r1
  * use ptrace capability - surenb, dancol

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/entry/syscalls/syscall_32.tbl |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl |  1 +
 include/linux/proc_fs.h                |  1 +
 include/linux/syscalls.h               |  2 ++
 include/uapi/asm-generic/unistd.h      |  2 ++
 kernel/signal.c                        |  2 +-
 kernel/sys_ni.c                        |  1 +
 mm/madvise.c                           | 45 ++++++++++++++++++++++++++
 8 files changed, 54 insertions(+), 1 deletion(-)

Michal Hocko May 20, 2019, 9:18 a.m. UTC | #1

[Cc linux-api]

On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COOL|COLD] to other process. Android's
> ActivityManagerService is one of them.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.

Could you expand some more about how this all works? How does the
centralized daemon track respective ranges? How does it synchronize
against parallel modification of the address space etc.

> To solve the issue, this patch introduces new syscall process_madvise(2)
> which works based on pidfd so it could give a hint to the exeternal
> process.
> 
> int process_madvise(int pidfd, void *addr, size_t length, int advise);

OK, this makes some sense from the API point of view. When we have
discussed that at LSFMM I was contemplating about something like that
except the fd would be a VMA fd rather than the process. We could extend
and reuse /proc/<pid>/map_files interface which doesn't support the
anonymous memory right now. 

I am not saying this would be a better interface but I wanted to mention
it here for a further discussion. One slight advantage would be that
you know the exact object that you are operating on because you have a
fd for the VMA and we would have a more straightforward way to reject
operation if the underlying object has changed (e.g. unmapped and reused
for a different mapping).

> All advises madvise provides can be supported in process_madvise, too.
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrrace the process could use it successfully.

proc_mem_open model we use for accessing address space via proc sounds
like a good mode. You are doing something similar.

> Please suggest better idea if you have other idea about the permission.
> 
> * from v1r1
>   * use ptrace capability - surenb, dancol
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/proc_fs.h                |  1 +
>  include/linux/syscalls.h               |  2 ++
>  include/uapi/asm-generic/unistd.h      |  2 ++
>  kernel/signal.c                        |  2 +-
>  kernel/sys_ni.c                        |  1 +
>  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
>  8 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4cd5f982b1e5..5b9dd55d6b57 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
>  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
>  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 64ca0d06259a..0e5ee78161c9 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  425	common	io_uring_setup		__x64_sys_io_uring_setup
>  426	common	io_uring_enter		__x64_sys_io_uring_enter
>  427	common	io_uring_register	__x64_sys_io_uring_register
> +428	common	process_madvise		__x64_sys_process_madvise
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 52a283ba0465..f8545d7c5218 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
>  
>  #endif /* CONFIG_PROC_FS */
>  
> +extern struct pid *pidfd_to_pid(const struct file *file);
>  struct net;
>  
>  static inline struct proc_dir_entry *proc_net_mkdir(
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..21c6c9a62006 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
>  asmlinkage long sys_mincore(unsigned long start, size_t len,
>  				unsigned char __user * vec);
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> +				size_t len, int behavior);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dee7292e1df6..7ee82ce04620 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
>  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
>  #define __NR_io_uring_register 427
>  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_process_madvise 428
> +__SYSCALL(__NR_process_madvise, sys_process_madvise)
>  
>  #undef __NR_syscalls
>  #define __NR_syscalls 428
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1c86b78a7597..04e75daab1f8 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
>  	return copy_siginfo_from_user(kinfo, info);
>  }
>  
> -static struct pid *pidfd_to_pid(const struct file *file)
> +struct pid *pidfd_to_pid(const struct file *file)
>  {
>  	if (file->f_op == &pidfd_fops)
>  		return file->private_data;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 4d9ae5ea6caf..5277421795ab 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
>  COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
> +COND_SYSCALL(process_madvise);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 119e82e1f065..af02aa17e5c1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -9,6 +9,7 @@
>  #include <linux/mman.h>
>  #include <linux/pagemap.h>
>  #include <linux/page_idle.h>
> +#include <linux/proc_fs.h>
>  #include <linux/syscalls.h>
>  #include <linux/mempolicy.h>
>  #include <linux/page-isolation.h>
> @@ -16,6 +17,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/falloc.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
>  	return madvise_core(current, start, len_in, behavior);
>  }
> +
> +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> +		size_t, len_in, int, behavior)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	struct task_struct *tsk;
> +	struct mm_struct *mm;
> +
> +	f = fdget(pidfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = pidfd_to_pid(f.file);
> +	if (IS_ERR(pid)) {
> +		ret = PTR_ERR(pid);
> +		goto err;
> +	}
> +
> +	ret = -EINVAL;
> +	rcu_read_lock();
> +	tsk = pid_task(pid, PIDTYPE_PID);
> +	if (!tsk) {
> +		rcu_read_unlock();
> +		goto err;
> +	}
> +	get_task_struct(tsk);
> +	rcu_read_unlock();
> +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> +	if (!mm || IS_ERR(mm)) {
> +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +		if (ret == -EACCES)
> +			ret = -EPERM;
> +		goto err;
> +	}
> +	ret = madvise_core(tsk, start, len_in, behavior);
> +	mmput(mm);
> +	put_task_struct(tsk);
> +err:
> +	fdput(f);
> +	return ret;
> +}
> -- 
> 2.21.0.1020.gf2820cf01a-goog
>

Minchan Kim May 21, 2019, 2:41 a.m. UTC | #2

On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> [Cc linux-api]
> 
> On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > ActivityManagerService is one of them.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> 
> Could you expand some more about how this all works? How does the
> centralized daemon track respective ranges? How does it synchronize
> against parallel modification of the address space etc.

Currently, we don't track each address ranges because we have two
policies at this moment:

	deactive file pages and reclaim anonymous pages of the app.

Since the daemon has a ability to let background apps resume(IOW, process
will be run by the daemon) and both hints are non-disruptive stabilty point
of view, we are okay for the race.

> 
> > To solve the issue, this patch introduces new syscall process_madvise(2)
> > which works based on pidfd so it could give a hint to the exeternal
> > process.
> > 
> > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> 
> OK, this makes some sense from the API point of view. When we have
> discussed that at LSFMM I was contemplating about something like that
> except the fd would be a VMA fd rather than the process. We could extend
> and reuse /proc/<pid>/map_files interface which doesn't support the
> anonymous memory right now. 
> 
> I am not saying this would be a better interface but I wanted to mention
> it here for a further discussion. One slight advantage would be that
> you know the exact object that you are operating on because you have a
> fd for the VMA and we would have a more straightforward way to reject
> operation if the underlying object has changed (e.g. unmapped and reused
> for a different mapping).

I agree your point. If I didn't miss something, such kinds of vma level
modify notification doesn't work even file mapped vma at this moment.
For anonymous vma, I think we could use userfaultfd, pontentially.
It would be great if someone want to do with disruptive hints like
MADV_DONTNEED.

I'd like to see it further enhancement after landing address range based
operation via limiting hints process_madvise supports to non-disruptive
only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
when someone want to extend the API.

> 
> > All advises madvise provides can be supported in process_madvise, too.
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrrace the process could use it successfully.
> 
> proc_mem_open model we use for accessing address space via proc sounds
> like a good mode. You are doing something similar.
> 
> > Please suggest better idea if you have other idea about the permission.
> > 
> > * from v1r1
> >   * use ptrace capability - surenb, dancol
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/proc_fs.h                |  1 +
> >  include/linux/syscalls.h               |  2 ++
> >  include/uapi/asm-generic/unistd.h      |  2 ++
> >  kernel/signal.c                        |  2 +-
> >  kernel/sys_ni.c                        |  1 +
> >  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
> >  8 files changed, 54 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 4cd5f982b1e5..5b9dd55d6b57 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -438,3 +438,4 @@
> >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 64ca0d06259a..0e5ee78161c9 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -355,6 +355,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	process_madvise		__x64_sys_process_madvise
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > index 52a283ba0465..f8545d7c5218 100644
> > --- a/include/linux/proc_fs.h
> > +++ b/include/linux/proc_fs.h
> > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
> >  
> >  #endif /* CONFIG_PROC_FS */
> >  
> > +extern struct pid *pidfd_to_pid(const struct file *file);
> >  struct net;
> >  
> >  static inline struct proc_dir_entry *proc_net_mkdir(
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e2870fe1be5b..21c6c9a62006 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
> >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  				unsigned char __user * vec);
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> > +				size_t len, int behavior);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >  			unsigned long prot, unsigned long pgoff,
> >  			unsigned long flags);
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..7ee82ce04620 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_process_madvise 428
> > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  
> >  #undef __NR_syscalls
> >  #define __NR_syscalls 428
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1c86b78a7597..04e75daab1f8 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> >  	return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > -static struct pid *pidfd_to_pid(const struct file *file)
> > +struct pid *pidfd_to_pid(const struct file *file)
> >  {
> >  	if (file->f_op == &pidfd_fops)
> >  		return file->private_data;
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 4d9ae5ea6caf..5277421795ab 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
> >  COND_SYSCALL(munlockall);
> >  COND_SYSCALL(mincore);
> >  COND_SYSCALL(madvise);
> > +COND_SYSCALL(process_madvise);
> >  COND_SYSCALL(remap_file_pages);
> >  COND_SYSCALL(mbind);
> >  COND_SYSCALL_COMPAT(mbind);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 119e82e1f065..af02aa17e5c1 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/proc_fs.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/mempolicy.h>
> >  #include <linux/page-isolation.h>
> > @@ -16,6 +17,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/falloc.h>
> >  #include <linux/sched.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/ksm.h>
> >  #include <linux/fs.h>
> >  #include <linux/file.h>
> > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> >  	return madvise_core(current, start, len_in, behavior);
> >  }
> > +
> > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > +		size_t, len_in, int, behavior)
> > +{
> > +	int ret;
> > +	struct fd f;
> > +	struct pid *pid;
> > +	struct task_struct *tsk;
> > +	struct mm_struct *mm;
> > +
> > +	f = fdget(pidfd);
> > +	if (!f.file)
> > +		return -EBADF;
> > +
> > +	pid = pidfd_to_pid(f.file);
> > +	if (IS_ERR(pid)) {
> > +		ret = PTR_ERR(pid);
> > +		goto err;
> > +	}
> > +
> > +	ret = -EINVAL;
> > +	rcu_read_lock();
> > +	tsk = pid_task(pid, PIDTYPE_PID);
> > +	if (!tsk) {
> > +		rcu_read_unlock();
> > +		goto err;
> > +	}
> > +	get_task_struct(tsk);
> > +	rcu_read_unlock();
> > +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> > +	if (!mm || IS_ERR(mm)) {
> > +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > +		if (ret == -EACCES)
> > +			ret = -EPERM;
> > +		goto err;
> > +	}
> > +	ret = madvise_core(tsk, start, len_in, behavior);
> > +	mmput(mm);
> > +	put_task_struct(tsk);
> > +err:
> > +	fdput(f);
> > +	return ret;
> > +}
> > -- 
> > 2.21.0.1020.gf2820cf01a-goog
> > 
> 
> -- 
> Michal Hocko
> SUSE Labs

Michal Hocko May 21, 2019, 6:17 a.m. UTC | #3

On Tue 21-05-19 11:41:07, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> > [Cc linux-api]
> > 
> > On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > ActivityManagerService is one of them.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > 
> > Could you expand some more about how this all works? How does the
> > centralized daemon track respective ranges? How does it synchronize
> > against parallel modification of the address space etc.
> 
> Currently, we don't track each address ranges because we have two
> policies at this moment:
> 
> 	deactive file pages and reclaim anonymous pages of the app.
> 
> Since the daemon has a ability to let background apps resume(IOW, process
> will be run by the daemon) and both hints are non-disruptive stabilty point
> of view, we are okay for the race.

Fair enough but the API should consider future usecases where this might
be a problem. So we should really think about those potential scenarios
now. If we are ok with that, fine, but then we should be explicit and
document it that way. Essentially say that any sort of synchronization
is supposed to be done by monitor. This will make the API less usable
but maybe that is enough.
 
> > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > which works based on pidfd so it could give a hint to the exeternal
> > > process.
> > > 
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > 
> > OK, this makes some sense from the API point of view. When we have
> > discussed that at LSFMM I was contemplating about something like that
> > except the fd would be a VMA fd rather than the process. We could extend
> > and reuse /proc/<pid>/map_files interface which doesn't support the
> > anonymous memory right now. 
> > 
> > I am not saying this would be a better interface but I wanted to mention
> > it here for a further discussion. One slight advantage would be that
> > you know the exact object that you are operating on because you have a
> > fd for the VMA and we would have a more straightforward way to reject
> > operation if the underlying object has changed (e.g. unmapped and reused
> > for a different mapping).
> 
> I agree your point. If I didn't miss something, such kinds of vma level
> modify notification doesn't work even file mapped vma at this moment.
> For anonymous vma, I think we could use userfaultfd, pontentially.
> It would be great if someone want to do with disruptive hints like
> MADV_DONTNEED.
> 
> I'd like to see it further enhancement after landing address range based
> operation via limiting hints process_madvise supports to non-disruptive
> only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
> when someone want to extend the API.

So do you think we want both interfaces (process_madvise and madvisefd)?

Christian Brauner May 21, 2019, 9:01 a.m. UTC | #4

Cc: Jann and Oleg too

On Mon, May 20, 2019 at 12:52:52PM +0900, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COOL|COLD] to other process. Android's
> ActivityManagerService is one of them.
> 
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
> 
> To solve the issue, this patch introduces new syscall process_madvise(2)
> which works based on pidfd so it could give a hint to the exeternal
> process.
> 
> int process_madvise(int pidfd, void *addr, size_t length, int advise);
> 
> All advises madvise provides can be supported in process_madvise, too.
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrrace the process could use it successfully.
> 
> Please suggest better idea if you have other idea about the permission.
> 
> * from v1r1
>   * use ptrace capability - surenb, dancol
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
>  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
>  include/linux/proc_fs.h                |  1 +
>  include/linux/syscalls.h               |  2 ++
>  include/uapi/asm-generic/unistd.h      |  2 ++
>  kernel/signal.c                        |  2 +-
>  kernel/sys_ni.c                        |  1 +
>  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
>  8 files changed, 54 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4cd5f982b1e5..5b9dd55d6b57 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -438,3 +438,4 @@
>  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
>  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
>  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 64ca0d06259a..0e5ee78161c9 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -355,6 +355,7 @@
>  425	common	io_uring_setup		__x64_sys_io_uring_setup
>  426	common	io_uring_enter		__x64_sys_io_uring_enter
>  427	common	io_uring_register	__x64_sys_io_uring_register
> +428	common	process_madvise		__x64_sys_process_madvise
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> index 52a283ba0465..f8545d7c5218 100644
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
>  
>  #endif /* CONFIG_PROC_FS */
>  
> +extern struct pid *pidfd_to_pid(const struct file *file);
>  struct net;
>  
>  static inline struct proc_dir_entry *proc_net_mkdir(
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e2870fe1be5b..21c6c9a62006 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
>  asmlinkage long sys_mincore(unsigned long start, size_t len,
>  				unsigned char __user * vec);
>  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> +				size_t len, int behavior);
>  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
>  			unsigned long prot, unsigned long pgoff,
>  			unsigned long flags);
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dee7292e1df6..7ee82ce04620 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
>  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
>  #define __NR_io_uring_register 427
>  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> +#define __NR_process_madvise 428
> +__SYSCALL(__NR_process_madvise, sys_process_madvise)
>  
>  #undef __NR_syscalls
>  #define __NR_syscalls 428
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 1c86b78a7597..04e75daab1f8 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
>  	return copy_siginfo_from_user(kinfo, info);
>  }
>  
> -static struct pid *pidfd_to_pid(const struct file *file)
> +struct pid *pidfd_to_pid(const struct file *file)
>  {
>  	if (file->f_op == &pidfd_fops)
>  		return file->private_data;
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 4d9ae5ea6caf..5277421795ab 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
>  COND_SYSCALL(munlockall);
>  COND_SYSCALL(mincore);
>  COND_SYSCALL(madvise);
> +COND_SYSCALL(process_madvise);
>  COND_SYSCALL(remap_file_pages);
>  COND_SYSCALL(mbind);
>  COND_SYSCALL_COMPAT(mbind);
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 119e82e1f065..af02aa17e5c1 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -9,6 +9,7 @@
>  #include <linux/mman.h>
>  #include <linux/pagemap.h>
>  #include <linux/page_idle.h>
> +#include <linux/proc_fs.h>
>  #include <linux/syscalls.h>
>  #include <linux/mempolicy.h>
>  #include <linux/page-isolation.h>
> @@ -16,6 +17,7 @@
>  #include <linux/hugetlb.h>
>  #include <linux/falloc.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/ksm.h>
>  #include <linux/fs.h>
>  #include <linux/file.h>
> @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>  {
>  	return madvise_core(current, start, len_in, behavior);
>  }
> +
> +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> +		size_t, len_in, int, behavior)
> +{
> +	int ret;
> +	struct fd f;
> +	struct pid *pid;
> +	struct task_struct *tsk;
> +	struct mm_struct *mm;
> +
> +	f = fdget(pidfd);
> +	if (!f.file)
> +		return -EBADF;
> +
> +	pid = pidfd_to_pid(f.file);

pidfd_to_pid() should not be directly exported since this allows
/proc/<pid> fds to be used too. That's something we won't be going
forward with. All new syscalls should only allow to operate on pidfds
created through CLONE_PIDFD or pidfd_open() (cf. [1]).

So e.g. please export a simple helper like

struct pid *pidfd_to_pid(const struct file *file)
{
        if (file->f_op == &pidfd_fops)
                return file->private_data;

        return NULL;
}

turning the old pidfd_to_pid() into something like:

static struct pid *__fd_to_pid(const struct file *file)
{
        struct pid *pid;

        pid = pidfd_to_pid(file);
        if (pid)
                return pid;

        return tgid_pidfd_to_pid(file);
}

All new syscalls should only be using anon inode pidfds since they can
actually have a clean security model built around them in the future.
Note, pidfd_open() will be sent out together with making pidfds pollable
for the 5.3 merge window.

[1]: https://lore.kernel.org/lkml/20190520155630.21684-1-christian@brauner.io/

Thanks!
Christian

Minchan Kim May 21, 2019, 10:32 a.m. UTC | #5

On Tue, May 21, 2019 at 08:17:43AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 11:41:07, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote:
> > > [Cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:52, Minchan Kim wrote:
> > > > There is some usecase that centralized userspace daemon want to give
> > > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > > ActivityManagerService is one of them.
> > > > 
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > 
> > > Could you expand some more about how this all works? How does the
> > > centralized daemon track respective ranges? How does it synchronize
> > > against parallel modification of the address space etc.
> > 
> > Currently, we don't track each address ranges because we have two
> > policies at this moment:
> > 
> > 	deactive file pages and reclaim anonymous pages of the app.
> > 
> > Since the daemon has a ability to let background apps resume(IOW, process
> > will be run by the daemon) and both hints are non-disruptive stabilty point
> > of view, we are okay for the race.
> 
> Fair enough but the API should consider future usecases where this might
> be a problem. So we should really think about those potential scenarios
> now. If we are ok with that, fine, but then we should be explicit and
> document it that way. Essentially say that any sort of synchronization
> is supposed to be done by monitor. This will make the API less usable
> but maybe that is enough.

Okay, I will add more about that in the description.

>  
> > > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > > which works based on pidfd so it could give a hint to the exeternal
> > > > process.
> > > > 
> > > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > > 
> > > OK, this makes some sense from the API point of view. When we have
> > > discussed that at LSFMM I was contemplating about something like that
> > > except the fd would be a VMA fd rather than the process. We could extend
> > > and reuse /proc/<pid>/map_files interface which doesn't support the
> > > anonymous memory right now. 
> > > 
> > > I am not saying this would be a better interface but I wanted to mention
> > > it here for a further discussion. One slight advantage would be that
> > > you know the exact object that you are operating on because you have a
> > > fd for the VMA and we would have a more straightforward way to reject
> > > operation if the underlying object has changed (e.g. unmapped and reused
> > > for a different mapping).
> > 
> > I agree your point. If I didn't miss something, such kinds of vma level
> > modify notification doesn't work even file mapped vma at this moment.
> > For anonymous vma, I think we could use userfaultfd, pontentially.
> > It would be great if someone want to do with disruptive hints like
> > MADV_DONTNEED.
> > 
> > I'd like to see it further enhancement after landing address range based
> > operation via limiting hints process_madvise supports to non-disruptive
> > only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload
> > when someone want to extend the API.
> 
> So do you think we want both interfaces (process_madvise and madvisefd)?

What I have in mind is to extend process_madvise later like this

struct pr_madvise_param {
    int size;                       /* the size of this structure */
    union {
    	const struct iovec __user *vec; /* address range array */
	int fd;				/* supported from 6.0 */
    }
}

with introducing new hint Or-able PR_MADV_RANGE_FD, so that process_madvise
can go with fd instead of address range.

Minchan Kim May 21, 2019, 11:35 a.m. UTC | #6

On Tue, May 21, 2019 at 11:01:01AM +0200, Christian Brauner wrote:
> Cc: Jann and Oleg too
> 
> On Mon, May 20, 2019 at 12:52:52PM +0900, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > ActivityManagerService is one of them.
> > 
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> > 
> > To solve the issue, this patch introduces new syscall process_madvise(2)
> > which works based on pidfd so it could give a hint to the exeternal
> > process.
> > 
> > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > 
> > All advises madvise provides can be supported in process_madvise, too.
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrrace the process could use it successfully.
> > 
> > Please suggest better idea if you have other idea about the permission.
> > 
> > * from v1r1
> >   * use ptrace capability - surenb, dancol
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> >  include/linux/proc_fs.h                |  1 +
> >  include/linux/syscalls.h               |  2 ++
> >  include/uapi/asm-generic/unistd.h      |  2 ++
> >  kernel/signal.c                        |  2 +-
> >  kernel/sys_ni.c                        |  1 +
> >  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
> >  8 files changed, 54 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > index 4cd5f982b1e5..5b9dd55d6b57 100644
> > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > @@ -438,3 +438,4 @@
> >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > index 64ca0d06259a..0e5ee78161c9 100644
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -355,6 +355,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	process_madvise		__x64_sys_process_madvise
> >  
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > index 52a283ba0465..f8545d7c5218 100644
> > --- a/include/linux/proc_fs.h
> > +++ b/include/linux/proc_fs.h
> > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
> >  
> >  #endif /* CONFIG_PROC_FS */
> >  
> > +extern struct pid *pidfd_to_pid(const struct file *file);
> >  struct net;
> >  
> >  static inline struct proc_dir_entry *proc_net_mkdir(
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index e2870fe1be5b..21c6c9a62006 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
> >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> >  				unsigned char __user * vec);
> >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> > +				size_t len, int behavior);
> >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> >  			unsigned long prot, unsigned long pgoff,
> >  			unsigned long flags);
> > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > index dee7292e1df6..7ee82ce04620 100644
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_process_madvise 428
> > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  
> >  #undef __NR_syscalls
> >  #define __NR_syscalls 428
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index 1c86b78a7597..04e75daab1f8 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> >  	return copy_siginfo_from_user(kinfo, info);
> >  }
> >  
> > -static struct pid *pidfd_to_pid(const struct file *file)
> > +struct pid *pidfd_to_pid(const struct file *file)
> >  {
> >  	if (file->f_op == &pidfd_fops)
> >  		return file->private_data;
> > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > index 4d9ae5ea6caf..5277421795ab 100644
> > --- a/kernel/sys_ni.c
> > +++ b/kernel/sys_ni.c
> > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
> >  COND_SYSCALL(munlockall);
> >  COND_SYSCALL(mincore);
> >  COND_SYSCALL(madvise);
> > +COND_SYSCALL(process_madvise);
> >  COND_SYSCALL(remap_file_pages);
> >  COND_SYSCALL(mbind);
> >  COND_SYSCALL_COMPAT(mbind);
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 119e82e1f065..af02aa17e5c1 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -9,6 +9,7 @@
> >  #include <linux/mman.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/page_idle.h>
> > +#include <linux/proc_fs.h>
> >  #include <linux/syscalls.h>
> >  #include <linux/mempolicy.h>
> >  #include <linux/page-isolation.h>
> > @@ -16,6 +17,7 @@
> >  #include <linux/hugetlb.h>
> >  #include <linux/falloc.h>
> >  #include <linux/sched.h>
> > +#include <linux/sched/mm.h>
> >  #include <linux/ksm.h>
> >  #include <linux/fs.h>
> >  #include <linux/file.h>
> > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >  {
> >  	return madvise_core(current, start, len_in, behavior);
> >  }
> > +
> > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > +		size_t, len_in, int, behavior)
> > +{
> > +	int ret;
> > +	struct fd f;
> > +	struct pid *pid;
> > +	struct task_struct *tsk;
> > +	struct mm_struct *mm;
> > +
> > +	f = fdget(pidfd);
> > +	if (!f.file)
> > +		return -EBADF;
> > +
> > +	pid = pidfd_to_pid(f.file);
> 
> pidfd_to_pid() should not be directly exported since this allows
> /proc/<pid> fds to be used too. That's something we won't be going
> forward with. All new syscalls should only allow to operate on pidfds
> created through CLONE_PIDFD or pidfd_open() (cf. [1]).

Thanks for the information.

> 
> So e.g. please export a simple helper like
> 
> struct pid *pidfd_to_pid(const struct file *file)
> {
>         if (file->f_op == &pidfd_fops)
>                 return file->private_data;
> 
>         return NULL;
> }
> 
> turning the old pidfd_to_pid() into something like:
> 
> static struct pid *__fd_to_pid(const struct file *file)
> {
>         struct pid *pid;
> 
>         pid = pidfd_to_pid(file);
>         if (pid)
>                 return pid;
> 
>         return tgid_pidfd_to_pid(file);
> }

So, I want to clarify what you suggest here.

1. modify pidfd_to_pid as what you described above(ie, return NULL
instead of returning tgid_pidfd_to_pid(file);
2. never export pidfd_to_pid
3. create wrapper __fd_to_pid which calls pidfd_to_pid internally
4. export __fd_to_pid and use it

Correct?

Thanks.

> 
> All new syscalls should only be using anon inode pidfds since they can
> actually have a clean security model built around them in the future.
> Note, pidfd_open() will be sent out together with making pidfds pollable
> for the 5.3 merge window.
> 
> [1]: https://lore.kernel.org/lkml/20190520155630.21684-1-christian@brauner.io/
> 
> Thanks!
> Christian

Christian Brauner May 21, 2019, 11:51 a.m. UTC | #7

On Tue, May 21, 2019 at 08:35:10PM +0900, Minchan Kim wrote:
> On Tue, May 21, 2019 at 11:01:01AM +0200, Christian Brauner wrote:
> > Cc: Jann and Oleg too
> > 
> > On Mon, May 20, 2019 at 12:52:52PM +0900, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COOL|COLD] to other process. Android's
> > > ActivityManagerService is one of them.
> > > 
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > > 
> > > To solve the issue, this patch introduces new syscall process_madvise(2)
> > > which works based on pidfd so it could give a hint to the exeternal
> > > process.
> > > 
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise);
> > > 
> > > All advises madvise provides can be supported in process_madvise, too.
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrrace the process could use it successfully.
> > > 
> > > Please suggest better idea if you have other idea about the permission.
> > > 
> > > * from v1r1
> > >   * use ptrace capability - surenb, dancol
> > > 
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > >  arch/x86/entry/syscalls/syscall_32.tbl |  1 +
> > >  arch/x86/entry/syscalls/syscall_64.tbl |  1 +
> > >  include/linux/proc_fs.h                |  1 +
> > >  include/linux/syscalls.h               |  2 ++
> > >  include/uapi/asm-generic/unistd.h      |  2 ++
> > >  kernel/signal.c                        |  2 +-
> > >  kernel/sys_ni.c                        |  1 +
> > >  mm/madvise.c                           | 45 ++++++++++++++++++++++++++
> > >  8 files changed, 54 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> > > index 4cd5f982b1e5..5b9dd55d6b57 100644
> > > --- a/arch/x86/entry/syscalls/syscall_32.tbl
> > > +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> > > @@ -438,3 +438,4 @@
> > >  425	i386	io_uring_setup		sys_io_uring_setup		__ia32_sys_io_uring_setup
> > >  426	i386	io_uring_enter		sys_io_uring_enter		__ia32_sys_io_uring_enter
> > >  427	i386	io_uring_register	sys_io_uring_register		__ia32_sys_io_uring_register
> > > +428	i386	process_madvise		sys_process_madvise		__ia32_sys_process_madvise
> > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> > > index 64ca0d06259a..0e5ee78161c9 100644
> > > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > > @@ -355,6 +355,7 @@
> > >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> > >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> > >  427	common	io_uring_register	__x64_sys_io_uring_register
> > > +428	common	process_madvise		__x64_sys_process_madvise
> > >  
> > >  #
> > >  # x32-specific system call numbers start at 512 to avoid cache impact
> > > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
> > > index 52a283ba0465..f8545d7c5218 100644
> > > --- a/include/linux/proc_fs.h
> > > +++ b/include/linux/proc_fs.h
> > > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file)
> > >  
> > >  #endif /* CONFIG_PROC_FS */
> > >  
> > > +extern struct pid *pidfd_to_pid(const struct file *file);
> > >  struct net;
> > >  
> > >  static inline struct proc_dir_entry *proc_net_mkdir(
> > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > > index e2870fe1be5b..21c6c9a62006 100644
> > > --- a/include/linux/syscalls.h
> > > +++ b/include/linux/syscalls.h
> > > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void);
> > >  asmlinkage long sys_mincore(unsigned long start, size_t len,
> > >  				unsigned char __user * vec);
> > >  asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
> > > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start,
> > > +				size_t len, int behavior);
> > >  asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
> > >  			unsigned long prot, unsigned long pgoff,
> > >  			unsigned long flags);
> > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> > > index dee7292e1df6..7ee82ce04620 100644
> > > --- a/include/uapi/asm-generic/unistd.h
> > > +++ b/include/uapi/asm-generic/unistd.h
> > > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> > >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> > >  #define __NR_io_uring_register 427
> > >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > > +#define __NR_process_madvise 428
> > > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> > >  
> > >  #undef __NR_syscalls
> > >  #define __NR_syscalls 428
> > > diff --git a/kernel/signal.c b/kernel/signal.c
> > > index 1c86b78a7597..04e75daab1f8 100644
> > > --- a/kernel/signal.c
> > > +++ b/kernel/signal.c
> > > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
> > >  	return copy_siginfo_from_user(kinfo, info);
> > >  }
> > >  
> > > -static struct pid *pidfd_to_pid(const struct file *file)
> > > +struct pid *pidfd_to_pid(const struct file *file)
> > >  {
> > >  	if (file->f_op == &pidfd_fops)
> > >  		return file->private_data;
> > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> > > index 4d9ae5ea6caf..5277421795ab 100644
> > > --- a/kernel/sys_ni.c
> > > +++ b/kernel/sys_ni.c
> > > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
> > >  COND_SYSCALL(munlockall);
> > >  COND_SYSCALL(mincore);
> > >  COND_SYSCALL(madvise);
> > > +COND_SYSCALL(process_madvise);
> > >  COND_SYSCALL(remap_file_pages);
> > >  COND_SYSCALL(mbind);
> > >  COND_SYSCALL_COMPAT(mbind);
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 119e82e1f065..af02aa17e5c1 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -9,6 +9,7 @@
> > >  #include <linux/mman.h>
> > >  #include <linux/pagemap.h>
> > >  #include <linux/page_idle.h>
> > > +#include <linux/proc_fs.h>
> > >  #include <linux/syscalls.h>
> > >  #include <linux/mempolicy.h>
> > >  #include <linux/page-isolation.h>
> > > @@ -16,6 +17,7 @@
> > >  #include <linux/hugetlb.h>
> > >  #include <linux/falloc.h>
> > >  #include <linux/sched.h>
> > > +#include <linux/sched/mm.h>
> > >  #include <linux/ksm.h>
> > >  #include <linux/fs.h>
> > >  #include <linux/file.h>
> > > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > >  {
> > >  	return madvise_core(current, start, len_in, behavior);
> > >  }
> > > +
> > > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start,
> > > +		size_t, len_in, int, behavior)
> > > +{
> > > +	int ret;
> > > +	struct fd f;
> > > +	struct pid *pid;
> > > +	struct task_struct *tsk;
> > > +	struct mm_struct *mm;
> > > +
> > > +	f = fdget(pidfd);
> > > +	if (!f.file)
> > > +		return -EBADF;
> > > +
> > > +	pid = pidfd_to_pid(f.file);
> > 
> > pidfd_to_pid() should not be directly exported since this allows
> > /proc/<pid> fds to be used too. That's something we won't be going
> > forward with. All new syscalls should only allow to operate on pidfds
> > created through CLONE_PIDFD or pidfd_open() (cf. [1]).
> 
> Thanks for the information.
> 
> > 
> > So e.g. please export a simple helper like
> > 
> > struct pid *pidfd_to_pid(const struct file *file)
> > {
> >         if (file->f_op == &pidfd_fops)
> >                 return file->private_data;
> > 
> >         return NULL;
> > }
> > 
> > turning the old pidfd_to_pid() into something like:
> > 
> > static struct pid *__fd_to_pid(const struct file *file)
> > {
> >         struct pid *pid;
> > 
> >         pid = pidfd_to_pid(file);
> >         if (pid)
> >                 return pid;
> > 
> >         return tgid_pidfd_to_pid(file);
> > }
> 
> So, I want to clarify what you suggest here.
> 
> 1. modify pidfd_to_pid as what you described above(ie, return NULL
> instead of returning tgid_pidfd_to_pid(file);
> 2. never export pidfd_to_pid
> 3. create wrapper __fd_to_pid which calls pidfd_to_pid internally
> 4. export __fd_to_pid and use it

Sorry, seems I was not clear enough. :)
What I meant was something along the lines of (not compile tested):

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 3c8ef5a199ca..3ccc07010603 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -68,6 +68,10 @@ extern struct pid init_struct_pid;
 
 extern const struct file_operations pidfd_fops;
 
+struct file;
+
+extern struct pid *pidfd_pid(const struct file *file);
+
 static inline struct pid *get_pid(struct pid *pid)
 {
 	if (pid)
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..0fcfffaf5fdc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1711,6 +1711,14 @@ const struct file_operations pidfd_fops = {
 #endif
 };
 
+struct pid *pidfd_pid(const struct file *file)
+{
+	if (file->f_op == &pidfd_fops)
+		return file->private_data;
+
+	return ERR_PTR(-EBADF);
+}
+
 /**
  * pidfd_create() - Create a new pid file descriptor.
  *
diff --git a/kernel/signal.c b/kernel/signal.c
index a1eb44dc9ff5..7fca9d7e1633 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3611,8 +3611,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
 
 static struct pid *pidfd_to_pid(const struct file *file)
 {
-	if (file->f_op == &pidfd_fops)
-		return file->private_data;
+	struct pid *pid;
+
+	pid = pidfd_pid(file);
+	if (pid)
+		return pid;
 
 	return tgid_pidfd_to_pid(file);
 }

Oleg Nesterov May 21, 2019, 3:31 p.m. UTC | #8

On 05/20, Minchan Kim wrote:
>
> +	rcu_read_lock();
> +	tsk = pid_task(pid, PIDTYPE_PID);
> +	if (!tsk) {
> +		rcu_read_unlock();
> +		goto err;
> +	}
> +	get_task_struct(tsk);
> +	rcu_read_unlock();
> +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> +	if (!mm || IS_ERR(mm)) {
> +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> +		if (ret == -EACCES)
> +			ret = -EPERM;
> +		goto err;
> +	}
> +	ret = madvise_core(tsk, start, len_in, behavior);

IIUC, madvise_core(tsk) plays with tsk->mm->mmap_sem. But this tsk can exit and
nullify its ->mm right after mm_access() succeeds.

another problem is that pid_task(pid) can return a zombie leader, in this case
mm_access() will fail while it shouldn't.

Oleg.

Minchan Kim May 27, 2019, 7:43 a.m. UTC | #9

Sorry for the late response. I miseed your comment. :(

On Tue, May 21, 2019 at 05:31:13PM +0200, Oleg Nesterov wrote:
> On 05/20, Minchan Kim wrote:
> >
> > +	rcu_read_lock();
> > +	tsk = pid_task(pid, PIDTYPE_PID);
> > +	if (!tsk) {
> > +		rcu_read_unlock();
> > +		goto err;
> > +	}
> > +	get_task_struct(tsk);
> > +	rcu_read_unlock();
> > +	mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS);
> > +	if (!mm || IS_ERR(mm)) {
> > +		ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
> > +		if (ret == -EACCES)
> > +			ret = -EPERM;
> > +		goto err;
> > +	}
> > +	ret = madvise_core(tsk, start, len_in, behavior);
> 
> IIUC, madvise_core(tsk) plays with tsk->mm->mmap_sem. But this tsk can exit and
> nullify its ->mm right after mm_access() succeeds.

You're absolutely right. I will fix it via passing mm_struct instead of
task_struct.

Thanks!

> 
> another problem is that pid_task(pid) can return a zombie leader, in this case
> mm_access() will fail while it shouldn't.

I'm sorry. I didn't notice that. However, I couldn't understand your point. 
Why do you think mm_access shouldn't fail even though pid_task returns
a zombie leader? I thought it's okay since the target process is exiting
so hinting operation would be meaniness for the process.

Oleg Nesterov May 27, 2019, 3:12 p.m. UTC | #10

On 05/27, Minchan Kim wrote:
>
> > another problem is that pid_task(pid) can return a zombie leader, in this case
> > mm_access() will fail while it shouldn't.
>
> I'm sorry. I didn't notice that. However, I couldn't understand your point.
> Why do you think mm_access shouldn't fail even though pid_task returns
> a zombie leader?

The leader can exit (call sys_exit(), not sys_exit_group()), this won't affect
other threads. In this case the process is still alive even if the leader thread
is zombie. That is why we have find_lock_task_mm().

Oleg.

Minchan Kim May 27, 2019, 11:33 p.m. UTC | #11

On Mon, May 27, 2019 at 05:12:02PM +0200, Oleg Nesterov wrote:
> On 05/27, Minchan Kim wrote:
> >
> > > another problem is that pid_task(pid) can return a zombie leader, in this case
> > > mm_access() will fail while it shouldn't.
> >
> > I'm sorry. I didn't notice that. However, I couldn't understand your point.
> > Why do you think mm_access shouldn't fail even though pid_task returns
> > a zombie leader?
> 
> The leader can exit (call sys_exit(), not sys_exit_group()), this won't affect
> other threads. In this case the process is still alive even if the leader thread
> is zombie. That is why we have find_lock_task_mm().

Thanks for clarification, Oleg. Then, Let me have a further question.

It means process_vm_readv, move_pages have same problem too because find_task_by_vpid
can return a zomebie leader and next line checks for mm_struct validation makes a
failure. My understand is correct? If so, we need to fix all places.

Michal Hocko May 28, 2019, 7:23 a.m. UTC | #12

On Tue 28-05-19 08:33:06, Minchan Kim wrote:
> On Mon, May 27, 2019 at 05:12:02PM +0200, Oleg Nesterov wrote:
> > On 05/27, Minchan Kim wrote:
> > >
> > > > another problem is that pid_task(pid) can return a zombie leader, in this case
> > > > mm_access() will fail while it shouldn't.
> > >
> > > I'm sorry. I didn't notice that. However, I couldn't understand your point.
> > > Why do you think mm_access shouldn't fail even though pid_task returns
> > > a zombie leader?
> > 
> > The leader can exit (call sys_exit(), not sys_exit_group()), this won't affect
> > other threads. In this case the process is still alive even if the leader thread
> > is zombie. That is why we have find_lock_task_mm().
> 
> Thanks for clarification, Oleg. Then, Let me have a further question.
> 
> It means process_vm_readv, move_pages have same problem too because find_task_by_vpid
> can return a zomebie leader and next line checks for mm_struct validation makes a
> failure. My understand is correct? If so, we need to fix all places.

Isn't that a problem of most callers of get_task_mm? Shouldn't we fix it
turning it into find_lock_task_mm?

Minchan Kim May 30, 2019, 12:38 a.m. UTC | #13

On Wed, May 29, 2019 at 11:41:23AM +0800, Hillf Danton wrote:
> 
> On Mon, 20 May 2019 12:52:52 +0900 Minchan Kim wrote:
> > --- a/arch/x86/entry/syscalls/syscall_64.tbl
> > +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> > @@ -355,6 +355,7 @@
> >  425	common	io_uring_setup		__x64_sys_io_uring_setup
> >  426	common	io_uring_enter		__x64_sys_io_uring_enter
> >  427	common	io_uring_register	__x64_sys_io_uring_register
> > +428	common	process_madvise		__x64_sys_process_madvise
> >  
> Much better if something similar is added for arm64.

I will port every architecture once we figure out RFC and reaches the
conclusion for right interface.

> 
> >  #
> >  # x32-specific system call numbers start at 512 to avoid cache impact
> > --- a/include/uapi/asm-generic/unistd.h
> > +++ b/include/uapi/asm-generic/unistd.h
> > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup)
> >  __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter)
> >  #define __NR_io_uring_register 427
> >  __SYSCALL(__NR_io_uring_register, sys_io_uring_register)
> > +#define __NR_process_madvise 428
> > +__SYSCALL(__NR_process_madvise, sys_process_madvise)
> >  
> >  #undef __NR_syscalls
> >  #define __NR_syscalls 428
> 
> Seems __NR_syscalls needs to increment by one.

Thanks. I will fix it.

[RFC,5/7] mm: introduce external memory hinting API

Commit Message

Comments

Patch