Message ID | 20210623192822.3072029-1-surenb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/1] mm: introduce process_reap system call | expand |
On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote: > > In modern systems it's not unusual to have a system component monitoring > memory conditions of the system and tasked with keeping system memory > pressure under control. One way to accomplish that is to kill > non-essential processes to free up memory for more important ones. > Examples of this are Facebook's OOM killer daemon called oomd and > Android's low memory killer daemon called lmkd. > For such system component it's important to be able to free memory > quickly and efficiently. Unfortunately the time process takes to free > up its memory after receiving a SIGKILL might vary based on the state > of the process (uninterruptible sleep), size and OPP level of the core > the process is running. A mechanism to free resources of the target > process in a more predictable way would improve system's ability to > control its memory pressure. > Introduce process_reap system call that reclaims memory of a dying process > from the context of the caller. This way the memory in freed in a more > controllable way with CPU affinity and priority of the caller. The workload > of freeing the memory will also be charged to the caller. > The operation is allowed only on a dying process. > > Previously I proposed a number of alternatives to accomplish this: > - https://lore.kernel.org/patchwork/patch/1060407 extending > pidfd_send_signal to allow memory reaping using oom_reaper thread; > - https://lore.kernel.org/patchwork/patch/1338196 extending > pidfd_send_signal to reap memory of the target process synchronously from > the context of the caller; > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > support for process_madvise implementing synchronous memory reaping. > > The end of the last discussion culminated with suggestion to introduce a > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > The reasoning was that the new variant of process_madvise > a) does not work on an address range > b) is destructive > c) doesn't share much code at all with the rest of process_madvise > From the userspace point of view it was awkward and inconvenient to provide > memory range for this operation that operates on the entire address space. > Using special flags or address values to specify the entire address space > was too hacky. > > The API is as follows, > > int process_reap(int pidfd, unsigned int flags); > > DESCRIPTION > The process_reap() system call is used to free the memory of a > dying process. > > The pidfd selects the process referred to by the PID file > descriptor. > (See pidofd_open(2) for further information) > > The flags argument is reserved for future use; currently, this > argument must be specified as 0. > > RETURN VALUE > On success, process_reap() returns 0. On error, -1 is returned > and errno is set to indicate the error. > I noticed that the patch does not apply to linux-next because of the new memfd_secret syscall introduced on x86 architecture only. It still applies to Linus' ToT. If needed I can change it to apply on top of linux-next. > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > arch/arm/tools/syscall.tbl | 1 + > arch/arm64/include/asm/unistd.h | 2 +- > arch/arm64/include/asm/unistd32.h | 2 + > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > arch/s390/kernel/syscalls/syscall.tbl | 1 + > arch/sh/kernel/syscalls/syscall.tbl | 1 + > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/unistd.h | 4 +- > kernel/sys_ni.c | 1 + > mm/oom_kill.c | 50 +++++++++++++++++++++ > 22 files changed, 74 insertions(+), 2 deletions(-) > > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl > index 3000a2e8ee21..14b9e81d2fc4 100644 > --- a/arch/alpha/kernel/syscalls/syscall.tbl > +++ b/arch/alpha/kernel/syscalls/syscall.tbl > @@ -486,3 +486,4 @@ > 554 common landlock_create_ruleset sys_landlock_create_ruleset > 555 common landlock_add_rule sys_landlock_add_rule > 556 common landlock_restrict_self sys_landlock_restrict_self > +557 common process_reap sys_process_reap > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl > index 28e03b5fec00..889b78d0f63f 100644 > --- a/arch/arm/tools/syscall.tbl > +++ b/arch/arm/tools/syscall.tbl > @@ -460,3 +460,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h > index 727bfc3be99b..fb7a0be2f3d9 100644 > --- a/arch/arm64/include/asm/unistd.h > +++ b/arch/arm64/include/asm/unistd.h > @@ -38,7 +38,7 @@ > #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) > #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) > > -#define __NR_compat_syscalls 447 > +#define __NR_compat_syscalls 448 > #endif > > #define __ARCH_WANT_SYS_CLONE > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h > index 5dab69d2c22b..80593454173e 100644 > --- a/arch/arm64/include/asm/unistd32.h > +++ b/arch/arm64/include/asm/unistd32.h > @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) > #define __NR_landlock_restrict_self 446 > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) > +#define __NR_process_reap 447 > +__SYSCALL(__NR_process_reap, sys_process_reap) > > /* > * Please add new compat syscalls above this comment and update > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl > index bb11fe4c875a..6c94feedf086 100644 > --- a/arch/ia64/kernel/syscalls/syscall.tbl > +++ b/arch/ia64/kernel/syscalls/syscall.tbl > @@ -367,3 +367,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl > index 79c2d24c89dd..e80a7fa55696 100644 > --- a/arch/m68k/kernel/syscalls/syscall.tbl > +++ b/arch/m68k/kernel/syscalls/syscall.tbl > @@ -446,3 +446,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl > index b11395a20c20..511b2bd61fc1 100644 > --- a/arch/microblaze/kernel/syscalls/syscall.tbl > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl > @@ -452,3 +452,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl > index 9220909526f9..1775704c6a24 100644 > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl > @@ -385,3 +385,4 @@ > 444 n32 landlock_create_ruleset sys_landlock_create_ruleset > 445 n32 landlock_add_rule sys_landlock_add_rule > 446 n32 landlock_restrict_self sys_landlock_restrict_self > +447 n32 process_reap sys_process_reap > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl > index 9cd1c34f31b5..d769daca3f79 100644 > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl > @@ -361,3 +361,4 @@ > 444 n64 landlock_create_ruleset sys_landlock_create_ruleset > 445 n64 landlock_add_rule sys_landlock_add_rule > 446 n64 landlock_restrict_self sys_landlock_restrict_self > +447 n64 process_reap sys_process_reap > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl > index d560c467a8c6..1bd2fc056677 100644 > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl > @@ -434,3 +434,4 @@ > 444 o32 landlock_create_ruleset sys_landlock_create_ruleset > 445 o32 landlock_add_rule sys_landlock_add_rule > 446 o32 landlock_restrict_self sys_landlock_restrict_self > +447 o32 process_reap sys_process_reap > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl > index aabc37f8cae3..0012561ca557 100644 > --- a/arch/parisc/kernel/syscalls/syscall.tbl > +++ b/arch/parisc/kernel/syscalls/syscall.tbl > @@ -444,3 +444,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl > index 8f052ff4058c..89cbcc732b18 100644 > --- a/arch/powerpc/kernel/syscalls/syscall.tbl > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl > @@ -526,3 +526,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl > index 0690263df1dd..7ebd4d809b5e 100644 > --- a/arch/s390/kernel/syscalls/syscall.tbl > +++ b/arch/s390/kernel/syscalls/syscall.tbl > @@ -449,3 +449,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap sys_process_reap > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl > index 0b91499ebdcf..178fd47b372e 100644 > --- a/arch/sh/kernel/syscalls/syscall.tbl > +++ b/arch/sh/kernel/syscalls/syscall.tbl > @@ -449,3 +449,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl > index e34cc30ef22c..faee121b7ae2 100644 > --- a/arch/sparc/kernel/syscalls/syscall.tbl > +++ b/arch/sparc/kernel/syscalls/syscall.tbl > @@ -492,3 +492,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 4bbc267fb36b..cbe070de9884 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -451,3 +451,4 @@ > 444 i386 landlock_create_ruleset sys_landlock_create_ruleset > 445 i386 landlock_add_rule sys_landlock_add_rule > 446 i386 landlock_restrict_self sys_landlock_restrict_self > +447 i386 process_reap sys_process_reap > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index ce18119ea0d0..e6765646731b 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -368,6 +368,7 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > > # > # Due to a historical design error, certain syscalls are numbered differently > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl > index fd2f30227d96..f0e9dbee1a5b 100644 > --- a/arch/xtensa/kernel/syscalls/syscall.tbl > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl > @@ -417,3 +417,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 050511e8f1f8..b6659e09bf0d 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len, > asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); > asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec, > size_t vlen, int behavior, unsigned int flags); > +asmlinkage long sys_process_reap(int pidfd, unsigned int flags); > asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, > unsigned long prot, unsigned long pgoff, > unsigned long flags); > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > index d2a942086fcb..b3bf57b928af 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) > #define __NR_landlock_restrict_self 446 > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) > +#define __NR_process_reap 447 > +__SYSCALL(__NR_process_reap, sys_process_reap) > > #undef __NR_syscalls > -#define __NR_syscalls 447 > +#define __NR_syscalls 448 > > /* > * 32 bit systems traditionally used different > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 0ea8128468c3..56eb7c9f8356 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall); > COND_SYSCALL(mincore); > COND_SYSCALL(madvise); > COND_SYSCALL(process_madvise); > +COND_SYSCALL(process_reap); > COND_SYSCALL(remap_file_pages); > COND_SYSCALL(mbind); > COND_SYSCALL_COMPAT(mbind); > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index eefd3f5fde46..0f85a0442fa5 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -28,6 +28,7 @@ > #include <linux/sched/task.h> > #include <linux/sched/debug.h> > #include <linux/swap.h> > +#include <linux/syscalls.h> > #include <linux/timex.h> > #include <linux/jiffies.h> > #include <linux/cpuset.h> > @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void) > out_of_memory(&oc); > mutex_unlock(&oom_lock); > } > + > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > +{ > + struct pid *pid; > + struct task_struct *task; > + struct mm_struct *mm = NULL; > + unsigned int f_flags; > + long ret = 0; > + > + if (flags != 0) > + return -EINVAL; > + > + pid = pidfd_get_pid(pidfd, &f_flags); > + if (IS_ERR(pid)) > + return PTR_ERR(pid); > + > + task = get_pid_task(pid, PIDTYPE_PID); > + if (!task) { > + ret = -ESRCH; > + goto put_pid; > + } > + > + /* > + * If the task is dying and in the process of releasing its memory > + * then get its mm. > + */ > + task_lock(task); > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > + mm = task->mm; > + mmget(mm); > + } > + task_unlock(task); > + if (!mm) { > + ret = -EINVAL; > + goto put_task; > + } > + > + mmap_read_lock(mm); > + if (!__oom_reap_task_mm(mm)) > + ret = -EAGAIN; > + mmap_read_unlock(mm); > + > + mmput(mm); > +put_task: > + put_task_struct(task); > +put_pid: > + put_pid(pid); > + return ret; > +} > -- > 2.32.0.93.g670b81a890-goog >
On Wed, Jun 23, 2021 at 12:28:22PM -0700, Suren Baghdasaryan wrote: > In modern systems it's not unusual to have a system component monitoring > memory conditions of the system and tasked with keeping system memory > pressure under control. One way to accomplish that is to kill > non-essential processes to free up memory for more important ones. > Examples of this are Facebook's OOM killer daemon called oomd and > Android's low memory killer daemon called lmkd. > For such system component it's important to be able to free memory > quickly and efficiently. Unfortunately the time process takes to free > up its memory after receiving a SIGKILL might vary based on the state > of the process (uninterruptible sleep), size and OPP level of the core > the process is running. A mechanism to free resources of the target > process in a more predictable way would improve system's ability to > control its memory pressure. > Introduce process_reap system call that reclaims memory of a dying process > from the context of the caller. This way the memory in freed in a more > controllable way with CPU affinity and priority of the caller. The workload > of freeing the memory will also be charged to the caller. > The operation is allowed only on a dying process. > > Previously I proposed a number of alternatives to accomplish this: > - https://lore.kernel.org/patchwork/patch/1060407 extending > pidfd_send_signal to allow memory reaping using oom_reaper thread; > - https://lore.kernel.org/patchwork/patch/1338196 extending > pidfd_send_signal to reap memory of the target process synchronously from > the context of the caller; > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > support for process_madvise implementing synchronous memory reaping. > > The end of the last discussion culminated with suggestion to introduce a > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > The reasoning was that the new variant of process_madvise > a) does not work on an address range > b) is destructive > c) doesn't share much code at all with the rest of process_madvise > From the userspace point of view it was awkward and inconvenient to provide > memory range for this operation that operates on the entire address space. > Using special flags or address values to specify the entire address space > was too hacky. > > The API is as follows, > > int process_reap(int pidfd, unsigned int flags); > > DESCRIPTION > The process_reap() system call is used to free the memory of a > dying process. > > The pidfd selects the process referred to by the PID file > descriptor. > (See pidofd_open(2) for further information) > > The flags argument is reserved for future use; currently, this > argument must be specified as 0. > > RETURN VALUE > On success, process_reap() returns 0. On error, -1 is returned > and errno is set to indicate the error. > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > arch/arm/tools/syscall.tbl | 1 + > arch/arm64/include/asm/unistd.h | 2 +- > arch/arm64/include/asm/unistd32.h | 2 + > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > arch/s390/kernel/syscalls/syscall.tbl | 1 + > arch/sh/kernel/syscalls/syscall.tbl | 1 + > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > include/linux/syscalls.h | 1 + > include/uapi/asm-generic/unistd.h | 4 +- > kernel/sys_ni.c | 1 + > mm/oom_kill.c | 50 +++++++++++++++++++++ > 22 files changed, 74 insertions(+), 2 deletions(-) > > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl > index 3000a2e8ee21..14b9e81d2fc4 100644 > --- a/arch/alpha/kernel/syscalls/syscall.tbl > +++ b/arch/alpha/kernel/syscalls/syscall.tbl > @@ -486,3 +486,4 @@ > 554 common landlock_create_ruleset sys_landlock_create_ruleset > 555 common landlock_add_rule sys_landlock_add_rule > 556 common landlock_restrict_self sys_landlock_restrict_self > +557 common process_reap sys_process_reap > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl > index 28e03b5fec00..889b78d0f63f 100644 > --- a/arch/arm/tools/syscall.tbl > +++ b/arch/arm/tools/syscall.tbl > @@ -460,3 +460,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h > index 727bfc3be99b..fb7a0be2f3d9 100644 > --- a/arch/arm64/include/asm/unistd.h > +++ b/arch/arm64/include/asm/unistd.h > @@ -38,7 +38,7 @@ > #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) > #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) > > -#define __NR_compat_syscalls 447 > +#define __NR_compat_syscalls 448 > #endif > > #define __ARCH_WANT_SYS_CLONE > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h > index 5dab69d2c22b..80593454173e 100644 > --- a/arch/arm64/include/asm/unistd32.h > +++ b/arch/arm64/include/asm/unistd32.h > @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) > #define __NR_landlock_restrict_self 446 > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) > +#define __NR_process_reap 447 > +__SYSCALL(__NR_process_reap, sys_process_reap) > > /* > * Please add new compat syscalls above this comment and update > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl > index bb11fe4c875a..6c94feedf086 100644 > --- a/arch/ia64/kernel/syscalls/syscall.tbl > +++ b/arch/ia64/kernel/syscalls/syscall.tbl > @@ -367,3 +367,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl > index 79c2d24c89dd..e80a7fa55696 100644 > --- a/arch/m68k/kernel/syscalls/syscall.tbl > +++ b/arch/m68k/kernel/syscalls/syscall.tbl > @@ -446,3 +446,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl > index b11395a20c20..511b2bd61fc1 100644 > --- a/arch/microblaze/kernel/syscalls/syscall.tbl > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl > @@ -452,3 +452,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl > index 9220909526f9..1775704c6a24 100644 > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl > @@ -385,3 +385,4 @@ > 444 n32 landlock_create_ruleset sys_landlock_create_ruleset > 445 n32 landlock_add_rule sys_landlock_add_rule > 446 n32 landlock_restrict_self sys_landlock_restrict_self > +447 n32 process_reap sys_process_reap > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl > index 9cd1c34f31b5..d769daca3f79 100644 > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl > @@ -361,3 +361,4 @@ > 444 n64 landlock_create_ruleset sys_landlock_create_ruleset > 445 n64 landlock_add_rule sys_landlock_add_rule > 446 n64 landlock_restrict_self sys_landlock_restrict_self > +447 n64 process_reap sys_process_reap > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl > index d560c467a8c6..1bd2fc056677 100644 > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl > @@ -434,3 +434,4 @@ > 444 o32 landlock_create_ruleset sys_landlock_create_ruleset > 445 o32 landlock_add_rule sys_landlock_add_rule > 446 o32 landlock_restrict_self sys_landlock_restrict_self > +447 o32 process_reap sys_process_reap > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl > index aabc37f8cae3..0012561ca557 100644 > --- a/arch/parisc/kernel/syscalls/syscall.tbl > +++ b/arch/parisc/kernel/syscalls/syscall.tbl > @@ -444,3 +444,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl > index 8f052ff4058c..89cbcc732b18 100644 > --- a/arch/powerpc/kernel/syscalls/syscall.tbl > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl > @@ -526,3 +526,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl > index 0690263df1dd..7ebd4d809b5e 100644 > --- a/arch/s390/kernel/syscalls/syscall.tbl > +++ b/arch/s390/kernel/syscalls/syscall.tbl > @@ -449,3 +449,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap sys_process_reap > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl > index 0b91499ebdcf..178fd47b372e 100644 > --- a/arch/sh/kernel/syscalls/syscall.tbl > +++ b/arch/sh/kernel/syscalls/syscall.tbl > @@ -449,3 +449,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl > index e34cc30ef22c..faee121b7ae2 100644 > --- a/arch/sparc/kernel/syscalls/syscall.tbl > +++ b/arch/sparc/kernel/syscalls/syscall.tbl > @@ -492,3 +492,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 4bbc267fb36b..cbe070de9884 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -451,3 +451,4 @@ > 444 i386 landlock_create_ruleset sys_landlock_create_ruleset > 445 i386 landlock_add_rule sys_landlock_add_rule > 446 i386 landlock_restrict_self sys_landlock_restrict_self > +447 i386 process_reap sys_process_reap > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index ce18119ea0d0..e6765646731b 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -368,6 +368,7 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > > # > # Due to a historical design error, certain syscalls are numbered differently > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl > index fd2f30227d96..f0e9dbee1a5b 100644 > --- a/arch/xtensa/kernel/syscalls/syscall.tbl > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl > @@ -417,3 +417,4 @@ > 444 common landlock_create_ruleset sys_landlock_create_ruleset > 445 common landlock_add_rule sys_landlock_add_rule > 446 common landlock_restrict_self sys_landlock_restrict_self > +447 common process_reap sys_process_reap > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index 050511e8f1f8..b6659e09bf0d 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len, > asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); > asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec, > size_t vlen, int behavior, unsigned int flags); > +asmlinkage long sys_process_reap(int pidfd, unsigned int flags); > asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, > unsigned long prot, unsigned long pgoff, > unsigned long flags); > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > index d2a942086fcb..b3bf57b928af 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) > #define __NR_landlock_restrict_self 446 > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) > +#define __NR_process_reap 447 > +__SYSCALL(__NR_process_reap, sys_process_reap) > > #undef __NR_syscalls > -#define __NR_syscalls 447 > +#define __NR_syscalls 448 > > /* > * 32 bit systems traditionally used different > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 0ea8128468c3..56eb7c9f8356 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall); > COND_SYSCALL(mincore); > COND_SYSCALL(madvise); > COND_SYSCALL(process_madvise); > +COND_SYSCALL(process_reap); > COND_SYSCALL(remap_file_pages); > COND_SYSCALL(mbind); > COND_SYSCALL_COMPAT(mbind); > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index eefd3f5fde46..0f85a0442fa5 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -28,6 +28,7 @@ > #include <linux/sched/task.h> > #include <linux/sched/debug.h> > #include <linux/swap.h> > +#include <linux/syscalls.h> > #include <linux/timex.h> > #include <linux/jiffies.h> > #include <linux/cpuset.h> > @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void) > out_of_memory(&oc); > mutex_unlock(&oom_lock); > } > + > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) Hey Suren, Wouldn't - process_memory_reap() - process_reap_memory() - process_mreap() be better names? > +{ > + struct pid *pid; > + struct task_struct *task; > + struct mm_struct *mm = NULL; > + unsigned int f_flags; > + long ret = 0; > + > + if (flags != 0) > + return -EINVAL; > + > + pid = pidfd_get_pid(pidfd, &f_flags); > + if (IS_ERR(pid)) > + return PTR_ERR(pid); > + > + task = get_pid_task(pid, PIDTYPE_PID); > + if (!task) { > + ret = -ESRCH; > + goto put_pid; > + } You have a similar pattern in process_madvise(): pid = pidfd_get_pid(pidfd, &f_flags); if (IS_ERR(pid)) { ret = PTR_ERR(pid); goto free_iov; } task = get_pid_task(pid, PIDTYPE_PID); if (!task) { ret = -ESRCH; goto put_pid; } I'd suggest you add a tiny helper to kernel/pid.c and call it in both places.
On Tue, Jun 29, 2021 at 6:14 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Wed, Jun 23, 2021 at 12:28:22PM -0700, Suren Baghdasaryan wrote: > > In modern systems it's not unusual to have a system component monitoring > > memory conditions of the system and tasked with keeping system memory > > pressure under control. One way to accomplish that is to kill > > non-essential processes to free up memory for more important ones. > > Examples of this are Facebook's OOM killer daemon called oomd and > > Android's low memory killer daemon called lmkd. > > For such system component it's important to be able to free memory > > quickly and efficiently. Unfortunately the time process takes to free > > up its memory after receiving a SIGKILL might vary based on the state > > of the process (uninterruptible sleep), size and OPP level of the core > > the process is running. A mechanism to free resources of the target > > process in a more predictable way would improve system's ability to > > control its memory pressure. > > Introduce process_reap system call that reclaims memory of a dying process > > from the context of the caller. This way the memory in freed in a more > > controllable way with CPU affinity and priority of the caller. The workload > > of freeing the memory will also be charged to the caller. > > The operation is allowed only on a dying process. > > > > Previously I proposed a number of alternatives to accomplish this: > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > pidfd_send_signal to reap memory of the target process synchronously from > > the context of the caller; > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > support for process_madvise implementing synchronous memory reaping. > > > > The end of the last discussion culminated with suggestion to introduce a > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > The reasoning was that the new variant of process_madvise > > a) does not work on an address range > > b) is destructive > > c) doesn't share much code at all with the rest of process_madvise > > From the userspace point of view it was awkward and inconvenient to provide > > memory range for this operation that operates on the entire address space. > > Using special flags or address values to specify the entire address space > > was too hacky. > > > > The API is as follows, > > > > int process_reap(int pidfd, unsigned int flags); > > > > DESCRIPTION > > The process_reap() system call is used to free the memory of a > > dying process. > > > > The pidfd selects the process referred to by the PID file > > descriptor. > > (See pidofd_open(2) for further information) > > > > The flags argument is reserved for future use; currently, this > > argument must be specified as 0. > > > > RETURN VALUE > > On success, process_reap() returns 0. On error, -1 is returned > > and errno is set to indicate the error. > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > --- > > arch/alpha/kernel/syscalls/syscall.tbl | 1 + > > arch/arm/tools/syscall.tbl | 1 + > > arch/arm64/include/asm/unistd.h | 2 +- > > arch/arm64/include/asm/unistd32.h | 2 + > > arch/ia64/kernel/syscalls/syscall.tbl | 1 + > > arch/m68k/kernel/syscalls/syscall.tbl | 1 + > > arch/microblaze/kernel/syscalls/syscall.tbl | 1 + > > arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + > > arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + > > arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + > > arch/parisc/kernel/syscalls/syscall.tbl | 1 + > > arch/powerpc/kernel/syscalls/syscall.tbl | 1 + > > arch/s390/kernel/syscalls/syscall.tbl | 1 + > > arch/sh/kernel/syscalls/syscall.tbl | 1 + > > arch/sparc/kernel/syscalls/syscall.tbl | 1 + > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > arch/xtensa/kernel/syscalls/syscall.tbl | 1 + > > include/linux/syscalls.h | 1 + > > include/uapi/asm-generic/unistd.h | 4 +- > > kernel/sys_ni.c | 1 + > > mm/oom_kill.c | 50 +++++++++++++++++++++ > > 22 files changed, 74 insertions(+), 2 deletions(-) > > > > diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl > > index 3000a2e8ee21..14b9e81d2fc4 100644 > > --- a/arch/alpha/kernel/syscalls/syscall.tbl > > +++ b/arch/alpha/kernel/syscalls/syscall.tbl > > @@ -486,3 +486,4 @@ > > 554 common landlock_create_ruleset sys_landlock_create_ruleset > > 555 common landlock_add_rule sys_landlock_add_rule > > 556 common landlock_restrict_self sys_landlock_restrict_self > > +557 common process_reap sys_process_reap > > diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl > > index 28e03b5fec00..889b78d0f63f 100644 > > --- a/arch/arm/tools/syscall.tbl > > +++ b/arch/arm/tools/syscall.tbl > > @@ -460,3 +460,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h > > index 727bfc3be99b..fb7a0be2f3d9 100644 > > --- a/arch/arm64/include/asm/unistd.h > > +++ b/arch/arm64/include/asm/unistd.h > > @@ -38,7 +38,7 @@ > > #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) > > #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) > > > > -#define __NR_compat_syscalls 447 > > +#define __NR_compat_syscalls 448 > > #endif > > > > #define __ARCH_WANT_SYS_CLONE > > diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h > > index 5dab69d2c22b..80593454173e 100644 > > --- a/arch/arm64/include/asm/unistd32.h > > +++ b/arch/arm64/include/asm/unistd32.h > > @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) > > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) > > #define __NR_landlock_restrict_self 446 > > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) > > +#define __NR_process_reap 447 > > +__SYSCALL(__NR_process_reap, sys_process_reap) > > > > /* > > * Please add new compat syscalls above this comment and update > > diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl > > index bb11fe4c875a..6c94feedf086 100644 > > --- a/arch/ia64/kernel/syscalls/syscall.tbl > > +++ b/arch/ia64/kernel/syscalls/syscall.tbl > > @@ -367,3 +367,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl > > index 79c2d24c89dd..e80a7fa55696 100644 > > --- a/arch/m68k/kernel/syscalls/syscall.tbl > > +++ b/arch/m68k/kernel/syscalls/syscall.tbl > > @@ -446,3 +446,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl > > index b11395a20c20..511b2bd61fc1 100644 > > --- a/arch/microblaze/kernel/syscalls/syscall.tbl > > +++ b/arch/microblaze/kernel/syscalls/syscall.tbl > > @@ -452,3 +452,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl > > index 9220909526f9..1775704c6a24 100644 > > --- a/arch/mips/kernel/syscalls/syscall_n32.tbl > > +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl > > @@ -385,3 +385,4 @@ > > 444 n32 landlock_create_ruleset sys_landlock_create_ruleset > > 445 n32 landlock_add_rule sys_landlock_add_rule > > 446 n32 landlock_restrict_self sys_landlock_restrict_self > > +447 n32 process_reap sys_process_reap > > diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl > > index 9cd1c34f31b5..d769daca3f79 100644 > > --- a/arch/mips/kernel/syscalls/syscall_n64.tbl > > +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl > > @@ -361,3 +361,4 @@ > > 444 n64 landlock_create_ruleset sys_landlock_create_ruleset > > 445 n64 landlock_add_rule sys_landlock_add_rule > > 446 n64 landlock_restrict_self sys_landlock_restrict_self > > +447 n64 process_reap sys_process_reap > > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl > > index d560c467a8c6..1bd2fc056677 100644 > > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl > > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl > > @@ -434,3 +434,4 @@ > > 444 o32 landlock_create_ruleset sys_landlock_create_ruleset > > 445 o32 landlock_add_rule sys_landlock_add_rule > > 446 o32 landlock_restrict_self sys_landlock_restrict_self > > +447 o32 process_reap sys_process_reap > > diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl > > index aabc37f8cae3..0012561ca557 100644 > > --- a/arch/parisc/kernel/syscalls/syscall.tbl > > +++ b/arch/parisc/kernel/syscalls/syscall.tbl > > @@ -444,3 +444,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl > > index 8f052ff4058c..89cbcc732b18 100644 > > --- a/arch/powerpc/kernel/syscalls/syscall.tbl > > +++ b/arch/powerpc/kernel/syscalls/syscall.tbl > > @@ -526,3 +526,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl > > index 0690263df1dd..7ebd4d809b5e 100644 > > --- a/arch/s390/kernel/syscalls/syscall.tbl > > +++ b/arch/s390/kernel/syscalls/syscall.tbl > > @@ -449,3 +449,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap sys_process_reap > > diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl > > index 0b91499ebdcf..178fd47b372e 100644 > > --- a/arch/sh/kernel/syscalls/syscall.tbl > > +++ b/arch/sh/kernel/syscalls/syscall.tbl > > @@ -449,3 +449,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl > > index e34cc30ef22c..faee121b7ae2 100644 > > --- a/arch/sparc/kernel/syscalls/syscall.tbl > > +++ b/arch/sparc/kernel/syscalls/syscall.tbl > > @@ -492,3 +492,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > > index 4bbc267fb36b..cbe070de9884 100644 > > --- a/arch/x86/entry/syscalls/syscall_32.tbl > > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > > @@ -451,3 +451,4 @@ > > 444 i386 landlock_create_ruleset sys_landlock_create_ruleset > > 445 i386 landlock_add_rule sys_landlock_add_rule > > 446 i386 landlock_restrict_self sys_landlock_restrict_self > > +447 i386 process_reap sys_process_reap > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > > index ce18119ea0d0..e6765646731b 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -368,6 +368,7 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > > > # > > # Due to a historical design error, certain syscalls are numbered differently > > diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl > > index fd2f30227d96..f0e9dbee1a5b 100644 > > --- a/arch/xtensa/kernel/syscalls/syscall.tbl > > +++ b/arch/xtensa/kernel/syscalls/syscall.tbl > > @@ -417,3 +417,4 @@ > > 444 common landlock_create_ruleset sys_landlock_create_ruleset > > 445 common landlock_add_rule sys_landlock_add_rule > > 446 common landlock_restrict_self sys_landlock_restrict_self > > +447 common process_reap sys_process_reap > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > > index 050511e8f1f8..b6659e09bf0d 100644 > > --- a/include/linux/syscalls.h > > +++ b/include/linux/syscalls.h > > @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len, > > asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); > > asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec, > > size_t vlen, int behavior, unsigned int flags); > > +asmlinkage long sys_process_reap(int pidfd, unsigned int flags); > > asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, > > unsigned long prot, unsigned long pgoff, > > unsigned long flags); > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > > index d2a942086fcb..b3bf57b928af 100644 > > --- a/include/uapi/asm-generic/unistd.h > > +++ b/include/uapi/asm-generic/unistd.h > > @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) > > __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) > > #define __NR_landlock_restrict_self 446 > > __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) > > +#define __NR_process_reap 447 > > +__SYSCALL(__NR_process_reap, sys_process_reap) > > > > #undef __NR_syscalls > > -#define __NR_syscalls 447 > > +#define __NR_syscalls 448 > > > > /* > > * 32 bit systems traditionally used different > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > > index 0ea8128468c3..56eb7c9f8356 100644 > > --- a/kernel/sys_ni.c > > +++ b/kernel/sys_ni.c > > @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall); > > COND_SYSCALL(mincore); > > COND_SYSCALL(madvise); > > COND_SYSCALL(process_madvise); > > +COND_SYSCALL(process_reap); > > COND_SYSCALL(remap_file_pages); > > COND_SYSCALL(mbind); > > COND_SYSCALL_COMPAT(mbind); > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > > index eefd3f5fde46..0f85a0442fa5 100644 > > --- a/mm/oom_kill.c > > +++ b/mm/oom_kill.c > > @@ -28,6 +28,7 @@ > > #include <linux/sched/task.h> > > #include <linux/sched/debug.h> > > #include <linux/swap.h> > > +#include <linux/syscalls.h> > > #include <linux/timex.h> > > #include <linux/jiffies.h> > > #include <linux/cpuset.h> > > @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void) > > out_of_memory(&oc); > > mutex_unlock(&oom_lock); > > } > > + > > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > > Hey Suren, > > Wouldn't > - process_memory_reap() > - process_reap_memory() > - process_mreap() > be better names? Hi Christian, I'm open to other names, whichever sounds better. From the list process_reap_memory() sounds best to me but I'm open to others as well. > > > +{ > > + struct pid *pid; > > + struct task_struct *task; > > + struct mm_struct *mm = NULL; > > + unsigned int f_flags; > > + long ret = 0; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + pid = pidfd_get_pid(pidfd, &f_flags); > > + if (IS_ERR(pid)) > > + return PTR_ERR(pid); > > + > > + task = get_pid_task(pid, PIDTYPE_PID); > > + if (!task) { > > + ret = -ESRCH; > > + goto put_pid; > > + } > > You have a similar pattern in process_madvise(): > > pid = pidfd_get_pid(pidfd, &f_flags); > if (IS_ERR(pid)) { > ret = PTR_ERR(pid); > goto free_iov; > } > > task = get_pid_task(pid, PIDTYPE_PID); > if (!task) { > ret = -ESRCH; > goto put_pid; > } > > I'd suggest you add a tiny helper to kernel/pid.c and call it in both > places. Agree. I'll post the new rev next week to give some more time for reviews of this version. Thanks! > > -- > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com. >
Hi Suren, On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote: > > In modern systems it's not unusual to have a system component monitoring > memory conditions of the system and tasked with keeping system memory > pressure under control. One way to accomplish that is to kill > non-essential processes to free up memory for more important ones. > Examples of this are Facebook's OOM killer daemon called oomd and > Android's low memory killer daemon called lmkd. > For such system component it's important to be able to free memory > quickly and efficiently. Unfortunately the time process takes to free > up its memory after receiving a SIGKILL might vary based on the state > of the process (uninterruptible sleep), size and OPP level of the core > the process is running. A mechanism to free resources of the target > process in a more predictable way would improve system's ability to > control its memory pressure. > Introduce process_reap system call that reclaims memory of a dying process > from the context of the caller. This way the memory in freed in a more > controllable way with CPU affinity and priority of the caller. The workload > of freeing the memory will also be charged to the caller. > The operation is allowed only on a dying process. > > Previously I proposed a number of alternatives to accomplish this: > - https://lore.kernel.org/patchwork/patch/1060407 extending > pidfd_send_signal to allow memory reaping using oom_reaper thread; > - https://lore.kernel.org/patchwork/patch/1338196 extending > pidfd_send_signal to reap memory of the target process synchronously from > the context of the caller; > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > support for process_madvise implementing synchronous memory reaping. > > The end of the last discussion culminated with suggestion to introduce a > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > The reasoning was that the new variant of process_madvise > a) does not work on an address range > b) is destructive > c) doesn't share much code at all with the rest of process_madvise > From the userspace point of view it was awkward and inconvenient to provide > memory range for this operation that operates on the entire address space. > Using special flags or address values to specify the entire address space > was too hacky. > > The API is as follows, > > int process_reap(int pidfd, unsigned int flags); > > DESCRIPTION > The process_reap() system call is used to free the memory of a > dying process. > > The pidfd selects the process referred to by the PID file > descriptor. > (See pidofd_open(2) for further information) *pidfd_open > > The flags argument is reserved for future use; currently, this > argument must be specified as 0. > > RETURN VALUE > On success, process_reap() returns 0. On error, -1 is returned > and errno is set to indicate the error. > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> Thanks for continuously pushing this. One question I have is how do you envision this syscall to be used for the cgroup based workloads. Traverse the target tree, read pids from cgroup.procs files, pidfd_open them, send SIGKILL and then process_reap them. Is that right? Orthogonal to this patch I wonder if we should have an optimized way to reap processes from a cgroup. Something similar to cgroup.kill (or maybe overload cgroup.kill with reaping as well). [...] > + > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > +{ > + struct pid *pid; > + struct task_struct *task; > + struct mm_struct *mm = NULL; > + unsigned int f_flags; > + long ret = 0; > + > + if (flags != 0) > + return -EINVAL; > + > + pid = pidfd_get_pid(pidfd, &f_flags); > + if (IS_ERR(pid)) > + return PTR_ERR(pid); > + > + task = get_pid_task(pid, PIDTYPE_PID); > + if (!task) { > + ret = -ESRCH; > + goto put_pid; > + } > + > + /* > + * If the task is dying and in the process of releasing its memory > + * then get its mm. > + */ > + task_lock(task); > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { task_will_free_mem() is fine here but I think in parallel we should optimize this function. At the moment it is traversing all the processes on the machine. It is very normal to have tens of thousands of processes on big machines, so it would be really costly when reaping a bunch of processes.
On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote: > > In modern systems it's not unusual to have a system component monitoring > memory conditions of the system and tasked with keeping system memory > pressure under control. One way to accomplish that is to kill > non-essential processes to free up memory for more important ones. > Examples of this are Facebook's OOM killer daemon called oomd and > Android's low memory killer daemon called lmkd. > For such system component it's important to be able to free memory > quickly and efficiently. Unfortunately the time process takes to free > up its memory after receiving a SIGKILL might vary based on the state > of the process (uninterruptible sleep), size and OPP level of the core > the process is running. A mechanism to free resources of the target > process in a more predictable way would improve system's ability to > control its memory pressure. > Introduce process_reap system call that reclaims memory of a dying process > from the context of the caller. This way the memory in freed in a more > controllable way with CPU affinity and priority of the caller. The workload > of freeing the memory will also be charged to the caller. > The operation is allowed only on a dying process. At the risk of asking a potentially silly question, should this just be a file in procfs? Also, please consider removing all mention of the word "reap" from the user API. For better or for worse, "reap" in UNIX refers to what happens when a dead task gets wait()ed. I sincerely wish I could go back in time and gently encourage whomever invented that particular abomination to change their mind, but my time machine doesn't work. --Andy
On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt <shakeelb@google.com> wrote: > > Hi Suren, > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > In modern systems it's not unusual to have a system component monitoring > > memory conditions of the system and tasked with keeping system memory > > pressure under control. One way to accomplish that is to kill > > non-essential processes to free up memory for more important ones. > > Examples of this are Facebook's OOM killer daemon called oomd and > > Android's low memory killer daemon called lmkd. > > For such system component it's important to be able to free memory > > quickly and efficiently. Unfortunately the time process takes to free > > up its memory after receiving a SIGKILL might vary based on the state > > of the process (uninterruptible sleep), size and OPP level of the core > > the process is running. A mechanism to free resources of the target > > process in a more predictable way would improve system's ability to > > control its memory pressure. > > Introduce process_reap system call that reclaims memory of a dying process > > from the context of the caller. This way the memory in freed in a more > > controllable way with CPU affinity and priority of the caller. The workload > > of freeing the memory will also be charged to the caller. > > The operation is allowed only on a dying process. > > > > Previously I proposed a number of alternatives to accomplish this: > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > pidfd_send_signal to reap memory of the target process synchronously from > > the context of the caller; > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > support for process_madvise implementing synchronous memory reaping. > > > > The end of the last discussion culminated with suggestion to introduce a > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > The reasoning was that the new variant of process_madvise > > a) does not work on an address range > > b) is destructive > > c) doesn't share much code at all with the rest of process_madvise > > From the userspace point of view it was awkward and inconvenient to provide > > memory range for this operation that operates on the entire address space. > > Using special flags or address values to specify the entire address space > > was too hacky. > > > > The API is as follows, > > > > int process_reap(int pidfd, unsigned int flags); > > > > DESCRIPTION > > The process_reap() system call is used to free the memory of a > > dying process. > > > > The pidfd selects the process referred to by the PID file > > descriptor. > > (See pidofd_open(2) for further information) > > *pidfd_open Ack > > > > > The flags argument is reserved for future use; currently, this > > argument must be specified as 0. > > > > RETURN VALUE > > On success, process_reap() returns 0. On error, -1 is returned > > and errno is set to indicate the error. > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > > Thanks for continuously pushing this. One question I have is how do > you envision this syscall to be used for the cgroup based workloads. > Traverse the target tree, read pids from cgroup.procs files, > pidfd_open them, send SIGKILL and then process_reap them. Is that > right? Yes, at least that's how Android does that. It's a bit more involved but it's a technical detail. Userspace low memory killer kills a process (sends SIGKILL and calls process_reap) and another system component detects that a process died and will kill all processes belonging to the same cgroup (that's how we identify related processes). > > Orthogonal to this patch I wonder if we should have an optimized way > to reap processes from a cgroup. Something similar to cgroup.kill (or > maybe overload cgroup.kill with reaping as well). Seems reasonable to me. We could use that in the above scenario. > > [...] > > > + > > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > > +{ > > + struct pid *pid; > > + struct task_struct *task; > > + struct mm_struct *mm = NULL; > > + unsigned int f_flags; > > + long ret = 0; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + pid = pidfd_get_pid(pidfd, &f_flags); > > + if (IS_ERR(pid)) > > + return PTR_ERR(pid); > > + > > + task = get_pid_task(pid, PIDTYPE_PID); > > + if (!task) { > > + ret = -ESRCH; > > + goto put_pid; > > + } > > + > > + /* > > + * If the task is dying and in the process of releasing its memory > > + * then get its mm. > > + */ > > + task_lock(task); > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > task_will_free_mem() is fine here but I think in parallel we should > optimize this function. At the moment it is traversing all the > processes on the machine. It is very normal to have tens of thousands > of processes on big machines, so it would be really costly when > reaping a bunch of processes. Hmm. But I think we still need to make sure that the mm is not shared with another non-dying process. IIUC that's the point of that traversal. Am I mistaken?
On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > In modern systems it's not unusual to have a system component monitoring > > memory conditions of the system and tasked with keeping system memory > > pressure under control. One way to accomplish that is to kill > > non-essential processes to free up memory for more important ones. > > Examples of this are Facebook's OOM killer daemon called oomd and > > Android's low memory killer daemon called lmkd. > > For such system component it's important to be able to free memory > > quickly and efficiently. Unfortunately the time process takes to free > > up its memory after receiving a SIGKILL might vary based on the state > > of the process (uninterruptible sleep), size and OPP level of the core > > the process is running. A mechanism to free resources of the target > > process in a more predictable way would improve system's ability to > > control its memory pressure. > > Introduce process_reap system call that reclaims memory of a dying process > > from the context of the caller. This way the memory in freed in a more > > controllable way with CPU affinity and priority of the caller. The workload > > of freeing the memory will also be charged to the caller. > > The operation is allowed only on a dying process. > > At the risk of asking a potentially silly question, should this just > be a file in procfs? Hmm. I guess it's doable if procfs will not disappear too soon before memory is released... syscall also supports parameters, in this case flags can be used in the future to support PIDs in addition to PIDFDs for example. Before looking more in that direction, a silly question from my side: why procfs interface would be preferable to a syscall? > > Also, please consider removing all mention of the word "reap" from the > user API. For better or for worse, "reap" in UNIX refers to what > happens when a dead task gets wait()ed. I sincerely wish I could go > back in time and gently encourage whomever invented that particular > abomination to change their mind, but my time machine doesn't work. I see. Thanks for the note. How about process_mem_release() and replacing reap with release everywhere? > > --Andy
On Wed, Jun 30, 2021 at 11:44 AM Suren Baghdasaryan <surenb@google.com> wrote: > [...] > > > + /* > > > + * If the task is dying and in the process of releasing its memory > > > + * then get its mm. > > > + */ > > > + task_lock(task); > > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > > > task_will_free_mem() is fine here but I think in parallel we should > > optimize this function. At the moment it is traversing all the > > processes on the machine. It is very normal to have tens of thousands > > of processes on big machines, so it would be really costly when > > reaping a bunch of processes. > > Hmm. But I think we still need to make sure that the mm is not shared > with another non-dying process. IIUC that's the point of that > traversal. Am I mistaken? You are right. I am talking about efficiently finding all processes which are sharing mm (maybe linked into another list) instead of traversing all the processes on the system.
On Wed, Jun 30, 2021 at 12:00 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Wed, Jun 30, 2021 at 11:44 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > [...] > > > > + /* > > > > + * If the task is dying and in the process of releasing its memory > > > > + * then get its mm. > > > > + */ > > > > + task_lock(task); > > > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > > > > > task_will_free_mem() is fine here but I think in parallel we should > > > optimize this function. At the moment it is traversing all the > > > processes on the machine. It is very normal to have tens of thousands > > > of processes on big machines, so it would be really costly when > > > reaping a bunch of processes. > > > > Hmm. But I think we still need to make sure that the mm is not shared > > with another non-dying process. IIUC that's the point of that > > traversal. Am I mistaken? > > You are right. I am talking about efficiently finding all processes > which are sharing mm (maybe linked into another list) instead of > traversing all the processes on the system. Oh, I see. I think that's a good idea but belongs to a separate patch as an optimization for task_will_free_mem(). Thanks for reviewing and for good suggestions!
On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote: > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > Also, please consider removing all mention of the word "reap" from the > > user API. For better or for worse, "reap" in UNIX refers to what > > happens when a dead task gets wait()ed. I sincerely wish I could go > > back in time and gently encourage whomever invented that particular > > abomination to change their mind, but my time machine doesn't work. > > I see. Thanks for the note. How about process_mem_release() and > replacing reap with release everywhere? I don't quite understand the objection. This syscall works on tasks that are at the end of their life, right? Isn't something like process_mreap() establishing exactly the mental link we want here? Release is less descriptive for what this thing is to be used for.
On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote: > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > > Also, please consider removing all mention of the word "reap" from the > > > user API. For better or for worse, "reap" in UNIX refers to what > > > happens when a dead task gets wait()ed. I sincerely wish I could go > > > back in time and gently encourage whomever invented that particular > > > abomination to change their mind, but my time machine doesn't work. > > > > I see. Thanks for the note. How about process_mem_release() and > > replacing reap with release everywhere? > > I don't quite understand the objection. This syscall works on tasks > that are at the end of their life, right? Isn't something like > process_mreap() establishing exactly the mental link we want here? > Release is less descriptive for what this thing is to be used for. For better or for worse, "reap" means to make a zombie pid go away. From the description, this new operation takes a dying process (not necessarily a zombie yet) and aggressively frees its memory. This is a different optioneration. How about "free_dying_process_memory"?
On Wed, Jun 30, 2021 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > In modern systems it's not unusual to have a system component monitoring > > > memory conditions of the system and tasked with keeping system memory > > > pressure under control. One way to accomplish that is to kill > > > non-essential processes to free up memory for more important ones. > > > Examples of this are Facebook's OOM killer daemon called oomd and > > > Android's low memory killer daemon called lmkd. > > > For such system component it's important to be able to free memory > > > quickly and efficiently. Unfortunately the time process takes to free > > > up its memory after receiving a SIGKILL might vary based on the state > > > of the process (uninterruptible sleep), size and OPP level of the core > > > the process is running. A mechanism to free resources of the target > > > process in a more predictable way would improve system's ability to > > > control its memory pressure. > > > Introduce process_reap system call that reclaims memory of a dying process > > > from the context of the caller. This way the memory in freed in a more > > > controllable way with CPU affinity and priority of the caller. The workload > > > of freeing the memory will also be charged to the caller. > > > The operation is allowed only on a dying process. > > > > At the risk of asking a potentially silly question, should this just > > be a file in procfs? > > Hmm. I guess it's doable if procfs will not disappear too soon before > memory is released... syscall also supports parameters, in this case > flags can be used in the future to support PIDs in addition to PIDFDs > for example. > Before looking more in that direction, a silly question from my side: > why procfs interface would be preferable to a syscall? It avoids using a syscall nr. (Admittedly a syscall nr is not *that* precious of a resource.) It also makes it possible to use a shell script to do this, which is maybe useful. --Andy
On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote: > > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > Also, please consider removing all mention of the word "reap" from the > > > > user API. For better or for worse, "reap" in UNIX refers to what > > > > happens when a dead task gets wait()ed. I sincerely wish I could go > > > > back in time and gently encourage whomever invented that particular > > > > abomination to change their mind, but my time machine doesn't work. > > > > > > I see. Thanks for the note. How about process_mem_release() and > > > replacing reap with release everywhere? > > > > I don't quite understand the objection. This syscall works on tasks > > that are at the end of their life, right? Isn't something like > > process_mreap() establishing exactly the mental link we want here? > > Release is less descriptive for what this thing is to be used for. > > For better or for worse, "reap" means to make a zombie pid go away. > From the description, this new operation takes a dying process (not > necessarily a zombie yet) and aggressively frees its memory. This is > a different optioneration. > > How about "free_dying_process_memory"? process_mreap sounds definitely better and in line with names like process_madvise. So maybe we can use it?
) On Wed, Jun 30, 2021 at 5:46 PM Andy Lutomirski <luto@kernel.org> wrote: > > On Wed, Jun 30, 2021 at 11:51 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > In modern systems it's not unusual to have a system component monitoring > > > > memory conditions of the system and tasked with keeping system memory > > > > pressure under control. One way to accomplish that is to kill > > > > non-essential processes to free up memory for more important ones. > > > > Examples of this are Facebook's OOM killer daemon called oomd and > > > > Android's low memory killer daemon called lmkd. > > > > For such system component it's important to be able to free memory > > > > quickly and efficiently. Unfortunately the time process takes to free > > > > up its memory after receiving a SIGKILL might vary based on the state > > > > of the process (uninterruptible sleep), size and OPP level of the core > > > > the process is running. A mechanism to free resources of the target > > > > process in a more predictable way would improve system's ability to > > > > control its memory pressure. > > > > Introduce process_reap system call that reclaims memory of a dying process > > > > from the context of the caller. This way the memory in freed in a more > > > > controllable way with CPU affinity and priority of the caller. The workload > > > > of freeing the memory will also be charged to the caller. > > > > The operation is allowed only on a dying process. > > > > > > At the risk of asking a potentially silly question, should this just > > > be a file in procfs? > > > > Hmm. I guess it's doable if procfs will not disappear too soon before > > memory is released... syscall also supports parameters, in this case > > flags can be used in the future to support PIDs in addition to PIDFDs > > for example. > > Before looking more in that direction, a silly question from my side: > > why procfs interface would be preferable to a syscall? > > It avoids using a syscall nr. (Admittedly a syscall nr is not *that* > precious of a resource.) It also makes it possible to use a shell > script to do this, which is maybe useful. I see. Not really sure if the shell usage is a big usecase for this operation but let's see if more people like that approach. For my specific usecase one syscall (process_reap) is better than three syscalls (open, write, close) and the possibility to extend the functionality using flags might be of value for the future. > > --Andy
On Thu, Jul 01, 2021 at 03:59:48PM -0700, Suren Baghdasaryan wrote: > On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote: > > > > On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote: > > > > On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: > > > > > Also, please consider removing all mention of the word "reap" from the > > > > > user API. For better or for worse, "reap" in UNIX refers to what > > > > > happens when a dead task gets wait()ed. I sincerely wish I could go > > > > > back in time and gently encourage whomever invented that particular > > > > > abomination to change their mind, but my time machine doesn't work. > > > > > > > > I see. Thanks for the note. How about process_mem_release() and > > > > replacing reap with release everywhere? > > > > > > I don't quite understand the objection. This syscall works on tasks > > > that are at the end of their life, right? Isn't something like > > > process_mreap() establishing exactly the mental link we want here? > > > Release is less descriptive for what this thing is to be used for. > > > > For better or for worse, "reap" means to make a zombie pid go away. > > From the description, this new operation takes a dying process (not > > necessarily a zombie yet) and aggressively frees its memory. This is > > a different optioneration. > > > > How about "free_dying_process_memory"? > > process_mreap sounds definitely better and in line with names like > process_madvise. So maybe we can use it? That one was my favorite from the list I gave too but maybe we can satisfy Andy too if we use one of: - process_mfree() - process_mrelease()
On 02.07.21 17:27, Christian Brauner wrote: > On Thu, Jul 01, 2021 at 03:59:48PM -0700, Suren Baghdasaryan wrote: >> On Wed, Jun 30, 2021 at 5:44 PM Andy Lutomirski <luto@kernel.org> wrote: >>> >>> On Wed, Jun 30, 2021 at 2:45 PM Johannes Weiner <hannes@cmpxchg.org> wrote: >>>> >>>> On Wed, Jun 30, 2021 at 11:51:36AM -0700, Suren Baghdasaryan wrote: >>>>> On Wed, Jun 30, 2021 at 11:26 AM Andy Lutomirski <luto@kernel.org> wrote: >>>>>> Also, please consider removing all mention of the word "reap" from the >>>>>> user API. For better or for worse, "reap" in UNIX refers to what >>>>>> happens when a dead task gets wait()ed. I sincerely wish I could go >>>>>> back in time and gently encourage whomever invented that particular >>>>>> abomination to change their mind, but my time machine doesn't work. >>>>> >>>>> I see. Thanks for the note. How about process_mem_release() and >>>>> replacing reap with release everywhere? >>>> >>>> I don't quite understand the objection. This syscall works on tasks >>>> that are at the end of their life, right? Isn't something like >>>> process_mreap() establishing exactly the mental link we want here? >>>> Release is less descriptive for what this thing is to be used for. >>> >>> For better or for worse, "reap" means to make a zombie pid go away. >>> From the description, this new operation takes a dying process (not >>> necessarily a zombie yet) and aggressively frees its memory. This is >>> a different optioneration. >>> >>> How about "free_dying_process_memory"? >> >> process_mreap sounds definitely better and in line with names like >> process_madvise. So maybe we can use it? > > That one was my favorite from the list I gave too but maybe we can > satisfy Andy too if we use one of: > - process_mfree() > - process_mrelease() > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free the memory if there are no other references") semantics. Further, a new syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents.
* Suren Baghdasaryan: > The API is as follows, > > int process_reap(int pidfd, unsigned int flags); > > DESCRIPTION > The process_reap() system call is used to free the memory of a > dying process. > > The pidfd selects the process referred to by the PID file > descriptor. > (See pidofd_open(2) for further information) > > The flags argument is reserved for future use; currently, this > argument must be specified as 0. > > RETURN VALUE > On success, process_reap() returns 0. On error, -1 is returned > and errno is set to indicate the error. I think the manual page should mention what it means for a process to be “dying”, and how to move a process to this state. Thanks, Florian
On Mon 05-07-21 09:41:54, David Hildenbrand wrote: > On 02.07.21 17:27, Christian Brauner wrote: [...] > > That one was my favorite from the list I gave too but maybe we can > > satisfy Andy too if we use one of: > > - process_mfree() > > - process_mrelease() > > > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free > the memory if there are no other references") semantics. Agreed. > Further, a new > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents. Yeah, proc based interface is both tricky to use and kinda ugly now that pidfd can solve all at in once. My original preference was a more generic kill syscall to allow flags but a dedicated syscall doesn't look really bad either.
On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Suren Baghdasaryan: > > > The API is as follows, > > > > int process_reap(int pidfd, unsigned int flags); > > > > DESCRIPTION > > The process_reap() system call is used to free the memory of a > > dying process. > > > > The pidfd selects the process referred to by the PID file > > descriptor. > > (See pidofd_open(2) for further information) > > > > The flags argument is reserved for future use; currently, this > > argument must be specified as 0. > > > > RETURN VALUE > > On success, process_reap() returns 0. On error, -1 is returned > > and errno is set to indicate the error. > > I think the manual page should mention what it means for a process to be > “dying”, and how to move a process to this state. Thanks for the suggestion, Florian! Would replacing "dying process" with "process which was sent a SIGKILL signal" be sufficient? > > Thanks, > Florian >
On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <mhocko@suse.com> wrote: > > On Mon 05-07-21 09:41:54, David Hildenbrand wrote: > > On 02.07.21 17:27, Christian Brauner wrote: > [...] > > > That one was my favorite from the list I gave too but maybe we can > > > satisfy Andy too if we use one of: > > > - process_mfree() > > > - process_mrelease() > > > > > > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free > > the memory if there are no other references") semantics. > > Agreed. Ok, sounds like process_mrelease() would be an acceptable compromise. > > > Further, a new > > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents. > > Yeah, proc based interface is both tricky to use and kinda ugly now that > pidfd can solve all at in once. Sounds good. Will keep it as is then. > My original preference was a more generic kill syscall to allow flags > but a dedicated syscall doesn't look really bad either. Yeah, I have tried that direction unsuccessfully before arriving at this one. Hopefully it represents the right compromise which can satisfy everyone's usecase. > -- > Michal Hocko > SUSE Labs
* Suren Baghdasaryan: > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Suren Baghdasaryan: >> >> > The API is as follows, >> > >> > int process_reap(int pidfd, unsigned int flags); >> > >> > DESCRIPTION >> > The process_reap() system call is used to free the memory of a >> > dying process. >> > >> > The pidfd selects the process referred to by the PID file >> > descriptor. >> > (See pidofd_open(2) for further information) >> > >> > The flags argument is reserved for future use; currently, this >> > argument must be specified as 0. >> > >> > RETURN VALUE >> > On success, process_reap() returns 0. On error, -1 is returned >> > and errno is set to indicate the error. >> >> I think the manual page should mention what it means for a process to be >> “dying”, and how to move a process to this state. > > Thanks for the suggestion, Florian! Would replacing "dying process" > with "process which was sent a SIGKILL signal" be sufficient? That explains very clearly the requirement, but it raises the question why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing system call. Thanks, Florian
On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Suren Baghdasaryan: > > > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote: > >> > >> * Suren Baghdasaryan: > >> > >> > The API is as follows, > >> > > >> > int process_reap(int pidfd, unsigned int flags); > >> > > >> > DESCRIPTION > >> > The process_reap() system call is used to free the memory of a > >> > dying process. > >> > > >> > The pidfd selects the process referred to by the PID file > >> > descriptor. > >> > (See pidofd_open(2) for further information) > >> > > >> > The flags argument is reserved for future use; currently, this > >> > argument must be specified as 0. > >> > > >> > RETURN VALUE > >> > On success, process_reap() returns 0. On error, -1 is returned > >> > and errno is set to indicate the error. > >> > >> I think the manual page should mention what it means for a process to be > >> “dying”, and how to move a process to this state. > > > > Thanks for the suggestion, Florian! Would replacing "dying process" > > with "process which was sent a SIGKILL signal" be sufficient? > > That explains very clearly the requirement, but it raises the question > why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing > system call. I think you are suggesting to use sigqueue() to deliver the signal and perform the reaping when a special value accompanies it. This would be somewhat similar to my early suggestion to use a flag in pidfd_send_signal() (see: https://lore.kernel.org/patchwork/patch/1060407) to implement memory reaping which has another advantage of operation on PIDFDs instead of PIDs which can be recycled. kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the signal and return without blocking. Changing that behavior was considered unacceptable in these discussions. On the other hand using some kthread to do the reaping asynchronously has its disadvantages: userspace can't control the priority and cpu affinity of the thread doing the reaping and this work is not charged towards the caller. In the end a separate blocking syscall was deemed appropriate for this operation. More details can be found in the links I posted in the description of the patch. > > Thanks, > Florian >
* Suren Baghdasaryan: > On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Suren Baghdasaryan: >> >> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote: >> >> >> >> * Suren Baghdasaryan: >> >> >> >> > The API is as follows, >> >> > >> >> > int process_reap(int pidfd, unsigned int flags); >> >> > >> >> > DESCRIPTION >> >> > The process_reap() system call is used to free the memory of a >> >> > dying process. >> >> > >> >> > The pidfd selects the process referred to by the PID file >> >> > descriptor. >> >> > (See pidofd_open(2) for further information) >> >> > >> >> > The flags argument is reserved for future use; currently, this >> >> > argument must be specified as 0. >> >> > >> >> > RETURN VALUE >> >> > On success, process_reap() returns 0. On error, -1 is returned >> >> > and errno is set to indicate the error. >> >> >> >> I think the manual page should mention what it means for a process to be >> >> “dying”, and how to move a process to this state. >> > >> > Thanks for the suggestion, Florian! Would replacing "dying process" >> > with "process which was sent a SIGKILL signal" be sufficient? >> >> That explains very clearly the requirement, but it raises the question >> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing >> system call. > > I think you are suggesting to use sigqueue() to deliver the signal and > perform the reaping when a special value accompanies it. This would be > somewhat similar to my early suggestion to use a flag in > pidfd_send_signal() (see: > https://lore.kernel.org/patchwork/patch/1060407) to implement memory > reaping which has another advantage of operation on PIDFDs instead of > PIDs which can be recycled. > kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the > signal and return without blocking. Changing that behavior was > considered unacceptable in these discussions. Does this mean that you need two threads, one that sends SIGKILL, and one that calls process_reap? Given that sending SIGKILL is blocking with the existing interfaces? Please also note that asynchronous deallocation of resources leads to bugs and can cause unrelated workloads to fail. For example, in some configurations, clone can fail with EAGAIN even in cases where the total number of tasks is clearly bounded because the kernel signals task exit to applications before all resources are deallocated. I'm worried that the new interface makes things quite a bit worse in this regard. Thanks, Florian
On Wed, Jul 7, 2021 at 11:15 PM Florian Weimer <fweimer@redhat.com> wrote: > > * Suren Baghdasaryan: > > > On Wed, Jul 7, 2021 at 10:41 PM Florian Weimer <fweimer@redhat.com> wrote: > >> > >> * Suren Baghdasaryan: > >> > >> > On Wed, Jul 7, 2021 at 2:47 AM Florian Weimer <fweimer@redhat.com> wrote: > >> >> > >> >> * Suren Baghdasaryan: > >> >> > >> >> > The API is as follows, > >> >> > > >> >> > int process_reap(int pidfd, unsigned int flags); > >> >> > > >> >> > DESCRIPTION > >> >> > The process_reap() system call is used to free the memory of a > >> >> > dying process. > >> >> > > >> >> > The pidfd selects the process referred to by the PID file > >> >> > descriptor. > >> >> > (See pidofd_open(2) for further information) > >> >> > > >> >> > The flags argument is reserved for future use; currently, this > >> >> > argument must be specified as 0. > >> >> > > >> >> > RETURN VALUE > >> >> > On success, process_reap() returns 0. On error, -1 is returned > >> >> > and errno is set to indicate the error. > >> >> > >> >> I think the manual page should mention what it means for a process to be > >> >> “dying”, and how to move a process to this state. > >> > > >> > Thanks for the suggestion, Florian! Would replacing "dying process" > >> > with "process which was sent a SIGKILL signal" be sufficient? > >> > >> That explains very clearly the requirement, but it raises the question > >> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing > >> system call. > > > > I think you are suggesting to use sigqueue() to deliver the signal and > > perform the reaping when a special value accompanies it. This would be > > somewhat similar to my early suggestion to use a flag in > > pidfd_send_signal() (see: > > https://lore.kernel.org/patchwork/patch/1060407) to implement memory > > reaping which has another advantage of operation on PIDFDs instead of > > PIDs which can be recycled. > > kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the > > signal and return without blocking. Changing that behavior was > > considered unacceptable in these discussions. > > Does this mean that you need two threads, one that sends SIGKILL, and > one that calls process_reap? Given that sending SIGKILL is blocking > with the existing interfaces? Sending SIGKILL is blocking in terms of delivering the signal, but it does not block waiting for SIGKILL to be processed by the signal recipient and memory to be released. When I was talking about "blocking", I meant that current kill() and friends do not block to wait for SIGKILL to be processed. process_reap() will block until the memory is released. Whether the userspace caller is using it right after sending a SIGKILL to reclaim the memory synchronously or spawns a separate thread to reclaim memory asynchronously is up to the user. Both patterns are supported. > Please also note that asynchronous deallocation of resources leads to > bugs and can cause unrelated workloads to fail. For example, in some > configurations, clone can fail with EAGAIN even in cases where the total > number of tasks is clearly bounded because the kernel signals task exit > to applications before all resources are deallocated. I'm worried that > the new interface makes things quite a bit worse in this regard. The process_reap() releases memory synchronously, no kthreads are being used. If asynchronous release is required, the userspace would need to spawn a userspace thread and issue this syscall from it. I hope this clears your concerns, which I think are about asynchronous deallocations within the kernel. Thanks! > > Thanks, > Florian >
* Suren Baghdasaryan: > Sending SIGKILL is blocking in terms of delivering the signal, but it > does not block waiting for SIGKILL to be processed by the signal > recipient and memory to be released. When I was talking about > "blocking", I meant that current kill() and friends do not block to > wait for SIGKILL to be processed. > process_reap() will block until the memory is released. Whether the > userspace caller is using it right after sending a SIGKILL to reclaim > the memory synchronously or spawns a separate thread to reclaim memory > asynchronously is up to the user. Both patterns are supported. I see, this makes sense. Considering that the pidfd sticks around after process_reap returns, the issue described in bug 154011 probably does not apply to process_reap. (This relates to asynchronous resource deallocation, as discussed before.) Thanks, Florian
On Wed, Jul 07, 2021 at 02:14:23PM -0700, Suren Baghdasaryan wrote: > On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Mon 05-07-21 09:41:54, David Hildenbrand wrote: > > > On 02.07.21 17:27, Christian Brauner wrote: > > [...] > > > > That one was my favorite from the list I gave too but maybe we can > > > > satisfy Andy too if we use one of: > > > > - process_mfree() > > > > - process_mrelease() > > > > > > > > > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free > > > the memory if there are no other references") semantics. > > > > Agreed. > > Ok, sounds like process_mrelease() would be an acceptable compromise. > > > > > > Further, a new > > > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents. > > > > Yeah, proc based interface is both tricky to use and kinda ugly now that > > pidfd can solve all at in once. > > Sounds good. Will keep it as is then. > > > My original preference was a more generic kill syscall to allow flags > > but a dedicated syscall doesn't look really bad either. > > Yeah, I have tried that direction unsuccessfully before arriving at > this one. Hopefully it represents the right compromise which can > satisfy everyone's usecase. I think a syscall is fine and it's not we're running out of numbers (anymore). :) Christian
On Fri, Jul 9, 2021 at 1:59 AM Christian Brauner <christian.brauner@ubuntu.com> wrote: > > On Wed, Jul 07, 2021 at 02:14:23PM -0700, Suren Baghdasaryan wrote: > > On Wed, Jul 7, 2021 at 5:38 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Mon 05-07-21 09:41:54, David Hildenbrand wrote: > > > > On 02.07.21 17:27, Christian Brauner wrote: > > > [...] > > > > > That one was my favorite from the list I gave too but maybe we can > > > > > satisfy Andy too if we use one of: > > > > > - process_mfree() > > > > > - process_mrelease() > > > > > > > > > > > > > FWIW, I tend to like process_mrelease(), due to the implied "release" ("free > > > > the memory if there are no other references") semantics. > > > > > > Agreed. > > > > Ok, sounds like process_mrelease() would be an acceptable compromise. > > > > > > > > > Further, a new > > > > syscall feels cleaner than some magic sysfs/procfs toggle. Just my 2 cents. > > > > > > Yeah, proc based interface is both tricky to use and kinda ugly now that > > > pidfd can solve all at in once. > > > > Sounds good. Will keep it as is then. > > > > > My original preference was a more generic kill syscall to allow flags > > > but a dedicated syscall doesn't look really bad either. > > > > Yeah, I have tried that direction unsuccessfully before arriving at > > this one. Hopefully it represents the right compromise which can > > satisfy everyone's usecase. > > I think a syscall is fine and it's not we're running out of numbers > (anymore). :) Thanks everyone for the input! So far I collected: 1. rename the syscall to process_mrelease() 2. replace "dying process" with "process which was sent a SIGKILL signal" in the manual page text I'll respin a v2 with these changes next week. Have a great weekend! Suren. > > Christian
On Mon, Jul 12, 2021 at 5:51 AM Jan Engelhardt <jengelh@inai.de> wrote: > > > On Thursday 2021-07-08 08:05, Suren Baghdasaryan wrote: > >> > >> That explains very clearly the requirement, but it raises the question > >> why this isn't an si_code flag for rt_sigqueueinfo, reusing the existing > >> system call. > > > >I think you are suggesting to use sigqueue() to deliver the signal and > >perform the reaping when a special value accompanies it. This would be > >somewhat similar to my early suggestion to use a flag in > >pidfd_send_signal() (see: > >https://lore.kernel.org/patchwork/patch/1060407) to implement memory > >reaping which has another advantage of operation on PIDFDs instead of > >PIDs which can be recycled. > >kill()/pidfd_send_signal()/sigqueue() are supposed to deliver the > >signal and return without blocking. Changing that behavior was > >considered unacceptable in these discussions. > > The way I understood the request is that a userspace program (or perhaps two, > if so desired) should issue _two_ calls, one to deliver the signal, > one to perform the reap portion: > > uinfo.si_code = SI_QUEUE; > sigqueue(pid, SIGKILL, &uinfo); > uinfo.si_code = SI_REAP; > sigqueue(pid, SIGKILL, &uinfo); This approach would still lead to the same discussion: by design, sigqueue/kill/pidfd_send_signal deliver the signal but do not wait for the signal to be processed by the recipient. Changing that would be a behavior change. Therefore we would have to follow this pattern and implement memory reaping in an asynchronous manner using a kthread/workqueue and it won't be done in the context of the calling process. This is undesirable because we lose the ability to control priority and cpu affinity for this operation and work won't be charged to the caller. That's why the proposed syscall performs memory reaping in the caller's context and blocks until the operation is done. In this proposal, your sequence looks like this: pidfd_send_signal(pidfd, SIGKILL, NULL, 0); process_reap(pidfd, 0); except we decided to rename process_reap() to process_mrelease() in the next revision.
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index 3000a2e8ee21..14b9e81d2fc4 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -486,3 +486,4 @@ 554 common landlock_create_ruleset sys_landlock_create_ruleset 555 common landlock_add_rule sys_landlock_add_rule 556 common landlock_restrict_self sys_landlock_restrict_self +557 common process_reap sys_process_reap diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index 28e03b5fec00..889b78d0f63f 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -460,3 +460,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 727bfc3be99b..fb7a0be2f3d9 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 447 +#define __NR_compat_syscalls 448 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 5dab69d2c22b..80593454173e 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -900,6 +900,8 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) #define __NR_landlock_restrict_self 446 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) +#define __NR_process_reap 447 +__SYSCALL(__NR_process_reap, sys_process_reap) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index bb11fe4c875a..6c94feedf086 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -367,3 +367,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 79c2d24c89dd..e80a7fa55696 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -446,3 +446,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index b11395a20c20..511b2bd61fc1 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -452,3 +452,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 9220909526f9..1775704c6a24 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -385,3 +385,4 @@ 444 n32 landlock_create_ruleset sys_landlock_create_ruleset 445 n32 landlock_add_rule sys_landlock_add_rule 446 n32 landlock_restrict_self sys_landlock_restrict_self +447 n32 process_reap sys_process_reap diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 9cd1c34f31b5..d769daca3f79 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -361,3 +361,4 @@ 444 n64 landlock_create_ruleset sys_landlock_create_ruleset 445 n64 landlock_add_rule sys_landlock_add_rule 446 n64 landlock_restrict_self sys_landlock_restrict_self +447 n64 process_reap sys_process_reap diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl index d560c467a8c6..1bd2fc056677 100644 --- a/arch/mips/kernel/syscalls/syscall_o32.tbl +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl @@ -434,3 +434,4 @@ 444 o32 landlock_create_ruleset sys_landlock_create_ruleset 445 o32 landlock_add_rule sys_landlock_add_rule 446 o32 landlock_restrict_self sys_landlock_restrict_self +447 o32 process_reap sys_process_reap diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index aabc37f8cae3..0012561ca557 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -444,3 +444,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 8f052ff4058c..89cbcc732b18 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -526,3 +526,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 0690263df1dd..7ebd4d809b5e 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -449,3 +449,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap sys_process_reap diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 0b91499ebdcf..178fd47b372e 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -449,3 +449,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index e34cc30ef22c..faee121b7ae2 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -492,3 +492,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl index 4bbc267fb36b..cbe070de9884 100644 --- a/arch/x86/entry/syscalls/syscall_32.tbl +++ b/arch/x86/entry/syscalls/syscall_32.tbl @@ -451,3 +451,4 @@ 444 i386 landlock_create_ruleset sys_landlock_create_ruleset 445 i386 landlock_add_rule sys_landlock_add_rule 446 i386 landlock_restrict_self sys_landlock_restrict_self +447 i386 process_reap sys_process_reap diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index ce18119ea0d0..e6765646731b 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -368,6 +368,7 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index fd2f30227d96..f0e9dbee1a5b 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -417,3 +417,4 @@ 444 common landlock_create_ruleset sys_landlock_create_ruleset 445 common landlock_add_rule sys_landlock_add_rule 446 common landlock_restrict_self sys_landlock_restrict_self +447 common process_reap sys_process_reap diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 050511e8f1f8..b6659e09bf0d 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -915,6 +915,7 @@ asmlinkage long sys_mincore(unsigned long start, size_t len, asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); asmlinkage long sys_process_madvise(int pidfd, const struct iovec __user *vec, size_t vlen, int behavior, unsigned int flags); +asmlinkage long sys_process_reap(int pidfd, unsigned int flags); asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, unsigned long prot, unsigned long pgoff, unsigned long flags); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index d2a942086fcb..b3bf57b928af 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -871,9 +871,11 @@ __SYSCALL(__NR_landlock_create_ruleset, sys_landlock_create_ruleset) __SYSCALL(__NR_landlock_add_rule, sys_landlock_add_rule) #define __NR_landlock_restrict_self 446 __SYSCALL(__NR_landlock_restrict_self, sys_landlock_restrict_self) +#define __NR_process_reap 447 +__SYSCALL(__NR_process_reap, sys_process_reap) #undef __NR_syscalls -#define __NR_syscalls 447 +#define __NR_syscalls 448 /* * 32 bit systems traditionally used different diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 0ea8128468c3..56eb7c9f8356 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -289,6 +289,7 @@ COND_SYSCALL(munlockall); COND_SYSCALL(mincore); COND_SYSCALL(madvise); COND_SYSCALL(process_madvise); +COND_SYSCALL(process_reap); COND_SYSCALL(remap_file_pages); COND_SYSCALL(mbind); COND_SYSCALL_COMPAT(mbind); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index eefd3f5fde46..0f85a0442fa5 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -28,6 +28,7 @@ #include <linux/sched/task.h> #include <linux/sched/debug.h> #include <linux/swap.h> +#include <linux/syscalls.h> #include <linux/timex.h> #include <linux/jiffies.h> #include <linux/cpuset.h> @@ -1141,3 +1142,52 @@ void pagefault_out_of_memory(void) out_of_memory(&oc); mutex_unlock(&oom_lock); } + +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) +{ + struct pid *pid; + struct task_struct *task; + struct mm_struct *mm = NULL; + unsigned int f_flags; + long ret = 0; + + if (flags != 0) + return -EINVAL; + + pid = pidfd_get_pid(pidfd, &f_flags); + if (IS_ERR(pid)) + return PTR_ERR(pid); + + task = get_pid_task(pid, PIDTYPE_PID); + if (!task) { + ret = -ESRCH; + goto put_pid; + } + + /* + * If the task is dying and in the process of releasing its memory + * then get its mm. + */ + task_lock(task); + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { + mm = task->mm; + mmget(mm); + } + task_unlock(task); + if (!mm) { + ret = -EINVAL; + goto put_task; + } + + mmap_read_lock(mm); + if (!__oom_reap_task_mm(mm)) + ret = -EAGAIN; + mmap_read_unlock(mm); + + mmput(mm); +put_task: + put_task_struct(task); +put_pid: + put_pid(pid); + return ret; +}
In modern systems it's not unusual to have a system component monitoring memory conditions of the system and tasked with keeping system memory pressure under control. One way to accomplish that is to kill non-essential processes to free up memory for more important ones. Examples of this are Facebook's OOM killer daemon called oomd and Android's low memory killer daemon called lmkd. For such system component it's important to be able to free memory quickly and efficiently. Unfortunately the time process takes to free up its memory after receiving a SIGKILL might vary based on the state of the process (uninterruptible sleep), size and OPP level of the core the process is running. A mechanism to free resources of the target process in a more predictable way would improve system's ability to control its memory pressure. Introduce process_reap system call that reclaims memory of a dying process from the context of the caller. This way the memory in freed in a more controllable way with CPU affinity and priority of the caller. The workload of freeing the memory will also be charged to the caller. The operation is allowed only on a dying process. Previously I proposed a number of alternatives to accomplish this: - https://lore.kernel.org/patchwork/patch/1060407 extending pidfd_send_signal to allow memory reaping using oom_reaper thread; - https://lore.kernel.org/patchwork/patch/1338196 extending pidfd_send_signal to reap memory of the target process synchronously from the context of the caller; - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED support for process_madvise implementing synchronous memory reaping. The end of the last discussion culminated with suggestion to introduce a dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) The reasoning was that the new variant of process_madvise a) does not work on an address range b) is destructive c) doesn't share much code at all with the rest of process_madvise From the userspace point of view it was awkward and inconvenient to provide memory range for this operation that operates on the entire address space. Using special flags or address values to specify the entire address space was too hacky. The API is as follows, int process_reap(int pidfd, unsigned int flags); DESCRIPTION The process_reap() system call is used to free the memory of a dying process. The pidfd selects the process referred to by the PID file descriptor. (See pidofd_open(2) for further information) The flags argument is reserved for future use; currently, this argument must be specified as 0. RETURN VALUE On success, process_reap() returns 0. On error, -1 is returned and errno is set to indicate the error. Signed-off-by: Suren Baghdasaryan <surenb@google.com> --- arch/alpha/kernel/syscalls/syscall.tbl | 1 + arch/arm/tools/syscall.tbl | 1 + arch/arm64/include/asm/unistd.h | 2 +- arch/arm64/include/asm/unistd32.h | 2 + arch/ia64/kernel/syscalls/syscall.tbl | 1 + arch/m68k/kernel/syscalls/syscall.tbl | 1 + arch/microblaze/kernel/syscalls/syscall.tbl | 1 + arch/mips/kernel/syscalls/syscall_n32.tbl | 1 + arch/mips/kernel/syscalls/syscall_n64.tbl | 1 + arch/mips/kernel/syscalls/syscall_o32.tbl | 1 + arch/parisc/kernel/syscalls/syscall.tbl | 1 + arch/powerpc/kernel/syscalls/syscall.tbl | 1 + arch/s390/kernel/syscalls/syscall.tbl | 1 + arch/sh/kernel/syscalls/syscall.tbl | 1 + arch/sparc/kernel/syscalls/syscall.tbl | 1 + arch/x86/entry/syscalls/syscall_32.tbl | 1 + arch/x86/entry/syscalls/syscall_64.tbl | 1 + arch/xtensa/kernel/syscalls/syscall.tbl | 1 + include/linux/syscalls.h | 1 + include/uapi/asm-generic/unistd.h | 4 +- kernel/sys_ni.c | 1 + mm/oom_kill.c | 50 +++++++++++++++++++++ 22 files changed, 74 insertions(+), 2 deletions(-)