diff mbox series

[RFC,v1,2/4] kernel/fork.c: implement new process_mmput_async syscall

Message ID 20211111095008.264412-4-imbrenda@linux.ibm.com (mailing list archive)
State New
Headers show
Series Two alternatives for mm async teardown | expand

Commit Message

Claudio Imbrenda Nov. 11, 2021, 9:50 a.m. UTC
The goal of this new syscall is to be able to asynchronously free the
mm of a dying process. This is especially useful for processes that use
huge amounts of memory (e.g. databases or KVM guests). The process is
allowed to terminate immediately, while its mm is cleaned/reclaimed
asynchronously.

A separate process needs use the process_mmput_async syscall to attach
itself to the mm of a running target process. The process will then
sleep until the last user of the target mm has gone.

When the last user of the mm has gone, instead of synchronously free
the mm, the attached process is awoken. The syscall will then continue
and clean up the target mm.

This solution has the advantage that the cleanup of the target mm can
happen both be asynchronous and properly accounted for (e.g. cgroups).

Tested on s390x.

A separate patch will actually wire up the syscall.

Signed-off-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
---
 include/linux/mm_types.h |   1 +
 kernel/fork.c            | 103 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 102 insertions(+), 2 deletions(-)

Comments

Eric W. Biederman Nov. 11, 2021, 7:20 p.m. UTC | #1
Claudio Imbrenda <imbrenda@linux.ibm.com> writes:

> The goal of this new syscall is to be able to asynchronously free the
> mm of a dying process. This is especially useful for processes that use
> huge amounts of memory (e.g. databases or KVM guests). The process is
> allowed to terminate immediately, while its mm is cleaned/reclaimed
> asynchronously.
>
> A separate process needs use the process_mmput_async syscall to attach
> itself to the mm of a running target process. The process will then
> sleep until the last user of the target mm has gone.
>
> When the last user of the mm has gone, instead of synchronously free
> the mm, the attached process is awoken. The syscall will then continue
> and clean up the target mm.
>
> This solution has the advantage that the cleanup of the target mm can
> happen both be asynchronous and properly accounted for (e.g. cgroups).
>
> Tested on s390x.
>
> A separate patch will actually wire up the syscall.

I am a bit confused.

You want the process report that it has finished immediately,
and you want the cleanup work to continue on in the background.

Why do you need a separate process?

Why not just modify the process cleanup code to keep the task_struct
running while allowing waitpid to reap the process (aka allowing
release_task to run)?  All tasks can be already be reaped after
exit_notify in do_exit.

I can see some reasons for wanting an opt-in.  It is nice to know all of
a processes resources have been freed when waitpid succeeds.

Still I don't see why this whole thing isn't exit_mm returning
the mm_sturct when a flag is set, and then having an exit_mm_late
being called and passed the returned mm after exit_notify.

Or maybe something with schedule_work or task_work, instead of an
exit_mm_late.  I don't see any practical difference.

I really don't see why this needs a whole other process to connect to
the process you care about asynchronously.

This whole thing seems an exercise in spending lots of resources to free
resources much later.

Eric
Claudio Imbrenda Nov. 12, 2021, 9:27 a.m. UTC | #2
On Thu, 11 Nov 2021 13:20:11 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> 
> > The goal of this new syscall is to be able to asynchronously free the
> > mm of a dying process. This is especially useful for processes that use
> > huge amounts of memory (e.g. databases or KVM guests). The process is
> > allowed to terminate immediately, while its mm is cleaned/reclaimed
> > asynchronously.
> >
> > A separate process needs use the process_mmput_async syscall to attach
> > itself to the mm of a running target process. The process will then
> > sleep until the last user of the target mm has gone.
> >
> > When the last user of the mm has gone, instead of synchronously free
> > the mm, the attached process is awoken. The syscall will then continue
> > and clean up the target mm.
> >
> > This solution has the advantage that the cleanup of the target mm can
> > happen both be asynchronous and properly accounted for (e.g. cgroups).
> >
> > Tested on s390x.
> >
> > A separate patch will actually wire up the syscall.  
> 
> I am a bit confused.
> 
> You want the process report that it has finished immediately,
> and you want the cleanup work to continue on in the background.

yes

> Why do you need a separate process?
> 
> Why not just modify the process cleanup code to keep the task_struct
> running while allowing waitpid to reap the process (aka allowing
> release_task to run)?  All tasks can be already be reaped after
> exit_notify in do_exit.
> 
> I can see some reasons for wanting an opt-in.  It is nice to know all of
> a processes resources have been freed when waitpid succeeds.
> 
> Still I don't see why this whole thing isn't exit_mm returning
> the mm_sturct when a flag is set, and then having an exit_mm_late
> being called and passed the returned mm after exit_notify.

so if I understand correctly you are saying exit_mm would skip the
mmput, set a flag, then I should introduce a new function
"exit_mm_late" after exit_notify, to check the flag and do the mmput if
needed

and that would mean that the cleanup would still be done in the context
of the exiting process, but without holding back anyone waiting for the
process to terminate (so the process appears to exit immediately)

sounds clean, I will do it

> Or maybe something with schedule_work or task_work, instead of an
> exit_mm_late.  I don't see any practical difference.
> 
> I really don't see why this needs a whole other process to connect to
> the process you care about asynchronously.

accounting. workqueues or kernel threads are not properly accounted to
the right cgroups; by using a userspace process, things get accounted
properly.

this was a major point that was made last month when a similar
discussion came up

> This whole thing seems an exercise in spending lots of resources to free
> resources much later.

there are some usecases for this (huge processes like databases, or huge
secure VMs where the teardown is significantly slower than normal
processes)

> 
> Eric
Claudio Imbrenda Nov. 12, 2021, 9:34 a.m. UTC | #3
On Thu, 11 Nov 2021 13:20:11 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> 
> > The goal of this new syscall is to be able to asynchronously free the
> > mm of a dying process. This is especially useful for processes that use
> > huge amounts of memory (e.g. databases or KVM guests). The process is
> > allowed to terminate immediately, while its mm is cleaned/reclaimed
> > asynchronously.
> >
> > A separate process needs use the process_mmput_async syscall to attach
> > itself to the mm of a running target process. The process will then
> > sleep until the last user of the target mm has gone.
> >
> > When the last user of the mm has gone, instead of synchronously free
> > the mm, the attached process is awoken. The syscall will then continue
> > and clean up the target mm.
> >
> > This solution has the advantage that the cleanup of the target mm can
> > happen both be asynchronous and properly accounted for (e.g. cgroups).
> >
> > Tested on s390x.
> >
> > A separate patch will actually wire up the syscall.  
> 
> I am a bit confused.
> 
> You want the process report that it has finished immediately,
> and you want the cleanup work to continue on in the background.
> 
> Why do you need a separate process?
> 
> Why not just modify the process cleanup code to keep the task_struct
> running while allowing waitpid to reap the process (aka allowing
> release_task to run)?  All tasks can be already be reaped after
> exit_notify in do_exit.
> 
> I can see some reasons for wanting an opt-in.  It is nice to know all of
> a processes resources have been freed when waitpid succeeds.
> 
> Still I don't see why this whole thing isn't exit_mm returning
> the mm_sturct when a flag is set, and then having an exit_mm_late
> being called and passed the returned mm after exit_notify.

nevermind, exit_notify is done after cgroup_exit, the teardown would
then not be accounted properly

> 
> Or maybe something with schedule_work or task_work, instead of an
> exit_mm_late.  I don't see any practical difference.
> 
> I really don't see why this needs a whole other process to connect to
> the process you care about asynchronously.
> 
> This whole thing seems an exercise in spending lots of resources to free
> resources much later.
> 
> Eric
Eric W. Biederman Nov. 12, 2021, 2:57 p.m. UTC | #4
Claudio Imbrenda <imbrenda@linux.ibm.com> writes:

> On Thu, 11 Nov 2021 13:20:11 -0600
> ebiederm@xmission.com (Eric W. Biederman) wrote:
>
>> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
>> 
>> > The goal of this new syscall is to be able to asynchronously free the
>> > mm of a dying process. This is especially useful for processes that use
>> > huge amounts of memory (e.g. databases or KVM guests). The process is
>> > allowed to terminate immediately, while its mm is cleaned/reclaimed
>> > asynchronously.
>> >
>> > A separate process needs use the process_mmput_async syscall to attach
>> > itself to the mm of a running target process. The process will then
>> > sleep until the last user of the target mm has gone.
>> >
>> > When the last user of the mm has gone, instead of synchronously free
>> > the mm, the attached process is awoken. The syscall will then continue
>> > and clean up the target mm.
>> >
>> > This solution has the advantage that the cleanup of the target mm can
>> > happen both be asynchronous and properly accounted for (e.g. cgroups).
>> >
>> > Tested on s390x.
>> >
>> > A separate patch will actually wire up the syscall.  
>> 
>> I am a bit confused.
>> 
>> You want the process report that it has finished immediately,
>> and you want the cleanup work to continue on in the background.
>> 
>> Why do you need a separate process?
>> 
>> Why not just modify the process cleanup code to keep the task_struct
>> running while allowing waitpid to reap the process (aka allowing
>> release_task to run)?  All tasks can be already be reaped after
>> exit_notify in do_exit.
>> 
>> I can see some reasons for wanting an opt-in.  It is nice to know all of
>> a processes resources have been freed when waitpid succeeds.
>> 
>> Still I don't see why this whole thing isn't exit_mm returning
>> the mm_sturct when a flag is set, and then having an exit_mm_late
>> being called and passed the returned mm after exit_notify.
>
> nevermind, exit_notify is done after cgroup_exit, the teardown would
> then not be accounted properly

So you want this new mechanism so you can separate the cleanup from
the exit notification, and so that things are accounted properly.

It would have helped if you had included a link to the previous
conversation.

I think Michal Hoko has a point.  This looks like a job for
"clone(CLONE_VM)" and "prctl(PR_SET_PDEATH_SIG)".  Maybe using a pidfd
instead of the prctl.

AKA just create a child that shares the parents memory, and waits for
the parent to exit and then cleans things up.

That should not need any new kernel mechanisms.



There is the other question: why this is disastrously slow on s390?
Is this a s390 specific issue?  Can the issue be fixed by optimizing
what is happening on s390?

Eric
Claudio Imbrenda Nov. 12, 2021, 4:53 p.m. UTC | #5
On Fri, 12 Nov 2021 08:57:13 -0600
ebiederm@xmission.com (Eric W. Biederman) wrote:

> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> 
> > On Thu, 11 Nov 2021 13:20:11 -0600
> > ebiederm@xmission.com (Eric W. Biederman) wrote:
> >  
> >> Claudio Imbrenda <imbrenda@linux.ibm.com> writes:
> >>   
> >> > The goal of this new syscall is to be able to asynchronously free the
> >> > mm of a dying process. This is especially useful for processes that use
> >> > huge amounts of memory (e.g. databases or KVM guests). The process is
> >> > allowed to terminate immediately, while its mm is cleaned/reclaimed
> >> > asynchronously.
> >> >
> >> > A separate process needs use the process_mmput_async syscall to attach
> >> > itself to the mm of a running target process. The process will then
> >> > sleep until the last user of the target mm has gone.
> >> >
> >> > When the last user of the mm has gone, instead of synchronously free
> >> > the mm, the attached process is awoken. The syscall will then continue
> >> > and clean up the target mm.
> >> >
> >> > This solution has the advantage that the cleanup of the target mm can
> >> > happen both be asynchronous and properly accounted for (e.g. cgroups).
> >> >
> >> > Tested on s390x.
> >> >
> >> > A separate patch will actually wire up the syscall.    
> >> 
> >> I am a bit confused.
> >> 
> >> You want the process report that it has finished immediately,
> >> and you want the cleanup work to continue on in the background.
> >> 
> >> Why do you need a separate process?
> >> 
> >> Why not just modify the process cleanup code to keep the task_struct
> >> running while allowing waitpid to reap the process (aka allowing
> >> release_task to run)?  All tasks can be already be reaped after
> >> exit_notify in do_exit.
> >> 
> >> I can see some reasons for wanting an opt-in.  It is nice to know all of
> >> a processes resources have been freed when waitpid succeeds.
> >> 
> >> Still I don't see why this whole thing isn't exit_mm returning
> >> the mm_sturct when a flag is set, and then having an exit_mm_late
> >> being called and passed the returned mm after exit_notify.  
> >
> > nevermind, exit_notify is done after cgroup_exit, the teardown would
> > then not be accounted properly  
> 
> So you want this new mechanism so you can separate the cleanup from
> the exit notification, and so that things are accounted properly.
> 
> It would have helped if you had included a link to the previous
> conversation.
> 
> I think Michal Hoko has a point.  This looks like a job for
> "clone(CLONE_VM)" and "prctl(PR_SET_PDEATH_SIG)".  Maybe using a pidfd
> instead of the prctl.
> 
> AKA just create a child that shares the parents memory, and waits for
> the parent to exit and then cleans things up.
> 
> That should not need any new kernel mechanisms.

Of course, but this also means that it's not possible to stop the OOM
killer while the teardown is in progress, and will require userspace
changes on a case-by-case basis.

Anyway, I will try to kludge something together with clone

> 
> 
> 
> There is the other question: why this is disastrously slow on s390?
> Is this a s390 specific issue?  Can the issue be fixed by optimizing

It's a hardware issue with protected VMs, which are achieved on s390x
without memory encryption. When a protected VM terminates, the
secure/trusted firmware needs to clear all protected memory, and change
the security properties to make it accessible. This last step in
particular takes more time than just clearing memory.

> what is happening on s390?
> 
> Eric
Michal Hocko Nov. 15, 2021, 10:43 a.m. UTC | #6
On Fri 12-11-21 17:53:09, Claudio Imbrenda wrote:
[...]
> Of course, but this also means that it's not possible to stop the OOM
> killer while the teardown is in progress,

Blocking the OOM killer and depend on the userspace to make a forward
progress is not acceptable solution.
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index bb8c6f5f19bc..adc62cba3e91 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -631,6 +631,7 @@  struct mm_struct {
 		atomic_long_t hugetlb_usage;
 #endif
 		struct work_struct async_put_work;
+		struct task_struct *mmput_async_task;
 
 #ifdef CONFIG_IOMMU_SUPPORT
 		u32 pasid;
diff --git a/kernel/fork.c b/kernel/fork.c
index 5de23f3e08bf..0da39b76005c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1062,6 +1062,7 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 #endif
 	mm_init_uprobes_state(mm);
 	hugetlb_count_init(mm);
+	mm->mmput_async_task = NULL;
 
 	if (current->mm) {
 		mm->flags = current->mm->flags & MMF_INIT_MASK;
@@ -1130,7 +1131,12 @@  void mmput(struct mm_struct *mm)
 {
 	might_sleep();
 
-	if (atomic_dec_and_test(&mm->mm_users))
+	if (!atomic_dec_and_test(&mm->mm_users))
+		return;
+
+	if (READ_ONCE(mm->mmput_async_task))
+		wake_up_process(mm->mmput_async_task);
+	else
 		__mmput(mm);
 }
 EXPORT_SYMBOL_GPL(mmput);
@@ -1146,7 +1152,12 @@  static void mmput_async_fn(struct work_struct *work)
 
 void mmput_async(struct mm_struct *mm)
 {
-	if (atomic_dec_and_test(&mm->mm_users)) {
+	if (!atomic_dec_and_test(&mm->mm_users))
+		return;
+
+	if (READ_ONCE(mm->mmput_async_task)) {
+		wake_up_process(mm->mmput_async_task);
+	} else {
 		INIT_WORK(&mm->async_put_work, mmput_async_fn);
 		schedule_work(&mm->async_put_work);
 	}
@@ -3191,3 +3202,91 @@  int sysctl_max_threads(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+SYSCALL_DEFINE2(process_mmput_async, int, pidfd, unsigned int, flags)
+{
+#ifdef CONFIG_MMU
+	struct mm_struct *mm = NULL;
+	struct task_struct *task;
+	unsigned int tmp;
+	struct pid *pid;
+
+	if (flags)
+		return -EINVAL;
+
+	pid = pidfd_get_pid(pidfd, &tmp);
+	if (IS_ERR(pid))
+		return PTR_ERR(pid);
+
+	task = get_pid_task(pid, PIDTYPE_TGID);
+	/* The PID is not needed once we have the tast_struct */
+	put_pid(pid);
+	if (!task)
+		return -ESRCH;
+
+	/*
+	 * The struct_mm is guaranteed to be there as long as we are holding
+	 * a reference to the task_struct. This also guarantees that we
+	 * will not race with mmput, since we are now holding one additional
+	 * reference.
+	 */
+	if (mmget_not_zero(task->mm))
+		mm = task->mm;
+	/* The task_struct is not needed once we have the mm_struct */
+	put_task_struct(task);
+	/* If the target process no longer has an mm, there is nothing we can do. */
+	if (!mm)
+		return -ENXIO;
+
+	/* Use TASK_IDLE instead of TASK_UNINTERRUPTIBLE to avoid stall notifications. */
+	set_current_state(TASK_IDLE);
+	/*
+	 * Return an error if another task had already set itself as async
+	 * cleanup for the target mm.
+	 */
+	if (cmpxchg(&mm->mmput_async_task, NULL, current) != NULL) {
+		set_current_state(TASK_RUNNING);
+		return -EEXIST;
+	}
+
+	/*
+	 * The target mm now has an extra reference to current.
+	 * Is this useless?
+	 */
+	get_task_struct(current);
+	/*
+	 * We will now do mmput, and we no longer have a reference to the
+	 * task; we need mmgrab to be able to reference the mm_struct even
+	 * after its last user is gone.
+	 */
+	mmgrab(mm);
+	/*
+	 * If the target mm has been discarded after we got its reference,
+	 * then we are holding the last reference to it, and doing mmput
+	 * here will cause us to be schedulable again.
+	 */
+	mmput(mm);
+
+	/*
+	 * Go to sleep; we are set to TASK_IDLE, so nothing will wake us up
+	 * except an explicit wake_up_process from mmput.
+	 */
+	schedule();
+
+	/* Put the extra reference taken above */
+	put_task_struct(current);
+
+	/* Should this be a warning instead? */
+	if (atomic_read(&mm->mm_users))
+		panic("mm_users not 0 but trying to __mmput anyway!");
+
+	/* Do the actual work */
+	__mmput(mm);
+	/* And put the extra reference taken above */
+	mmdrop(mm);
+
+	return 0;
+#else
+	return -ENOSYS;
+#endif /* CONFIG_MMU */
+}