diff mbox series

[v5,3/3] mm/oom_kill: allow process_mrelease to run under mmap_lock protection

Message ID 20211209191325.3069345-3-surenb@google.com (mailing list archive)
State New
Headers show
Series [v5,1/3] mm: protect free_pgtables with mmap_lock write lock in exit_mmap | expand

Commit Message

Suren Baghdasaryan Dec. 9, 2021, 7:13 p.m. UTC
With exit_mmap holding mmap_write_lock during free_pgtables call,
process_mrelease does not need to elevate mm->mm_users in order to
prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
is walking the VMA tree. The change prevents process_mrelease from
calling the last mmput, which can lead to waiting for IO completion
in exit_aio.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
changes in v5
- Removed Fixes: tag, per Michal Hocko
- Added Acked-by's

 mm/oom_kill.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

Comments

Jason Gunthorpe Dec. 9, 2021, 8:48 p.m. UTC | #1
On Thu, Dec 09, 2021 at 11:13:25AM -0800, Suren Baghdasaryan wrote:
> With exit_mmap holding mmap_write_lock during free_pgtables call,
> process_mrelease does not need to elevate mm->mm_users in order to
> prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> is walking the VMA tree. The change prevents process_mrelease from
> calling the last mmput, which can lead to waiting for IO completion
> in exit_aio.
> 
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
> changes in v5
> - Removed Fixes: tag, per Michal Hocko
> - Added Acked-by's
> 
>  mm/oom_kill.c | 27 +++++++++++++++------------
>  1 file changed, 15 insertions(+), 12 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

There are mmget_not_zero's all over the place, can others be cleaned
after this series goes ahead too?

It seems like anything doing the mmget just to look at the vma list
under the mmap lock is now fine with only a mmgrab?

A few I know about:

drivers/infiniband/core/umem_odp.c:     if (!mmget_not_zero(umem->owning_mm)) {

This is because mmu_interval_notifier_insert() might call
mm_take_all_locks() which was unsafe with concurrent exit_mmap

drivers/infiniband/core/umem_odp.c:     if (!owning_process || !mmget_not_zero(owning_mm)) {

This is because it calls hmm_range_fault() which iterates over the vma
list which is safe now

drivers/iommu/iommu-sva-lib.c:  return mmget_not_zero(mm);
drivers/iommu/iommu-sva-lib.c:  return ioasid_find(&iommu_sva_pasid, pasid, __mmget_not_zero);

It calls find_extend_vma() - but also it doesn't seem to have a mmgrab when it
does that mmget. The rcu is messed up here too, so humm.

Jason
Suren Baghdasaryan Dec. 10, 2021, 3:54 p.m. UTC | #2
On Thu, Dec 9, 2021 at 12:48 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Dec 09, 2021 at 11:13:25AM -0800, Suren Baghdasaryan wrote:
> > With exit_mmap holding mmap_write_lock during free_pgtables call,
> > process_mrelease does not need to elevate mm->mm_users in order to
> > prevent exit_mmap from destrying pagetables while __oom_reap_task_mm
> > is walking the VMA tree. The change prevents process_mrelease from
> > calling the last mmput, which can lead to waiting for IO completion
> > in exit_aio.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> > ---
> > changes in v5
> > - Removed Fixes: tag, per Michal Hocko
> > - Added Acked-by's
> >
> >  mm/oom_kill.c | 27 +++++++++++++++------------
> >  1 file changed, 15 insertions(+), 12 deletions(-)
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks!

>
> There are mmget_not_zero's all over the place, can others be cleaned
> after this series goes ahead too?
>
> It seems like anything doing the mmget just to look at the vma list
> under the mmap lock is now fine with only a mmgrab?

Sounds reasonable to me. I'll try to carve out some time to look into
it but no promises. Lots of loose ends to tie at the end of the year
:(

>
> A few I know about:
>
> drivers/infiniband/core/umem_odp.c:     if (!mmget_not_zero(umem->owning_mm)) {
>
> This is because mmu_interval_notifier_insert() might call
> mm_take_all_locks() which was unsafe with concurrent exit_mmap
>
> drivers/infiniband/core/umem_odp.c:     if (!owning_process || !mmget_not_zero(owning_mm)) {
>
> This is because it calls hmm_range_fault() which iterates over the vma
> list which is safe now
>
> drivers/iommu/iommu-sva-lib.c:  return mmget_not_zero(mm);
> drivers/iommu/iommu-sva-lib.c:  return ioasid_find(&iommu_sva_pasid, pasid, __mmget_not_zero);
>
> It calls find_extend_vma() - but also it doesn't seem to have a mmgrab when it
> does that mmget. The rcu is messed up here too, so humm.
>
> Jason
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1ddabefcfb5a..67780386f478 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1169,15 +1169,15 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		goto put_task;
 	}
 
-	if (mmget_not_zero(p->mm)) {
-		mm = p->mm;
-		if (task_will_free_mem(p))
-			reap = true;
-		else {
-			/* Error only if the work has not been done already */
-			if (!test_bit(MMF_OOM_SKIP, &mm->flags))
-				ret = -EINVAL;
-		}
+	mm = p->mm;
+	mmgrab(mm);
+
+	if (task_will_free_mem(p))
+		reap = true;
+	else {
+		/* Error only if the work has not been done already */
+		if (!test_bit(MMF_OOM_SKIP, &mm->flags))
+			ret = -EINVAL;
 	}
 	task_unlock(p);
 
@@ -1188,13 +1188,16 @@  SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 		ret = -EINTR;
 		goto drop_mm;
 	}
-	if (!__oom_reap_task_mm(mm))
+	/*
+	 * Check MMF_OOM_SKIP again under mmap_read_lock protection to ensure
+	 * possible change in exit_mmap is seen
+	 */
+	if (!test_bit(MMF_OOM_SKIP, &mm->flags) && !__oom_reap_task_mm(mm))
 		ret = -EAGAIN;
 	mmap_read_unlock(mm);
 
 drop_mm:
-	if (mm)
-		mmput(mm);
+	mmdrop(mm);
 put_task:
 	put_task_struct(task);
 	return ret;