diff mbox series

[RFC,3/3] mm, oom: hand over MMF_OOM_SKIP to exit path if it is guranteed to finish

Message ID 20180910125513.311-4-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series rework mmap-exit vs. oom_reaper handover | expand

Commit Message

Michal Hocko Sept. 10, 2018, 12:55 p.m. UTC
From: Michal Hocko <mhocko@suse.com>

David Rientjes has noted that certain user space memory allocators leave
a lot of page tables behind and the current implementation of oom_reaper
doesn't deal with those workloads very well. In order to improve these
workloads define a point when exit_mmap is guaranteed to finish the tear
down without any further blocking etc. This is right after we unlink
vmas (those still depend on locks which are held while performing memory
allocations from other contexts) and before we start releasing page
tables.

Opencode free_pgtables and explicitly unlink all vmas first. Then set
mm->mmap to NULL (there shouldn't be anybody looking at it at this
stage) and check for mm->mmap in the oom_reaper path. If the mm->mmap
is NULL we rely on the exit path and won't set MMF_OOM_SKIP from the
reaper.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mmap.c     | 24 ++++++++++++++++++++----
 mm/oom_kill.c | 13 +++++++------
 2 files changed, 27 insertions(+), 10 deletions(-)
diff mbox series

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
index 3481424717ac..99bb9ce29bc5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3085,8 +3085,27 @@  void exit_mmap(struct mm_struct *mm)
 	/* oom_reaper cannot race with the page tables teardown */
 	if (oom)
 		down_write(&mm->mmap_sem);
+	/*
+	 * Hide vma from rmap and truncate_pagecache before freeing
+	 * pgtables
+	 */
+	while (vma) {
+		unlink_anon_vmas(vma);
+		unlink_file_vma(vma);
+		vma = vma->vm_next;
+	}
+	vma = mm->mmap;
+	if (oom) {
+		/*
+		 * the exit path is guaranteed to finish without any unbound
+		 * blocking at this stage so make it clear to the caller.
+		 */
+		mm->mmap = NULL;
+		up_write(&mm->mmap_sem);
+	}
 
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	free_pgd_range(&tlb, vma->vm_start, vma->vm_prev->vm_end,
+			FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
 	/*
@@ -3099,9 +3118,6 @@  void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 	}
 	vm_unacct_memory(nr_accounted);
-
-	if (oom)
-		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 049e67dc039b..0ebf93c76c81 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -570,12 +570,10 @@  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 	}
 
 	/*
-	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
-	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
-	 * under mmap_sem for reading because it serializes against the
-	 * down_write();up_write() cycle in exit_mmap().
+	 * If exit path clear mm->mmap then we know it will finish the tear down
+	 * and we can go and bail out here.
 	 */
-	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+	if (!mm->mmap) {
 		trace_skip_task_reaping(tsk->pid);
 		goto out_unlock;
 	}
@@ -624,8 +622,11 @@  static void oom_reap_task(struct task_struct *tsk)
 	/*
 	 * Hide this mm from OOM killer because it has been either reaped or
 	 * somebody can't call up_write(mmap_sem).
+	 * Leave the MMF_OOM_SKIP to the exit path if it managed to reach the
+	 * point it is guaranteed to finish without any blocking
 	 */
-	set_bit(MMF_OOM_SKIP, &mm->flags);
+	if (mm->mmap)
+		set_bit(MMF_OOM_SKIP, &mm->flags);
 
 	/* Drop a reference taken by wake_oom_reaper */
 	put_task_struct(tsk);