diff mbox

[v3] mm, oom: fix unnecessary killing of additional processes

Message ID alpine.DEB.2.21.1807200133310.119737@chino.kir.corp.google.com (mailing list archive)
State New, archived
Headers show

Commit Message

David Rientjes July 20, 2018, 8:41 a.m. UTC
On Thu, 19 Jul 2018, Tetsuo Handa wrote:

> Sigh...
> 
> Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> 
> because David is not aware what is wrong.
> 

Hmm, didn't you incorporate this exact patch into your own patch series 
that you proposed? :)

I'm coming to this stark realization that all of these theater is only for 
effect.  Perhaps other observers have come to that understanding earlier 
and I was late to the party.

You're nacking a patch because it does a double set_bit() and jiffies can 
wraparound and we can add a process to the oom reaper list twice if the 
oom happens at the exact same moment.  Ok.  These are extremely trivial to 
fix.

> Let's call "A" as a thread doing exit_mmap(), and "B" as the OOM reaper kernel thread.
> 
> (1) "A" finds that unlikely(mm_is_oom_victim(mm)) == true.
> (2) "B" finds that test_bit(MMF_OOM_SKIP, &mm->flags) in oom_reap_task() is false.
> (3) "B" finds that !test_bit(MMF_UNSTABLE, &mm->flags) in oom_reap_task() is true.
> (4) "B" enters into oom_reap_task_mm(tsk, mm).
> (5) "B" finds that !down_read_trylock(&mm->mmap_sem) is false.
> (6) "B" finds that mm_has_blockable_invalidate_notifiers(mm) is false.
> (7) "B" finds that test_bit(MMF_UNSTABLE, &mm->flags) is false.
> (8) "B" enters into __oom_reap_task_mm(mm).
> (9) "A" finds that test_and_set_bit(MMF_UNSTABLE, &mm->flags) is false.
> (10) "A" is preempted by somebody else.
> (11) "B" finds that test_and_set_bit(MMF_UNSTABLE, &mm->flags) is true.
> (12) "B" leaves __oom_reap_task_mm(mm).
> (13) "B" leaves oom_reap_task_mm().
> (14) "B" finds that time_after_eq(jiffies, mm->oom_free_expire) became true.
> (15) "B" finds that !test_bit(MMF_OOM_SKIP, &mm->flags) is true.
> (16) "B" calls set_bit(MMF_OOM_SKIP, &mm->flags).
> (17) "B" finds that test_bit(MMF_OOM_SKIP, &mm->flags) is true.
> (18) select_bad_process() finds that MMF_OOM_SKIP is already set.
> (19) out_of_memory() kills a new OOM victim.
> (20) "A" resumes execution and start reclaiming memory.
> 
> because oom_lock serialization was already removed.
> 

Absent oom_lock serialization, this is exactly working as intended.  You 
could argue that once the thread has reached exit_mmap() and begins oom 
reaping that it should be allowed to finish before the oom reaper declares 
MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
encountered a usecase where it were needed.  Or, we could restart the oom 
expiration when MMF_UNSTABLE is set and deem that progress is being made 
so it give it some extra time.  In practice, again, we haven't seen this 
needed.  But either of those are very easy to add in as well.  Which would 
you prefer?


mm, oom: fix unnecessary killing of additional processes fix

Fix double set_bit() per Tetsuo.

Fix jiffies wraparound per Tetsuo.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/mmap.c     | 13 ++++++-------
 mm/oom_kill.c |  7 +++++--
 2 files changed, 11 insertions(+), 9 deletions(-)

Comments

Tetsuo Handa July 20, 2018, 9:57 a.m. UTC | #1
On 2018/07/20 17:41, David Rientjes wrote:
> Absent oom_lock serialization, this is exactly working as intended.  You 
> could argue that once the thread has reached exit_mmap() and begins oom 
> reaping that it should be allowed to finish before the oom reaper declares 
> MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
> encountered a usecase where it were needed.  Or, we could restart the oom 
> expiration when MMF_UNSTABLE is set and deem that progress is being made 
> so it give it some extra time.  In practice, again, we haven't seen this 
> needed.  But either of those are very easy to add in as well.  Which would 
> you prefer?

I don't think we need to introduce user-visible knob interface (even if it is in
debugfs), for I think that my approach can solve your problem. Please try OOM lockup
(CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 )
and my cleanup patch ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 )
on top of linux.git . And please reply how was the result, for I'm currently asking
Roman whether we can apply these patches before applying the cgroup-aware OOM killer.
David Rientjes July 20, 2018, 8:19 p.m. UTC | #2
On Fri, 20 Jul 2018, Tetsuo Handa wrote:

> > Absent oom_lock serialization, this is exactly working as intended.  You 
> > could argue that once the thread has reached exit_mmap() and begins oom 
> > reaping that it should be allowed to finish before the oom reaper declares 
> > MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
> > encountered a usecase where it were needed.  Or, we could restart the oom 
> > expiration when MMF_UNSTABLE is set and deem that progress is being made 
> > so it give it some extra time.  In practice, again, we haven't seen this 
> > needed.  But either of those are very easy to add in as well.  Which would 
> > you prefer?
> 
> I don't think we need to introduce user-visible knob interface (even if it is in
> debugfs), for I think that my approach can solve your problem. Please try OOM lockup
> (CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 )

The issue I am fixing has nothing to do with contention on oom_lock, it 
has to do with the inability of the oom reaper to free memory for one or 
more of several reasons: mlock, blockable mmus, ptes, mm->mmap_sem 
contention, and then the setting of MMF_OOM_SKIP to choose another victim 
before the original victim even reaches exit_mmap().  Thus, removing 
oom_lock from exit_mmap() will not fix this issue.

I agree that oom_lock can be removed from exit_mmap() and it would be 
helpful to do so, and may address a series of problems that we have yet to 
encounter, but this would not fix the almost immediate setting of 
MMF_OOM_SKIP that occurs with minimal memory freeing due to the oom 
reaper.
Tetsuo Handa July 20, 2018, 8:47 p.m. UTC | #3
On 2018/07/21 5:19, David Rientjes wrote:
> On Fri, 20 Jul 2018, Tetsuo Handa wrote:
> 
>>> Absent oom_lock serialization, this is exactly working as intended.  You 
>>> could argue that once the thread has reached exit_mmap() and begins oom 
>>> reaping that it should be allowed to finish before the oom reaper declares 
>>> MMF_OOM_SKIP.  That could certainly be helpful, I simply haven't 
>>> encountered a usecase where it were needed.  Or, we could restart the oom 
>>> expiration when MMF_UNSTABLE is set and deem that progress is being made 
>>> so it give it some extra time.  In practice, again, we haven't seen this 
>>> needed.  But either of those are very easy to add in as well.  Which would 
>>> you prefer?
>>
>> I don't think we need to introduce user-visible knob interface (even if it is in
>> debugfs), for I think that my approach can solve your problem. Please try OOM lockup
>> (CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 )
> 
> The issue I am fixing has nothing to do with contention on oom_lock, it 
> has to do with the inability of the oom reaper to free memory for one or 
> more of several reasons: mlock, blockable mmus, ptes, mm->mmap_sem 
> contention, and then the setting of MMF_OOM_SKIP to choose another victim 
> before the original victim even reaches exit_mmap().  Thus, removing 
> oom_lock from exit_mmap() will not fix this issue.
> 
> I agree that oom_lock can be removed from exit_mmap() and it would be 
> helpful to do so, and may address a series of problems that we have yet to 
> encounter, but this would not fix the almost immediate setting of 
> MMF_OOM_SKIP that occurs with minimal memory freeing due to the oom 
> reaper.
> 

Why [PATCH 2/2] in https://marc.info/?l=linux-mm&m=153119509215026&w=4 does not
solve your problem?
David Rientjes July 20, 2018, 10:19 p.m. UTC | #4
On Sat, 21 Jul 2018, Tetsuo Handa wrote:

> Why [PATCH 2/2] in https://marc.info/?l=linux-mm&m=153119509215026&w=4 does not
> solve your problem?
> 

Such an invasive patch, and completely rewrites the oom reaper.  I now 
fully understand your frustration with the cgroup aware oom killer being 
merged into -mm without any roadmap to actually being merged.  I agree 
with you that it should be dropped, not sure why it has not been since 
there is no active review on the proposed patchset from four months ago, 
posted twice, that fixes the issues with it, or those patches being merged 
so the damn thing can actually make progress.
diff mbox

Patch

diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3069,23 +3069,22 @@  void exit_mmap(struct mm_struct *mm)
 		 * Nothing can be holding mm->mmap_sem here and the above call
 		 * to mmu_notifier_release(mm) ensures mmu notifier callbacks in
 		 * __oom_reap_task_mm() will not block.
-		 */
-		__oom_reap_task_mm(mm);
-
-		/*
-		 * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
+		 *
+		 * This sets MMF_UNSTABLE to avoid racing with the oom reaper.
 		 * This needs to be done before calling munlock_vma_pages_all(),
 		 * which clears VM_LOCKED, otherwise the oom reaper cannot
 		 * reliably test for it.  If the oom reaper races with
 		 * munlock_vma_pages_all(), this can result in a kernel oops if
 		 * a pmd is zapped, for example, after follow_page_mask() has
 		 * checked pmd_none().
-		 *
+		 */
+		__oom_reap_task_mm(mm);
+
+		/*
 		 * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
 		 * guarantee that the oom reaper will not run on this mm again
 		 * after mmap_sem is dropped.
 		 */
-		set_bit(MMF_UNSTABLE, &mm->flags);
 		down_write(&mm->mmap_sem);
 		up_write(&mm->mmap_sem);
 	}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -666,12 +666,15 @@  static int oom_reaper(void *unused)
 static u64 oom_free_timeout_ms = 1000;
 static void wake_oom_reaper(struct task_struct *tsk)
 {
+	unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms);
+
+	if (!expire)
+		expire++;
 	/*
 	 * Set the reap timeout; if it's already set, the mm is enqueued and
 	 * this tsk can be ignored.
 	 */
-	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
-			jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
+	if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire))
 		return;
 
 	get_task_struct(tsk);