Message ID | alpine.DEB.2.21.1807200133310.119737@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 2018/07/20 17:41, David Rientjes wrote: > Absent oom_lock serialization, this is exactly working as intended. You > could argue that once the thread has reached exit_mmap() and begins oom > reaping that it should be allowed to finish before the oom reaper declares > MMF_OOM_SKIP. That could certainly be helpful, I simply haven't > encountered a usecase where it were needed. Or, we could restart the oom > expiration when MMF_UNSTABLE is set and deem that progress is being made > so it give it some extra time. In practice, again, we haven't seen this > needed. But either of those are very easy to add in as well. Which would > you prefer? I don't think we need to introduce user-visible knob interface (even if it is in debugfs), for I think that my approach can solve your problem. Please try OOM lockup (CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ) and my cleanup patch ( [PATCH 1/2] at https://marc.info/?l=linux-mm&m=153119509215026&w=4 ) on top of linux.git . And please reply how was the result, for I'm currently asking Roman whether we can apply these patches before applying the cgroup-aware OOM killer.
On Fri, 20 Jul 2018, Tetsuo Handa wrote: > > Absent oom_lock serialization, this is exactly working as intended. You > > could argue that once the thread has reached exit_mmap() and begins oom > > reaping that it should be allowed to finish before the oom reaper declares > > MMF_OOM_SKIP. That could certainly be helpful, I simply haven't > > encountered a usecase where it were needed. Or, we could restart the oom > > expiration when MMF_UNSTABLE is set and deem that progress is being made > > so it give it some extra time. In practice, again, we haven't seen this > > needed. But either of those are very easy to add in as well. Which would > > you prefer? > > I don't think we need to introduce user-visible knob interface (even if it is in > debugfs), for I think that my approach can solve your problem. Please try OOM lockup > (CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ) The issue I am fixing has nothing to do with contention on oom_lock, it has to do with the inability of the oom reaper to free memory for one or more of several reasons: mlock, blockable mmus, ptes, mm->mmap_sem contention, and then the setting of MMF_OOM_SKIP to choose another victim before the original victim even reaches exit_mmap(). Thus, removing oom_lock from exit_mmap() will not fix this issue. I agree that oom_lock can be removed from exit_mmap() and it would be helpful to do so, and may address a series of problems that we have yet to encounter, but this would not fix the almost immediate setting of MMF_OOM_SKIP that occurs with minimal memory freeing due to the oom reaper.
On 2018/07/21 5:19, David Rientjes wrote: > On Fri, 20 Jul 2018, Tetsuo Handa wrote: > >>> Absent oom_lock serialization, this is exactly working as intended. You >>> could argue that once the thread has reached exit_mmap() and begins oom >>> reaping that it should be allowed to finish before the oom reaper declares >>> MMF_OOM_SKIP. That could certainly be helpful, I simply haven't >>> encountered a usecase where it were needed. Or, we could restart the oom >>> expiration when MMF_UNSTABLE is set and deem that progress is being made >>> so it give it some extra time. In practice, again, we haven't seen this >>> needed. But either of those are very easy to add in as well. Which would >>> you prefer? >> >> I don't think we need to introduce user-visible knob interface (even if it is in >> debugfs), for I think that my approach can solve your problem. Please try OOM lockup >> (CVE-2016-10723) mitigation patch ( https://marc.info/?l=linux-mm&m=153112243424285&w=4 ) > > The issue I am fixing has nothing to do with contention on oom_lock, it > has to do with the inability of the oom reaper to free memory for one or > more of several reasons: mlock, blockable mmus, ptes, mm->mmap_sem > contention, and then the setting of MMF_OOM_SKIP to choose another victim > before the original victim even reaches exit_mmap(). Thus, removing > oom_lock from exit_mmap() will not fix this issue. > > I agree that oom_lock can be removed from exit_mmap() and it would be > helpful to do so, and may address a series of problems that we have yet to > encounter, but this would not fix the almost immediate setting of > MMF_OOM_SKIP that occurs with minimal memory freeing due to the oom > reaper. > Why [PATCH 2/2] in https://marc.info/?l=linux-mm&m=153119509215026&w=4 does not solve your problem?
On Sat, 21 Jul 2018, Tetsuo Handa wrote: > Why [PATCH 2/2] in https://marc.info/?l=linux-mm&m=153119509215026&w=4 does not > solve your problem? > Such an invasive patch, and completely rewrites the oom reaper. I now fully understand your frustration with the cgroup aware oom killer being merged into -mm without any roadmap to actually being merged. I agree with you that it should be dropped, not sure why it has not been since there is no active review on the proposed patchset from four months ago, posted twice, that fixes the issues with it, or those patches being merged so the damn thing can actually make progress.
diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3069,23 +3069,22 @@ void exit_mmap(struct mm_struct *mm) * Nothing can be holding mm->mmap_sem here and the above call * to mmu_notifier_release(mm) ensures mmu notifier callbacks in * __oom_reap_task_mm() will not block. - */ - __oom_reap_task_mm(mm); - - /* - * Now, set MMF_UNSTABLE to avoid racing with the oom reaper. + * + * This sets MMF_UNSTABLE to avoid racing with the oom reaper. * This needs to be done before calling munlock_vma_pages_all(), * which clears VM_LOCKED, otherwise the oom reaper cannot * reliably test for it. If the oom reaper races with * munlock_vma_pages_all(), this can result in a kernel oops if * a pmd is zapped, for example, after follow_page_mask() has * checked pmd_none(). - * + */ + __oom_reap_task_mm(mm); + + /* * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will * guarantee that the oom reaper will not run on this mm again * after mmap_sem is dropped. */ - set_bit(MMF_UNSTABLE, &mm->flags); down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -666,12 +666,15 @@ static int oom_reaper(void *unused) static u64 oom_free_timeout_ms = 1000; static void wake_oom_reaper(struct task_struct *tsk) { + unsigned long expire = jiffies + msecs_to_jiffies(oom_free_timeout_ms); + + if (!expire) + expire++; /* * Set the reap timeout; if it's already set, the mm is enqueued and * this tsk can be ignored. */ - if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, - jiffies + msecs_to_jiffies(oom_free_timeout_ms))) + if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL, expire)) return; get_task_struct(tsk);